# HR data EDA analysis

Welcome to this notebook. I will perform EDA analysis based on training data.

# Importing libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Loading and preparing data

In [None]:
hr_frame = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')

In [None]:
# hr_frame.dropna(inplace=True)
hr_frame.reset_index(inplace=True, drop=True)
hr_frame.info()
hr_frame.drop("enrollee_id", axis=1, inplace=True)

In [None]:
hr_frame.head()

# City development index analysis

### The mean value of city development is 0.84. The median and the standard deviation equals 0.91 and 0.11. The first half of the values are less than 0.91 and the second half is more than 0.91. Looking at the standard deviation we can see, that the values do not differ from the average of values. 
### It means that most of the candidates are from well-developed cities. 
### Most of the candidates come from the city_103. Next are city_21, city_16, city_114, city_160.

In [None]:
city_dev_index = hr_frame['city_development_index'].sort_values()
city_dev_index.head()
print("Mean:", city_dev_index.mean())
print("Median:", city_dev_index.median())
print("Standard deviation:", city_dev_index.std())

In [None]:
hr_frame['city'].value_counts().head(5).plot(kind='bar')

# Gender analysis

### We can see a big difference between the genders. In candidates, list prevails men.
### There are 8073 men and 804 women. 78 people identification as another gender.

In [None]:
print(hr_frame['gender'].value_counts())
hr_frame['gender'].value_counts().plot(kind='bar')

# Relevant experience analysis

### Around 10800 people have relevant experience and 3633 hasn't it. Based on genders, people mostly have a relevant experience.
### In the male group, 25% of men haven't got relevant experience. The group of women fared worse. In this group, 30% of women haven't got relevant experience.

In [None]:
hr_frame[['relevent_experience', 'gender']].value_counts().plot(kind='barh')

In [None]:
gender_exp = hr_frame[['relevent_experience', 'gender']].value_counts()
gender_exp

In [None]:
male_no_exp_prec = (gender_exp[1]*100)/(gender_exp[0]+gender_exp[1])
female_no_exp_prec = (gender_exp[3]*100)/(gender_exp[2]+gender_exp[3])
other_no_exp_prec = (gender_exp[5]*100)/(gender_exp[4]+gender_exp[5])
print(male_no_exp_prec)
print(female_no_exp_prec)
print(other_no_exp_prec)

# Enrolled university analysis

### The most of people did not attend college.
### 3757 people attended college in a full-time course. 1198 people attended college on a part-time course.
### 36% of candidates attended college.

In [None]:
hr_frame['enrolled_university'].value_counts()

In [None]:
hr_frame['enrolled_university'].value_counts().plot(kind='barh')

In [None]:
attended = hr_frame['enrolled_university'].value_counts()[1] + hr_frame['enrolled_university'].value_counts()[2]
percent_of_attended = (100*attended)/hr_frame['enrolled_university'].value_counts()[0]
percent_of_attended

# Education level analysis

In [None]:
hr_frame['education_level'].value_counts().sort_values().plot(kind='barh')

# Major discipline analysis

### The vast majority of candidates specialize in the STEM discipline (14492 candidates). 
### The next specializations are:
### * Humanities (669)
### * Other (381)
### * Business Degree (327)
### * Arts (253)
### * No Major (223)

### The most candidates who are specialized in STEM, have a Graduate level.
### Data Science relies heavily on math and science. This explains why so many candidates specialize in STEM.


In [None]:
hr_frame['major_discipline'].value_counts().sort_values().plot(kind='barh')

In [None]:
hr_frame[['major_discipline', 'education_level']].value_counts().head(5).sort_values().plot(kind='barh')

# Experience analysis

### About 3,300 candidates have over 20 years of experience in Data Science. There are also many people with 5 years of experience


In [None]:
exp_val_counts = hr_frame['experience'].value_counts()

In [None]:
plt.figure(figsize=(10, 10))
hr_frame['experience'].value_counts().plot(kind='barh')

### We can see the increasing popularity of Data Science. Most candidates have between 0 and 5 years of experience.

In [None]:
one_to_five = sum(exp_val_counts[['<1', '1', '2', '3', '4', '5']].values)
six_to_ten = sum(exp_val_counts[['6', '7', '8', '9', '10']].values)
eleven_to_fifteen = sum(exp_val_counts[['11', '12', '13', '14', '15']].values)
sixteen_to_twenty = sum(exp_val_counts[['16', '17', '18', '19', '20', '>20']].values)

print("Candidates with experience between less than one and five years: ", one_to_five)
print("Candidates with experience between six and ten years: ", six_to_ten)
print("Candidates with experience between eleven and fifteen years: ", eleven_to_fifteen)
print("Candidates with experience between sixteen and more than twenty years: ", sixteen_to_twenty)

# Company size analysis

### The most candidates work in small companies (from 50 to 99 and from 100 to 500 workers).

In [None]:
hr_frame['company_size'].value_counts().plot(kind='barh')

# Company type analysis

### Most of the companies which candidates work is a private limited company (9817). Next are Funded Startup (1001) and Public Sector (955).

In [None]:
hr_frame['company_type'].value_counts()

In [None]:
hr_frame['company_type'].value_counts().plot(kind='barh')

# Last new job analysis

### The most common difference between the candidate's past and current job is 1 year.

In [None]:
hr_frame['last_new_job'].value_counts().plot(kind='barh')

#### Thank you for reading my notebook.
#### It is my first notebook. If you enjoyed it and you think it is helpful, please give feedback :)