<div style="width:100%; margin-left:auto; margin-right:auto;">
    <img src=mackenzie_logo.png style="height:70px; float:left; margin-top:0px;"/>
    <img src=intel_logo.png style="height:70px; float:right; margin-top:0px;"/>
</div>

<div style="margin-top:120px;">
    <h1 style="text-align:center;">Human Resources Analytics: A Descriptive Analysis</h1>
</div>

<div style="width:100%; margin-top:30px;">
    <p style="text-align:center;">William Walter da Silva</p>
</div>


### Abstract

In the last decades, having the best machines was enough to be competitive or to dominate an industrial sector. Nowadays, the company that has more engaged and productive employees will have a better chance of winning market competition. For this reason, companies can not lose important employees and when that begins to happen you need to understand why, to prevent this from happening. The Human Resorces Analytics dataset, is used to explain the first steps in the data analysis path. In this first part is presented how to get familiarize itself with the data set by performing the descriptive analysis. Techniques such as exploratory data analysis (EDA) allow us to present the data in a more meaningful way, applying general statistical methods and exploratory graphics, that allow a simpler interpretation before engage a machine learning algorithm.

## 1. The Human Resources Dataset

   The Human Resources Analytics is a simulated dataset from <a href="https://www.kaggle.com/ludobenistant/hr-analytics">Kaggle</a> and the focus is to understand why the best and most experienced employees is leaving the company.
   By the exploration of this dataset its possible to extract good insights of a problems that the Human Resource department deals daily.
   In many industries retain their best employees its a question of long term strategy, and can impact the companies growth or put in financial risk, mainly if the employees leave to work at the competitor.
   


## 2. Exploratory Data Analysis (EDA)

Exploratory data analysis employs a variety of techniques (mostly statistical graphics) before making inferences from data. It is essencial to examine all variables in the dataset to:

   * Catch mistakes
   * Generate hypotheses
   * See patterns in the data
   * Extract important variables
   * Detect outliers and anomalies
   * Gain deep familiarity with the dataset
   * Refine selection of features that will be used to build the machine learning models.

Special attention to not skip the EDA process, because can generate inaccurate models or accurate models on the wrong data. This dataset contains 14999 objects and 10 attributes described below:

              Variables   |  Descriptions
    __________________________________________________________________
    
    satisfaction_level    |  Satisfaction Level
    last_evaluation       |  Last evaluation
    number_project        |  Number of projects
    average_montly_hours  |  Average monthly hours
    time_spend_company    |  Time spent at the company
    Work_accident         |  Whether they have had a work accident
    left                  |  Whether the employee has left
    promotion_last_5years |  Whether had a promotion in the last 5 years
    sales                 |  Departments (column sales)
    salary                |  Salary

## 3. Preprossesing the dataset
   
   Before starting the process, its important to answer if it's clear what kind of problem we are dealing with, because in many cases isn't so simple to identy it. A good understanding of the problem will help to choose the right data mining and machine learning techniques to make the right predictions. Thus, the first step, is preprocessing the data to look for missing, incomplete or noise values, because, in real word, the raw datas can be collect from many sources like sensors, websites, public data and many others.

To start the step of preprossing the dataset is neccessary to import some useful Python libraries.
   
* Numpy: Is a fundamental package to use linear algebra and random number capabilities.
  See: www.numpy.org/

* Pandas: Is a package to work with relacional data as tables.
  See: pandas.pydata.org/

In [None]:
import numpy as np
import pandas as pd 

#### Load the data

To load the dataset we use a Pandas method called **read_csv** that read CSV(comma-separated) files and covert into DataFrame.

In [None]:
data = pd.read_csv('../input/HR_comma_sep.csv')

Other useful method is **info** that shows a summary of the dataset, like number of observations, columns, variable type and the total memory usage. The dataset have 14999 observations, 10 columns and with no null values. The data types of the variables are divided in 2 float, 6 integer and 2 object.

In [None]:
data.info()

Let's see the first 5 lines of the dataset. The **head** method list first *N* rows from the DataFrame and the method **tail**, returns the last *N* rows.

In [None]:
data.head(5)

In [None]:
data.tail(5)

**sample** is a easy way to get a few data quickly.

In [None]:
data.sample(10)

#### Variables transformations

To plot some statistical graphics and for better understanding, we make some transformations in the variables:

* sales: Rename to department
* salary: Convert the type of the variable from categorical to numerical.  

In [None]:
# RENAME column sale to department
data.rename(columns={'sales': 'department'}, inplace = True)

# Convert salary variable type to numeric
data['salary'] = data['salary'].map({'low':1, 'medium':2, 'high':3})

## 4. Descripitve Analysis

The descripitve Analysis is used to simplify and summarize the mainly characteristics of the dataset. In other words, show what kind of information the dataset has. The Pandas method **describe** generates a descriptive statistics that summarize the central tendency, dispersion and shape of the dataset. By using this method in Human Resource dataset important insights is possible to see:
* That approximately 24% os the employees left the company. 
* The satisfaction level is around 62% and performance is around 72%.
* Employees work in average on 4 projects with 200 hours worked per month.


In [None]:
data.describe()


###   4.1 How many employees works in each department?

Depending on how many employees work in each department, you can learn more about the type of company segment.

In [None]:
print(data['department'].value_counts())

### 4.2 How many employees per salary range?

The employees salary is divided in Low (1), Medium (2) and High (3), distributed as follows:

In [None]:
print(data['salary'].value_counts())

### 4.3 How many employees per salary range and department?

In [None]:
table = data.pivot_table(values="satisfaction_level", index="department", columns="salary",aggfunc=np.count_nonzero)
table

### 4.4 How plot graphics?

In descriptive analysis is very useful to use graphics to represent the data. 
For that, is necessary to import the libraries:

* Matplotlib: is a plotting lybrary, usefull to plot statistical graphics.
  See: www.matplotlib.org

* Seaborn: is a library based on matplotlib that can draw attrative statistical graphics.
  See: seaborn.pydata.org/index.html

In [None]:
%matplotlib inline 

import matplotlib.pyplot as plt

import seaborn as sns

sns.set()

#### Boxplot

A boxplot is a way of summarizing a set of data measured on an interval scale. It is often used in exploratory data analysis. It is a type of graph which is used to show the shape of the distribution, its central value, and variability. The picture produced consists of the most extreme values in the data set (maximum and minimum values), the lower and upper quartiles, and the median.
[Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1]

<center><img src=http://i63.tinypic.com/24dgb6h.jpg width=600px/></center>
<p style="text-align:center;"><strong>Figure 1.</strong> Boxplot example</p>

Boxplot is a good statistical graphic to analyze the dataset and indentify *outliers* values. An outlier is as observation that lies an abnormal distance from other values, in this case the analyst have to decide what is considered abnormal. 

The boxplots below, give the information about the data distributions:

* Satisfaction level and Last evaluation has a skewed left (negative) ditribuitions.
* Number of projects has a skewed right(positive)ditribution.
* Average monthly hours has a simetric ditribution.
   
Analyse de distribution of the variables is important due the fact that many statistical tests assume normal distribution.

In [None]:
f, axes = plt.subplots(2,2, figsize=(10,10), sharex=True)

plt.subplots_adjust(wspace=0.5)# adjust the space between the plots

sns.despine(left=True)

# plot a boxplot of satisfaction_level to see if there is outliers
sns.boxplot( x= 'satisfaction_level',  data=data, orient='v',ax=axes[0,0])

# plot a boxplot of last_evaluation to see if there is outliers
sns.boxplot( x= 'last_evaluation',  data=data, orient='v',ax=axes[0,1])

# plot a boxplot of number_project to see if there is outliers
sns.boxplot( x= 'number_project',  data=data, orient='v',ax=axes[1,0])

# plot a boxplot of average_montly_hours to see if there is outliers
sns.boxplot( x= 'average_montly_hours',  data=data, orient='v',ax=axes[1,1]);

#Put a ; at the end of the last line to suppress the printing of output 

In the boxplots below it is possibel to see that only time_spend_company has outliers. Let's explain what kind of information is possible to conclude:

* The employees with more time in the company have 10 years, so is possible to say that is a relatively young company.
* Most of the employees have between 3 or 4 years in the company. 

In [None]:
plt.figure(figsize=(4,5))
sns.boxplot( x= 'time_spend_company',  data=data, orient='v');

### 4.5 Correlation Analysis

The correlation is a very useful statitiscal analysis that describes the degree of relationship between two variables. Let´s see the table below and the heat map to see what relationship  are in the data.

In the heat map is possible to see:

* Negative correlation of (-0.39) between satisfaction_level and the employees that left the company.
* The highest positive correlation is between number of projects and average monthly hours (0.42).
* Last_evaluation is high correlated to number_project(0.35)and average_monthly_hours(0.34). 
* Work_accident have a low negative correlation(-0.15)and salary (-0.16) with employees that left.

In [None]:
corr = data.corr()
corr

In [None]:
sns.set(style='white')

mask = np.zeros_like(corr, dtype=np.bool)

mask[np.triu_indices_from(mask)] = True

# Inserir a figura
f, ax = plt.subplots(figsize=(13,8))

cmap = sns.diverging_palette(10,220, as_cmap=True)

#Desenhar o heatmap com a máscara
ax = sns.heatmap(corr, mask=mask, cmap=cmap, vmax= .5, annot=True, annot_kws= {'size':11}, square=True, xticklabels=True, yticklabels=True, linewidths=.5, 
           cbar_kws={'shrink': .5}, ax=ax)
ax.set_title('Correlation between variables', fontsize=20);


## 5. Hypothesis

Now let's extract some more informations and testing some hypothesis

### 5.1 How many employees left the company?

In [None]:
print(data['left'].value_counts()[1],"employees left the company")

In [None]:
# The plot show the amount o employees that stayed and left the company.
plt.figure(figsize=(4,5))
ax = sns.countplot(data.left)
total = float(len(data))
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}'.format(height/total),
            ha="center") 
plt.title('Stayed or Left', fontsize=14);

### First Hypothesis

The first hypothesis is that salary is the reason why the employees left the company.
Let's see if is this correct.

In [None]:
j = sns.factorplot(x='salary', y='left', kind='bar', data=data)
plt.title('Employees that left by salary level', fontsize=14)
j.set_xticklabels(['High', 'Medium', 'Low']);

In the graphic Salaries by department is possible to see the distribuition of the salaries by department.

* Most of the employees of the sales department have low or medium salaries, this may be due that in some companies the sales commission is paid separately.
* Technical department is in the second place where most of the employees receives low and medium salaries.

In [None]:
h = sns.factorplot(x = 'salary', hue='department', kind ='count', size = 5,aspect=1.5, data=data, palette='Set1' )
plt.title("Salaries by department", fontsize=14)
h.set_xticklabels(['High', 'Medium', 'Low']);

In the graphic(Salary Comparison):

   * The manangement department has the biggest difference between the salary of the employees who stayed and those that left. 
   * It's not possible to see a huge difference in other departments.
    
The first hypothesis looks very weak to be the main reason why the employees left the company.   

In [None]:
sns.set()
plt.figure(figsize=(10,5))
sns.barplot(x='department', y='salary', hue='left', data=data)
plt.title('Salary Comparison', fontsize=14);

### Second Hypothesis

It is a dangerous job?

The second hypothesis is: employees leave the company because work is not safe.

In [None]:
sns.factorplot(x='Work_accident', y='left', kind='bar', data=data)
plt.title('Employees that had work accident', fontsize=14);

About 14% of the employees had a work accident, although of the high number only of accidents only 169 employees data left the company had work a accident. Then this hypothesis is discarded.

In [None]:
print(data.Work_accident.sum())
print(data.Work_accident.mean())
print((data[data['left']==1]['Work_accident']).sum())

### Third Hypothesis

Is this company a good place to grow professionally?

In [None]:
sns.factorplot(x='promotion_last_5years', y='left', kind='bar', data=data)
plt.title('Employees who have been promoted in the last 5 years', fontsize=14);

In the last five years only 319 employees had promotion, this is equivalent to 2% of all employees. 
This may be a problem because if it is difficult to get promoted many employees become unmotivated and start looking for a new job.

In [None]:
print(data.promotion_last_5years.sum())
print(data.promotion_last_5years.mean())

**Years in the company**

In the graphic 'Years in the company' we can identify an important characteristic.

* Employees with 7 or more years didn't left, maybe because with the passing of the years they are more confortable and not so interested in look for a new challenge in other company.
* The problem starts when the employees have more than 3 years and get worst when they achieve 5 years.
* It is too early to say that the difficult to get promoted is the main reason for the leaving of the employees, but more research is needed.

In [None]:
plt.figure(figsize =(7,5))
bins = np.linspace(1.0, 11,10)
plt.hist(data[data['left']==1]['time_spend_company'], bins=bins, alpha=1, label='Employees Left')
plt.hist(data[data['left']==0]['time_spend_company'], bins=bins, alpha = 0.5, label = 'Employee Stayed')
plt.grid(axis='x')
plt.xticks(np.arange(2,11))
plt.xlabel('time_spend_company')
plt.title('Years in the company', fontsize=14)
plt.legend(loc='best');

### Performance Analysis

There are 2 distincts groups of employees. A group with poor performance and other with high performance employees.
It's natural that employees that don't work well leave the company, but the main problem is that the high performance employees is leaving too and it's necessary to understand why.

In [None]:
plt.figure(figsize =(7,7))
bins = np.linspace(0.305, 1.0001, 14)
plt.hist(data[data['left']==1]['last_evaluation'], bins=bins, alpha=1, label='Employees Left')
plt.hist(data[data['left']==0]['last_evaluation'], bins=bins, alpha = 0.5, label = 'Employee Stayed')
plt.title('Employees Performance', fontsize=14)
plt.xlabel('last_evaluation')
plt.legend(loc='best');

It is possible to see that 98% of employees with few projects that left also have poor performance.

And 95% of the employees with 5 or more projects that left the company had the highest performance.

3 or 4 are the best number of projects.

In [None]:
poor_performance_left = data[(data.last_evaluation <= 0.62) & (data.number_project == 2) & (data.left == 1)]
print('poor_performance_left:',len(poor_performance_left))

poor_performance_stayed = data[(data.last_evaluation > 0.62) & (data.number_project == 2) & (data.left == 1)]
print('poor_performance_stayed:',len(poor_performance_stayed))

print('\n')

high_performance_left= data[(data.last_evaluation <= 0.62) & (data.number_project >=5) & (data.left == 1)]
high_performance_stayed= data[(data.last_evaluation > 0.8) & (data.number_project >=5) & (data.left == 0)]
print('high_performance_left:',len(high_performance_left))
print('high_performance_stayed', len(high_performance_stayed))

plt.figure(figsize =(7,5))
bins = np.linspace(1.5,7.5, 7)
plt.hist(data[data['left']==1]['number_project'], bins=bins, alpha=1, label='Employees Left')
plt.hist(data[data['left']==0]['number_project'], bins=bins, alpha = 0.5, label = 'Employee Stayed')
plt.title('Number of projects', fontsize=14)
plt.xlabel('number_ projects')
plt.legend(loc='best');

### Working hours

Again, there are 2 groups of employees. A group that works fewer hours and another that works more hours compared to the average hours worked.

In [None]:
plt.figure(figsize =(7,5))
bins = np.linspace(80,315, 15)
plt.hist(data[data['left']==1]['average_montly_hours'], bins=bins, alpha=1, label='Employees Left')
plt.hist(data[data['left']==0]['average_montly_hours'], bins=bins, alpha = 0.5, label = 'Employee Stayed')
plt.title('Working Hours', fontsize=14)
plt.xlabel('average_montly_hours')
plt.xlim((70,365))
plt.legend(loc='best');

Clearly is possible to see that the employees with 6 projects or more, work on average 20% more hours.

In [None]:
groupby_number_projects = data.groupby('number_project').mean()
groupby_number_projects = groupby_number_projects['average_montly_hours']
print(groupby_number_projects)
plt.figure(figsize=(7,5))
groupby_number_projects.plot();

With the information above the employees that left the company are grouped as:

* Employees with 2 projects and worked less than the average of the company.
* Employees with 5 or more projects that worked at least 20% more than the average.

In [None]:
work_less_hours_left = data[(data.average_montly_hours < 200) & (data.number_project == 2) & (data.left == 1)]
print('work_less_hours_left:',len(work_less_hours_left))

work_more_hours_left = data[(data.average_montly_hours > 240) & (data.number_project >=5 ) & (data.left == 1)]
print('work_more_hours_left:',len(work_more_hours_left))

#<p><font color="red">Aqui você fala sobre a relação entre horas de trabalho e quantidade de projetos, mas isso não é exibido no gráfico</font></p>

### Satisfaction Level

It is possible to see 3 interesting peaks in the satisfaction levels of the employees that left the company.

* We have a peak of employees who are totally disappointed.
* Another peak  at 0.4, representing another group with the satisfaction level below the average.
* And another amount in the range 0.7 and 0.9, with employees that left, although the high satisfaction.

In [None]:
plt.figure(figsize =(7,5))
bins = np.linspace(0.006,1.000, 15)
plt.hist(data[data['left']==1]['satisfaction_level'], bins=bins, alpha=1, label='Employees Left')
plt.hist(data[data['left']==0]['satisfaction_level'], bins=bins, alpha = 0.5, label = 'Employee Stayed')
plt.title('Employees Satisfaction', fontsize=14)
plt.xlabel('satisfaction_level')
plt.xlim((0,1.05))
plt.legend(loc='best');

#### Average satisfaction for years in the company

In [None]:
groupby_time_spend = data.groupby('time_spend_company').mean()
groupby_time_spend['satisfaction_level']

#### When the employees becames unsatisfayed?

In next results it is clear the drop in satisfaction when employees are working on 6 or more projects.

In [None]:
sns.set()
sns.set_context("talk")
ax = sns.factorplot(x="number_project", y="satisfaction_level", col="time_spend_company",col_wrap=4, size=3, color='blue',sharex=False, data=data)
ax.set_xlabels('Number of Projects');

Let´s see why the most valuable employees tend to leave.

From the employees that left with high performance, 4 or more years in the company and working on 5 or more project had:
* Low satisfaction level, 
* Worked more hours, 
* Haven´t been  promoted in the last five years.


In [None]:
func_living = data[(data.last_evaluation >= 0.70) | (data.time_spend_company >=4) | (data.number_project >= 5)]

corr2 = func_living.corr()

sns.set(style='white')

mask = np.zeros_like(corr2, dtype=np.bool)

mask[np.triu_indices_from(mask)] = True

# Insert the graphic
f, ax = plt.subplots(figsize=(13,8))

cmap = sns.diverging_palette(10,220, as_cmap=True)

#Draw heat map mask
ax = sns.heatmap(corr2, mask=mask, cmap=cmap, vmax= .5, annot=True, annot_kws= {'size':11}, square=True, xticklabels=True, yticklabels=True, linewidths=.5, 
           cbar_kws={'shrink': .5}, ax=ax)
ax.set_title('Correlation: Why Valuable Employees Tend to Leave', fontsize=20);

### Summary of the Exploratory Data Analysis

* It is a relatively young company, on average, employees have 3 or 4 years in the company and the oldest employees are working 10 years.
* The biggest difference in the salary from who stayed and those who left, was found in the managemnet department, in the others departments although the salaries of who stayed be higher in average, it is not a big difference.
* The number of employees that had a work accident is about 14%, of which only 169 employees left the company, so don't seem to have a correlation with the employees leaving.
* In five years only 2% of the employees were promoted. Is possible that many employees get unmotivated and start planning to leave.
* Employees with 7 or longer in the company didn't left. Employees with 5 years have more chances to leaving.
* There are 2 distincts groups of employees performance that left. A group with poor performance with 2 projects and others with high performance with 5 or more projects. It is not necessary retain all the employees, the focus is on keeping employees with high performance.
* The employees with 4 years in the company have the lowest average satisfaction level of all the company with (0.47).
* The satisfaction drops when the employees are working in 5 or more projects. A number of 3 or 4 projects seems to be ideal independent of the time spend in the company.
* The employees with 5 or more projects that left also worked at least 20% more hours than the average of the company. 
* The satisfaction level of the employees that left is grouped in totally disappointed, below the average satisfaction and satisfied.


## Let's connect

**LinkedIn:**<a href="https://www.linkedin.com/in/williamwsilva">https://www.linkedin.com/in/williamwsilva</a>

**This is the first part of the analysis. If you want to leave comments on improvements to this notebook or just talk about data science, I'll be happy to connect with you.**