# Human Resources Study

## Data visualization techniques, Exploratory Data Analysis

### About the data set

The CSV revolves around a fictitious company and the core data set contains : **names**, **DOBs**, **age**, **gender**, **marital status**, **date of hire**, **reasons for termination**, **department**, whether they are **active or terminated**, **position title**, **pay rate**, **manager name**, **performance score**, **absences**, **most recent performance review date**, and **employee engagement score**.

Many points of interest can be explored through all these variables. Because grasping relevant information is the key to set up a good management, it is important for companies to study this kind of dataset and sometimes when it's needed, make new decisions.
Thus, to help our company to evolve, a lot of questions can be addressed thanks to data analysis :
- Is the performance score unequal between different areas of the company ?
- Can we detect any relationship between who a person works for and their performance score ?
- What is the overall diversity profile of the organization ?
- What are our best recruiting sources if we want to ensure a diverse organization ?
- Are there areas of the company where pay is not equitable ?
- Can we predict who is going to terminate and who isn't ? What level of accuracy can we achieve on this ?
- Which profiles tend to cumulate absences or to have a poor satisfaction working in the company ?


In [None]:
!pip install seaborn --upgrade

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

First, let's import the csv dataset.

In [None]:
df = pd.read_csv('../input/human-resources-data-set/HRDataset_v14.csv')

It's also important to check the datatype of all our variables, how many observations we have and whether or not there are missing values. Here, no imputation will be needed.

In [None]:
df.info()

In [None]:
# Check unique values for each fields except a few ones not so relevant
for c in df.columns:
    if df[c].dtype == object and (c not in ('Employee_Name', 'DOB', 'DateofHire', 'DateofTermination', 'LastPerformanceReview_Date', 'ManagerName')):
        print(c, df[c].unique())

In [None]:
df.head()

# Exploratory Data Analysis

### *Is the performance score unequal between different areas of the company ? Can we detect any relationship between who a person works for and their performance score ?*

First, let's remind what are the unique values in PerformanceScore feature :

In [None]:
df['PerformanceScore'].unique()

How is PerformanceScore distributed in the dataset ?

In [None]:
sns.catplot(x='PerformanceScore', data=df, kind="count",height=5, aspect=1.5)

In [None]:
sns.catplot(y='PerformanceScore', hue='Sex', data=df, kind="count", height=5, aspect=1.5)

In [None]:
sns.catplot(y='PerformanceScore', hue='Department', data=df, kind="count", height=5, aspect=1.5)

In [None]:
df['Department'].unique()

In [None]:
perfs = ['Exceeds', 'Needs Improvement', 'PIP', 'Fully Meets']
dps = ['Production       ', 'IT/IS', 'Software Engineering',
       'Admin Offices', 'Executive Office', 'Sales']
palette1 ={"IT/IS": "C0", "Production       ": "C1", "Software Engineering": "C2", "Admin Offices":"C3", "Sales": "C4", "Executive Office":"C5"}
palette2 ={"Exceeds": "C0", "Needs Improvement": "C1", "PIP": "C2", "Fully Meets":"C3"}

The histogram below allows us to see the conditional probability of Department given Perfomance Score. It's a way to grasp the overall performance of the company but knowing the fact the production department is much bigger than other departments, we can do better taking the performance score given the department.

In [None]:
plt.figure(figsize=(15, 10))
for ps in perfs:
    sns.histplot(x='PerformanceScore', hue='Department', multiple='stack', shrink=.9, stat='probability',palette=palette1, data=df[df['PerformanceScore']==ps])

In [None]:
plt.figure(figsize=(15, 10))
for dp in dps:
    sns.histplot(x='Department', hue='PerformanceScore', multiple='stack', shrink=.9, stat='probability',palette=palette2, data=df[df['Department']==dp])

The sales department needs to be watched since it's the department which requires the most performance improvement plan (PIP). These have been set to address failures to meet specific job goals or to ameliorate behavior-related concerns. 

How many managers are there in the company ?

In [None]:
print(len(df['ManagerName'].unique()), "unique managers are currently working in the company :", df['ManagerName'].unique())

As PerformanceScore is a categorical and ordinal feature, it is a good practice for machine learning purpose to encode it using LabelEncoder. Sklearn library contains a preprocessing method called LabelEncoder to do so but we can also do it manually to decide which numerical value we assign to each category. Later, we'll use the replace method to do the encoding. But for the moment, we can keep it like this to perform the plotting.

In [None]:
df_copy = df.copy()
#df_copy['PerformanceScore'].replace({'Exceeds':3, 'Fully Meets':2, 'Needs Improvement':1, 'PIP':0}, inplace=True)

# convert the float ManagerID field to string and remove the .0 at the end of each value
df_copy['ManagerID'] = df['ManagerID'].astype(str)
df_copy['ManagerID'] = df_copy['ManagerID'].apply(lambda x : x.split('.')[0])

In [None]:
sns.catplot(y='ManagerID', x='PerformanceScore', kind='swarm', data=df_copy[df_copy['ManagerID']!='nan'], height=10, aspect=1)

In [None]:
# Simple method to see performance results for each manager through a dataframe
df_copy.groupby('ManagerID')['PerformanceScore'].value_counts()

In [None]:
ids = ['30', '4', '20', '16', '39', '11', '10', '19', '12', '7', '14', '18', '3', '2', '1', '17', '5', '21', '6', '15', '13', '9', '22']

Plotting the performance scores for each individual manager is relevant to see how they perform individually :

In [None]:
plt.figure(figsize=(15, 10))
for id in ids:
    sns.histplot(x='ManagerID', data=df_copy[df_copy['ManagerID']==id], hue='PerformanceScore', stat='probability', multiple='stack', shrink=0.8, palette=palette2)

Which managers supervise the most in the company ?

In [None]:
ManagersIds = df_copy.groupby('ManagerID')['PerformanceScore'].count().sort_values(ascending=False).index

In [None]:
sns.catplot(y='ManagerID', hue='PerformanceScore', kind='count', data=df_copy, order=ManagersIds, palette=palette2, height=10, aspect=1)

Let's order them by decreasing number of good results :

In [None]:
# Ordering managers by number of good results
ManagersIds = df_copy.groupby('PerformanceScore')['ManagerID'].value_counts().reset_index(name='count')['ManagerID'].unique()

In [None]:
sns.catplot(y='ManagerID', hue='PerformanceScore', kind='count', data=df_copy, palette=palette2, order=ManagersIds, height=10, aspect=1)

It's also interesting to see the distribution of salaries between individuals having different performance scores and differents managers supervising them :

In [None]:
sns.catplot(x='ManagerID', y='Salary', hue='PerformanceScore', kind='box', data=df_copy, height=10, aspect=3)

In [None]:
df_copy.head()

### *What is the overall diversity profile of the organization ?*

In [None]:
sns.countplot(y=df['Sex'])

In [None]:
sns.histplot(x='Sex', data=df, stat='probability')

In [None]:
sns.countplot(y=df['Department'], hue=df['Sex'])

In [None]:
sns.catplot(x='MaritalDesc', hue='Department', data=df, kind="count",height=7, aspect=1)

In [None]:
races = df.groupby('RaceDesc')['EmpID'].count().sort_values(ascending=False).index
df.groupby('RaceDesc')['EmpID'].count().sort_values(ascending=False)

In [None]:
sns.catplot(y='RaceDesc', data=df, kind='count', order=races,height=8, aspect=1)

In [None]:
df['Sex'].unique()

In [None]:
palette3 ={"M ": "C0", "F": "C1"}
races = ['White', 'Black or African American', 'Two or more races', 'Asian', 'Hispanic', 'American Indian or Alaska Native']
palette4 ={'White':"C0", 'Black or African American':"C1", 'Two or more races':"C2", 'Asian':"C3", 'Hispanic':"C4", 'American Indian or Alaska Native':"C5"}

In [None]:
plt.figure(figsize=(15, 8))
for r in races:
    sns.histplot(x='RaceDesc', hue="Sex", multiple="stack", data=df[df['RaceDesc']==r], palette=palette3, stat='probability', shrink=.8)

In [None]:
plt.figure(figsize=(15, 8))
sns.histplot(x='Department', hue='RaceDesc', multiple='stack', data=df)

In [None]:
dps = ['Sales', 'IT/IS', 'Software Engineering',
       'Admin Offices', 'Executive Office','Production       ']
plt.figure(figsize=(15, 8))
for d in dps:
    sns.histplot(x='Department', hue='RaceDesc', palette=palette4, stat='probability', multiple='stack', shrink=0.9, data=df[df['Department']==d])

### *What are our best recruiting sources if we want to ensure a diverse organization ?*

In [None]:
df['RecruitmentSource'].unique()

In [None]:
sns.catplot(y='RecruitmentSource', kind='count', order=df.groupby('RecruitmentSource')['EmpID'].count().sort_values(ascending=False).index, data=df)

In [None]:
sources = ['Other', 'LinkedIn', 'Google Search', 'Employee Referral','Diversity Job Fair', 'On-line Web application', 'CareerBuilder', 'Website', 'Indeed']

In [None]:
plt.figure(figsize=(20, 8))
for s in sources:
    sns.histplot(x='RecruitmentSource', hue='RaceDesc', data=df[df['RecruitmentSource']==s], stat='probability', palette=palette4,shrink=0.9, multiple='stack')

### *Are there areas of the company where pay is not equitable ?*

First, let's have a quick view on the distribution of salaries in the company. We can plot it by count :

In [None]:
sns.displot(x=df['Salary'], height=6, aspect=1)

In [None]:
df['Salary'].hist(bins=50,color='darkred',alpha=0.3)

Or by density :

In [None]:
sns.displot(x=df['Salary'], kind='kde', height=6, aspect=1)

Box plots give us informations about quartiles (Q1, Q2 or median, Q3) and outliers :

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(x=df['Salary'])

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(y=df['Sex'], x=df['Salary']/1000)

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(y=df['Sex'], x=df['Salary']/1000, hue=df['MarriedID'])

In [None]:
sns.catplot(y='Department', x='Salary', data=df, kind="box", height = 6, aspect = 1.5)

Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution :



In [None]:
sns.catplot(y='Department', x='Salary', data=df, kind="violin", height = 6, aspect = 1.5)

In [None]:
sns.catplot(y='Department', x='Salary', col='Sex', data=df, kind="box", height = 10, aspect = 1)