# Human Resources Data Set

**Introduction:**

This data set contains employees informations like names, DOBs, age, gender, marital status, date of hire, reasons for termination, department, whether they are active or terminated, position title, pay rate, manager name, and performance score. And I'm going to explore this data to get some observations and answers of some questions.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [None]:
df = pd.read_csv('../input/human-resources-data-set/HRDataset_v14.csv')
df.head()

# Descriptive analysis

(A) Data Wrangling

Assessment & Cleaning

In [None]:
df.shape[0]

In [None]:
df.dtypes

In [None]:
df.info()

In [None]:
sum(df.duplicated())

In [None]:
df.describe()

# (B) Data Exploration

**Univariate**

**What is the gender of the most employees ?**

In [None]:
df['Sex'].value_counts().plot(kind='pie',autopct='%1.1f%%');
plt.title('Employees gender');

**What is the Performance Score of the empoyees ?**

In [None]:
bas_color = sb.color_palette()[0]
ordered = df['PerformanceScore'].value_counts().index
sb.countplot(data=df, x= 'PerformanceScore', color=bas_color, order = ordered);
plt.title('Employeed Performance');

**How Many employees have been terminated and why?**

In [None]:
plt.figure(figsize = [20, 5])
bas_color = sb.color_palette()[0]
ordered = df['TermReason'].value_counts().index
sb.countplot(data=df, x= 'TermReason', color=bas_color, order = ordered);
plt.title('Employment Status');
plt.xticks(rotation=30);

**What is the salary scale in the copmany ?**

In [None]:
plt.figure(figsize = [20, 5])
bas_color = sb.color_palette()[0]
plt.hist(data=df, x= 'Salary', color=bas_color);
plt.title('Employees Salaries');

**What is the scale of Employees Satisfaction?**

In [None]:
df['EmpSatisfaction'].unique()

In [None]:
bas_color = sb.color_palette()[0]
Orderr = df['EmpSatisfaction'].value_counts().index
sb.countplot(data=df, x='EmpSatisfaction', color =bas_color, order=Orderr);
plt.title('Employees satisfaction rate');

**Bivariate**

**What is the relationship between Recruitment Source & Performance Score?**

In [None]:
plt.figure(figsize = [15, 5])
Ord = df['RecruitmentSource'].value_counts().index
sb.countplot(data= df, x='RecruitmentSource', hue ='PerformanceScore', order=Ord);
plt.xticks(rotation=30);
plt.title('The relationship between Recruitment Source & Performance Score');

**What is the relationship between Employees Satisfaction & Special Projects Count ?**

In [None]:
sb.regplot(data = df, x = 'EmpSatisfaction', y = 'SpecialProjectsCount');
tick_x = [0,1, 2, 3, 4,5.5]
plt.xticks(tick_x);
plt.title('Employees Satisfaction & Special Projects Count');

In [None]:
plt.figure(figsize = [10, 3])
Ord = df['EmploymentStatus'].value_counts().index
sb.countplot(data= df, x='EmploymentStatus', hue ='Sex', order=Ord);
plt.title('Employees gender & Employment Status');
plt.legend(['Male', 'Female']);

In [None]:
plt.figure(figsize = [15, 5])
Ord = df['Position'].value_counts().index
sb.countplot(data= df, x='Position', hue ='EmploymentStatus', order=Ord);
plt.xticks(rotation=30);
plt.title('Position & Employment Status');
plt.xticks(rotation=90);
plt.legend(['Active','Voluntarily Terminated','Terminated for Cause'] ,loc='upper right');

**Multivariate**

**The relationship between DaysLate & Absences & PerformanceScore.**

In [None]:
sb.pairplot(df, x_vars=["DaysLateLast30", "Absences", "PerformanceScore"],
    y_vars=["DaysLateLast30", "Absences","PerformanceScore"], height=2.5, aspect=1.75);

**The department names that has the best performance and satisfied employees.**

In [None]:
Perfor_Satis = df.query('PerformanceScore == "Exceeds" & EmpSatisfaction == 5')
Perfor_Satis

In [None]:
Row = [0,5,55,76,77,96,212,217,220,232,237,274,289,293,308]
column = ['Department']
top_Dep = df.loc[Row,column]
top_Dep

**The department names that has the lowest performance and satisfied employees.**

In [None]:
lw_Perfor_lwSatis = df.query('PerformanceScore == "PIP" & EmpSatisfaction == 1')
lw_Perfor_lwSatis

In [None]:
Rows = [72,83]
columns = ['Department']
low_Dep = df.loc[Rows,columns]
low_Dep

**The relationship between Employment Status & Performance Score & Employees Satisfaction.**

In [None]:
def Performance_rate (PerformanceScore):
    if PerformanceScore == 'Exceeds':
        return 4
    elif PerformanceScore == 'Fully Meets':
        return 3
    elif PerformanceScore == 'Needs Improvement':
        return 2
    elif PerformanceScore == 'PIP':
        return 1

df['Performance_rate'] = df.apply(lambda x: Performance_rate(x['PerformanceScore']), axis=1)

In [None]:
def mean_poly(x, y, bins = 10, **kwargs):
    if type(bins) == int:
        bins = np.linspace(x.min(), x.max(), bins+1)
    bin_centers = (bin_edges[1:] + bin_edges[:-1]) / 2
    data_bins = pd.cut(x, bins, right = False,
                     include_lowest = True)
    means = y.groupby(data_bins).mean()
    plt.errorbar(x = bin_centers, y = means, **kwargs)

bin_edges = np.arange(0.5, df['Performance_rate'].max()+1, 1)
g = sb.FacetGrid(data = df, hue = 'EmploymentStatus', height = 5)
g.map(mean_poly, "Performance_rate", "EmpSatisfaction", bins = bin_edges)
g.set_ylabels('EmpSatisfaction')
plt.title('The relationship between Employment Status & Performance Score & Employees Satisfaction.')
g.add_legend();

**The relationship between Department & Performance Score & Employees Satisfaction.**

In [None]:
means = df.groupby(['Department', 'PerformanceScore']).mean()['EmpSatisfaction']
means = means.reset_index(name = 'EmpSatisfaction_avg')
means = means.pivot(index = 'PerformanceScore', columns = 'Department',
                            values = 'EmpSatisfaction_avg')
sb.heatmap(means, annot = True, fmt = '.3f',
           cbar_kws = {'label' : 'mean(EmpSatisfaction)'});
plt.title('The relationship between Department & Performance Score & Employees Satisfaction.');

# Prediction analysis

In [None]:
df[['Active', 'Volun','Terminated']] = pd.get_dummies(df['EmploymentStatus'])
df = df.drop('Active', axis = 1)
df = df.drop('Volun', axis = 1)

In [None]:
df = df.drop('GenderID', axis = 1)

In [None]:
df[['F', 'M']] = pd.get_dummies(df['Sex'])

In [None]:
import statsmodels.api as sm

In [None]:
df['intercept'] = 1
Reg_model = sm.Logit(df['Terminated'], df[['intercept','MarriedID','F','FromDiversityJobFairID']])
Result = Reg_model.fit()
Result.summary2()

In [None]:
np.exp(0.3430),np.exp(0.1010),np.exp(1.3002)

**Conclusions:**

- The dataset contains 311 employees, 56.6% are females and 43,4% are males.
- The performance evaluation shows that most of the employees fully meet the requirements and there is a good percentage of those who exceed these requirements or expectations.
- The most common reasons of leaving their jobs as a resignation or termination are (Another opportunities - unsatisfaction - better offers and the attendance & working hours policy)
- The average of the most salaries are from 4.5k to 6k.
- Most employees have an average or above average satisfaction rate.
- As for the recruitment resources, The most resources are used are indeed and Linkedin (and the employees who get hired from there have a good performance score) also there are another two resources (Online web page application & career builder ) that are the best regarding the employees performance although there is no too much depend on them .
- The relationship between Employees Satisfaction & Special Projects Count is very slightly positive.
- The percentage of the employees who have a very low performance and satisfaction rate are very small.
- Married employees are 1.4% more likely to be terminated for cause also Females are 1.1% more likely than males and Employees who hired for From Diversity JobFair are 3.7% more likely to be terminated for cause. 
