### STEPS PERFORMED IN THIS ANALYSIS:
 
 1.) Read in Data and initial review of fields and data.
 
 2.) Use Pivot Tables to analyze key turnover variable, "left" against other variables in the data.

 3.) Create different DataFrames for employees who stayed and who left for analysis.
 
 4.) Create Histograms of values to visualize the differences in staying/leaving in key variables.
 
 5.) Convert salary text field with "Low"/"Medium"/"High" values to numeric for model analysis.
 
 6.) Convert Catagorical/Text Field 'Department' into seperate fields using Get_Dummies Command.
 
 7.) Seperate the data variables from the prediction variable.
 
 8.) Run train_test_split command to prepare data for training and testing.
 
 9.) Run Logistic Regression, SVC, and Random Forest models for prediction.
 
 10.) Evaluate Results - Random Forest is best Model for this data with score of 99.



In [None]:
# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

In [None]:
# Create HR file DataFrame, Look at Header Info:
df_hr = pd.read_csv('/kaggle/input/hr-analytics/HR_comma_sep.csv')
df_hr.head(5)

In [None]:
# Look at Features, Types, and N/A Values if any:
df_hr.info()

## Create Pivot Tables to Summarize Turnover ("left") Data by Variable:

### Catagorical Variables:

In [None]:
# Time Spent With Company:
print(pd.pivot_table(df_hr, index = 'left', columns = 'time_spend_company', values = 'Department' ,aggfunc ='count'))

In [None]:
# Turnover by Department:
print(pd.pivot_table(df_hr, index = 'left', columns = 'Department', values = 'salary' ,aggfunc ='count'))

In [None]:
# Turnover by Salary Level:
print(pd.pivot_table(df_hr, index = 'left', columns = 'salary', values = 'Department' ,aggfunc ='count'))

In [None]:
# Turnover by whether or not someone had an accident:
print(pd.pivot_table(df_hr, index = 'left', columns = 'Work_accident', values = 'Department' ,aggfunc ='count'))

In [None]:
# Turnover by Promotion in the last 5 years:
print(pd.pivot_table(df_hr, index = 'left', columns = 'promotion_last_5years', values = 'Department' ,aggfunc ='count'))

### Continious Variables:

In [None]:
print(pd.pivot_table(df_hr, index = 'left', values = ['number_project','average_montly_hours','last_evaluation','satisfaction_level']))

In [None]:
#Create seperate DataFrames for employees who "left" and those who "stayed"
df_left = df_hr[df_hr.left == 1]
df_stay = df_hr[df_hr.left == 0]

In [None]:
df_left.info()

In [None]:
df_stay.info()

## Create Histograms to See Value Ranges of Key Fields

In [None]:
plt.xlabel('Job Satisfaction Value')
plt.ylabel('Number of Employees')
plt.title('REPORTED JOB SATISFACTION RANKINGS - EMPLOYEES WHO LEFT')
plt.hist(df_left.satisfaction_level)

In [None]:
plt.xlabel('Job Satisfaction Value')
plt.ylabel('Number of Employees')
plt.title('REPORTED JOB SATISFACTION RANKINGS - EMPLOYEES WHO STAYED')
plt.hist(df_stay.satisfaction_level,color='red')

## Convert Text Salary Field to Numeric:  Field is Ordinal, so will convert salary "levels" to numeric values.

In [None]:
# Add a Numeric Field for Salary
df_hr['salary_num'] = 0

In [None]:
# Assign Numeric Values for Salary Levels
df_hr.loc[df_hr['salary'] == 'low', 'salary_num'] = 1
df_hr.loc[df_hr['salary'] == 'medium', 'salary_num'] = 2
df_hr.loc[df_hr['salary'] == 'high', 'salary_num'] = 3

In [None]:
df_hr.head(5)

In [None]:
# Confirm Turnover by Salary Number is the same as by Salary Level:
print(pd.pivot_table(df_hr, index = 'left', columns = 'salary_num', values = 'Department' ,aggfunc ='count'))

In [None]:
# Turnover by Salary Level - Same Values:
print(pd.pivot_table(df_hr, index = 'left', columns = 'salary', values = 'Department' ,aggfunc ='count'))

In [None]:
#  Remove Salary Text Field
df_hr = df_hr.drop('salary', axis=1)
df_hr.head(10)

## Convert Catagorical/Text Field 'Department' into seperate fields using Get_Dummies Command

In [None]:
df_dum = pd.get_dummies(df_hr.Department)

In [None]:
df_dum

In [None]:
df_merged = pd.concat([df_hr,df_dum],axis='columns')

In [None]:
df_merged

In [None]:
#Drop original 'Department' text field and one of new Dummies column fields as redundant - 'technical'
df_final = df_merged.drop(['Department','technical'],axis='columns')
df_final

## Create Seperate DataFrames for Dependant and Independant Variables

In [None]:
df_features = df_final.drop('left', axis=1)

In [None]:
df_features.head(5)

In [None]:
df_dependant = df_final.left

In [None]:
df_dependant.head(6000)

## Create Training and Testing Datasets - Run and Score ML Models: Logistic, SVC, and Random Forest

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_features,df_dependant,test_size=0.25)

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [None]:
model.fit(X_train,y_train)

In [None]:
model.predict(X_test)

In [None]:
y_test

In [None]:
model.score(X_test, y_test)

In [None]:
from sklearn.svm import SVC
svm = SVC(gamma='auto')
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=40)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

# Conclusion: Random Forest is the Best Model with a 99% Score. 