<a href="https://colab.research.google.com/github/sabudev/CAIF/blob/main/CAIF_Module_III_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stay, Don't Leave! 
## Predicting employee churn at a hi-tech startup

Human capital is probably the most important asset of any company, especially as the world moves towards the knowledge economy employees is a very important aspect of any organization. The cost associated with the departure of employees is typically high, as departing employees leave with significant tribal knowledge and it is often difficult to replace employees with niche skills. Employee churn not only leads to loss of productivity, and delayed timelines, but might also have a significant impact on the brand image of the company. This case study examines why are employees leaving this startup and build a predictive model to compute the probability of high risk employees. Once the HR department is aware of who might potentially leave the company, they can take steps to retain them if they are desirable employees.



This case focuses on:
1. Exploratory data analysis
2. Classification
3. Model evaluation

# New Section

**Load the Data files**

In [None]:
! git clone https://github.com/vibsabhishek/EP290.git

In [None]:
! rm -rf EP290

## Load the imporant libraries for EDA

In [None]:
#Import modules
import pandas  # for dataframes
import numpy as np
import matplotlib.pyplot as plt # for plotting graphs
import seaborn as sns # for plotting graphs
from scipy.stats import gaussian_kde

## Load and explore the dataset

In [None]:
hr_data=pandas.read_csv('EP290/HR_tech_startup.csv')

In [None]:
hr_data.head()

**Q1. Generate the summary statistics for the data**


In [None]:
#Fill code here

In [None]:
hr_data.describe()

**Q2. Generate the correlation between different features in the data**

In [None]:
#Fill code here

In [None]:
hr_data.corr()

## Visualize the data

### Histogram for number of projects by different employees

In [None]:
num_projects=hr_data.groupby('number_project').count()
plt.bar(num_projects.index.values, num_projects['satisfaction_level'])
plt.xlabel('Number of Projects')
plt.ylabel('Number of Employees')
plt.show()

### Histogram for salary levels

In [None]:
salary = hr_data['salary'].value_counts()
salary.plot(kind='barh')
plt.xlabel('Number of Employees')
plt.ylabel('Salary')
plt.show()

**Q3. Plot the distribution of last evaluation

In [None]:
#Fill code here

In [None]:
hr_data["last_evaluation"].plot.hist()

**Q4. What department is the largest based on employee strength?**

In [None]:
#Fill code here

In [None]:
department_data = hr_data["Departments"].value_counts()
department_data.plot(kind='barh')
plt.xlabel('Number of Employees')
plt.ylabel('Department')
plt.show()

Which department?

## Understanding why employees are leaving the company

**Q5. Plot the distribution of the working hours.**



In [None]:
#Fill code here

In [None]:
hr_data["average_montly_hours"].plot.hist()


What does this histogram tell you? - Write your answer here. 

**Q6. Are the working hours correlated with job satisfaction and the proclivity to leave? What does this indicate?**

In [None]:
#Fill code here

In [None]:
hr_data["average_montly_hours"].corr(hr_data["satisfaction_level"])

In [None]:
hr_data["average_montly_hours"].corr(hr_data["left"])

What does the previous analysis tell you?

**Q7. Show these relationships visually.**

In [None]:
#Fill code here that shows the relationship between working hours and job satisfaction 

In [None]:
plt.scatter(hr_data["average_montly_hours"], hr_data["satisfaction_level"])

In [None]:
sns.scatterplot(hr_data["average_montly_hours"], hr_data["satisfaction_level"], hue=hr_data["left"])

In [None]:
# color by density

x = hr_data["average_montly_hours"]
y = hr_data["satisfaction_level"]

# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)

# Sort the points by density, so that the densest points are plotted last
idx = z.argsort()
x, y, z = x[idx], y[idx], z[idx]

fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=50)
plt.show()


In [None]:
sns.scatterplot(x, y, hue=hr_data["Departments "])


**Q8. Perform analysis to determine who is leaving and why?** 

Are there distinct employee segments that are inclined to leave they company? Are the reasons for their departures similar? What insights can you draw from the exploratory analysis? (Hint: see how the left variable is associated with other features)

# Machine Learning - Classification

## Import ML libararies. 
sklearn is the workhorse ML package in Python

In [None]:
# Import LabelEncoder for converting categorical variables into numerical (dummy) variables
from sklearn import preprocessing as prep

#Import train_test_split function
from sklearn.model_selection import train_test_split

#Classification models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVR
from sklearn.neural_network import MLPClassifier

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

In [None]:
#creating labelEncoder
le = prep.LabelEncoder()

# Converting categorical variables into numberic variables
hr_data['salary']=le.fit_transform(hr_data['salary'])
hr_data['Departments']=le.fit_transform(hr_data['Departments'])

### Split the data into predictors and outcome

In [None]:
X=hr_data[['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident',
       'promotion_last_5years', 'Departments', 'salary']]
X_labels = ['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident',
       'promotion_last_5years', 'Departments', 'salary']
y=hr_data['left']

### Split dataset into training set and test set

Note that here we are choosing 70% of the data to train the model and 30% of the data to test the model. Later on, you will see how changing this parameter affects the model performance.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)  # 70% training and 30% test

### Build a Logistic Regression model

In [None]:
#Create Logistic Regression Classifier
lr = LogisticRegression()

#Train the LR model 
lr.fit(X_train, y_train)

#Predict the outcome for Test data
lr_y_pred = lr.predict(X_test)

In [None]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, lr_y_pred))

# Model Precision
print("Precision:",metrics.precision_score(y_test, lr_y_pred))

# Model Recall
print("Recall:",metrics.recall_score(y_test, lr_y_pred))

# Confusion matrix
print("Confusion matrix:\n", metrics.confusion_matrix(y_test, lr_y_pred))

**Q9. What is the accuracy of the Logistic Regrssion Model? How many times does it incorrectly predict that someone is leaving but they don't?**

** Fill answer here**

**Q10. Vary the test_size parameter in the train_test_split function and see how the accuracy changes. Plot the accuracy w.r.t. to this parameter.**

In [None]:
#Fill code here
test_sizes = [*fill in the different values you tried here, separated by commas*]
accuracies = [*fill in the accuracy levels get here, separated by commas*]

plt.plot(test_sizes, accuracies)

In [None]:
#Fill code here
train_sizes = [.5, .6, .7, .8, .9]
accuracies = [0.60, 0.79, 0.85, 0.75, 0.70]

plt.plot(train_sizes, accuracies)

plt.xlabel("Training Size")
plt.ylabel("Accuracy")

plt.show()


In [None]:
#Create Decison Tree  Classifier
dt = DecisionTreeClassifier()

#Train the DT model 
dt.fit(X_train, y_train)

#Predict the outcome for Test data
dt_y_pred = dt.predict(X_test)

**Q11. Compare the Decision Tree Classifier with the Logistic Regression (use a 70/30 split for training/test)**

In [None]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, dt_y_pred))

# Model Precision
print("Precision:",metrics.precision_score(y_test, dt_y_pred))

# Model Recall
print("Recall:",metrics.recall_score(y_test, dt_y_pred))

# Confusion matrix
print("Confusion matrix:\n", metrics.confusion_matrix(y_test, dt_y_pred))

**Q12. Compare different models such as MultinomialNB, SVM, MLPClassifier. Which model performs best on this data set?**

In [None]:
#Fill code here

nb = GaussianNB()

#Fill in code for training, testing and measuring accuracy

svm = SVR()

#Fill in code for training, testing and measuring accuracy

nn = MLPClassifier()

#Fill in code for training, testing and measuring accuracy


