# Project Name: Analysis of Why Employees Leave Their Jobs

**Education:** Education Level (Bachelors, Masters and PHD)

**Joining year:** Year of starting work

**City:** Hometowns of Employees

**PaymentTier:** Salary Tier

**Age:** Ages of Employees

**Gender:** Gender (Male, Female)

**EverBenched:** Indicates whether an employee is given a temporary job (except for his/her responsibility). (yes or no)

**Eperience in current domain:** :Years of experience in the company (in years)

**LeaveOrNot:** Whether they left the job (1: To leave 0: Not to leave)


# Imported Libraries



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import pylab

#Convert Categorical to Numerical Value
from sklearn.preprocessing import LabelEncoder 
#Splitting Data
from sklearn.model_selection import train_test_split,cross_val_score
#Model
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
#Metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,ConfusionMatrixDisplay

# Load Dataset

In [None]:
df = pd.read_csv("/kaggle/input/indian-employee-dataset/Employee.csv")
df.head(10)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

# Exploratory Data Analysis

## Cleaning Data

***Are there any null values in any column in the data?***

In [None]:
df.isna().sum()

***Are there any duplicate rows in the data? Delete these lines.***

In [None]:
df.duplicated().sum()

***There are dublicate rows. This does not affect our analysis, but I decided not to delete it because it affects the model.***

In [None]:
df.shape

**Are there outlier values in any column?**

In [None]:
num_cols = df.select_dtypes(include=['int64', 'float64'])
plt.figure(figsize=(20,10))
sns.boxplot(data=num_cols)
plt.show()

In [None]:
df['PaymentTier'].value_counts()

## Univariate Analysis

In [None]:
df['LeaveOrNot'].value_counts()

In [None]:
sns.countplot(data=df, x=df['LeaveOrNot'], color="pink")
plt.show()

### Categorical Data

In [None]:
sns.countplot( x= df['Education'], hue='LeaveOrNot', data=df ,palette='Blues')
#sns.countplot(data=dataframe, color="red")
plt.xlabel('Education', fontsize=10)
plt.show()

***Almost half of the employees with master's degree have left their jobs, while the majority of employees with bachelor's degree and doctorate continue to work in the company.***

In [None]:
sns.countplot(x= df['City'], hue='LeaveOrNot', data=df ,palette='Blues')
plt.xlabel('City', fontsize=10)
plt.show()

***While half of the employees from Pune City have left their jobs, most of the employees from Bangalore and New Delhi have remained with the company.***

In [None]:
sns.countplot(x= df['Gender'], hue='LeaveOrNot', data=df ,palette='Blues')
plt.xlabel('Gender', fontsize=10)
plt.show()

***While approximately half of female employees continue to work, the majority of male employees have decided to stay at the company.***

In [None]:
sns.countplot(x= df['EverBenched'], hue='LeaveOrNot', data=df ,palette='Blues')
plt.xlabel('EverBenched', fontsize=10)
plt.show()

**The number of employees taking on extra duties is small, but we can say that the EverBenched feature has an impact on employee turnover. Almost half of the employees who took on extra duties left their jobs.**

### Numerical Data

In [None]:
df['JoiningYear'].value_counts()

In [None]:
sns.set(rc={"figure.figsize":(6,4)})
sns.histplot(data=df, x= df['JoiningYear'],hue= 'LeaveOrNot', kde=True, color="red",multiple="stack")

***Here we see that the people who left their jobs were registered in 2015, 2017 and 2018.
But at the same time, if we make a comparison among the people who started working in 2017, most people did not leave their jobs, and almost all of the employees who started in 2018 left their jobs.
We can say that the years with the lowest layoff rate were 2012 and 2016.***

In [None]:
sns.set(rc={"figure.figsize":(4,4)})
sns.histplot(data=df, x= df['PaymentTier'],hue= 'LeaveOrNot', kde=False,bins=3,multiple="stack")

***Here we can see that more people want to stay in the company when the salary range is between 2.30-3.00.***

In [None]:
sns.set(rc={"figure.figsize":(8,4)})
sns.histplot(data=df, x= df['Age'],hue= 'LeaveOrNot', kde=True,bins=20,multiple="stack")

***According to this graph, when employees are compared among their peers, we see that people between the ages of 24-27 prefer to change jobs more (>50%), while more people in the remaining age group prefer to continue their jobs.***

In [None]:
sns.histplot(data=df, x= df['ExperienceInCurrentDomain'],hue= 'LeaveOrNot', kde=True,bins=7,multiple="stack")

In [None]:
pd.crosstab(df['ExperienceInCurrentDomain'],df['LeaveOrNot'])

***For each year of experience, the rate of leaving employees with 0-3 years of experience is between 35-38%. The leaving rates of employees with 4-5 years of experience are 31%, and these rates are 20% and 25% for employees with 6 and 7 years of experience, respectively. We can say that the loyalty rate of employees partially increases as the period spent in the company increases.***

## Bivariate Analsysis

In [None]:
#I will explain this place after the heat map review.

plt.figure(figsize=(8, 6))
sns.barplot(x='Gender', y='PaymentTier', data=df,palette='Blues_r')
plt.title('PaymentTier by Gender Plot')
plt.xlabel('Gender')
plt.ylabel('PaymentTier')
plt.show()

***Considering the comments we received from the heat map (below), I also wanted to examine the Gender - Salary Grade graph.***

***Although the difference is not very clear, we see that men are paid more than women.***

In [None]:
plt.figure(figsize=(15, 10))
fig, axes = plt.subplots(1, 2)
sns.histplot(y="Education", x= "City", data=df,ax=axes[0],cmap='Greens')
plt.yticks(rotation = 90)
plt.ylabel('')
sns.histplot(y="Education", x= "JoiningYear", data=df,ax=axes[1],cmap='Reds')
plt.yticks(rotation = 90)
plt.ylabel('')
plt.show()

***Again, in the heat map chart(below), I came to the idea that the City and Starting Year columns had an impact on the Education Level. Looking at the chart below, most of the employees with a bachelor's degree are from Bangalore. We can say that the number of employees with bachelor's and master's degrees is almost the same among employees living in the capital, New Delhi.
Most of the employees started working in 2017, and most of these employees had a bachelor's degree, while the majority of employees with a master's degree started working in 2017. It can be said that the number of doctoral degrees is almost the same for each year.***

### Label Encoding for Categorical Columns

In [None]:
le=LabelEncoder()
df["Education"]= le.fit_transform(df["Education"])
df["City"]=le.fit_transform(df["City"])
df["Gender"]=le.fit_transform(df["Gender"])
df["EverBenched"]=le.fit_transform(df["EverBenched"])

In [None]:
df.head()

**We will use a heat map chart to understand the relationship between columns. This can help us understand the data.**

In [None]:
plt.figure(figsize=(9,5))
sns.heatmap(df.corr(), annot=True, fmt=".2f", cmap="rocket_r")
plt.show()

**As we can see in the chart, the most relevant columns are Gender and Salary Grade columns. From here we can understand that gender has an impact on salary level. But we can also see that gender and salary grade values do not have much of an impact on the situation of leaving the company.**

**The values/attributes most relevant to leaving the company are Year of Joining the Company with a correlation score of 0.20 and City with a correlation score of 0.18. We can also say that these two features have an impact on the Education Level (with 0.14 and 0.15 points).**

***The reason why the other graphics I created based on the ideas I got from the heat map chart are above is because I used label encoder for heat map***

# Model

### Split the Data into Train and Test Set

In [None]:
X=df.drop(['LeaveOrNot'],axis=1)
y=df['LeaveOrNot']

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.15,random_state=45)

**I used 85% of the data to train the model and 15% to test it. The accuracy rate and error rate of the Decision Tree Model are shown below.**

**While modeling the data with machine learning, I used the two methods we learned in the course. These methods are the Decision Tree Method and the KNN Method.**


In [None]:
def classification_model(model,X_train,X_test,y_train,y_test):
    print(f'{model}')
    model.fit(X_train,y_train)
    model_pred=model.predict(X_test)
    print(f'The {model} accuracy score {accuracy_score(y_test,model_pred)*100:.2f}')
    print(" ")
    print("Mean Absolute Error : ",mean_absolute_error(model_pred,y_test))
    print(f'\n the confusion_matrix {confusion_matrix(y_test,model_pred)}')
    print("-*"*30)
    ConfusionMatrixDisplay.from_predictions(y_test,model_pred,cmap="Reds")
    

    
    plt.show()

In [None]:
models={
    'tree':DecisionTreeClassifier(),
    'knn':KNeighborsClassifier(),
}

In [None]:
for i in range(len(models)):
    model_names=list(models.values())[i]
    features=list(models.keys())[i]
    classification_model(model_names,X_train,X_test,y_train,y_test)

**I wanted to add the "Confusion Matrix" method, which I saw in some notebooks I reviewed on Kaggle, to my project. In the chart below, “True Label” represents the 15% we used for testing, while “Predicted Label” represents the predictions of the model.**