# Employee Attrition Rate using Regression

## Introduction

Artificial intelligence is commonly used in various trade circles to automate processes, gather insights on business, and speed up processes. You will use Python to study the usage of artificial intelligence in real-life scenarios - how AI actually impacts industries. 

Employees are the most important entities in an organization. Successful employees offer a lot to organisations. In this notebook, we will use AI to predict the attrition rate of employees or how often a company can retain employees.

## Context

We will be working with the dataset containing employee attrition rates, which is collected by Hackerearth and uploaded at [Kaggle](https://www.kaggle.com/blurredmachine/hackerearth-employee-attrition). We will use regression to predict attrition rates and see how successful is our model.



## Use Python to open csv files

We will use the [scikit-learn](https://scikit-learn.org/stable/) and [pandas](https://pandas.pydata.org/) to work with our dataset. Scikit-learn is a very useful machine learning library that provides efficient tools for predictive data analysis.  Pandas is a popular Python library for data science. It offers powerful and flexible data structures to make data manipulation and analysis easier.


## Import Libraries


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

from sklearn.linear_model import LinearRegression 
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error

### Importing the Dataset

The dataset contains employee attrition rates. Let us visualize the dataset.



In [2]:
train = pd.read_csv("[Dataset]_Module11_Train_(Employee).csv") 
test = pd.read_csv("[Dataset]_Module11_Test_(Employee).csv")

## Task 1: Print the columns of the training set

In [3]:
#yourcodehere
print(train.columns)

In [4]:
print(train.shape)
train.head()

### Data Description

Let us see how the data is distributed. We can visualize the mean, max, and min value of each column alongside other characteristics.

In [5]:
train.describe()


## Task2: Get information about the training data set using the describe function

In [6]:
#yourcodehere
train.describe()

In [7]:
train.isna().sum()

In [8]:
# Let's see if training set has any missing values
train.isna().any()

### Data Visualization

Now, let us see the correlation matrix to see how related are the features.

In [9]:
plt.figure(figsize=(18,10))
cor = train.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Accent)
plt.show()
plt.savefig("main_correlation.png")

### Preparing the model

Now we will finalize the data for the training and prepare the model.

In [None]:
#Attrition_rate is the label or output to be predicted
#features will be used to predict Attrition_rate
label = ["Attrition_rate"]
features = ['VAR7','VAR6','VAR5','VAR1','VAR3','growth_rate','Time_of_service','Time_since_promotion','Travel_Rate','Post_Level','Education_Level']


In [None]:
featured_data = train.loc[:,features+label]
#We will drop the columns here which have missing values using dropna function
featured_data = featured_data.dropna(axis=0)
featured_data.shape

In [None]:
X = featured_data.loc[:,features]
y = featured_data.loc[:,label]

In [None]:
#Here the training and test data are split 55% to 45% as test size is 0.55
# Here the test size is 55% because model is giving good accuracy on bigger test size also. It completely depends on the model, 
# if the developer feels positive about the results then they go give with bigger test size.Experimentation is the key to understand the model.
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1,test_size=0.55)

In [None]:
#df = Ridge(alpha=0.000001)
df = LinearRegression()
df.fit(X_train,y_train)
y_pred = df.predict(X_test)
c=[]
for i in range(len(y_pred)):
    c.append((y_pred[i][0].round(5)))
pf=c[:3000]


In [None]:
#Let's print the accuracy now
score = 100* max(0, 1-mean_squared_error(y_test, y_pred))
print(score)

In [None]:
#Predicting
import pandas as pd
dff = pd.DataFrame({'Employee_ID':test['Employee_ID'],'Attrition_rate':pf})
dff.head()

## Task 3: Print the first 20 columns of predictions


In [None]:
#yourcodehere

### Conclusion

In this notebook, we have seen how AI can be used by companies to predict which employess would be loyal to them. We have bulit a linear regression model to predict the attrition rate.