- **Business understanding**

Build a Machine Learning template to determine if an employee will stay or leave the company. We’re dealing with a classification issue. We will use two of the most used algorithms to solve this problem: Neural Network and Logistic Regression.

**Turnover** designates in an enterprise the renewal of the workforce, following recruitment and departures of the staff. It is a valuable indicator which can quite easily reflect the work environment within the company. Machine Learning can help to analyze and predict this rate and thus make the Right decisions.

In [None]:
#importing the algorithme library
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

#The others library
from  sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import model_selection

#this library import all library like pandas, numpy, matplotlib,seaborn.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

#Evaluate our algorithme
from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score,confusion_matrix,classification_report

In [None]:
data = pd.read_csv('/kaggle/input/ibm-attrition-analysis/WA_Fn-UseC_-HR-Employee-Attrition.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.columns

In [None]:
# list of numeric variable
num_vars = [var for var in data.columns if data[var].dtypes != 'O']

print('Number of variable numeric: ', len(num_vars))

# show
data[num_vars].head()

In [None]:
# list of categorical variable
cat_vars = [var for var in data.columns if data[var].dtypes == 'O']

print('Number of variable categoric: ', len(cat_vars))

# show
data[cat_vars].head()

In [None]:
#display each values for each variable
for var in cat_vars:
    print(var, len(data[var].unique()), ' categories')

In [None]:
data["Age"].mode()
data["Age"].mean()

In [None]:
#Calculate the %(move) and %(stay) in Dataset
move = data[data['Attrition'] == "Yes"]
stay = data[data['Attrition'] == "No"]
print("moves: %i (%.1f%%)" %(len(move),(len(move)) / len(data)*100))
print("stay: %i (%.1f%%)" %(len(stay),(len(stay)) / len(data)*100))
print("Total: %i" %len(data))

The enterprise recorded only 237 departures either **16.1%** against 1233 or **(83.9%)** out of a total of 1470. For the modelling of our model we can say gold and already that the dataset is unbalanced. In the following sections, if necessary, we will use the **smote()** technique to balance the dataset in order to make it efficient in order to have a better **accuracy**.

- **Data Viz**

As a Data Scientist, your job is not only to interpret and analyze the data, but also to communicate and present your findings. That’s why it’s very important for you to have those skills.

In [None]:
#Viz
import matplotlib.ticker as mtick
ax = (data['Attrition'].value_counts()*100.0 /len(data))\
.plot.pie(autopct='%.1f%%', labels = ['Stay', 'Move'],figsize =(5,5), fontsize = 12 )                                                                           
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.set_ylabel('Attrition',fontsize = 12)
ax.set_title('% Statistique RH', fontsize = 12)

Let’s create some interesting viz on the impact of attritions compared to other variables

In [None]:
pd.crosstab(data["Department"],data["Attrition"]).plot(kind='bar')
plt.title('Attrition par Departement')
plt.xlabel('Department')
plt.ylabel('Fréquence')
plt.show()

We note more departures in the RD department. This can be explained in the extent that half of the employees are in this department.

In [None]:
data["Department"].value_counts()

In [None]:
table=pd.crosstab(data["EducationField"], data["Attrition"])
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title("Attrition en fonction de l'education")
plt.xlabel('Education')
plt.ylabel('Proportion Employé')
plt.show()

- **Data Processing**


An important part of data science is the manual collection and cleaning of data. This process is also known as Data Wrangling Although exciting it is very important to know that it is a tedious task that can take up 80% of the work of a Data Scientist.

In [None]:
#Identified the nan values
data_nan = pd.isnull(data).sum()
data_nan

Our dataset is clean. Actually I didn’t expect it. In the real world you probably won’t have data that clean.

Thank you IBM!!!!

In [None]:
#find the outliers
def find_outliers(df, var):
    df = df.copy()
    
    if 0 in data[var].unique():
        pass
    else:
        df[var] = np.log(df[var])
        df.boxplot(column=var)
        plt.title(var)
        plt.ylabel(var)
        plt.show()
    
for var in num_vars:
    find_outliers(data, var)

Some variables have extremes. This is the case of Dailyrate, Employeenumber. However, it is important to know that the management of its outliers requires an understanding of the business and this is often subjective. From my little experience, I always left its values because they did not interfere with the performance of my models all the more if we make the scaling up In our case, we decide to keep them in our model.

- **Transforming the data in the right format**

We use LabelEncoder technique

In [None]:

lb = LabelEncoder() 
data['Attrition'] = lb.fit_transform(data['Attrition'])
data['BusinessTravel'] = lb.fit_transform(data['BusinessTravel'])
data['Department'] = lb.fit_transform(data['Department'])
data['EducationField'] = lb.fit_transform(data['EducationField'])
data['Gender'] = lb.fit_transform(data['Gender'])
data['JobRole'] = lb.fit_transform(data['JobRole'])
data['MaritalStatus'] = lb.fit_transform(data['MaritalStatus'])
data['Over18'] = lb.fit_transform(data['Over18'])
data['OverTime'] = lb.fit_transform(data['OverTime'])

In [None]:
data.info()

it is good, all my categorical variables are transforming in numerical type

In [None]:
data.head()

- Feature Selection

In [None]:
X = data.drop(["Attrition"], axis = 1)
y = data["Attrition"]

In [None]:
X.shape

In [None]:
y.shape

In [None]:
#function for normalizing our data

def normalisation(train_df, test_df):
    from sklearn.preprocessing import StandardScaler
    sc_X = StandardScaler()
    train_df = sc_X.fit_transform(train_df)
    test_df =  sc_X.transform(test_df)
    return train_df, test_df


- Machine Learning

In this section, we will create our **ML models** and also **evaluate them**. 

let’s go

In [None]:
#Split our dataset in train ans test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
#Normalize
X_train, X_test = normalisation(X_train, X_test)

- Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
classifier_log =  LogisticRegression(solver='liblinear', C = 0.38, max_iter = 200, random_state = 0)
classifier_log.fit(X_train, y_train)

In [None]:
#prédict on test set
y_pred_log = classifier_log.predict(X_test)
y_pred_proba = classifier_log.predict_proba(X_test)[:,1]

In [None]:
#score
accurancy_log = round(classifier_log.score(X_test, y_test) * 100)
print(str(accurancy_log )+ ' %')

It’s just amazing. The model has an accuracy of **89%**. 

However, you can always compare the score of the train and the test in order to check if there is not a big gap. I also advise you to explore the cross_val_score method.

- Evaluate LR

In [None]:
#matrixof confusion
cm_log = confusion_matrix(y_test, y_pred_log)
sns.heatmap(cm_log, annot=True,fmt='3.0f',cmap="cubehelix")

The model was wrong on 33 employees. In fact, we call data science the **False Positive**. Nevertheless, 313 employees will remain (**True positive**)

In [None]:
#Classification Report
print(classification_report(y_test,y_pred_log))

In [None]:
#ROC AUC
probs = classifier_log.predict_proba(X_test)
probs = probs[:, 1]

auc_log = roc_auc_score(y_test, probs)
print('AUC - Test Set: %.2f%%' % (auc_log*100))

# calculons roc curve
fpr, tpr, thresholds = roc_curve(y_test, probs)

plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.')
# show the plot
plt.show()

Having an **AUC of 78.24%** is not bad at all content of dataset size.

- Neural Network

In [None]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(14,14,14), activation='relu', solver='adam', max_iter=100)
mlp.fit(X_train,y_train)
y_pred_mlp = mlp.predict(X_test)
y_proba_mlp = mlp.predict_proba(X_test)[:,1]

In [None]:
accurancy_neural = round(mlp.score(X_test, y_test) * 100, 2)
print(str(accurancy_neural) + ' %')

- Evaluate Neural Network


In [None]:
from sklearn.metrics import confusion_matrix
cm_mlp = confusion_matrix(y_test, y_pred_mlp)
import seaborn as sns
sns.heatmap(cm_mlp, annot=True,fmt='3.0f',cmap="cubehelix")

It is clear that neural networks are able to better predict the employees who will leave 18 (True Negative) contrary to logistic regression . In fact, he was wrong only about 30 employees.

However, is this the model to remember?

In [None]:
#classification report
print(classification_report(y_test,y_pred_mlp))

In [None]:
#ROC AUC
probs = mlp.predict_proba(X_test)
probs = probs[:, 1]
auc_mlp = roc_auc_score(y_test, probs)
print('AUC - Test Set: %.2f%%' % (auc_mlp*100))


fpr, tpr, thresholds = roc_curve(y_test, probs)
plt.plot([0, 1], [0, 1], linestyle='--')

plt.plot(fpr, tpr, marker='.')
# show the plot
plt.show()

- Resume

In [None]:
Result = pd.DataFrame({
    'Model': ['Logistic Regression','Neural Network', ],
      'Score': [accurancy_log,accurancy_neural]})
Result.sort_values(by='Score', ascending=False)

In [None]:
#Compare our predictions and real values
Result_logistic = pd.DataFrame({
        "True data": y_test,
        "Predict data": y_pred_log,
        "Proba data":y_pred_proba
    })
#Result_logistic.head(40)

- Conclusion

In view of our results we can think that logistic regression is the best model. However, I would ask you to analyze in depth in order to identify the best model in particular by making a comparison of all metrics (Matrix confusion, ROC curve and classification_rapport.

As for me, I do not want to give my point of view. I am open to your suggestions, contributions and advice and thank you for reading me to the end.
AIDARA Chamsedine

Student in MBA 2 Big Data

Data scientist at Expresso Senegal(Telco Company)

aidarachamsedine10@gmail.com