# IBM Attrition

Kaggle

https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists.

Education 1 'Below College' 2 'College' 3 'Bachelor' 4 'Master' 5 'Doctor'

EnvironmentSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'

JobInvolvement
1 'Low' 2 'Medium' 3 'High' 4 'Very High'

JobSatisfaction 1 'Low' 2 'Medium' 3 'High' 4 'Very High'

PerformanceRating
1 'Low' 2 'Good' 3 'Excellent' 4 'Outstanding'

RelationshipSatisfaction
1 'Low' 2 'Medium' 3 'High' 4 'Very High'

WorkLifeBalance 1 'Bad' 2 'Good' 3 'Better' 4 'Best'

### Import Necessary Library

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
%matplotlib inline

### Import Dataset

In [None]:
df = pd.read_csv('../input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')

### View Dataset Information (i.e. null, int, object, shape)

In [None]:
df.info()

### Preview Dataset

In [None]:
df.head()

### Check for Null Values

In [None]:
df.isnull().sum()

### Look for Constant Features

In [None]:
df.nunique().sort_values()

### Drop Constant Features

In [None]:
df = df.drop(columns=['EmployeeCount','EmployeeNumber', 'Over18','StandardHours'])

### Analyse Dataset

In [None]:
df.describe()

### Plot HeatMap to Pick Out Highly Correlated Features

In [None]:
plt.subplots(figsize=(30,30))
sns.heatmap(df.corr(), annot=True, linewidths=0.8);

In [None]:
# drop highly correlated column
df = df.drop(columns=['JobLevel','MonthlyIncome', 'TotalWorkingYears',
                      'YearsInCurrentRole', 'YearsWithCurrManager', 
                      'YearsSinceLastPromotion'])

### Replace Target Column Obj with Int

In [None]:
df['Attrition'] = df['Attrition'].replace('Yes', 0)
df['Attrition'] = df['Attrition'].replace('No', 1)
df['Attrition'] = df['Attrition'].astype('int64')

### Check if Data is Balance

In [None]:
df.Attrition.value_counts()
# imbalance
# target column

### Replace Features Obj with Int

In [None]:
print(df.Attrition.value_counts())

print(df.BusinessTravel.value_counts())
print(df.Department.value_counts())
print(df.EducationField.value_counts())
print(df.Gender.value_counts())
print(df.JobRole.value_counts())
print(df.MaritalStatus.value_counts())
print(df.OverTime.value_counts())

In [None]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
colnames_to_encode = ['Attrition','BusinessTravel', 'Department', 'EducationField', 
                      'Gender', 'JobRole', 'MaritalStatus', 'OverTime']
for c in colnames_to_encode:
    df[c] = label_encoder.fit_transform(df[c])
print(df.Attrition.value_counts())
print(df.BusinessTravel.value_counts())
print(df.Department.value_counts())
print(df.EducationField.value_counts())
print(df.Gender.value_counts())
print(df.JobRole.value_counts())
print(df.MaritalStatus.value_counts())
print(df.OverTime.value_counts())


### Move Target to the Last Column for Convenience

In [None]:
target = df.Attrition
df = df.drop(labels=['Attrition'], axis=1)
df.insert(24,'Attrition',target)
df.head()

In [None]:
X = df.iloc[:, :-1].values   
y = df.iloc[:, 24].values

### Train Test Split

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

### Normalise Data

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

### Use SMOTE to Over Sample Noting that There is a Lot Less "Yes" under "Attrition" Column

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 10, ratio=1.0)
X_train_sm,  y_train_sm = sm.fit_sample(X_train, y_train)

### Use RFE to Select Most Important Feature for Logistic Regression

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs', random_state=9)
rfe = RFE(model, 5)
fit = rfe.fit(X_train_sm, y_train_sm)
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % (fit.support_))
print("Feature Ranking: %s" % (fit.ranking_))

### Use .info() to Know Which Column Feature Ranking is Referring

In [None]:
df.info()

### Drop Less Important Features

In [None]:
df = df.drop(columns=['BusinessTravel','DailyRate', 'Department',
                      'DistanceFromHome', 'Education', 'EducationField', 
                      'NumCompaniesWorked', 'Gender', 'HourlyRate',
                      'JobRole', 'JobSatisfaction','MonthlyRate', 'PercentSalaryHike',
                      'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel',
                      'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany'])

### Use SMOTE Features and Target to Predict

In [None]:
model.fit(X_train_sm, y_train_sm)
y_pred = model.predict(X_test)

### Confusion Matrix

In [None]:
from sklearn import metrics
metrics.confusion_matrix(y_test,y_pred)

### High Recall, Low Precision

#### This means the Company is able to sieve out potential people who are going to quit and intervene first.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

### Put Predictions Back in CSV Excel and Download

In [None]:
y_hats = model.predict_proba(X_test)

In [None]:
y_hats2 = model.predict(X)

df['y_hats'] = y_hats2

df.to_csv('data1.csv')