# Supervised Machine Learning Models - Logistic Regression - Exercise Solution

#### Exercise

In this exercise you will use the **WA_Fn-UseC_-HR-Employee-Attrition.csv** dataset to predict employee attrition. The data was obtained from https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset and contains the following variables:

| Variable | Definition |
| --- | --- |
| Attrition | 'Yes' if the employee leaves the company, 'No' if the employee stays with the company |
| EmployeeNumber | Unique identifier for each employee |
| Age | Age in years of the employee |
| BusinessTravel | Frequency of business travel: 'Frequently', 'Rarely' or 'Non-Travel' |
| DailyRate | Daily rate of pay for the employee |
| Department | Department the employee belongs to: 'Sales', 'Research & Development' or 'Human Resources' |
| DistanceFromHome | Distance from employee's home to workplace |
| Education | Level of education: 1 'Below College', 2 'College', 3 'Bachelor', 4 'Master', 5 'Doctor' |
| EducationField | Field of study in which the employee obtained their highest education: 'Life Sciences', 'Medical', 'Marketing', 'Technical Degree', 'Human Resources', 'Engineering', 'Arts', or 'Other' |
| EmployeeCount | Number of employees in the company |
| EnvironmentSatisfaction | Employee's level of satisfaction with their work environment: 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| Gender | Employee's gender: 'Male' or 'Female' |
| HourlyRate | Hourly rate of pay for the employee |
| JobInvolvement | Employee's level of job involvement: 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| JobLevel | Employee's job level: 1 'Entry Level', 2 'Intermediate Level', 3 'Managerial Level', 4 'Director Level', 5 'Executive Level' |
| JobRole | Employee's job role: 'Sales Executive', 'Research Scientist', 'Laboratory Technician', 'Manufacturing Director', 'Healthcare Representative', 'Manager', 'Sales Representative', 'Research Director', 'Human Resources' |
| JobSatisfaction | Employee's level of job satisfaction: 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| MaritalStatus | Employee's marital status: 'Single', 'Married' or 'Divorced' |
| MonthlyIncome | Monthly income of the employee |
| MonthlyRate | Monthly rate of pay for the employee |
| NumCompaniesWorked | Number of companies the employee has worked for |
| Over18 | Whether the employee is over 18 years old: 'Y' or 'N' |
| OverTime | Whether the employee works overtime: 'Yes' or 'No' |
| PercentSalaryHike | Percentage increase in salary for the employee |
| PerformanceRating | Employee's performance rating: 1 'Low', 2 'Good', 3 'Excellent', 4 'Outstanding' |
| RelationshipSatisfaction | Employee's level of satisfaction with their relationships at work: 1 'Low', 2 'Medium', 3 'High', 4 'Very High' |
| StandardHours | Standard number of working hours for the company |
| StockOptionLevel | Employee's level of stock options: 0 'None', 1 'Low', 2 'Medium', 3 'High' |
| TotalWorkingYears | Total number of years the employee has worked |
| TrainingTimesLastYear | Number of times the employee received training last year |
| WorkLifeBalance | Employee's level of work-life balance: 1 'Bad', 2 'Good', 3 'Better', 4 'Best' |
| YearsAtCompany | Number of years the employee has worked at the company |
| YearsInCurrentRole | Number of years the employee has been in their current role |
| YearsSinceLastPromotion | Number of years since the employee's last promotion |
| YearsWithCurrManager | Number of years the employee has been working under their current manager |

Use logistic regression to predict whether the employee will leave the company (i.e., **Attrition**) using all possible independent variables in the model. Train the model on a random sample of 75% of the observations in the dataset, and test the model on the remaining 25% of the observations in the dataset. How well did the model perform?

In [1]:
import pandas as pd
import numpy as np
import warnings
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 10000)
pd.set_option('display.max_colwidth', None)
pd.options.display.float_format = '{:,.3f}'.format

In [2]:
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,2,Female,94,3,2,Sales Executive,4,Single,5993,19479,8,Y,Yes,11,3,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,3,Male,61,2,2,Research Scientist,2,Married,5130,24907,1,Y,No,23,4,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,4,Male,92,2,1,Laboratory Technician,3,Single,2090,2396,6,Y,Yes,15,3,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,4,Female,56,3,1,Research Scientist,3,Married,2909,23159,1,Y,Yes,11,3,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,1,Male,40,3,1,Laboratory Technician,2,Married,3468,16632,9,Y,No,12,3,4,80,1,6,3,3,2,2,2,2


In [3]:
df.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.924,802.486,9.193,2.913,1.0,1024.865,2.722,65.891,2.73,2.064,2.729,6502.931,14313.103,2.693,15.21,3.154,2.712,80.0,0.794,11.28,2.799,2.761,7.008,4.229,2.188,4.123
std,9.135,403.509,8.107,1.024,0.0,602.024,1.093,20.329,0.712,1.107,1.103,4707.957,7117.786,2.498,3.66,0.361,1.081,0.0,0.852,7.781,1.289,0.706,6.127,3.623,3.222,3.568
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,1.0,1009.0,2094.0,0.0,11.0,3.0,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,2.0,2911.0,8047.0,1.0,12.0,3.0,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,3.0,4919.0,14235.5,2.0,14.0,3.0,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,4.0,8379.0,20461.5,4.0,18.0,3.0,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,4.0,19999.0,26999.0,9.0,25.0,4.0,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


In [4]:
# Notice that 'EmployeeCount' has all values equal to 1. Let's remove it.
#del df['EmployeeCount']

# Notice that 'StandardHours' has all values equal to 80. Let's remove it.
#del df['StandardHours']

# EmployeeNumber should have no relation with Attrition. Let's remove it.
#del df['EmployeeNumber']

df.drop(['EmployeeCount', 'StandardHours', 'EmployeeNumber'], axis=1, inplace=True)

In [5]:
# Create an indicator variable equal to 1 if Attrition is 'Yes' and equal to 0 if Attrition is 'No'
df['Attrition'] = np.where(df.Attrition == 'Yes',1,0)
df.describe()

Unnamed: 0,Age,Attrition,DailyRate,DistanceFromHome,Education,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.924,0.161,802.486,9.193,2.913,2.722,65.891,2.73,2.064,2.729,6502.931,14313.103,2.693,15.21,3.154,2.712,0.794,11.28,2.799,2.761,7.008,4.229,2.188,4.123
std,9.135,0.368,403.509,8.107,1.024,1.093,20.329,0.712,1.107,1.103,4707.957,7117.786,2.498,3.66,0.361,1.081,0.852,7.781,1.289,0.706,6.127,3.623,3.222,3.568
min,18.0,0.0,102.0,1.0,1.0,1.0,30.0,1.0,1.0,1.0,1009.0,2094.0,0.0,11.0,3.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,0.0,465.0,2.0,2.0,2.0,48.0,2.0,1.0,2.0,2911.0,8047.0,1.0,12.0,3.0,2.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,0.0,802.0,7.0,3.0,3.0,66.0,3.0,2.0,3.0,4919.0,14235.5,2.0,14.0,3.0,3.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,0.0,1157.0,14.0,4.0,4.0,83.75,3.0,3.0,4.0,8379.0,20461.5,4.0,18.0,3.0,4.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1.0,1499.0,29.0,5.0,4.0,100.0,4.0,5.0,4.0,19999.0,26999.0,9.0,25.0,4.0,4.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


In [6]:
# Use ChatGPT
# Create dummy variables for categorical columns

df = pd.get_dummies(df, columns=['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime'], drop_first=True, dtype= int)
df.describe()

Unnamed: 0,Age,Attrition,DailyRate,DistanceFromHome,Education,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely,Department_Research & Development,Department_Sales,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,EducationField_Other,EducationField_Technical Degree,Gender_Male,JobRole_Human Resources,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Married,MaritalStatus_Single,OverTime_Yes
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.924,0.161,802.486,9.193,2.913,2.722,65.891,2.73,2.064,2.729,6502.931,14313.103,2.693,15.21,3.154,2.712,0.794,11.28,2.799,2.761,7.008,4.229,2.188,4.123,0.188,0.71,0.654,0.303,0.412,0.108,0.316,0.056,0.09,0.6,0.035,0.176,0.069,0.099,0.054,0.199,0.222,0.056,0.458,0.32,0.283
std,9.135,0.368,403.509,8.107,1.024,1.093,20.329,0.712,1.107,1.103,4707.957,7117.786,2.498,3.66,0.361,1.081,0.852,7.781,1.289,0.706,6.127,3.623,3.222,3.568,0.391,0.454,0.476,0.46,0.492,0.311,0.465,0.23,0.286,0.49,0.185,0.381,0.254,0.298,0.227,0.399,0.416,0.231,0.498,0.467,0.451
min,18.0,0.0,102.0,1.0,1.0,1.0,30.0,1.0,1.0,1.0,1009.0,2094.0,0.0,11.0,3.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,30.0,0.0,465.0,2.0,2.0,2.0,48.0,2.0,1.0,2.0,2911.0,8047.0,1.0,12.0,3.0,2.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,36.0,0.0,802.0,7.0,3.0,3.0,66.0,3.0,2.0,3.0,4919.0,14235.5,2.0,14.0,3.0,3.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,43.0,0.0,1157.0,14.0,4.0,4.0,83.75,3.0,3.0,4.0,8379.0,20461.5,4.0,18.0,3.0,4.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
max,60.0,1.0,1499.0,29.0,5.0,4.0,100.0,4.0,5.0,4.0,19999.0,26999.0,9.0,25.0,4.0,4.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [7]:
# Ask ChatGPT: What is a list comprehension?

indep_vars = [x for x in df.columns if x not in ['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime']]
print(indep_vars)

['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager', 'BusinessTravel_Travel_Frequently', 'BusinessTravel_Travel_Rarely', 'Department_Research & Development', 'Department_Sales', 'EducationField_Life Sciences', 'EducationField_Marketing', 'EducationField_Medical', 'EducationField_Other', 'EducationField_Technical Degree', 'Gender_Male', 'JobRole_Human Resources', 'JobRole_Laboratory Technician', 'JobRole_Manager', 'JobRole_Manufacturing Director', 'JobRole_Research Director', 'JobRole_Research Scientist', 'JobRole_Sales Executive', 'JobRole_Sales Representative', 'MaritalStatus_Married', 'MaritalStatus_

In [8]:
# Create X and y DataFrames

X = df[indep_vars].assign(_const=1)
y = df[['Attrition']]

display(X.head())
display(y.head())

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely,Department_Research & Development,Department_Sales,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,EducationField_Other,EducationField_Technical Degree,Gender_Male,JobRole_Human Resources,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Married,MaritalStatus_Single,OverTime_Yes,_const
0,41,1102,1,2,2,94,3,2,4,5993,19479,8,11,3,1,0,8,0,1,6,4,0,5,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,1
1,49,279,8,1,3,61,2,2,2,5130,24907,1,23,4,4,1,10,3,3,10,7,1,7,1,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1
2,37,1373,2,2,4,92,2,1,3,2090,2396,6,15,3,2,0,7,3,3,0,0,0,0,0,1,1,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,1,1,1
3,33,1392,3,4,4,56,3,1,3,2909,23159,1,11,3,3,0,8,3,3,8,7,3,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,1
4,27,591,2,1,1,40,3,1,2,3468,16632,9,12,3,4,1,6,3,3,2,2,2,2,0,1,1,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1


Unnamed: 0,Attrition
0,1
1,0
2,1
3,0
4,0


In [9]:
# Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123456, stratify=y)

print(f'# Observations in X_train: {len(X_train)}')
print(f'# Observations in y_train: {len(y_train)}')
print(f'# Observations in X_test: {len(X_test)}')
print(f'# Observations in y_test: {len(y_test)}')

display(X_train.head())
display(y_train.head())

# Observations in X_train: 1102
# Observations in y_train: 1102
# Observations in X_test: 368
# Observations in y_test: 368


Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely,Department_Research & Development,Department_Sales,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,EducationField_Other,EducationField_Technical Degree,Gender_Male,JobRole_Human Resources,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Married,MaritalStatus_Single,OverTime_Yes,_const
241,32,976,26,4,3,100,3,2,4,4465,12069,0,18,3,1,0,4,2,3,3,2,2,2,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,1
769,26,921,1,1,1,66,2,1,3,2007,25265,1,13,3,3,2,5,5,3,5,3,1,3,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
1455,40,1322,2,4,3,52,2,1,3,2809,2725,2,14,3,4,0,8,2,3,2,2,2,2,0,1,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,1
1089,37,674,13,3,1,47,3,2,4,4285,3031,1,17,3,1,0,10,2,3,10,8,3,7,0,1,1,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1
1007,29,337,14,1,3,84,3,3,4,7553,22930,0,12,3,1,0,9,1,3,8,7,7,7,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1


Unnamed: 0,Attrition
241,0
769,0
1455,0
1089,0
1007,1


In [10]:
# Estimate the logistic regression model and create predictions in the out-of-sample testing set

model = LogisticRegression(C=1e9, max_iter=10000).fit(X_train, y_train.values.ravel())
y_test['Attrition_p'] = model.predict(X_test)

predictions = y_test[['Attrition_p']]
df = pd.merge(df,predictions,left_index=True,right_index=True,how='left')
df.head()

Unnamed: 0,Age,Attrition,DailyRate,DistanceFromHome,Education,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,MonthlyRate,NumCompaniesWorked,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,BusinessTravel_Travel_Frequently,BusinessTravel_Travel_Rarely,Department_Research & Development,Department_Sales,EducationField_Life Sciences,EducationField_Marketing,EducationField_Medical,EducationField_Other,EducationField_Technical Degree,Gender_Male,JobRole_Human Resources,JobRole_Laboratory Technician,JobRole_Manager,JobRole_Manufacturing Director,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Married,MaritalStatus_Single,OverTime_Yes,Attrition_p
0,41,1,1102,1,2,2,94,3,2,4,5993,19479,8,11,3,1,0,8,0,1,6,4,0,5,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,
1,49,0,279,8,1,3,61,2,2,2,5130,24907,1,23,4,4,1,10,3,3,10,7,1,7,1,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,
2,37,1,1373,2,2,4,92,2,1,3,2090,2396,6,15,3,2,0,7,3,3,0,0,0,0,0,1,1,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,1,1,1.0
3,33,0,1392,3,4,4,56,3,1,3,2909,23159,1,11,3,3,0,8,3,3,8,7,3,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,
4,27,0,591,2,1,1,40,3,1,2,3468,16632,9,12,3,4,1,6,3,3,2,2,2,2,0,1,1,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,


In [11]:
# NOT REQUIRED -- But to view the model summary output, we can use sm.Logit

model = sm.Logit(y_train, X_train).fit()
print(model.summary())

         Current function value: 0.284289
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:              Attrition   No. Observations:                 1102
Model:                          Logit   Df Residuals:                     1057
Method:                           MLE   Df Model:                           44
Date:                Fri, 29 Nov 2024   Pseudo R-squ.:                  0.3571
Time:                        23:09:53   Log-Likelihood:                -313.29
converged:                      False   LL-Null:                       -487.29
Covariance Type:            nonrobust   LLR p-value:                 6.748e-49
                                        coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
Age                                  -0.0418      0.017     -2.456      0.014      -0.075      -0.

In [12]:
# Evaluate the model

conf_mat = confusion_matrix(y_test['Attrition'], y_test['Attrition_p']) 

TN, FP, FN, TP = confusion_matrix(y_test['Attrition'], y_test['Attrition_p']).ravel()
 
Accuracy = (TN + TP)/(TN + TP + FN + FP)
Sensitivity = TP/(TP + FN)
Specificity = TN/(TN + FP)

print(f'True Negative : {TN}')
print(f'True Positive : {TP}')
print(f'False Negative: {FN}')
print(f'False Positive: {FP}')

print()

print(f'Accuracy    : {Accuracy:.3f}')
print(f'Sensitivity : {Sensitivity:.3f}')
print(f'Specificity : {Specificity:.3f}')

True Negative : 297
True Positive : 25
False Negative: 34
False Positive: 12

Accuracy    : 0.875
Sensitivity : 0.424
Specificity : 0.961


In [13]:
model = sm.Logit(y_test['Attrition'], y_test[['Attrition_p']].assign(_const=1)).fit()
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.361086
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:              Attrition   No. Observations:                  368
Model:                          Logit   Df Residuals:                      366
Method:                           MLE   Df Model:                            1
Date:                Fri, 29 Nov 2024   Pseudo R-squ.:                  0.1797
Time:                        23:09:53   Log-Likelihood:                -132.88
converged:                       True   LL-Null:                       -162.00
Covariance Type:            nonrobust   LLR p-value:                 2.326e-14
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Attrition_p     2.9013      0.395      7.343      0.000       2.127       3.676
_const         -2.1674    