
Challenge

Now that you have two new regression methods at your fingertips, it's time to give them a spin. In fact, for this challenge, let's put them together! Pick a dataset of your choice with a binary outcome and the potential for at least 15 features. If you're drawing a blank, the crime rates in 2013 dataset has a lot of variables that could be made into a modelable binary outcome.

Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach). The models should be:

    Vanilla logistic regression
    Ridge logistic regression
    Lasso logistic regression

If you're stuck on how to begin combining your two new modeling skills, here's a hint: the SKlearn LogisticRegression method has a "penalty" argument that takes either 'l1' or 'l2' as a value.

In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

Record your work and reflections in a notebook to discuss with your mentor.





IMPORT EVERYTHING: because why not!

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import math
import seaborn as sns
import sklearn
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline
sns.set_style('white')

IMPORT THE DATA: find shape, nulls values, column names, and get the head.

In [None]:
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')

print (df.shape)
print (df.columns.isnull().sum())
print (df.columns)
df.head()

FIND OUT THE DISTRIBUTION OF THE DEPENDENT VARIABLE

In [None]:
df['Attrition'].value_counts()

PRODUCE DUMMY FEATURES FOR SEVERAL INDEPENDENT VARIABLES AND SAVE THEM INTO A NEW DATAFRAME

In [None]:
objlist = df[['Attrition', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime']]
df2=df.append(pd.get_dummies(objlist, drop_first=True))


CHECK THE NEW DATAFRAME SHAPE

In [None]:
df2.shape

DROP ORIGINAL VARIABLE 'Attrition'

In [None]:
df3 = df2.drop(objlist, axis = 1)

FILL IN EMPTY CELLS WITH 0

In [None]:
df3 = df3.fillna(0)
df3.head()


CHECK DISTRIBUTION ON DEPENDENT VARIABLE

In [None]:
df3['Attrition_Yes'].value_counts()

ASSIGN DEPENDENT AND INDEPENDENT VARIABLES TO X AND y as well as setup the train_test_split.

In [None]:
y = df3['Attrition_Yes']
X = df3[['Age', 'DailyRate', 'Department_Research & Development', 'Department_Sales',
        'DistanceFromHome', 'Education', 'EducationField_Life Sciences',
        'EducationField_Marketing', 'EducationField_Medical',
       'EducationField_Other', 'EducationField_Technical Degree',
       'EmployeeCount',  'EnvironmentSatisfaction',
       'Gender_Male', 'HourlyRate', 'JobInvolvement', 'JobLevel',
       'JobRole_Human Resources', 'JobRole_Laboratory Technician',
       'JobRole_Manager', 'JobRole_Manufacturing Director',
       'JobRole_Research Director', 'JobRole_Research Scientist',
       'JobRole_Sales Executive', 'JobRole_Sales Representative',
       'JobSatisfaction', 'MaritalStatus_Married', 'MaritalStatus_Single',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 
       'OverTime_Yes', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)



IMPORT AND UTILIZE AN UNDER-SAMPLING METHOD (RandomUnderSampler) TO HELP CREATE A BALANCED DISTRIBUTION IN THE DEPENDENT VARIABLE

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))


RUN THE LOGISTIC REGRESSION ON THE VARIABLES

In [None]:
lr = LogisticRegression(penalty= 'l1', C=1000)

# Fit the model.
fit = lr.fit(X_resampled, y_resampled)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn = lr.predict(X)

print('\n Accuracy by Attrition')
print(pd.crosstab(pred_y_sklearn, y))

print('\n Percentage accuracy')
print(lr.score(X_test, y_test))

COMPARE WITH RUNNING A RIDGE REGRESSION ON THE SAME VARIABLES 

In [None]:
ridgeregr = linear_model.Ridge(alpha=10, fit_intercept=False) 
ridgeregr.fit(X_resampled, y_resampled)
print(ridgeregr.score(X_train, y_train))
print(ridgeregr.score(X_test, y_test))

In [None]:
#Ridge Regression


ridge = Ridge()

par = {'alpha': [0, 1, 2, 3]}

ridge_regressor = GridSearchCV(ridge, par, scoring = 'neg_mean_squared_error', cv = 3)

ridge_regressor.fit(X_resampled, y_resampled)

print(ridge_regressor.best_params_)
print(ridge_regressor.best_score_)

print('\n Percentage accuracy')
print(ridge_regressor.score(X_test, y_test))






COMPARE WITH A LASSO REGRESSION ON THE SAME VARIABLES

In [None]:
lass = linear_model.Lasso(alpha=.35)
lassfit = lass.fit(X_resampled, y_resampled)

print(lass.score(X_test, y_test))