## Advanced regression

Now that you have two new regression methods at your fingertips, it's time to give them a spin. In fact, for this challenge, let's put them together! Pick a dataset of your choice with a binary outcome and the potential for at least 15 features. If you're drawing a blank, the crime rates in 2013 dataset has a lot of variables that could be made into a modelable binary outcome.

Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach).

The models should be:

Vanilla logistic regression

Ridge logistic regression

Lasso logistic regression

If you're stuck on how to begin combining your two new modeling skills, here's a hint: the SKlearn LogisticRegression method has a "penalty" argument that takes either 'l1' or 'l2' as a value.

In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

In [1]:
import warnings
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import math
import seaborn as sns
import sklearn
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
%matplotlib inline
sns.set_style('white')

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore",
    module="scipy",
    message="^internal gelsd"
)

warnings.filterwarnings("ignore", category=FutureWarning)

The annual County Health Rankings data (http://www.countyhealthrankings.org/explore-health-rankings) provides health and demographic data for every county in the US. It combines both quantitative data (e.g. life expectancy) and qualitative data (e.g. physical distress frequency) to give a snapshot of each county's health profile. This study will ignore most demographic data and focus on more medically-related categories. While demographics no doubt play a role in health, their impact is more difficult to interpret.

In [2]:
full_data = pd.read_csv('../data/chr/County_Health_Rankings.csv',index_col=0)

In [3]:
full_data.head()

Unnamed: 0_level_0,State,County,Life Expectancy,Age-Adjusted Mortality,Child Mortality Rate,Infant Mortality Rate,% Frequent Physical Distress,% Frequent Mental Distress,% Diabetic,HIV Prevalence Rate,% Food Insecure,% Limited Food Access,Drug Overdose Mortality Rate,Motor Vehicle Mortality Rate,% Insufficient Sleep,% Uninsured Adults,% Uninsured Children,% Disconnected Youth,Household Income,% Free or Reduced Lunch,Homicide Rate,Firearm Fatalities Rate,% Homeowners,% Severe Housing Cost Burden,Population,% < 18,% 65 and over,% African American,% American Indian/Alaskan Native,% Asian,% Native Hawaiian/Other Pacific Islander,% Hispanic,% Non-Hispanic White,% Not Proficient in English,% Female,% Rural,Premature Death Years,% Fair/Poor Health,Physically Unhealthy Days,Mentally Unhealthy Days,% Low Birth Wt,% Smokers,% Obese,Food Environment Index,% Physically Inactive,% With Access to Exercise,% Excessive Drinking,% Alcohol-Impaired Deaths,Chlamydia Rate,Teen Birth Rate,% Uninsured,PCP Ratio,Dentist Ratio,MHP Ratio,Preventable Hosp. Rate,% Mammograph Screened,% Flu Vaccinated,HS Graduation Rate,% Some College,% Unemployed,% Children in Poverty,80/20 Income Ratio,% Single-Parent Households,Social Association Rate,Violent Crime Rate,Injury Death Rate,Average Daily Particulate Matter 2.5,Presence of drinking water violation,% Severe Housing Problems,% Drive Alone,% Long Commute - Drives Alone
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1
1001,Alabama,Autauga,76.3,439.0,53.0,8.0,13,13,14,226.0,13,12.0,10.0,20.0,36,11.0,2.0,8.0,58343.0,48.0,5.0,18.0,73,13.0,55504,23.9,15.1,19.3,0.5,1.3,0.1,2.9,74.5,1,51.3,42.0,0.159,18,4.2,4.3,8.0,19,38,7.2,31,69.0,17,29.0,341.2,27.0,9.0,2409,3084,6167,6599.0,44.0,41.0,90.0,61,3.9,19.0,4.6,25.0,12.6,272.0,74.0,11.7,No,15,86,38
1003,Alabama,Baldwin,78.6,348.0,47.0,6.0,13,13,11,164.0,12,5.0,16.0,15.0,33,14.0,3.0,8.0,56607.0,45.0,3.0,14.0,73,13.0,212628,21.8,19.9,9.0,0.8,1.2,0.1,4.6,83.0,0,51.5,42.3,0.034,18,4.1,4.2,8.0,17,31,8.0,24,72.0,17,32.0,338.8,30.0,11.0,1372,2006,1096,3833.0,45.0,45.0,86.0,66,4.0,15.0,4.5,25.0,10.7,204.0,69.0,10.3,Yes,14,85,41
1005,Alabama,Barbour,75.8,470.0,77.0,,16,15,18,436.0,23,11.0,,21.0,39,17.0,3.0,12.0,32490.0,74.0,7.0,15.0,63,14.0,25270,20.8,18.8,47.9,0.7,0.5,0.2,4.2,46.0,1,47.2,67.8,0.379,26,5.1,4.6,11.0,22,44,5.6,28,54.0,13,30.0,557.9,45.0,13.0,2597,2808,12635,4736.0,46.0,37.0,81.0,37,5.9,50.0,5.8,57.0,8.5,414.0,73.0,11.5,No,15,83,34
1007,Alabama,Bibb,73.9,564.0,112.0,15.0,13,13,15,192.0,16,3.0,22.0,25.0,38,12.0,3.0,,45795.0,65.0,8.0,21.0,75,9.0,22668,20.6,16.0,21.5,0.4,0.2,0.1,2.6,74.3,0,46.5,68.4,0.52,20,4.4,4.3,11.0,20,38,7.6,35,16.0,16,27.0,302.1,45.0,10.0,1742,3778,11334,5998.0,44.0,39.0,84.0,48,4.4,27.0,4.3,30.0,10.2,89.0,100.0,11.2,No,11,86,49
1009,Alabama,Blount,74.6,502.0,76.0,6.0,14,14,14,95.0,11,3.0,25.0,26.0,36,16.0,3.0,15.0,48253.0,53.0,7.0,20.0,79,8.0,58013,23.3,17.8,1.5,0.6,0.3,0.1,9.6,86.9,2,50.7,90.0,0.188,21,4.5,4.7,8.0,20,34,8.5,29,23.0,15,22.0,114.3,36.0,12.0,4439,4834,9669,4162.0,36.0,38.0,93.0,54,4.0,19.0,4.1,30.0,9.0,483.0,105.0,11.7,No,10,87,60


In [4]:
# Select medically-related fields
data = pd.DataFrame(index=full_data.index)
data['Life Expectancy'] = full_data['Life Expectancy']
data['Physical Distress Pct'] = full_data['% Frequent Physical Distress']
data['Mental Distress Pct'] = full_data['% Frequent Mental Distress']
data['Diabetic Pct'] = full_data['% Diabetic']
data['HIV Rate'] = full_data['HIV Prevalence Rate']
data['Food Insecure Pct'] = full_data['% Food Insecure']
data['Insufficient Sleep Pct'] = full_data['% Insufficient Sleep']
data['Houshold Income'] = full_data['Household Income']
data['Youth Pct'] = full_data['% < 18']
data['Elderly Pct'] = full_data['% 65 and over']
data['Female Pct'] = full_data['% Female']
data['Poor Health Pct'] = full_data['% Fair/Poor Health']
data['Physically Unhealthy Days'] = full_data['Physically Unhealthy Days']
data['Mentally Unhealthy Days'] = full_data['Mentally Unhealthy Days']
data['Low Birth Weight Pct'] = full_data['% Low Birth Wt']
data['Smoker Pct'] = full_data['% Smokers']
data['Obesity Pct'] = full_data['% Obese']
data['Inactive Pct'] = full_data['% Physically Inactive']
data['Exercise Availability Pct'] = full_data['% With Access to Exercise']
data['Excess Drinker Pct'] = full_data['% Excessive Drinking']
data['Chlamydia Rate'] = full_data['Chlamydia Rate']
data['Teen Birth Rate'] = full_data['Teen Birth Rate']
data['PCP Ratio'] = full_data['PCP Ratio']
data['Dentist Ratio'] = full_data['Dentist Ratio']
data['MHP Ratio'] = full_data['MHP Ratio']
data['Mammograph Pct'] = full_data.iloc[:,[-16]] #full_data['% Mammograph Screened']
data['Flu Vaccinated Pct'] = full_data['% Flu Vaccinated']

In [5]:
# Check whether data cleaning is needed
data.dtypes

Life Expectancy              float64
Physical Distress Pct          int64
Mental Distress Pct            int64
Diabetic Pct                   int64
HIV Rate                     float64
Food Insecure Pct              int64
Insufficient Sleep Pct         int64
Houshold Income              float64
Youth Pct                    float64
Elderly Pct                  float64
Female Pct                   float64
Poor Health Pct                int64
Physically Unhealthy Days    float64
Mentally Unhealthy Days      float64
Low Birth Weight Pct         float64
Smoker Pct                     int64
Obesity Pct                    int64
Inactive Pct                   int64
Exercise Availability Pct    float64
Excess Drinker Pct             int64
Chlamydia Rate               float64
Teen Birth Rate              float64
PCP Ratio                     object
Dentist Ratio                 object
MHP Ratio                     object
Mammograph Pct               float64
Flu Vaccinated Pct           float64
d

In [6]:
# Handle non-numeric columns
data['PCP Ratio'] = pd.to_numeric(data['PCP Ratio'], errors='coerce')
data['Dentist Ratio'] = pd.to_numeric(data['Dentist Ratio'], errors='coerce')
data['MHP Ratio'] = pd.to_numeric(data['MHP Ratio'], errors='coerce')

There is a considerable amount of co-linearity with some high correlations but try to allow the model to handle them.

In [7]:
data.corr()

Unnamed: 0,Life Expectancy,Physical Distress Pct,Mental Distress Pct,Diabetic Pct,HIV Rate,Food Insecure Pct,Insufficient Sleep Pct,Houshold Income,Youth Pct,Elderly Pct,Female Pct,Poor Health Pct,Physically Unhealthy Days,Mentally Unhealthy Days,Low Birth Weight Pct,Smoker Pct,Obesity Pct,Inactive Pct,Exercise Availability Pct,Excess Drinker Pct,Chlamydia Rate,Teen Birth Rate,PCP Ratio,Dentist Ratio,MHP Ratio,Mammograph Pct,Flu Vaccinated Pct
Life Expectancy,1.0,-0.667,-0.689,-0.642,-0.106,-0.598,-0.526,0.625,-0.159,0.032,-0.122,-0.637,-0.668,-0.635,-0.477,-0.698,-0.545,-0.628,0.36,0.539,-0.326,-0.675,-0.179,-0.258,-0.134,0.346,0.224
Physical Distress Pct,-0.667,1.0,0.938,0.619,0.141,0.684,0.66,-0.687,0.085,-0.075,0.055,0.92,0.974,0.87,0.512,0.806,0.415,0.54,-0.307,-0.644,0.351,0.68,0.202,0.254,0.1,-0.433,-0.261
Mental Distress Pct,-0.689,0.938,1.0,0.66,0.118,0.7,0.673,-0.658,0.027,-0.035,0.148,0.837,0.942,0.941,0.501,0.808,0.432,0.541,-0.284,-0.651,0.335,0.599,0.174,0.229,0.066,-0.373,-0.19
Diabetic Pct,-0.642,0.619,0.66,1.0,0.12,0.54,0.601,-0.561,-0.074,0.221,0.208,0.597,0.637,0.652,0.478,0.615,0.66,0.748,-0.399,-0.653,0.16,0.472,0.241,0.312,0.226,-0.167,-0.137
HIV Rate,-0.106,0.141,0.118,0.12,1.0,0.365,0.346,-0.071,-0.084,-0.164,-0.055,0.236,0.089,0.058,0.416,0.083,0.022,0.087,0.006,-0.098,0.519,0.12,0.02,-0.012,-0.008,-0.044,-0.102
Food Insecure Pct,-0.598,0.684,0.7,0.54,0.365,1.0,0.522,-0.605,-0.029,-0.06,0.097,0.676,0.648,0.606,0.612,0.584,0.377,0.448,-0.259,-0.536,0.528,0.503,0.129,0.156,0.061,-0.297,-0.239
Insufficient Sleep Pct,-0.526,0.66,0.673,0.601,0.346,0.522,1.0,-0.314,-0.011,-0.255,0.103,0.653,0.682,0.689,0.538,0.639,0.432,0.478,-0.119,-0.45,0.329,0.391,0.151,0.198,0.058,-0.202,0.022
Houshold Income,0.625,-0.687,-0.658,-0.561,-0.071,-0.605,-0.314,1.0,0.086,-0.291,0.017,-0.671,-0.652,-0.559,-0.407,-0.596,-0.456,-0.576,0.428,0.534,-0.194,-0.627,-0.199,-0.262,-0.174,0.304,0.392
Youth Pct,-0.159,0.085,0.027,-0.074,-0.084,-0.029,-0.011,0.086,1.0,-0.587,0.18,0.169,0.04,-0.05,-0.08,0.085,0.18,0.066,-0.076,-0.06,0.243,0.345,0.041,-0.004,0.028,-0.223,-0.094
Elderly Pct,0.032,-0.075,-0.035,0.221,-0.164,-0.06,-0.255,-0.291,-0.587,1.0,0.074,-0.145,-0.06,-0.021,-0.057,-0.141,-0.059,0.119,-0.175,-0.161,-0.369,-0.085,0.012,0.068,0.088,0.129,-0.194


A simple measure of health is life expectancy. As a classifer, counties can be divided into those above and below average life expectancy (not population-adjusted).

In [8]:
# Use above/below average life expectancy as classifier
X = data.dropna()
Y = (X['Life Expectancy'] > X['Life Expectancy'].mean())*1
X = X.drop('Life Expectancy', 1)

In [9]:
# Create training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

Vanilla Logistic Regression:

In [10]:
log_reg = linear_model.LogisticRegression(C=1e40) # use arbitrarily high C to negate regularization
log_fit = log_reg.fit(X_train, Y_train)

cv_score = cross_val_score(log_fit, X, Y, cv=10)
print("Vanilla Logistic Regression Accuracy: %0.5f (+/- %0.5f)" % (cv_score.mean(), cv_score.std() * 2))

Vanilla Logistic Regression Accuracy: 0.85173 (+/- 0.11117)


Ridge Logistic Regression:

In [11]:
ridge_reg = linear_model.LogisticRegressionCV(cv=10,solver='liblinear',penalty='l2',max_iter=1000).fit(X_train,Y_train)
ridge_reg.score(X_test,Y_test)
cv_score = cross_val_score(ridge_reg,X_test,Y_test)
print("Ridge Logistic Regression Accuracy: %0.5f (+/- %0.5f)" % (cv_score.mean(), cv_score.std() * 2))

Ridge Logistic Regression Accuracy: 0.84213 (+/- 0.02063)


Lasso Logistic Regression:

In [12]:
lasso_reg = linear_model.LogisticRegressionCV(cv=10,solver='liblinear',penalty='l1',max_iter=1000).fit(X_train,Y_train)
lasso_reg.score(X_test,Y_test)
cv_score = cross_val_score(lasso_reg,X_test,Y_test)
print("Lasso Logistic Regression Accuracy: %0.5f (+/- %0.5f)" % (cv_score.mean(), cv_score.std() * 2))

Lasso Logistic Regression Accuracy: 0.82897 (+/- 0.02057)


Conclusions:

Of the three regressions, the vanilla logistic regression yields the best accuracy but shows considerably more variation in cross validation scores. Its accuracy is about 0.010 better than the ridge regression and about 0.023 better than the lasso regression. Variation across the ridge and lasso regression cross validation scores are similar so of the three, the ridge regression has the best combination of accuracy and robustness.