# Challenge
Now that you have two new regression methods at your fingertips, it's time to give them a spin. In fact, for this challenge, let's put them together! Pick a dataset of your choice with a binary outcome and the potential for at least 15 features. If you're drawing a blank, the crime rates in 2013 dataset has a lot of variables that could be made into a modelable binary outcome.

Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach). The models should be:

1) Vanilla logistic regression

2) Ridge logistic regression

3) Lasso logistic regression

If you're stuck on how to begin combining your two new modeling skills, here's a hint: the SKlearn LogisticRegression method has a "penalty" argument that takes either 'l1' or 'l2' as a value.

In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

Record your work and reflections in a notebook to discuss with your mentor.

In [3]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns; 
sns.set();

In [7]:
#https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
df = pd.read_csv('~/thinkful_mac/thinkful_large_files/breast_cancer_data.csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


# Attribute Information:
#1) ID number
#2) Diagnosis (M = malignant, B = benign)
#3-32 Ten real-valued features are computed for each cell nucleus:
#a) radius (mean of distances from center to points on the perimeter)
#b) texture (standard deviation of gray-scale values)
#c) perimeter
#d) area
#e) smoothness (local variation in radius lengths)
#f) compactness (perimeter^2 / area - 1.0)
#g). concavity (severity of concave portions of the contour)
#h). concave points (number of concave portions of the contour)
#i). symmetry
#j). fractal dimension ("coastline approximation" - 1)

# Additional Information
This dataset has 10 'unique' features, and three versions of these features (mean, SE, and 'worst', meaning the average of the 3 largest)

The point is, the data should show some multicolinearity..meaning that our ridge regression should come in handy here, and that we could see some higher than usual coefficients in the regular logistic regression model.

In [8]:
#Convert diagnosis to binary from M (malignant) and B (benign)
df['diagnosis'] = df['diagnosis'].apply(lambda x: 1 if x == 'M' else 0)

In [9]:
df.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

In [10]:
#print(df['Unnamed: 32'].unique())
#All values in this column are NAN, so drop it, and also drop the ID column

df.drop('Unnamed: 32', axis = 1, inplace = True)
df.drop('id', axis = 1, inplace = True)

In [12]:
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [33]:
from sklearn.linear_model import LogisticRegression

#REGULAR LOGISTIC REGRESSION

predictors = df.iloc[:, 1:]
targets = df['diagnosis']

#Set the C value very high such that there is essentially no regularization taking place (C is inverse of regularization strength)
lr = LogisticRegression(C = 1e9)

# Fit the model.
fit = lr.fit(predictors, targets)

print('Overall Train Percentage Accuracy')
print(lr.score(predictors, targets),'\n')

X = predictors
y = targets


from sklearn.model_selection import KFold

test_no = 1
kf = KFold(n_splits=5) 

acc = []

for train_index, test_index in kf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    #Use model defined above
    predicted = lr.fit(X_train, y_train).predict(X_test)
    actual = y_test
    print('Accuracy: Fold', test_no)
    print(lr.score(X_test, y_test))
    acc.append(lr.score(X_test, y_test))
    test_no = test_no + 1

print('\nAverage CV accuracy: \n',np.mean(acc))    

#Store the parameter estimates.
origparams = np.append(lr.coef_, lr.intercept_)
print('\nParameter estimates: \n')
print(origparams)

Overall Train Percentage Accuracy
0.9753954305799648 

Accuracy: Fold 1
0.9210526315789473
Accuracy: Fold 2
0.9649122807017544
Accuracy: Fold 3
0.9736842105263158
Accuracy: Fold 4
0.9736842105263158
Accuracy: Fold 5
0.9557522123893806

Average CV accuracy: 
 0.9578171091445429

Parameter estimates: 

[-4.80658975 -0.03294645  0.13337399  0.02657818  0.91058688  0.89125386
  1.73859178  1.68164055  1.01864149  0.01391087  0.14729246 -2.94092035
  0.50738884  0.10992636  0.14041741 -0.79178585 -0.86491466  0.18585906
  0.11011728 -0.15773857 -0.06919903  0.45854925  0.10663872  0.01677184
  1.92657661  2.16735943  3.78477668  3.35400848  2.80998972  0.23977778
 -1.22668934]


In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led 
to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you 
think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

In [34]:
#Now try ridge regression to see if adding a regularization penalty can help improve accuracy

#RIDGE REGRESSION (L2)

from sklearn.linear_model import LogisticRegressionCV

#Set the C value lower such that the regularization does have an impact
lr = LogisticRegressionCV(penalty='l2', solver = 'liblinear', cv = 5)

# Fit the model.
fit = lr.fit(predictors, targets)

print('Overall Percentage Accuracy')
print(lr.score(predictors, targets),'\n')

X = predictors
y = targets


from sklearn.model_selection import KFold

test_no = 1
kf = KFold(n_splits=5) 

acc = []

for train_index, test_index in kf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    #Use model defined above
    predicted = lr.fit(X_train, y_train).predict(X_test)
    actual = y_test
    print('Accuracy: Fold', test_no)
    print(lr.score(X_test, y_test))
    acc.append(lr.score(X_test, y_test))   
    test_no = test_no + 1

print('\nAverage CV accuracy: \n',np.mean(acc)) 
    
#Store the parameter estimates.
cvparams = np.append(lr.coef_, lr.intercept_)
print('\nParameter estimates: \n')
print(cvparams)

Overall Percentage Accuracy
0.961335676625659 

Accuracy: Fold 1
0.9122807017543859
Accuracy: Fold 2
0.9736842105263158
Accuracy: Fold 3
0.9736842105263158
Accuracy: Fold 4
0.9649122807017544
Accuracy: Fold 5
0.9469026548672567

Average CV accuracy: 
 0.9542928116752056

Parameter estimates: 

[-2.41673843  0.03787475 -0.01877681  0.0069718   0.19397394  0.49799715
  0.72981043  0.40337224  0.24143464  0.03929995  0.01610378 -1.71540785
 -0.06201418  0.11453601  0.02151266 -0.00315313  0.04596985  0.04851242
  0.05267721 -0.01000586 -1.48318618  0.30102184  0.22551172  0.02496025
  0.37927817  1.57091275  2.07514229  0.82097632  0.81770313  0.17115496
 -0.43111529]


In [35]:
#LASSO REGRESSION (L1)

#LOGISTICREGRESSIONCV is supposed to find the optimal C value

from sklearn.linear_model import LogisticRegressionCV

#Set the C value lower such that the regularization does have an impact
lr = LogisticRegressionCV(penalty='l1', solver = 'liblinear', cv = 5)

# Fit the model.
fit = lr.fit(predictors, targets)

print('Overall Percentage Accuracy')
print(lr.score(predictors, targets),'\n')

X = predictors
y = targets


from sklearn.model_selection import KFold

test_no = 1
kf = KFold(n_splits=5) 

acc = []

for train_index, test_index in kf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    #Use model defined above
    predicted = lr.fit(X_train, y_train).predict(X_test)
    actual = y_test
    print('Accuracy: Fold', test_no)
    print(lr.score(X_test, y_test))  
    acc.append(lr.score(X_test, y_test))   
    test_no = test_no + 1

print('\nAverage CV accuracy: \n',np.mean(acc)) 
    
#Store the parameter estimates.
cvparams = np.append(lr.coef_, lr.intercept_)
print('\nParameter estimates: \n')
print(cvparams)

Overall Percentage Accuracy
0.9859402460456942 

Accuracy: Fold 1
0.956140350877193
Accuracy: Fold 2
0.9385964912280702
Accuracy: Fold 3
0.9649122807017544
Accuracy: Fold 4
0.9824561403508771
Accuracy: Fold 5
0.9823008849557522

Average CV accuracy: 
 0.9648812296227295

Parameter estimates: 

[-2.68117973e+00  2.90861566e-01 -3.38374265e-01  3.45263294e-02
  0.00000000e+00 -1.72265722e+01  4.86805790e+00  6.69499875e+01
 -9.87821851e+00  0.00000000e+00  2.17671461e+00 -1.23688593e+00
 -4.17604240e-01  1.96898203e-01  0.00000000e+00 -2.76322789e+01
 -3.92602835e+01  0.00000000e+00  0.00000000e+00  0.00000000e+00
  1.83615593e-02  2.45361853e-01  1.07326959e-01  1.64501828e-02
  4.83319433e+01 -4.11880539e+00  1.10407917e+01  4.79138185e+01
  6.98151052e+00  0.00000000e+00 -7.95429525e+00]


# Discussion: 

I chose to use the LogisiticRegressionCV function because it has a built-in solver that finds the optimal C-value to use for the regularization parameter. 

Overall, I would opt to use the LASSO regresssion (L1) logisticregressioncv function because it performed more consistently on the cross-validation tests with accuracy ranging from ~94%-98% and averaging 96.5%.

In the future, you could also fit the data to a random forest to determine which are the most important features, and then re-run the standard and ridge regression models to see if performance improves on the reduced feature set on test validations.