## Day 35 Lecture 1 Assignment

In this assignment, we will learn about gradient boosting. We will use a dataset describing survival rates after breast cancer surgery loaded below and analyze the model generated for this dataset.

In [19]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [20]:
# Attributes:
# Age of patient at time of operation (numerical)
# Patient's year of operation (year - 1900, numerical)
# Number of positive axillary nodes detected (numerical)
# Survival status (class attribute)
#  -- 1 = the patient survived 5 years or longer
#  -- 2 = the patient died within 5 year

cols = ['age', 'op_year', 'nodes', 'survival']
cancer = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/haberman.data', names=cols)

In [21]:
cancer.head()

Unnamed: 0,age,op_year,nodes,survival
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


Check for missing data and remove all rows containing missing data

In [22]:
# answer below:
cancer.dropna()


Unnamed: 0,age,op_year,nodes,survival
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1
...,...,...,...,...
301,75,62,1,1
302,76,67,0,1
303,77,65,3,1
304,78,65,1,2


Adjust the target variable so that it has values of either 0 or 1

In [25]:
#change two to zero.
cancer.loc[cancer.survival == 2, "survival"] = 0

In [26]:
cancer.describe()

Unnamed: 0,age,op_year,nodes,survival
count,306.0,306.0,306.0,306.0
mean,52.457516,62.852941,4.026144,0.735294
std,10.803452,3.249405,7.189654,0.441899
min,30.0,58.0,0.0,0.0
25%,44.0,60.0,0.0,0.0
50%,52.0,63.0,1.0,1.0
75%,60.75,65.75,4.0,1.0
max,83.0,69.0,52.0,1.0


In [27]:
cancer[cancer['survival'] == 0]

Unnamed: 0,age,op_year,nodes,survival
7,34,59,0,0
8,34,66,9,0
24,38,69,21,0
34,39,66,0,0
43,41,60,23,0
...,...,...,...,...
286,70,58,4,0
293,72,63,0,0
299,74,65,3,0
304,78,65,1,0


Split the data into train and test (20% in test)

In [29]:
#Size of the test set and target variable to split the data. set the dataframe to the name of the dataframe that I'm using, for easy usage. 
#Function potential with params (dataframe, col, size=0.2)

target = 'survival'
SIZE = 0.2
df = cancer

y = df[target]
X = df.drop(columns=[target])

from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=SIZE)
print('There are {:d} training samples and {:d} test samples'.format(X_train.shape[0], X_test.shape[0]))


There are 244 training samples and 62 test samples


Create a gradient boosted classification algorithm with a learning rate of 0.01 and max depth of 5. Report the accuracy.

In [31]:
# answer below:
# We'll make 500 iterations, use 2-deep trees, and set our loss function.

# *** It's best to update this table for the drill 
#  at the end of the lesson.  ***
from sklearn.preprocessing import StandardScaler
from sklearn import ensemble

scaler = StandardScaler()

scaler.fit_transform(X_train)
scaler.transform(X_test)

params = {'max_depth': 5,          
          'learning_rate': 0.01}

#Holding off, but because of the PCA I might want to consider using a standard scalar.
# Initialize and fit the model.
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train)

predict_train = clf.predict(X_train)
predict_test = clf.predict(X_test)


In [32]:
clf.score(X_train, y_train)

0.8770491803278688

In [33]:
clf.score(X_test, y_test)

0.7258064516129032

Print the confusion matrix for the test data. What do you notice about our predictions?

In [35]:
# answer below:
from sklearn.metrics import confusion_matrix, precision_score, recall_score

confusion_matrix(y_train, predict_train)

array([[ 34,  29],
       [  1, 180]])

In [36]:
confusion_matrix(y_test, predict_test)

array([[ 3, 15],
       [ 2, 42]])

I forgot which is which. I remember that I've got a lot of them. I knkow I have a high accuracy score. There's a lot of true positives? Or are those false negatives? 

Hold on. There's not a lot in the 0 value, and there fore they appear to be less accurate. 

Print the confusion matrix for a learning rate of 1 and a learning rate of 0.5. What do you see now that stands out to you in the confusion matrix?

In [37]:
# answer below:
params = {'max_depth': 5,          
          'learning_rate': 1}

#Holding off, but because of the PCA I might want to consider using a standard scalar.
# Initialize and fit the model.
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train)

predict_train = clf.predict(X_train)
predict_test = clf.predict(X_test)


In [38]:
confusion_matrix(y_train, predict_train)

array([[ 63,   0],
       [  2, 179]])

In [39]:
confusion_matrix(y_test, predict_test)

array([[ 5, 13],
       [10, 34]])

It changed in favor of the other size. 

In [40]:
# answer below:
params = {'max_depth': 5,          
          'learning_rate': 0.5}

#Holding off, but because of the PCA I might want to consider using a standard scalar.
# Initialize and fit the model.
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train)

predict_train = clf.predict(X_train)
predict_test = clf.predict(X_test)


In [41]:
confusion_matrix(y_train, predict_train)

array([[ 61,   2],
       [  0, 181]])

In [54]:
cm = confusion_matrix(y_test, predict_test)

I still got quite a bit of bias.

In [51]:
cm[0][0:2]

array([ 5, 13])

In [53]:
precision_score(y_train, predict_train)

1.0

In [55]:
recall_score(y_train, predict_train)

1.0

In [58]:
from sklearn.metrics import classification_report

print(classification_report(y_train, predict_train))

              precision    recall  f1-score   support

           0       1.00      0.97      0.98        63
           1       0.99      1.00      0.99       181

    accuracy                           0.99       244
   macro avg       0.99      0.98      0.99       244
weighted avg       0.99      0.99      0.99       244



Perform a grid search for the optimal learning rate. Instead of accuracy, use a metric that will help your model predict the positive class.

In [59]:
# answer below:
from sklearn.model_selection import GridSearchCV

parameters = {'learning_rate':[0.8, 1, 0.2], 'max_features':['auto', 'sqrt', 'log2']}
result = GridSearchCV(clf, parameters, scoring='neg_mean_absolute_error')
result.fit(X_train, y_train)


GridSearchCV(cv=None, error_score=nan,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.5,
                                                  loss='deviance', max_depth=5,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_change=None,
      

In [67]:
result.best_params_

{'learning_rate': 0.8, 'max_features': 'sqrt'}

List the feature importances for the model with the optimal learning rate.

In [64]:
print(result.best_estimator_.feature_importances_)

[0.46905728 0.20526279 0.32567993]
