## Day 35 Lecture 1 Assignment

In this assignment, we will learn about gradient boosting. We will use a dataset describing survival rates after breast cancer surgery loaded below and analyze the model generated for this dataset.

In [None]:
%matplotlib inline

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
# Attributes:
# Age of patient at time of operation (numerical)
# Patient's year of operation (year - 1900, numerical)
# Number of positive axillary nodes detected (numerical)
# Survival status (class attribute)
#  -- 1 = the patient survived 5 years or longer
#  -- 2 = the patient died within 5 year

cols = ['age', 'op_year', 'nodes', 'survival']
cancer = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/haberman.data', names=cols)

In [None]:
cancer.head()

Unnamed: 0,age,op_year,nodes,survival
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


Check for missing data and remove all rows containing missing data

In [None]:
cancer.isnull().sum()

age         0
op_year     0
nodes       0
survival    0
dtype: int64

In [None]:
# answer below:
# There are no missing values


Adjust the target variable so that it has values of either 0 or 1

In [None]:
# answer below:
cancer.survival.map({1:1, 2:0})

0      1
1      1
2      1
3      1
4      1
      ..
301    1
302    1
303    1
304    0
305    0
Name: survival, Length: 306, dtype: int64

Split the data into train and test (20% in test)

In [None]:
# answer below:
from sklearn.model_selection import train_test_split
X=cancer.drop('survival',axis=1)
y = cancer['survival']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2,random_state=30)


Create a gradient boosted classification algorithm with a learning rate of 0.01 and max depth of 5. Report the accuracy.

In [None]:
# answer below:
from sklearn.ensemble import GradientBoostingClassifier

gbr = GradientBoostingClassifier(learning_rate=.01, max_depth=5)
gbr.fit(X_train,y_train)
print("Gradient Boosting Accuracy: ",gbr.score(X_test,y_test))


Gradient Boosting Accuracy:  0.7741935483870968


Print the confusion matrix for the test data. What do you notice about our predictions?

In [None]:
y_pred_train = gbr.predict(X_train)
y_pred_test = gbr.predict(X_test)

In [None]:
y_pred_test

array([1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1])

In [None]:
# answer below:
confusion_matrix(y_test,y_pred_test)


array([[46,  2],
       [12,  2]])

Print the confusion matrix for a learning rate of 1 and a learning rate of 0.5. What do you see now that stands out to you in the confusion matrix?

In [None]:
gbr = GradientBoostingClassifier(learning_rate=1, max_depth=5)
gbr.fit(X_train,y_train)
print("Gradient Boosting Accuracy: ",gbr.score(X_test,y_test))
y_pred_test = gbr.predict(X_test)

Gradient Boosting Accuracy:  0.6935483870967742


In [None]:
confusion_matrix(y_test,y_pred_test)

array([[37, 11],
       [ 8,  6]])

In [None]:
# answer below:
gbr = GradientBoostingClassifier(learning_rate=.5, max_depth=5)
gbr.fit(X_train,y_train)
print("Gradient Boosting Accuracy: ",gbr.score(X_test,y_test))
y_pred_test = gbr.predict(X_test)

Gradient Boosting Accuracy:  0.7096774193548387


In [None]:
confusion_matrix(y_test,y_pred_test)

array([[40,  8],
       [10,  4]])

Perform a grid search for the optimal learning rate. Instead of accuracy, use a metric that will help your model predict the positive class.

In [None]:
# answer below:
from sklearn.model_selection import GridSearchCV

params = {'learning_rate': [.001, .01, .1, .25, .5, .75]}
grid = GridSearchCV(GradientBoostingClassifier(), param_grid=params, cv=5, verbose=1, scoring='recall')
grid.fit(X_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    1.8s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_change=None,
         

List the feature importances for the model with the optimal learning rate.

In [None]:
# answer below:
grid.best_params_


1.0

In [None]:
grid.best_score_

1.0