## Day 35 Lecture 1 Assignment

In this assignment, we will learn about gradient boosting. We will use a dataset describing survival rates after breast cancer surgery loaded below and analyze the model generated for this dataset.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
# Attributes:
# Age of patient at time of operation (numerical)
# Patient's year of operation (year - 1900, numerical)
# Number of positive axillary nodes detected (numerical)
# Survival status (class attribute)
#  -- 1 = the patient survived 5 years or longer
#  -- 2 = the patient died within 5 year

cols = ['age', 'op_year', 'nodes', 'survival']
cancer = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/haberman.data', names=cols)

In [3]:
cancer.head()

Unnamed: 0,age,op_year,nodes,survival
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


Check for missing data and remove all rows containing missing data

In [4]:
# answer below:
cancer.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   age       306 non-null    int64
 1   op_year   306 non-null    int64
 2   nodes     306 non-null    int64
 3   survival  306 non-null    int64
dtypes: int64(4)
memory usage: 9.7 KB


Adjust the target variable so that it has values of either 0 or 1

In [5]:
# answer below:
cancer.survival = cancer.survival.map({1:1, 2:0})
cancer.survival

0      1
1      1
2      1
3      1
4      1
      ..
301    1
302    1
303    1
304    0
305    0
Name: survival, Length: 306, dtype: int64

Split the data into train and test (20% in test)

In [6]:
# answer below:
from sklearn.model_selection import train_test_split

X = cancer.drop('survival', axis=1)
y = cancer.survival

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Create a gradient boosted classification algorithm with a learning rate of 0.01 and max depth of 5. Report the accuracy.

In [7]:
# answer below:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(learning_rate=0.01, max_depth=5)

gbc.fit(X_train, y_train)

print(
    f'Train accuracy: {gbc.score(X_train, y_train)}\n'
    f'Test accuracy: {gbc.score(X_test, y_test)}\n'
)

Train accuracy: 0.8401639344262295
Test accuracy: 0.7903225806451613



Print the confusion matrix for the test data. What do you notice about our predictions?

In [14]:
# answer below:
from sklearn.metrics import confusion_matrix

y_test_pred = gbc.predict(X_test)

confusion_matrix(y_test, y_test_pred)



array([[ 4, 11],
       [ 2, 45]])

In [None]:
# Not many true negatives and a large amount of false positives

Print the confusion matrix for a learning rate of 1 and a learning rate of 0.5. What do you see now that stands out to you in the confusion matrix?

In [15]:
# answer below:

gbc2 = GradientBoostingClassifier(learning_rate=1, max_depth=5)
gbc3 = GradientBoostingClassifier(learning_rate=0.5, max_depth=5)

gbc2.fit(X_train, y_train)
gbc3.fit(X_train, y_train)

y_test_pred2 = gbc2.predict(X_test)
y_test_pred3 = gbc3.predict(X_test)

print(f'Learning rate: {gbc2.learning_rate}')
print(confusion_matrix(y_test, y_test_pred2))

print(f'Learning rate: {gbc3.learning_rate}')
print(confusion_matrix(y_test, y_test_pred3))



Learning rate: 1
[[ 2 13]
 [11 36]]
Learning rate: 0.5
[[ 4 11]
 [10 37]]


In [None]:
# Learing rate 1 model is worse at predicting both classes. 
# Learning rate 0.5 is worse at predicting the positive class.

Perform a grid search for the optimal learning rate. Instead of accuracy, use a metric that will help your model predict the positive class.

In [39]:
# answer below:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

gbc4 = GradientBoostingClassifier(max_depth=5, n_iter_no_change=10)

param_grid = {
    'learning_rate':[0.9, 0.75, 0.5, 0.25, 0.1, 0.01, 0.001]
}

grid = GridSearchCV(gbc4, param_grid, scoring='recall', n_jobs=-1, cv=5)
grid.fit(X_train, y_train)

y_test_pred4 = grid.best_estimator_.predict(X_test)

print(f'Learning rate: {grid.best_estimator_.learning_rate}')
print(confusion_matrix(y_test, y_test_pred4))

Learning rate: 0.001
[[ 0 15]
 [ 0 47]]


In [46]:
from scipy import stats
gbc5 = GradientBoostingClassifier(max_depth=3, n_iter_no_change=10)

param_grid = {
    'learning_rate':stats.uniform(0.0001, 0.9)
}

grid2 = RandomizedSearchCV(gbc4, param_grid, scoring='recall', n_jobs=-1, n_iter=100)
grid2.fit(X_train, y_train)

y_test_pred5 = grid2.best_estimator_.predict(X_test)

print(f'Learning rate: {grid2.best_estimator_.learning_rate}')
print(confusion_matrix(y_test, y_test_pred5))

Learning rate: 0.012557095292209797
[[ 0 15]
 [ 0 47]]


List the feature importances for the model with the optimal learning rate.

In [49]:
# answer below:

pd.Series(grid2.best_estimator_.feature_importances_, index=X_train.columns)


age        0.416221
op_year    0.052890
nodes      0.530889
dtype: float64