## Day 35 Lecture 1 Assignment

In this assignment, we will learn about gradient boosting. We will use a dataset describing survival rates after breast cancer surgery loaded below and analyze the model generated for this dataset.

In [0]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix

sns.set()

In [0]:
# Attributes:
# Age of patient at time of operation (numerical)
# Patient's year of operation (year - 1900, numerical)
# Number of positive axillary nodes detected (numerical)
# Survival status (class attribute)
#  -- 1 = the patient survived 5 years or longer
#  -- 2 = the patient died within 5 year

cols = ['age', 'op_year', 'nodes', 'survival']
cancer = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/haberman.data', names=cols)

In [3]:
cancer.head()

Unnamed: 0,age,op_year,nodes,survival
0,30,64,1,1
1,30,62,3,1
2,30,65,0,1
3,31,59,2,1
4,31,65,4,1


Check for missing data and remove all rows containing missing data

In [9]:
# answer below:
cancer.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age         306 non-null int64
op_year     306 non-null int64
nodes       306 non-null int64
survival    306 non-null int64
dtypes: int64(4)
memory usage: 9.7 KB


Adjust the target variable so that it has values of either 0 or 1

In [11]:
# answer below:
cancer.survival.nunique()


2

Create a dummy variable from the number of nodes

In [0]:
# answer below:
nodes_dummy = pd.get_dummies(cancer['nodes'].astype(str), drop_first=True)
cancer_df = pd.concat([cancer,nodes_dummy],axis=1)

Split the data into train and test (20% in test)

In [0]:
# answer below:
X=cancer_df.drop('survival',axis=1)
Y=cancer_df['survival']
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=.2)

Create a gradient boosted classification algorithm with a learning rate of 0.01 and max depth of 5. Report the accuracy.

In [45]:
# answer below:
gbc =  GradientBoostingClassifier(learning_rate=0.01, max_depth=5, n_estimators=50)
gbc.fit(X_train, y_train)

print('Train score:', gbc.score(X_train,y_train))
print('Train score:', gbc.score(X_test,y_test))


Train score: 0.8360655737704918
Train score: 0.8064516129032258


Print the confusion matrix for the test data. What do you notice about our predictions?

In [47]:
# answer below:
confusion_matrix(y_test, gbc.predict(X_test))


array([[45,  3],
       [ 9,  5]])

Print the confusion matrix for a learning rate of 1 and a learning rate of 0.5. What do you see now that stands out to you in the confusion matrix?

In [26]:
# answer below:

gbc =  GradientBoostingClassifier(learning_rate=1, max_depth=5, n_estimators=100)
gbc.fit(X_train, y_train)

confusion_matrix(y_test, gbc.predict(X_test))



array([[37, 11],
       [ 9,  5]])

In [55]:
gbc =  GradientBoostingClassifier(learning_rate=.01, max_depth=5, n_estimators=100)
gbc.fit(X_train, y_train)

confusion_matrix(y_test, gbc.predict(X_test))

array([[45,  3],
       [ 9,  5]])

Perform a grid search for the optimal learning rate.

In [50]:
# answer below:
params = {'learning_rate':[0.005,0.01,0.1], 'max_depth':[1,3,5,10]}
grid = GridSearchCV(gbc, param_grid=params, scoring= 'accuracy',cv=3, return_train_score=True, n_jobs=3)
grid.fit(X, Y)
print(grid.best_params_)
print(grid.best_score_)



{'learning_rate': 0.1, 'max_depth': 3}
0.7516339869281046


List the feature importances for the model with the optimal learning rate.

In [51]:
# answer below:
pd.DataFrame({'columns':X.columns,'importance score':grid.best_estimator_.feature_importances_}).sort_values(by = 'importance score', ascending =False)


Unnamed: 0,columns,importance score
2,nodes,0.364138
0,age,0.293407
1,op_year,0.134605
32,9,0.026238
27,5,0.022355
25,4,0.018951
7,13,0.018482
18,23,0.017542
9,15,0.015951
3,1,0.011893
