## Day 35 Lecture 1 Assignment

In this assignment, we will learn about gradient boosting. We will use a dataset describing survival rates after breast cancer surgery loaded below and analyze the model generated for this dataset.

In [0]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [0]:
# Attributes:
# Age of patient at time of operation (numerical)
# Patient's year of operation (year - 1900, numerical)
# Number of positive axillary nodes detected (numerical)
# Survival status (class attribute)
#  -- 1 = the patient survived 5 years or longer
#  -- 2 = the patient died within 5 year

cols = ['age', 'op_year', 'nodes', 'survival']
cancer = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/haberman.data', names=cols)

In [0]:
cancer.head()

Check for missing data and remove all rows containing missing data

In [0]:
# answer below:
cancer.dropna(inplace=True)
cancer.info()

Adjust the target variable so that it has values of either 0 or 1

In [3]:
pd.get_dummies(cancer['survival'], prefix='survival', drop_first=False)


Unnamed: 0,survival_1,survival_2
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0
...,...,...
301,1,0
302,1,0
303,1,0
304,0,1


In [4]:
# answer below:
cancer_dummies = pd.concat([cancer.drop(columns='survival'), pd.get_dummies(cancer['survival'], prefix = 'survival', drop_first=True)], axis=1)

cancer_dummies.head()       

Unnamed: 0,age,op_year,nodes,survival_2
0,30,64,1,0
1,30,62,3,0
2,30,65,0,0
3,31,59,2,0
4,31,65,4,0


In [0]:
# cancer['survival_fixed'] = cancer['survival'].apply(lambda x: 1 if x == 2 else 0)

Create a dummy variable from the number of nodes

In [0]:
# answer below:
#cancer_dummies = pd.concat([cancer.drop(columns='nodes'), pd.get_dummies(cancer['nodes'], drop_first=True)], axis=1)

Split the data into train and test (20% in test)

In [0]:
cancer_dummies.head()

In [0]:
cancer_dummies.info()

In [0]:
# answer below:
from sklearn.model_selection import train_test_split
X = cancer_dummies.drop('survival_2',1)
y = cancer_dummies['survival_2']

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state = 35)


In [0]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Create a gradient boosted classification algorithm with a learning rate of 0.01 and max depth of 5. Report the accuracy.

In [7]:
# answer below:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(learning_rate= 0.01, max_depth=5)
gbc.fit(X_train, y_train)

gbc.score(X_test,y_test)

0.7402597402597403

Print the confusion matrix for the test data. What do you notice about our predictions?

In [8]:
# answer below:
#confusion matrix
y_pred = gbc.predict(X_test)
from sklearn import metrics
print(metrics.confusion_matrix(y_test, y_pred))

[[54  3]
 [17  3]]


Print the confusion matrix for a learning rate of 1 and a learning rate of 0.5. What do you see now that stands out to you in the confusion matrix?

In [9]:
# answer below:
gbc = GradientBoostingClassifier(learning_rate= 1, max_depth=0.5)
gbc.fit(X_train, y_train)

gbc.score(X_test,y_test)

y_pred = gbc.predict(X_test)
from sklearn import metrics
print(metrics.confusion_matrix(y_test, y_pred))

[[57  0]
 [20  0]]


In [0]:
# def conf_matrix(y_true, y_pred):
#   data = confusion_matrix(y_true, y_pred)
#   index = ['Actual_0', 'Actual_1']
#   columns = ['Predicted_0', 'Predicted_1']
#   return pd.DataFrame(data, index, columns)
  
# conf_matrix(y_test, y_pred)

Perform a grid search for the optimal learning rate.

In [10]:
learning_range = np.logspace(-3,1,5)
max_depth = np.arange(1,10,2)
n_estimators = np.logspace(1,4,4)
n_estimators.astype(int)
print(learning_range)
print(max_depth)
print(n_estimators)

[1.e-03 1.e-02 1.e-01 1.e+00 1.e+01]
[1 3 5 7 9]
[   10.   100.  1000. 10000.]


In [18]:
# answer below:
from sklearn.model_selection import GridSearchCV

# params = {'learning_rate': learning_range, 'max_depth': max_depth, 'n_estimators': n_estimators}
params = {'learning_rate': [0.0025,0.05], 'max_depth': [4,5], 'n_estimators': [500,550]}
clf = GridSearchCV(gbc,param_grid= params, cv=3)
clf.fit(X,y)

print(clf.best_params_)
print(clf.best_score_)

{'learning_rate': 0.0025, 'max_depth': 4, 'n_estimators': 550}
0.7647058823529411


List the feature importances for the model with the optimal learning rate.

In [19]:
# answer below:
pd.DataFrame({'columns': X.columns, 'importance scores':clf.best_estimator_.feature_importances_}).sort_values(by='importance scores', ascending=False)


Unnamed: 0,columns,importance scores
2,nodes,0.493285
0,age,0.374442
1,op_year,0.132273
