# Week 4 Problem 2

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says YOUR CODE HERE. Do not write your answer in anywhere else other than where it says YOUR CODE HERE. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select Kernel, and restart the kernel and run all cells (Restart & Run all).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select File → Save and CheckPoint)

5. When you are ready to submit your assignment, go to Dashboard → Assignments and click the Submit button. Your work is not submitted until you click Submit.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

7. If your code does not pass the unit tests, it will not pass the autograder.



# Due Date: 6 PM, February 12, 2018


In [82]:
# Set up Notebook

% matplotlib inline

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from time import time
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import neighbors
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier

from numpy.testing import assert_array_equal, assert_array_almost_equal
from pandas.util.testing import assert_frame_equal, assert_index_equal
from nose.tools import assert_false, assert_equal, assert_almost_equal, assert_true, assert_in, assert_is_not

# We do this to ignore several specific Pandas warnings
import warnings
warnings.filterwarnings("ignore")

# Set global figure properties
import matplotlib as mpl
mpl.rcParams.update({'axes.titlesize' : 20,
                     'axes.labelsize' : 18,
                     'legend.fontsize': 16})

# Set default seaborn plotting style
sns.set_style('white')


# Breast Cancer Dataset

We will be using the built-in dataset about breast cancer and the respective information on indivudal breast cancer cases. This dataset has 569 samples and a dimensionality size of 30. We will be using only the 1st 10 features in order to create a Gradient Boosting model that will predict whether the individual case is either malignant (harmful) or benign (non-harmful).

The following code below imports the dataset as a pandas dataframe. It also concatenates a column called classification which contains whether the record was determined to be a malignant or benign tumor. Note: In this dataset, a malignant tumor has a value of 0 and a benign tumor has a value of 1.

We will create 3 different models using different classification techniques and try to tune the models using Grid search.

In [83]:
# Load in the dataset as a Pandas DataFrame
data = load_breast_cancer()
cancer_data = pd.DataFrame(data.data, columns=data.feature_names)
cancer_data['target'] = data.target
# View the label distribution
print(cancer_data.target.value_counts(ascending=True))

features = cancer_data[cancer_data.columns[:10]]
labels = cancer_data.target
# Count the number of features
print("Number of features:", len(features.columns))

test_frac = 0.4
X_train, X_test, y_train, y_test = train_test_split(features, labels, 
                                                    test_size=test_frac, random_state=40)

skf = StratifiedKFold(n_splits=5, random_state=23)


0    212
1    357
Name: target, dtype: int64
Number of features: 10


# Problem 1

In the code cell below, we will create a k Nearest Neighbors model to classify the tumor as malignant or benign. We will use scaling before fitting the model as scaling can have a huge impact on the predictions in case of kNN. We will use 'skf' for purpose of cross validation. Earlier we used loops to create different models for different values of Nearest Neighbors. Here, we will try different values of Nearest Neighbors(k) to build the model using Grid Search.


**Remember : You don't have to perform the search itself. Your function should return the GridSearchCV instance so that the hyperparameters can be passed to the function to actually perform the grid search for model building.**


In [84]:
def knn(k_vals):
    '''
    Perform scaling using StandardScaler and define a KNeighborsClassifier(Create pipeline of scaler and estimator)

    Create a Grid Search cross validator(cv=skf) for the above where param_grid will be a dictionary containing 
    n_neighbors as hyperparameter and k_vals as values. 
    
    Parameters
    ----------
    k_vals : range of nearest neighbors value passed as a numpy array
    
    Returns
    -------
    Grid search cross validator instance which has the Pipeline, paramater grid containing neighbor values 
    and cross-validation = 'skf' as parameters.
    '''
    # Create pipeline of scaler and estimator
    knnp = Pipeline([('ss', StandardScaler()),
                     ('knn', neighbors.KNeighborsClassifier())])
    
    # Create a dictionary of hyperparameters and values
    params = dict(knn__n_neighbors=k_vals)

    # Create grid search cross validator
    gse = GridSearchCV(estimator=knnp, param_grid=params, cv=skf)
    
    return gse
    

In [85]:
k_vals = np.arange(1,101,2)
gse = knn(k_vals)
gse.fit(X_train, y_train)
assert_equal(isinstance(gse, GridSearchCV), True)
best_score = float('%4.3f' % round(gse.best_score_, 3))
assert_almost_equal(0.935, best_score, places = 3)
test_score = float('%4.3f' % round(gse.score(X_test, y_test), 3))
assert_almost_equal(0.934, test_score, places = 3)


# Problem 2

In the code cell below, we will create a Decision Tree model to classify the tumor as malignant or benign. However, we will **not** use scaling before fitting the model. We will use 'skf' for purpose of cross validation. We will try different values of max_depth and max_features to build the model using Grid Search.

**Note :** Since a Decision Tree doesn't take much time to build, we will try all the posibble combinations.

**Remember : You don't have to perform the search itself . Your function should return the GridSearchCV instance so that the hyperparameters can be passed to the function to actually perform the grid search for model building.**


In [88]:
def tree(depth, features):
    '''
    Create a Grid Search cross validator(cv=svf) for DecisionTreeClassifier with random_state=40.
    The parameter grid will be multi-dimensional and will contain max_depth and max_features as hyperparameters.
    
    Parameters
    ----------
    depth : range of max_depth values passed as a numpy array
    features : range of max_features values passed as a numpy array
    
    Returns
    -------
    A Grid search cross validator instance for DecisionTreeClassifier
    '''
    #create decison tree estimator
    dtc = DecisionTreeClassifier(random_state = 40)   
    
    # Create a dictionary of hyperparameters and values      
    params = {'max_depth': depth,
              'max_features': features}
 
    # Create grid search cross validator
    gse = GridSearchCV(estimator=dtc, param_grid=params, cv=skf)
    
    return gse
    

In [89]:
depth = np.arange(1,10)
features = np.arange(1,10)

gse = tree(depth, features)
gse.fit(X_train, y_train)
assert_equal(isinstance(gse,GridSearchCV), True)
gbe = gse.best_estimator_
assert_equal(isinstance(gbe,DecisionTreeClassifier), True)
assert_equal(gse.best_estimator_.max_depth, 3)
assert_equal(gse.best_estimator_.max_features, 6)
best_score = float('%4.3f' % round(gse.best_score_, 3))
assert_almost_equal(0.927, best_score, places = 3)
test_score = float('%4.3f' % round(gse.score(X_test, y_test), 3))
assert_almost_equal(0.921, test_score, places = 3)




# Problem 3

In the code cell below, we will create a Gradient Boosting model to classify the tumor as malignant or benign. We will **not** use scaling before fitting the model. We will use 'skf' for purpose of cross validation. We will try different values of learning_rate, n_estimators and max_features to build the model using Grid Search. Since a Gradient Boosting involves mutiple iterations, trying all possible combinations can be computationally very expensive. Hence we will use a Randomized Grid Search cross validator.

**Note :** Although we are using Randomized Grid Search, the code cell containing tests might take some time to execute.

**Remember : You don't have to perform the search itself . Your function should return the RandomizedSearchCV instance so that the hyperparameters can be passed to the function to actually perform the grid search for model building.**


In [100]:
def GBM(learning_rate, n_estimators, max_features, sample):
    '''
    Create a Randomized Grid Search cross validator(cv=svf and random_state=40) for 
    GradientBoostingClassifier with random_state=40.
    
    The parameter grid will be multi-dimensional and will contain learning_rate, n_estimators 
    and max_features as hyperparameters.
    
    Parameters
    ----------
    learning_rate : range of learning_rate values passed as a numpy array
    n_estimators : range of n_estimators values passed as a numpy array
    max_features : range of max_features values passed as a numpy array
    sample : Number of parameter settings that are sampled
    
    Returns
    -------
    A Randomized Grid search cross validator instance for GradientBoostingClassifier
    '''
    gbtc = GradientBoostingClassifier(random_state=40)

    # Create parameter grid, by using explicit dictionary
    pd = {'learning_rate': learning_rate,
          'n_estimators': n_estimators,
          'max_features': max_features}
 
    # Run randomized search
    rscv = RandomizedSearchCV(gbtc, param_distributions=pd,
                              n_iter=sample, random_state=40, cv=skf)
    
    return rscv

In [101]:
learning_rate = np.arange(0.01,0.05,0.01)
n_estimators = np.arange(200,401,100)
max_features = np.arange(3,8,2)
tgse = GBM(learning_rate, n_estimators, max_features, 10)
tgse.fit(X_train, y_train)
tgbe = tgse.best_estimator_
best_score = float('%4.3f' % round(tgse.best_score_, 3))
test_score = float('%4.3f' % round(tgse.score(X_test, y_test), 3))

assert_equal(isinstance(tgbe, GradientBoostingClassifier), True)
assert_equal(isinstance(tgse, RandomizedSearchCV), True)
assert_almost_equal(0.938, best_score, places = 3)
assert_almost_equal(0.956, test_score, places = 3)

