# Week 6 Problem 2

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says YOUR CODE HERE. Do not write your answer in anywhere else other than where it says YOUR CODE HERE. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select Kernel, and restart the kernel and run all cells (Restart & Run all).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select File → Save and CheckPoint)

5. When you are ready to submit your assignment, go to Dashboard → Assignments and click the Submit button. Your work is not submitted until you click Submit.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

7. If your code does not pass the unit tests, it will not pass the autograder.



# Due Date: 6 PM, February 26, 2018


In [1]:
# Set up Notebook

% matplotlib inline

# Standard imports
import numpy as np
import pandas as pd
import seaborn as sns
from time import time
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
import math
from sklearn import metrics
from sklearn.decomposition import NMF
from sklearn.decomposition import FastICA
from sklearn.decomposition import MiniBatchSparsePCA
from sklearn.decomposition import MiniBatchDictionaryLearning
from sklearn.decomposition import FactorAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from numpy.testing import assert_array_equal, assert_array_almost_equal
from pandas.util.testing import assert_frame_equal, assert_index_equal
from nose.tools import assert_false, assert_equal, assert_almost_equal, assert_true, assert_in, assert_is_not

# We do this to ignore several specific Pandas warnings
import warnings
warnings.filterwarnings("ignore")

# Set global figure properties
import matplotlib as mpl
mpl.rcParams.update({'axes.titlesize' : 20,
                     'axes.labelsize' : 18,
                     'legend.fontsize': 16})

# Set default seaborn plotting style
sns.set_style('white')


## Breast Cancer Dataset

We will be using the built-in dataset about breast cancer and the respective information on indivudal breast cancer cases. This dataset has 569 samples and a dimensionality size of 30. We will be using only the 1st 10 features in order to create a Gradient Boosting model that will predict whether the individual case is either malignant (harmful) or benign (non-harmful).

The following code below imports the dataset as a pandas dataframe. It also concatenates a column called classification which contains whether the record was determined to be a malignant or benign tumor. Note: In this dataset, a malignant tumor has a value of 0 and a benign tumor has a value of 1.



In [53]:
# Load in the dataset as a Pandas DataFrame
data = load_breast_cancer()
cancer_data = pd.DataFrame(data.data, columns=data.feature_names)
cancer_data['target'] = data.target
# View the label distribution
print(cancer_data.target.value_counts(ascending=True))

features = cancer_data[cancer_data.columns]
labels = cancer_data.target
# Count the number of features
print("Number of features:", len(features.columns))

skf = StratifiedKFold(n_splits=5, random_state=23)


0    212
1    357
Name: target, dtype: int64
Number of features: 31


# Problem 1

We can see that the above Breast Cancer Dataset has 30 independent features. In this problem, we will perform the PCA on this dataset so as to reduce the number of features. 
Performing PCA on unscaled variables will lead to insanely large loadings for variables with high variance. In turn, this will lead to dependence of a principal component on the variable with high variance. This is undesirable.

Create a function which will perform PCA on the features passed as parameters. A scale parameter is used in the function which will specify whether to perform scaling or not. Function should return the length of the transformed feature set(after PCA) and the number of new variables explaining nearly **'n'** percentage of the variance.

In [54]:
def pca(features, n , scale):
    '''
    Fit PCA model on features and return the transformed dataset along with the number of variables
    (after PCA) explaining n percentage variance.
    
    Parameters
    ----------
    features : variables on which PCA is to be applied
    n : percentage cutoff of cumulative variance explained by variables after PCA 
    scale : A boolean value(if true, then transform variables using StandardScaler else do nothing)
    
    Returns
    -------
    A tuple of 2 containing(in respective order) :
    Number of variables explaining variance less than or equal to n(just the length), 
    The transformed data after PCA using fit_transform.
    
    Hint
    ----
    Find the cumulative variance explained and then take take the length of (cumulative variance<=n).
    Remember to transform the dataset after PCA(If scaling is True, there will be 2 transformations)
    '''
    
    # Principal Component Analysis
    pca = PCA()
    #standard scaler
    ss = StandardScaler()
    
    if scale == True:
        features = ss.fit_transform(features)
        
    # Fit model to the data
    features = pca.fit_transform(features)
    
    vars = pca.explained_variance_ratio_
    
    sum_var = 0
    i = 0
    while (sum_var < n):
        sum_var = sum_var + vars[i]
        i = i + 1
    
    return i-1, features
    

In [56]:
length, features_reduced = pca(features, 0.9, True)
assert_equal(length, 6)
assert_equal(isinstance(features_reduced, np.ndarray), True)
assert_almost_equal(features_reduced[0][0], 9.22577, 3)
length1, features_reduced1 = pca(features, 0.995, False)
assert_equal(length1, 1)
assert_almost_equal(features_reduced1[0][0], 1160.1427, 2)


# Problem 2

Create a test-train split where test_size=0.3 and random_state=23 after subsetting the transformed feature set based on the length parameter to transform our new feature space to 'length' dimensions.

In the code cell below, we will create a KNeighborsClassifier to classify the tumor as malignant or benign. We will use 'skf' for purpose of cross validation. We will try different values of **k_vals** to build the model using Grid Search.



In [79]:
def knn(k_vals, length, features_reduced, labels):
    '''
    Subset the features based on the length parameter passed in the function i.e. the new feature space should 
    be only of 'length' dimensions.

    Perform a test train split with this reduced feature set and lables with following parameters:
    test_size=0.3, random_state=23

    Create a Grid Search cross validator(cv=skf) for KNeighborsClassifier where param_grid will be a dictionary 
    containing n_neighbors as hyperparameter and k_vals as values. 
    
    Parameters
    ----------
    k_vals : range of nearest neighbors value passed as a numpy array
    length : number of dimensions that the feature set(transformed after PCA) should be reduced to.
    features_reduced : the reduced features set obtained after PCA transformation.
    labels : the original labels from the data
    
    Returns
    -------
    Training features as a multi dimensional numpy array (contains 70% of the features)
    Testing features as a multi dimensional numpy array (contains 30% of the features)
    Training labels as pandas series (contains 70% of the labels)
    Testing labels as pandas series (contains 30% of the labels)
    Grid search cross validator instance which has the knn as estimator, paramater grid containing neighbor values 
    and cross-validation = 'skf' as parameters.
    
    Hint
    ----
    features_reduced is a 2D numpy array containing 569 observations and 31 features. 
    Make sure to subset this properly(i.e. don't subset the observations instead of features)  
    '''
    #subset the features
    data_reduced = features_reduced[:, :length]
    
    #train-test-split
    X_train, X_test, y_train, y_test \
       = train_test_split(data_reduced, labels, test_size = 0.3, random_state = 23)
    
    #create an estimator 
    knn = KNeighborsClassifier()
    
    # Create a dictionary of hyperparameters and values
    params = dict(n_neighbors=k_vals)
    
    #create a grid search cross validator
    gse = GridSearchCV(estimator = knn, param_grid = params, cv = skf)
    
    return X_train, X_test, y_train, y_test, gse

In [80]:
k_vals = np.arange(1,101,2)
f_train, f_test, l_train, l_test, gknn = knn(k_vals, length, features_reduced, labels)
gknn.fit(f_train, l_train)
bknn=gknn.best_estimator_
assert_equal(type(f_train), type(np.ndarray(0)))
assert_equal(type(f_test), type(np.ndarray(0)))
assert_equal(type(l_train), type(pd.core.series.Series()))
assert_equal(type(l_test), type(pd.core.series.Series()))
assert_equal(isinstance(gknn, GridSearchCV), True)
assert_equal(isinstance(bknn, KNeighborsClassifier), True)
assert_almost_equal(0.972, gknn.best_score_, 2)
assert_almost_equal(0.982, gknn.score(f_test, l_test), 2)
assert_equal(len(f_train[0,:]), 6)


# Problem 3

In this problem, we will create a function which will create an instance for one of the following methods : PCA, NMF, FactorAnalysis, FastICA and MiniBatchSparsePCA. Use **n_components** as the hyperparameter. Your function should return the method instance along with the transformed dataset.

***Hint***: Use fit_transform to transform the dataset.

In [81]:
def Reduction(method, features, n):
    '''
    This function will use method parameter to create an instance for one of the following methods:
    PCA, NMF, FactorAnalysis, FastICA and MiniBatchSparsePCA and return the transformed dataset along with method used.
    
    Parameters
    ----------
    method : one of either PCA, NMF, FactorAnalysis, FastICA or MiniBatchSparsePCA
    features : feature set that should be transformed based on the method used.
    n : n_components 
    
    Returns
    -------
    A tuple of 2 containing the instance of the method created and the transformed features based on 
    model used respectively
    '''

    if method == 'PCA' :
        classifier = PCA
    elif method == 'NMF':
        classifier = NMF
    elif method == 'FactorAnalysis':
        classifier = FactorAnalysis
    elif method == 'FastICA':
        classifier = FastICA
    else:
        classifier = MiniBatchSparsePCA
        
    model = classifier(n_components = n)
    features = model.fit_transform(features)
    
    return model, features

In [82]:
pca_model, X1 = Reduction("PCA", features, 3)
nmf_model, X2 = Reduction("NMF", features, 3)
fa_model, X3 = Reduction("FactorAnalysis", features, 3)
fica_model, X4 = Reduction("FastICA", features, 3)
sp_model, X5 = Reduction("MiniBatchSparsePCA", features, 3)

assert_equal(type(pca_model), PCA)
assert_equal(type(nmf_model), NMF)
assert_equal(type(fa_model), FactorAnalysis)
assert_equal(type(fica_model), FastICA)
assert_equal(type(sp_model), MiniBatchSparsePCA)
assert_almost_equal(X1[0][0], 1160.14274385, 0)

f_train, f_test, l_train, l_test \
    = train_test_split(X2, labels, 
                        test_size=0.3, random_state=23)
model = DecisionTreeClassifier(random_state=40, max_depth = 5)
model.fit(f_train, l_train)
predicted = model.predict(f_test)

assert_almost_equal(X2[0][0], 12.84946978, 0)
assert_almost_equal(0.9532, metrics.accuracy_score(l_test, predicted), 2)

