# Week 3 Problem 1

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select *Kernel*, and restart the kernel and run all cells (*Restart & Run all*).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select *File* → *Save and CheckPoint*)

5. When you are ready to submit your assignment, go to *Dashboard* → *Assignments* and click the *Submit* button. Your work is not submitted until you click *Submit*.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Due Date: 6 PM, February 5, 2018

In [1]:
import os
from nose.tools import assert_equal, assert_true, assert_almost_equal
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import load_breast_cancer, load_boston
# We do this to ignore several specific warnings
import warnings
warnings.filterwarnings("ignore")

## Breast Cancer Dataset
For this assignment we will be using the built-in dataset about breast cancer and the respective information on indivudal breast cancer cases. This dataset has 569 samples and a dimensionality size of 30. In this assignment, we will be using the various attributes and Logistic Regression in order to create a model that will predict whether the individual case is either malignant (harmful) or benign (non-harmful). Throughout the assignment, we will be improving our model from one that is very naïve to a more complicated one that accounts for all the attributes in the given dataset. 

The following code below imports the dataset as a pandas dataframe and previews a few sample data points. It also concatenates a column called classification which contains whether the record was determined to be a malignant or benign tumor. **Note: In this dataset, a malignant tumor has a value of 0 and a benign tumor has a value of 1**

In [2]:
'''
NOTE: Make sure to load this data set before completing the assignment
'''
# Load in the dataset as a Pandas DataFrame
data = load_breast_cancer()
data_df = pd.DataFrame(data.data, columns=data.feature_names)

# Preview the first few lines
data_df['classification'] = data.target
data_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,classification
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [3]:
'''
Separate the dataset into training and testing data using the `train_test_split` function.
The testing and training data will be used in all the successive questions so make sure to finish this
question before proceeding to the other questions.

- Set the test size to 0.3
- Set the random_state to 23
- Use data_df_clean and labels variables below as parameters to the `train_test_split` function
'''

data_df_clean = data_df[data_df.columns[:-1]]
labels = data.target

x_train, x_test, y_train, y_test = train_test_split(data_df_clean, labels,
                                                   test_size = 0.3,
                                                   random_state = 23)


## Question 1 - Construct a Random Forest Classifier

Complete the following function `get_rfc` that returns a random forest classifier from the breast cancer dataset. The function parameters are `n_estimators`, and `max_features`. You will need to create a random forest classifier based on the function parameters passed into the `get_rfc` function. Fit the classifier with the training data as well. **NOTE: you will need to set the random_state to 23 for the classifier.**

In [4]:
def get_rfc(n_es, max_fea):
    '''
    Return a Random Forest Classifier based on the function parameters
    
    Parameters
    ----------
    n_estimators: An integer
    max_features: An integer
    
    Returns
    -------
    A RandomForestClassifier object for the input data
    '''
    rfc = RandomForestClassifier(n_estimators = n_es, 
                                 max_features = max_fea,
                                 random_state = 23)
    
    #fit estimator to training data
    rfc = rfc.fit(x_train, y_train)
    
    return rfc

In [5]:
rfc_model = get_rfc(10, 10)
assert_true(isinstance(rfc_model, RandomForestClassifier))
assert_equal(rfc_model.n_estimators, 10)
assert_equal(rfc_model.max_features, 10)

## Question 2 - Find the most accurate number of `max_features`

In some scenarios, `auto` might not be the best value for the `max_features` parameter that would yield a model with the highest prediction accuracy. We want to find which integer subset of the features would generate the highest prediction accuracy. Our previous `get_rfc` function will help us in generating models based on a variable number of `max_features`.

Complete the function `find_best_max_features` that takes in 2 parameters: `data_df` and `labels` that iterates from 1 to 30 (the length of the number of features in the dataset) and determines the number of `max_features` that would yield the highest prediction accuracy. Return a 2-tuple of `(number_of_features, max_accuracy)` that contains the number of features that yielded the highest predictive accuracy.

In order to find the prediction accuracy for a model, you can use the `score()` method on the `RegressionClassifier` object by passing in the testing data as parameters into the score function. **You will need to multiply the return value by 100 and return the `max_accuracy` as a percent instead of a decimal.**

**NOTE: For the `get_rfc` function call, use 10 as the value for the `n_estimators` parameter. The autograder will check that you have called `get_rfc`**

In [23]:
def find_best_max_features(x = data_df, y = labels):
    '''
    Return the highest predictive accuracy and the respective max_features
    
    Parameters
    ----------
    
    Returns
    -------
    number_of_features, max_accuracy: A 2-tuple of integer, float
    '''
    #define a new list for storing scores for different models
    score_list = []
    
    for i in range(1, 30):
        #build the random forest
        rfc = get_rfc(10, i)
        
        #compute and display accuracy score
        score = 100.0 * rfc.score(x_test, y_test)
        score_list.append(score)
        
    #compare scores to select the one with maximum accuracy
    max_accuracy = max(score_list)
    number_of_features = score_list.index(max_accuracy) + 1
    
    return number_of_features, max_accuracy  

In [24]:
(max_features, accuracy) = find_best_max_features()
assert_true(accuracy > 90.0)
assert_true(max_features >= 1 and max_features <= 31)

In [25]:
#used to test whether `get_rfc` has been used for solutions where it has been explicitly specified.

orig_get_rfc = get_rfc
del get_rfc

    # test get_rfc
try:
    find_best_max_features()

    # if an NameError is thrown, that means get_rfc has been used
except NameError:
    pass

    # if no error is thrown, that means get_rfc has not been used
else:
    raise AssertionError("get_rfc has not been used in find_best_max_features")

    # restore the original function
finally:
    get_rfc = orig_get_rfc
    del orig_get_rfc

## Question 3 - Rank feature importance

Complete the following function `rank_feature_names` that will return a list of 2-tuples of (`feature_name`, `feature_importance`) that is ranked from the most important feature to the least important. The function takes in one parameter `n_estimators` which will be the parameter to the `get_rfc` function call that will return the rfc model based on the `n_estimators` parameter and the `max_features` parameter which should be set to the length of the `feature_names` variable. **Hint: You can access the feature importances for a model by using the `feature_importances_` field of the RandomForestClassifier object**. 

In [38]:
def rank_feature_names(n_estimators):
    '''
    Return a list of 2-tuples of (feature_name, feature_importance) 
    that is ranked from the most important feature to least important
    
    Parameters
    ----------
    n_estimators: An integer
    
    Returns
    -------
    A list of 2-tuples where each tuple is (string, double)
    '''
    
    feature_names = data.feature_names
    
    #build the rfc model
    rfc_model = get_rfc(n_estimators, len(feature_names))
    
    #get the feature importances   
    sort_data = sorted(zip(feature_names, 100*(rfc_model.feature_importances_)), key=lambda x: x[1])
    results = list(reversed(sort_data))
    
    return results

In [40]:
rankings = rank_feature_names(10)
assert_equal(len(rankings), 30)
assert_equal(rankings[0][0], 'worst concave points')
assert_almost_equal(rankings[0][1], 33.933, places=3)