# Module 6 Assignment

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
2. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and restart the kernel and run all cells (Restart & Run all).
3. Do not change the title (i.e. file name) of this notebook.
4. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).
5. All work must be your own, if you do use any code from another source (such as a course notebook or a website) you need to properly cite the source.

-----

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

import sklearn.preprocessing
from sklearn.model_selection import train_test_split

import helper as hp
from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

-----

## Loading Breast Cancer Data

In this assignment, we will work with a breast cancer data set to make predictive models. Before we build a model, we first load the data into the assignment notebook, and randomly sample several rows. The second Code cell creates the features and labels.

-----

In [None]:
# Load data
df = pd.read_csv('./breast-cancer-wisconsin.data')
df.sample(5)

In [None]:
# Create features and labels
labels = df['class']
features = df.drop('class', axis=1)

-----

## Problem 1:  Variance Thresholding

In this problem, you will create and implement a function to perform variance thresholding. Specifically, you must complete the following tasks:

- Define a function called `VarianceThreshold` that excepts three arguments: `features`, `labels`, and `threshold`, in this order.
- Create a [`Variance Threshold`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) feature selector by using the scikit learn library.
  - Use default parameters for this selector, except
  - Set the `threshold` parameter to the `threshold` argument provided to your function.
- Apply this feature selector to fit and transform the feature and label arguments provided to your function.
- Return the selected features and variances of the individual features, in this order.  

**TIP**: Notice that the function you write will be defined as `VarianceThreshold`. Thus, you should not import VarianceThreshold from sklearn.feature.selection. Instead, use the _dot_ operator to reference the `VarianceThreshold` class from the `feature_selection` module in the scikit learn library.

-----

In [None]:
from  sklearn import feature_selection

### YOUR CODE HERE

In [None]:
# Select features and compute relevant variances
selected_features, variences = VarianceThreshold(features, labels, 6)

# Test function
assert_equal(hp.vt_sf, selected_features.flatten().tolist())
assert_equal(variences.flatten().tolist(), [384635052873.4785, 7.945044792052974,
     9.381357331041032, 8.918538272070725, 8.193702316667704, 4.934873062387323,
     13.258254749844047, 5.992227040723361, 9.30512830956357, 2.9977641487794995])

-----

## Problem 2: Recursive Feature Extraction
In this problem, you will create and implement a function to perform recursive feature extraction. Specifically, you must complete the following tasks:

- Write a function called `RFE` that accepts four parameters: features, label, rs, and n, in this order.
- Create a random forest classifier and assign the `random_state` parameter for this estimator to the `rs` argument passed into your function.
- Create a Recursive Feature Estimator (RFE) by using the scikit learn library's [RFE](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE) object. 
  - Use the random forest classifier as the estimator for the `RFE` object.
  - Assign the numbers of features to select parameter to `n`, which was passed as an argument into your function.
- Fit the RFE estimator by using the features and labels.
- Return the RFE model.

When completing this problem, you should reference the **TIP** provided with Problem 1: Variance Thresholding.

-----

In [None]:
from sklearn import ensemble

### YOUR CODE HERE

In [None]:
# Create RFE estimator
rfe1 = RFE(features, labels, rs=0, n=1)

# Test function results
assert_equal(rfe1.ranking_.tolist(), [9, 7, 2, 1, 8, 5, 4, 3, 6, 10])

for i in range(1, len(features.columns)+1):
    rfe = RFE(features, labels, rs=0, n=i)
    print ('Performing Recursive Feature Selection. Choosing', i, 'features.')
    if i == 1:
        assert_equal(rfe.ranking_.tolist(), [9, 7, 2, 1, 8, 5, 4, 3, 6, 10])
        assert_equal(rfe.support_.tolist(), [False, False, False, True, False, 
                                             False, False, False, False, False])
    elif i == 2:
        assert_equal(rfe.ranking_.tolist(), [8, 6, 1, 1, 7, 4, 3, 2, 5, 9])
        assert_equal(rfe.support_.tolist(), [False, False, True, True, False, 
                                             False, False, False, False, False])
    elif i == 3:
        assert_equal(rfe.ranking_.tolist(), [7, 5, 1, 1, 6, 3, 2, 1, 4, 8])
        assert_equal(rfe.support_.tolist(), [False, False, True, True, False, 
                                             False, False, True, False, False])
    elif i == 4:
        assert_equal(rfe.ranking_.tolist(), [6, 4, 1, 1, 5, 2, 1, 1, 3, 7])
        assert_equal(rfe.support_.tolist(), [False, False, True, True, False, 
                                             False, True, True, False, False])
    elif i == 5:
        assert_equal(rfe.ranking_.tolist(), [5, 3, 1, 1, 4, 1, 1, 1, 2, 6])
        assert_equal(rfe.support_.tolist(), [False, False, True, True, False, 
                                             True, True, True, False, False])
    elif i == 6:
        assert_equal(rfe.ranking_.tolist(), [4, 2, 1, 1, 3, 1, 1, 1, 1, 5])
        assert_equal(rfe.support_.tolist(), [False, False, True, True, False, 
                                             True, True, True, True, False])
    elif i == 7:
        assert_equal(rfe.ranking_.tolist(), [3, 1, 1, 1, 2, 1, 1, 1, 1, 4])
        assert_equal(rfe.support_.tolist(), [False, True, True, True, False, True, 
                                             True, True, True, False])
    elif i == 8:
        assert_equal(rfe.ranking_.tolist(), [2, 1, 1, 1, 1, 1, 1, 1, 1, 3])
        assert_equal(rfe.support_.tolist(), [False, True, True, True, True, True, 
                                             True, True, True, False])
    elif i == 9:
        assert_equal(rfe.ranking_.tolist(), [1, 1, 1, 1, 1, 1, 1, 1, 1, 2])
        assert_equal(rfe.support_.tolist(), [True, True, True, True, True, True, 
                                             True, True, True, False])
    else:
        assert_equal(rfe.ranking_.tolist(), [1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
        assert_equal(rfe.support_.tolist(), [True, True, True, True, True, True, 
                                             True, True, True, True])


-----

For the final problem below, you will use the car data to relate car price to the car specifications. In the following Code cells, we first load these data and sample several instances, before cleaning the data for non-numeric features, and finally, extracting features (`X`) and labels (`y`).

-----

In [None]:
# Load data
df = pd.read_csv('./imports-85.data')
df.sample(5)

In [None]:
# Eliminate non-numeric features
df_simple = df[df.columns[df.dtypes!=object]]
df_simple.head()

In [None]:
# Create feature and label arrays
X = df_simple.drop("price", axis=1)
y = np.ravel(df_simple["price"])

-----

## Problem 3: Principal Component Analysis

For this problem, you will complete the function `fit_pca`, provided below, to perform principal component analysis on a provided DataFrame, `df`. To complete this task, you must create an instance of the `PCA` estimator in the scikit learn library, specifying the number of components to keep (which is passed into the `fit_pca` function via the `n_c` parameter). Using this estimator, you should fit and transform the DataFrame to compute a NumPy array that contains the reduced set of features.

-----

In [None]:
from sklearn.decomposition import PCA

def fit_pca(df, n_c):
    """
    Perform principal component analysis
    
    Parameters
    ----------
    df: A pandas DataFrame containing the relevant features
    n_c: The number of principal components to keep
    
    Returns
    -------
    reduced: A NumPy array containing the PCA reduced features
    """
    
    ### YOUR CODE HERE
    
    return reduced

In [None]:
# Compute reduced features on car data
pca_data = fit_pca(X,n_c=X.shape[1])

# Test function
assert_almost_equal(sum(pca_data[0]), 0.60937, places=4)
assert_equal(pca_data.shape, (205, 15))

**&copy; 2017: Robert J. Brunner at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode 