# Week 6 Problem 4

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select *Kernel*, and restart the kernel and run all cells (*Restart & Run all*).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select *File* → *Save and CheckPoint*)

5. When you are ready to submit your assignment, go to *Dashboard* → *Assignments* and click the *Submit* button. Your work is not submitted until you click *Submit*.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Due Date: 6 PM, February 26, 2018

In [1]:
% matplotlib inline
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from numpy.testing import assert_array_equal
from nose.tools import assert_equal, assert_true, assert_almost_equal, assert_is_instance, assert_is_not
from sklearn.feature_selection import SelectKBest, SelectPercentile, mutual_info_regression
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.svm import LinearSVC
from sklearn.datasets import load_boston
from sklearn.decomposition import PCA
# We do this to ignore several specific warnings
import warnings
warnings.filterwarnings("ignore")

## Boston Dataset
For this assignment we will be using the built-in dataset about the Boston area and the respective house-prices. This dataset has 506 samples and a dimensionality size of 13. Each record contains data about crime rate, average number of rooms dwelling, and other factors. The following code below imports the dataset as a pandas dataframe and previews a few sample data points.

In [2]:
'''
NOTE: Make sure to load this data set before completing the assignment
'''
# Load in the dataset as a Pandas DataFrame

data = load_boston()

# Print the dataset description
df = pd.DataFrame(data.data, columns=data.feature_names)

# Preview the first few lines
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


## Question 1

In this question, we will be returning the overlapping features between 2 feature selection strategies: `SelectKBest` and `SelectPercentile` respectively. From the reading, `SelectKBest` and `SelectPercentile` both return the top features in the dataset. 

- Use the `features` and `labels` function parameters to fit to a `SelectKBest` and `SelectPercentile` feature selection strategies respectively.
- For `SelectKBest`, use `all` for the `k` parameter.
- For `SelectPercentile`, use `mutual_info_regression` as the `score_func` parameter and use `20` as the `percentile` parameter.
- You can access the scores for a particular FS strategy by accessing the `scores_` attribute of the return value of the `fit()` function call.
- **Hint**: In order to pair the respective score from both strategies to the actual feature_name, the index from the `scores` attribute of the result value of the `fit()` function will correspond to the respective index in the `feature_names` array. For example, accessing the feature name of the 0th index of the result value of `select_percentile.fit()` will correspond to the 0th index in the `feature_names` function parameter.
- Return the overlapping features in the top `k` features from the results of both `SelectPercentile` and `SelectKBest` as a list of strings.

In [67]:
def get_overlapping_features(features, labels, k, feature_names):
    '''    
    Get the list of k overlapping features from SelectKBest and SelectPercentile 
    
    Parameters
    ----------
    features: A numpy.ndarray
    labels: A numpy.ndarray
    k: An int
    feature_names: A list of strings
    
    Returns
    -------
    overlapping_features: A list of strings
    '''
    
    # need to return the ordering based on the ordering of the top k values of the SelectKBest feature selection strategy
    
    #fit the selectkbest estimator
    skb = SelectKBest(k='all')
    score_skb = skb.fit(features, labels).scores_

    #obtain the name of top k scores
    index1 = score_skb.argsort()[::-1][:k]
    name_skb = feature_names[index1]
    
    #fit the selectpercentile estimator
    sp = SelectPercentile(mutual_info_regression, percentile=20)
    score_sp = sp.fit(features, labels).scores_
    index2 = score_sp.argsort()[::-1][:k]
    name_sp = feature_names[index2]
    
    #if the name of name_sp occurs in name_skb, then add it into over_lapping_features
    overlapping_features = []
    for element in name_skb:
        if element in name_sp:
             overlapping_features.append(element)
    
    return overlapping_features

In [69]:
features = data.data
labels = data.target
overlapping_features = get_overlapping_features(features, labels, 4, data.feature_names)
assert_equal(len(overlapping_features), 3)
assert_equal(overlapping_features, ['LSTAT', 'RM', 'NOX'])
overlapping_features = get_overlapping_features(features, labels, len(features), data.feature_names)
assert_equal(len(overlapping_features), len(data.feature_names))
assert_equal(sorted(overlapping_features), sorted(data.feature_names))

## Question 2

In this question, you will be using Principal Component Analysis (PCA) to return the weights of components that has the highest explained_variance_ratio as a dictionary and the corresponding explained_variance_ratio.

- Use the `features` parameter to fit a `PCA` model
- Use the `num_c` function parameter as the parameter to the `PCA` constructor for the parameter `num_components`
- Return a 2-tuple of a dictionary and the value of the highest `explained_variance_ratio_` (as a percentage, should be multiplied by 100)
- The dictionary should map strings to floats where the keys are the individual feature names and the floats are the weights for each of the `feature_names` with the highest `explained_variance_ratio`
- **Hint: The highest `explained_variance_ratio` is at index 0 in the array `explained_variance_ratio`. Respectively, the weights for each feature_name in the `components_` attribute are also at index 0.**

In [86]:
def best_decomposer(features, num_c, feature_names):
    '''
    Returns the corresponding weights of each feature as a dictionary with respect to the highest
    explained_variance_ratio_ using PCA
    
    Parameters
    ----------
    features: An numpy.ndarray
    num_c: An int
    feature_names: An numpy.ndarray
    
    Returns
    -------
    A 2-tuple of a dictionary and a float 
    '''
    #fit the PCA
    pca = PCA(n_components=num_c)
    pca.fit(features)
    
    highest_ratio = pca.explained_variance_ratio_[0] * 100
    weights = pca.components_[0, :]
    
    #create a dictionary using feature_names and weights
    dictionary = dict(zip(feature_names, weights))
    
    return dictionary, highest_ratio

In [87]:
features = data.data
weights, highest_ratio = best_decomposer(features, 13, data.feature_names)
assert_true(isinstance(weights, dict))
assert_true(isinstance(highest_ratio, float))
assert_almost_equal(weights['AGE'], 0.0836, places=4)
assert_almost_equal(weights['CRIM'], 0.0291, places=4)
assert_almost_equal(highest_ratio, 80.5815, places=4)

## Question 3

In this question, you will create a Machine Learning Pipeline that contains a Feature Union.

- Create a FeatureUnion object that contains a `SelectKBest` and `SelectPercentile` feature selection strategies
- Use `percentile=10` fpr SelectPercentile
- Use `k=all` for the `k` parameter to `SelectKBest`
- For the pipeline, combine the feature union from above and a `LinearSVC` model with `random_state=23` and return the Pipeline object

In [90]:
def get_pipeline():
    '''
    Get a pipeline that contains both a FeatureUnion made of a SelectKBest and SelectPercentile FS strategies
    and a LinearSVC model with random_state=23
    
    Parameters
    ----------
    
    Returns
    -------
    A Pipeline object
    '''  
    #make sure that SelectPercentile is the first fs strategy in the call to the FeatureUnion and SelectKBest is the second fs strategy in the call to FeatureUnion.
    
    fu = FeatureUnion([("sp", SelectPercentile(percentile=10)),
                       ("skb", SelectKBest(k=all))])
    
    # Feature selection as part of a pipeline
    pl = Pipeline([('feature_union', fu),
                   ('svc', LinearSVC(random_state=23))])
    
    return pl

In [91]:
pipeline = get_pipeline()
assert_true(isinstance(pipeline.get_params()['feature_union'], FeatureUnion))
assert_true(len(pipeline.get_params()['feature_union'].transformer_list), 2)
fs_1, fs_2 = pipeline.get_params()['feature_union'].transformer_list[0], pipeline.get_params()['feature_union'].transformer_list[1]
assert_true(isinstance(fs_1[1], SelectPercentile) or isinstance(fs_2[1], SelectPercentile))
assert_true(isinstance(fs_2[1], SelectKBest) or isinstance(fs_2[1], SelectKBest))