# Week 1 Problem 1

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select *Kernel*, and restart the kernel and run all cells (*Restart & Run all*).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select *File* → *Save and CheckPoint*)

5. When you are ready to submit your assignment, go to *Dashboard* → *Assignments* and click the *Submit* button. Your work is not submitted until you click *Submit*.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Due Date: 6 PM, January 22, 2017

In [1]:
import os
from nose.tools import assert_equal, assert_true
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.externals import joblib
# We do this to ignore several specific warnings
import warnings
warnings.filterwarnings("ignore")

## Boston Dataset
For this assignment we will be using the built-in dataset about the Boston area and the respective house-prices. This dataset has 506 samples and a dimensionality size of 13. Each record contains data about crime rate, average number of rooms dwelling, and other factors. The following code below imports the dataset as a pandas dataframe and previews a few sample data points.

In [2]:
'''
NOTE: Make sure to load this data set before completing the assignment
'''
# Load in the dataset as a Pandas DataFrame
from sklearn.datasets import load_boston

data = load_boston()
print(data.DESCR)

# Print the dataset description
df = pd.DataFrame(data.data, columns=data.feature_names)

# Preview the first few lines
df.head()

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


## Question 1 - Review Pandas

The purpose of this notebook is to review and implement some of the introductory concepts in machine learning introduced in Week 1. In particular, this problem will review the Pandas library which is used for large scale data processing in Python. Complete the following function `get_top_k_crime_rates` that takes in 2 parameters: `crime_rate` and `k` that returns the top `k` records that have a crime rate over the `crime_rate` parameter. For example, given the first 5 records as the sample dataframe and the `crime_rate` threshold = 0.01 and the `k` value = 2. Then records 4 and 3 would be returned respectively since these are the top 2 records with a crime_rate threshold over 0.01.

**NOTE: Filter the records by the CRIM attribute and the corresponding threshold in order to retrieve the records with crime rates higher than the threshold. Also make sure to cast the final object to a pd.DataFrame object as the autograder will be checking for this. The records should also be in descending order.**

**HINT: A useful resource to sort values: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html. **

In [3]:
def get_top_k_crime_rates(crime_rate, k):
    '''
    Return the top k records that are over the crime_rate threshold
    
    Parameters
    ----------
    crime_rate: A double that represents the crime_rate threshold
    k: An integer that represents the number of records to return
    
    Returns
    -------
    crime_rate_values: A pandas.DataFrame that contains the top k records over the crime_rate threshold
    '''
    crime_rate_values = None
    crime_rate_values = df[df['CRIM'] > crime_rate].sort_values(by = 'CRIM', ascending = False)[0:k]
    return crime_rate_values

In [4]:
assert_true(isinstance(get_top_k_crime_rates(20.0, 2), pd.DataFrame))
assert_equal(len(get_top_k_crime_rates(20.0, 2)), 2)
assert_equal(get_top_k_crime_rates(20.0, 2).iloc[0]['CRIM'], 88.9762)

## Question 2 - Data Scaling

In the next few problems, you will explore how to split your data between testing and training data using the `test_train_split` function in the sklean library. However, another important aspect of preprocessing data is to scale the data across all features accordingly so that one feature with a much larger span does not dominate the algorithm. 

As discussed in the lesson, there are a few various scaling methods. In this particular question, we will be implementing the range technique. If you look at the statistics for the various features, it is evident that all the features are not normally distributed. Some of the features like `B` have a higher max than many of the other features and the range is much larger.

Complete the function `data_scale_range` that takes in 2 parameters: `d_train` and `d_test` which is the training and test data respectively and returns `d_train_scaled` and `d_test_scaled` (as a 2-tuple) which is the training and test data scaled respectively. You will need to use a `MinMaxScaler` in order to preprocess the data and apply the range scaling method on the data. **Resource: http://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range**

In [7]:
def data_scale_range(d_train, d_test):
    '''
    Return the trained and testing data scaled from 0 to 1 using the range technique
    
    Parameters
    ----------
    d_train: A Pandas dataframe representing the training data
    d_test: A Pandas dataframe represengint the testing data
    
    Returns
    -------
    d_train_scaled, d_test_scaled: A 2-tuple
    numpy.ndArray that returns the training and testing data scaled respectively
    '''
    d_train_scaled, d_test_scaled = None, None
    
    from sklearn import preprocessing
    min_max_scaler = preprocessing.MinMaxScaler()
    d_train_scaled = min_max_scaler.fit_transform(d_train)
    d_test_scaled = min_max_scaler.transform(d_test)
    
    return d_train_scaled, d_test_scaled

In [8]:
d_train, d_test, _, _ = train_test_split(df, data.target, test_size=0.3, random_state=23)
d_train_scaled, d_test_scaled = data_scale_range(d_train, d_test)
assert_equal(len(d_train), len(d_train_scaled))
assert_equal(len(d_test), len(d_test_scaled))
for val, val_sc in zip(d_train, d_train_scaled):
    assert_true(val != val_sc)
for val, val_sc in zip(d_test, d_test_scaled):
    assert_true(val != val_sc)

## Question 3 - Dimensionality Reduction

In this question, we will be performing a dimensionality reduction using PCA as shown in this week's lesson. The Boston dataset we have been using thus far has a dimensionality size of 13. Our goal in this question will be reduce the dimensionality from 13 to the number of components specified as a parameter (`num_components`) to the function `dimensionality_reduction()`.

Complete the function `dimensionality_reduction()` that takes in two parameters `num_components` and `data`. Perform a Principal Component Analysis (PCA) on the `data` and return the transformed data. **Note: You do not need to convert the transformed data into a pandas DataFrame. Simply transforming the data is sufficient.**

In [9]:
def dimensionality_reduction(num_components, data):
    '''
    Return the reduced data with a dimensionality size of num_components
    
    Parameters
    ----------
    num_components: An integer representing the reduced dimensionality size
    data: A numpy.ndArray representing the data values
    
    Returns
    -------
    transformed_data: A numpy.ndArray representing the data with a reduced dimensionality size
    '''
    transformed_data = None
    
    from sklearn.decomposition import PCA
    # define number of components
    pca = PCA(n_components=num_components)

    # Fit model to the data
    pca.fit(data)

    # Compute the transformed data (rotation to PCA space)
    transformed_data = pca.transform(data)
    
    return transformed_data

In [10]:
reduced_data = dimensionality_reduction(6, df.values)
for data in reduced_data:
    assert_true(len(data) == 6)

## Question 4 - Model Persistence
Complete the function `persist_PCA_model` below that takes in 1 parameter: `filename`, and creates a PCA model with number of components set to **6** and write this PCA model to the specified `filename`. Make sure to use the joblib library in order to write the PCA model to the specified filename.

In [12]:
def persist_PCA_model(filename):
    '''
    Create a PCA model of dimensionality size 6 and write this model to the specified filename
    
    Parameters
    ----------
    filename: A string describing the filename to write the PCA model to
    
    Returns
    -------
    None
    '''
    pca = PCA(n_components=6)

    from sklearn.externals import joblib
    
    with open(filename, 'wb') as fout:
        joblib.dump(pca, fout)

In [13]:
persist_PCA_model('test_model.pkl')
assert_true(os.path.exists('test_model.pkl'))
!rm test_model.pkl