# Week 5 Problem 4

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select *Kernel*, and restart the kernel and run all cells (*Restart & Run all*).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select *File* → *Save and CheckPoint*)

5. When you are ready to submit your assignment, go to *Dashboard* → *Assignments* and click the *Submit* button. Your work is not submitted until you click *Submit*.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Due Date: 6 PM, February 19, 2018

In [3]:
% matplotlib inline
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from nose.tools import assert_equal, assert_true, assert_almost_equal, assert_is_instance, assert_is_not
from sklearn.gaussian_process.kernels import RBF, DotProduct, Matern, WhiteKernel
from sklearn.gaussian_process import  GaussianProcessRegressor
from sklearn.datasets import load_boston

# We do this to ignore several specific warnings
import warnings
warnings.filterwarnings("ignore")

## Boston Dataset
For this assignment we will be using the built-in dataset about the Boston area and the respective house-prices. This dataset has 506 samples and a dimensionality size of 13. Each record contains data about crime rate, average number of rooms dwelling, and other factors. The following code below imports the dataset as a pandas dataframe and previews a few sample data points.

In [4]:
'''
NOTE: Make sure to load this data set before completing the assignment
'''
# Load in the dataset as a Pandas DataFrame

data = load_boston()

# Print the dataset description
df = pd.DataFrame(data.data, columns=data.feature_names)

# Preview the first few lines
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


## Question 1

Create a function called `get_kernel` which returns a specific kernel function based on the `kernel_label` string parameter that is passed into the function. 

Your function should take an input called `kernel_label` which is a string corresponding as follows:
- "rbf" : [Radial Basis Function](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.RBF.html)
- "dot_product": [DotProduct](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.DotProduct.html)
- "matern": [Matern](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.Matern.html)
- "white_kernel" : [WhiteKernel](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.WhiteKernel.html)

Note, use the default parameters for all of the kernels (i.e., you do not need to specify any kernel parameters).

In [7]:
def get_kernel(kernel_label):
    '''    
    Get the respective kernel function based on the `kernel_label` parameter.
    
    Parameters
    ----------
    kernel_label: A String specifying the type of kernel
    
    Returns
    -------
    kernel: A gaussian_process.kernels instance
    '''
    if kernel_label == 'rbf':
        kernel = RBF()
    elif kernel_label == 'dot_product':
        kernel = DotProduct()
    elif kernel_label == 'matern':
        kernel = Matern()
    else:
        kernel = WhiteKernel()

    return kernel

In [8]:
assert_true(isinstance(get_kernel('rbf'), RBF))
assert_true(isinstance(get_kernel('dot_product'), DotProduct))

## Question 2

In this question, we will be creating a Gaussian Regressor in order to predict how the remaining features influence the crime rate (CRIM) in parts of Boston.

- Use `train_test_split` to split the `data` and `labels` into training and testing data. Use a `random_state` of 23 for the constructor. Use a test size of `0.3`.
- For the kernel parameter for the `GaussianProcessRegressor`, we will define a custom kernel. Namely, use a combination of a `RBF Kernel` + `WhiteKernel`. For the `RBF Kernel`, use a `length_scale` parameter of 1, and for the `WhiteKernel`, use a `noise_level` parameter of 12. In addition for the `GaussianProcessRegressor`, use a `random_state` value of 23 as well.
- Fit the model to the training data, and return the gaussian process model.

In [9]:
def gaussian_regressor(independent_data, dependent_data):
    '''
    Predict the CRIM rate in Boston using the other features
    
    Parameters
    ----------
    data: A pandas.core.frame.DataFrame
    labels: A pandas.core.frame.DataFrame
    
    Returns
    -------
    A GaussianProcessRegressor object 
    '''
    # Amount held out for testing
    frac = 0.3

    # Split data intro training:testing data set
    ind_train, ind_test, dep_train, dep_test = \
        train_test_split(independent_data, dependent_data, test_size=frac, random_state=23)

    # Define custom kernel (Matern + noise)
    krnl = RBF(length_scale=1) + WhiteKernel(noise_level=12)

    # Create Regressor with specified properties
    model = GaussianProcessRegressor(kernel=krnl, random_state=23)
    model = model.fit(ind_train, dep_train)
    
    return model

In [10]:
dependent_data = df.CRIM
independent_data = df.drop('CRIM', axis=1)
gaussian_model = gaussian_regressor(independent_data, dependent_data)
assert_true(isinstance(gaussian_model, GaussianProcessRegressor))

## Question 3

In this question, we will determine which kernel combination with `WhiteKernel` with `noise_level=13` will generate the highest score for a `GaussianProcessRegressor`. The kernels we will be considering are `rbf`, `dot_product`, and `matern`.

- Iterate through all possible kernel sum combinations with `WhiteKernel` with `noise_level=13`. The first kernel will be either `rbf`, `dot_product`, or `matern`, and the second kernel (`WhiteKernel`) will be added to each of the first 3 to determine the best possible 2 kernel combination (i.e. (RBF + White), (DotProduct + White), or (Matern + White))
- Use the `get_kernel` method from question 1 in order to retrieve the kernel instance for the first kernel. This will be checked for in the unit tests.
- Create a `GaussianProcessRegressor` for each 2 kernel combination as described above with a `random_state=23` and `fit` the model with the `ind_train` and the `dep_train` parameters that are passed into the function.
- Use the `score` method on the `GaussianProcessRegressor` in order to determine the score with the `ind_test` and `dep_test` parameters passed into the function.
- Finally, keep track of the `best_score` and the `best_process` after iterating through all possible 2 kernel combinations as mentioned above and return a 2-tuple of (`best_score`, `best_process`). The `best_score` is the highest scoring Gaussian Process.

In [19]:
def get_best_kernel(ind_train, ind_test, dep_train, dep_test):
    '''
    Get the best 2-kernel combination with WhiteKernel based on the dataset passed into the function.
    
    Parameters
    ----------
    ind_train: A pandas.core.frame.DataFrame
    ind_test: A pandas.core.frame.DataFrame
    dep_train: A pandas.core.series.Series
    dep_test: A pandas.core.series.Series
    
    Returns
    -------
    A 2-tuple of the best gaussian_process score and the respective GaussianProcessRegressor object
    '''
    # Define kernels
    kernels = [ get_kernel('rbf') + WhiteKernel(noise_level=13),
                 get_kernel('dot_product') + WhiteKernel(noise_level=13),
                 get_kernel('matern') + WhiteKernel(noise_level=13)]
    score_list = []
    
     # Iterate through different neighbor counts
    for idx, k in enumerate(kernels):
        
        gpr = GaussianProcessRegressor(random_state=23)
        
        # Now train our model
        gpr.set_params(kernel=k, random_state=23)
        gpr = gpr.fit(ind_train, dep_train)

        #compute the score
        score_list.append(gpr.score(ind_test, dep_test))
        
    best_score = max(score_list)
    inx = score_list.index(best_score) 
    
    gpr = GaussianProcessRegressor(random_state=23)
    gpr.set_params(kernel=kernels[idx], random_state=23)
    gpr = gpr.fit(ind_train, dep_train)   
    
    return best_score, gpr

In [20]:
dependent_data = df.CRIM
independent_data = df.drop('CRIM', axis=1)
ind_train, ind_test, dep_train, dep_test = train_test_split(independent_data,
                                                 dependent_data,
                                                 test_size=0.3,
                                                 random_state=23)
best_score, best_regressor = get_best_kernel(ind_train, ind_test, dep_train, dep_test)
assert_true(best_regressor is not None)
assert_true(best_score != 0)
assert_true(best_regressor.random_state == 23)
assert_true(best_regressor.kernel.get_params()['k2__noise_level'] == 13)

#used to test whether `get_kernel` has been used for solutions where it has been explicitly specified.
orig_get_kernel = get_kernel
del get_kernel

    # test get_kernel
try:
    get_best_kernel(ind_train, ind_test, dep_train, dep_test)()

    # if an NameError is thrown, that means get_kernel has been used
except NameError:
    pass

    # if no error is thrown, that means get_kernel has not been used
else:
    raise AssertionError("get_kernel has not been used in get_best_kernel")

    # restore the original function
finally:
    get_kernel = orig_get_kernel
    del orig_get_kernel
    