# Module 1 Assignment

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
2. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and restart the kernel and run all cells (Restart & Run all).
3. Do not change the title (i.e. file name) of this notebook.
4. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).
5. All work must be your own, if you do use any code from another source (such as a course notebook or a website) you need to properly cite the source.

-----

In [None]:
import pandas as pd
import numpy as np

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

-----

## Predicting the Price of a Car

In this assignment, we will use a partially cleaned dataset to make a predictive model. Before we attempt to build a mode, we first must load the data, select the independent feature, and the dependent label. The first Code cell below reads the data from a CSV file and displays several random instances. The second Code cell selects the independent features (`make`) and the dependent label (`price`), and displays the first few instances in this new DataFrame. 

-----

In [None]:
df = pd.read_csv('./imports-85.data')
df.sample(5)

In [None]:
df_simple = df[['make', 'price']]
df_simple.head()

-----

## Problem 1: Creating the Training and Testing Datasets

The `data_split` function shown below accepts a parameter called `data` that is stored in a DataFrame. Your task is to use the `train_test_split` function available in the scikit learn library to split this DataFrame into two separate DataFrames (a testing set and training set). 

To complete this process, do the following:
- Split the training and testing set into two separate DataFrames.
- The `test_size` argument in `train_test_split` should be set to the `size` parameter.
- The `random_state` argument in `train_test_split` should be set the `random_state` parameter.
- Return both the training set and the testing set, in this order.

-----

In [None]:
from sklearn.model_selection import train_test_split

def data_split(data, size=0.25, random_state=0):
    '''    
    Split data into training and testing sets.
    
    Parameters
    ---------
    data: Panadas DataFrame
    size: ratio of training and testing data
    random_state: random seed for random number generator

    Returns
    -------
    Two DataFrames: train and test, in that order.
    '''
    
    ### YOUR CODE HERE

In [None]:
# Call function
train, test = data_split(df_simple)

# Test Data types
assert_equal(type(train), pd.DataFrame, msg="train is not a DataFrame")
assert_equal(type(test), pd.DataFrame, msg="test is not a DataFrame")

# Converting to NumPy arrays for sklearn
train_X = train.make.values.reshape(len(train.make),1)
train_y = train.price.values
test_X = test.make.values.reshape(len(test.make), 1)
test_y = test.price.values

# Test dependent values
assert_almost_equal(np.sum(test_y), 689311.38805, places=2)
assert_almost_equal(np.sum(train_y), 2018150.12935, places=2)

# Test independent values
assert_equal(train_X[1][0], 8)
assert_equal(train_X[45][0], 13)
assert_equal(test_X[25][0], 18)
assert_equal(test_X[51][0], 12)

-----

## Problem 2: Performing Linear Regression

Your task for this problem is to build and use the scikit learn library's `LinearRegression` estimator to  make predictions on the cars dataset. The framework for a regression function has been provided below, that takes two NumPy arrays containing the features (`trainX`) and the labels (`trainY`), an optional Boolean flag, `fit_intercept`, which indicates whether an intercept term should be fit as part of the linear regression. To complete this function, you must explicitly:
- Create a `LinearRegression` estimator by using scikit learn.
- Fit the `LinearRegression` estimator using trainX and trainY.
- Return the resulting estimator.

-----

In [None]:
from sklearn.linear_model import LinearRegression

def regression(trainX, trainY, fit_intercept=False):
    '''
    Compute a linear regression model for given training data set.
    
    Parameters
    ---------
    trainX: the training indepedent features
    trainY: the training depedent features
    fit_intercept: optional Boolean flag to indicate if an intercept should be fit

    Returns
    -------
    The fitted linear regression model
    '''
    
    ### YOUR CODE HERE

In [None]:
lr = regression(train_X, train_y, fit_intercept=True)

assert_equal(type(lr), type(LinearRegression()))
assert_equal(lr.get_params(), {'copy_X': True, 'fit_intercept': True, \
                               'n_jobs': 1, 'normalize': False})

-----

## Problem 3: Checking R2 Score on Testing Dataset

For this problem, you will compute the R2 score given a model, the independent features (`X`), and the dependent feature (`y`). By using the function template provided below, you must explicitly:
- Compute the R2 score for the supplied model.
- Return the resulting score.

-----

In [None]:
def r2_score(model, X, y):
    '''
    Compute the R2 score for a given model and data set.
    
    Parameters
    ---------
    model: linear regression model
    X: NumPy array containing indepenent data (features)
    y: NumPy array containing depenent data (labels)

    Returns
    -------
    A float containing the model score
    '''

    ### YOUR CODE HERE

In [None]:
# Compute the score
score = r2_score(lr, train_X, train_y)

# Test the score
assert_almost_equal(0.010956, score, places=2)

-----

## Problem 4: Model Persistence

Complete the function `persist_model`, which accepts two parameters: `name` and `model`. This function will persist the provided `model` into a new file specified by `name`. To persist the machine learning model, you should use the joblib library. By using the function template provided below, you must explicitly:
- Open the file by using the provided name for writing.
- Save the model to this file.

-----

In [None]:
import os
from sklearn.externals import joblib

def persist_model(name, model):
    '''
    Write a model to the specified file.

    Parameters
    ---------   
    name: A string containg the filename to which the model should be written.
    model: The model that should be saved to the file.

    Returns
    -------
    Nothing
    '''
   
    ### YOUR CODE HERE

In [None]:
# Save model to temporary file
persist_model('test_model.pkl', lr)

# Does the file exist?
assert_true(os.path.exists('test_model.pkl'))

# Test model recreation
with open('test_model.pkl', 'rb') as fin:
    new_model = joblib.load(fin)
    assert_equal(new_model.fit_intercept, True)

# Remove the temporary file
!rm test_model.pkl

**&copy; 2017: Robert J. Brunner at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode 