# Week 1 Problem 3

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select *Kernel*, and restart the kernel and run all cells (*Restart & Run all*).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select *File* → *Save and CheckPoint*)

5. When you are ready to submit your assignment, go to *Dashboard* → *Assignments* and click the *Submit* button. Your work is not submitted until you click *Submit*.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

7. **If your code does not pass the unit tests, it will not pass the autograder.**

# Due Date: 6 PM, January 22, 2018

In [9]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder, normalize
from sklearn.model_selection import train_test_split
from nose.tools import assert_false, assert_equal, assert_almost_equal, assert_true

### Reading in the Auto-MPG Data Set

The code cell below reads in the  [Auto MPG dataset](https://archive.ics.uci.edu/ml/datasets/auto+mpg) and converts all instances of '?' to numpy's implementation of NaNs. This dataset comes from [UCI's Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).  

The dependent variable in this dataset is: mpg  

The independent variables are: cylinders, displacement, horsepower, weight, acceleration, model, year, origin, and car name.

In [10]:
names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
    'acceleration', 'model year', 'origin', 'car name']
df = pd.read_fwf('/home/data_scientist/data/misc/auto-mpg.data', header=None, names=names)
df.loc[df['horsepower'] == '?'] = np.nan
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,"""chevrolet chevelle malibu"""
1,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,"""buick skylark 320"""
2,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,"""plymouth satellite"""
3,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,"""amc rebel sst"""
4,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,"""ford torino"""


### Problem 3.1: Preprocessing Data  

Complete the cell below by finishing the function *preprocess_data*. The function should drop all rows containing NaNs and [encode](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) all of the categorical variables inside of data *(be sure to keep the dataframe header names)*.  

Sample output of the 1st 5 rows for the correct solution:  

| mpg | cylinders | displacement | horsepower | weight | acceleration | model | year | origin | car name |  
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |  
| 0 | 18.0 | 8.0 | 307.0 | 130.0 | 3504.0 | 12.0 | 70.0 | 1.0 | 49 |  
| 1	| 15.0 | 8.0 | 350.0 | 165.0 | 3693.0 | 11.5 | 70.0 | 1.0 |34  |  
| 2	| 18.0 | 8.0 | 318.0 | 150.0 | 3436.0 | 11.0 | 70.0 | 1.0 |232 |  
| 3	| 16.0 | 8.0 | 304.0 | 150.0 | 3433.0 | 12.0 | 70.0 | 1.0 |13 |  
| 4	| 17.0 | 8.0 | 302.0 | 140.0 | 3449.0 | 10.5 | 70.0 | 1.0 |160 |


In [11]:
def preprocess_data(data):
    '''
    Preprocess data by removing all NaNs and encodes all categorical variables.
    
    Parameters
    ----------
    data: dataframe containing auto-mpg dataset
    
    Returns
    -------
    processed data inside of a dataframe with the same headers
    '''
    
    #drop all NaNs
    data.dropna(inplace = True)

    # Create & fit Label Encoder to allowed labels
    le = LabelEncoder()
    le.fit(data['car name'])
    data['car name'] = le.transform(data['car name'])

    return data

In [12]:
preprocessed_data = preprocess_data(df)

assert_equal(type(preprocessed_data), pd.core.frame.DataFrame, msg='Return a dataframe!')
assert_equal(len(preprocessed_data), 392, msg='Make sure you dropped the NaNs and nothing else!')

    
assert_equal(max(preprocessed_data['car name']), 300, msg='Encode all Categorical Variables!')
assert_equal(min(preprocessed_data['car name']), 0, msg='Encode all Categorical Variables!')

### Problem 3.2: Splitting your data
Finish the function *split_data* in the cell below by performing a 80/20 split on your preprocessed data from your solution in Problem 3.1.

For this problem use [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. Set the random_state argument to be 0. Use 80% of the preprocessed data for training and the remaining 20% for your testing set. Lastly, *in this order:* return the independent variables (features) to be used for training, the features to be used for testing, the dependent variables (labels) to be used for training and the labels to be used for testing.


In [5]:
def split_data(data):
    '''
    Parameters
    ----------

    data: pandas dataframe containing preprocessed dataset
    
    returns
    --------
    2 pandas dataframes in this order: training features, 
    testing features and also 2 pandas series in this order:
        training labels, and testing labels
    '''
    
    # Split the DataFrame into a DataFrame of features and a Series of labels
    data_x = data[data.columns[1:]] #extract all other columns except the first column(whose index is 0)
    data_y = data.mpg

    from sklearn.utils import check_random_state
    
    # The train-test-split procedure is the same with the one in problem2
    X_train, X_test, y_train, y_test \
       = train_test_split(data_x, data_y, test_size = 0.2, random_state = check_random_state(0))
        
    return X_train, X_test, y_train, y_test

In [6]:
X_train, X_test, y_train, y_test = split_data(preprocessed_data.copy())

assert_equal(type(y_test), pd.core.series.Series,
             msg='Testing labels should be returned as a serues')

assert_equal(type(y_train), pd.core.series.Series,
             msg='Training labels should be returned as a series')

assert_equal(type(X_train), pd.core.frame.DataFrame,
             msg='Training features should be returned as a dataframe')

assert_equal(type(X_test), pd.core.frame.DataFrame,
             msg='Testing features should be returned as a dataframe')

assert_equal(y_test.iloc[4], 33.8, msg='Make sure you used the random_state argument properly')
assert_equal(y_train.iloc[104], 37.0, msg='Make sure you used the random_state argument properly')
assert_equal(X_train.iloc[104].tolist()[3], 2434.0, msg='Make sure you used the random_state argument properly')


### Problem 3.3: Performing Linear Regression on the AUTO MPG Dataset
In the code cell below do the following:  
- Create a Linear Regression Model using sci-kit learn with the default parameters and assign it to a variable called *model*.
- Fit your model on the training features and labels (which are stored in *X_train* and *y_train*).
- Lastly compute the R^2 score on the testing features and labels (which are stored in *X_test* and y_test) and store the result in a variable called score.

In [7]:
# Create and fit our linear regression model to training data
model = LinearRegression(fit_intercept=True)
model.fit(X_train, y_train)

# Compute score and display result (Coefficient of Determination)
score = model.score(X_test, y_test)

In [8]:
assert_true(type(model), type(LinearRegression))
assert_almost_equal(.836, score, places=2 )