# Module 2 Assignment

A few things you should keep in mind when working on assignments:

1. Run the first code cell to import modules needed by this assignment before proceeding to problems.
2. Make sure you fill in any place that says `# YOUR CODE HERE`. Do not write your answer anywhere else other than where it says `# YOUR CODE HERE`. Anything you write elsewhere will be removed or overwritten by the autograder.
3. Each problem has an autograder cell below the answer cell. Run the autograder cell to check your answer. If there's anything wrong in your answer, the autograder cell will display error messages.
4. Before you submit your assignment, make sure everything runs as expected. Go to the menubar, select Kernel, and Restart & Run all. If the notebook runs through the last code cell without an error message, you've answered all problems correctly.
5. Make sure that you save your work (in the menubar, select File → Save and CheckPoint).

-----

# Run Me First!

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from nose.tools import assert_equal, assert_almost_equal, assert_true, assert_is_instance

-----

# Predicting MPG

In this assignment, we will use mpg dataset to make a regression model. Before we attempt to build a mode, we first must load the data from the Seaborn module.

Please run the next Code cell before proceeding to Problem 1.

-----

In [2]:
#load MPG dataset
mpg = pd.read_csv('data/mpg.csv')
mpg.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


---

# Problem 1: Create Dependent and Independent Variables

For this problem you will use the DataFrame **mpg** defined above.

To complete the task, do the following:
1. Use `dmatrices` in `patsy` module to create independent variable **x** and dependent variable __y__ from DataFrame mpg. 
2. In patsy formula, use 'mpg' as dependent variable.
3. Use displacement, acceleration and model_year as independent variables.
4. Treat model_year as a categorical feature(enclose in C()). 
5. Set `dmatrices` argument `return_type` to 'dataframe'.

After this problem, there are two new variables defined, **x** and __y__.

-----

In [3]:
import patsy as pts

### BEGIN SOLUTION
# Create dependent and independent variables
y, x = pts.dmatrices('mpg ~ displacement + acceleration + C(model_year)', data=mpg, return_type='dataframe')
### END SOLUTION

In [4]:
assert_equal(type(x), pd.core.frame.DataFrame, msg="x is not a DataFrame")
assert_equal(x.shape, (392, 15), 
             msg="The independent features are not correct. Make sure to treat model_year as categorical feature")
# 2 random rows of the independent variables
x.sample(2)


Unnamed: 0,Intercept,C(model_year)[T.71],C(model_year)[T.72],C(model_year)[T.73],C(model_year)[T.74],C(model_year)[T.75],C(model_year)[T.76],C(model_year)[T.77],C(model_year)[T.78],C(model_year)[T.79],C(model_year)[T.80],C(model_year)[T.81],C(model_year)[T.82],displacement,acceleration
155,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,350.0,14.0
284,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,302.0,13.4


-----

# Problem 2: Create the Training and Testing Datasets

This problem works on the variables **x** and __y__ created in problem 1.

Split the independent and dependent variables to training and testing set.

To complete this process, do the following:

- Name the training and testing independent variable as x_train and x_test.
- Name the training and testing dependent variable as y_train and y_test.
- The `test_size` argument in `train_test_split` should be set to 0.3.
- The `random_state` argument in `train_test_split` should be set to 23.

After this problem, there are 4 new variables defined, **x_train, x_test, y_train** and __y_test__.

-----

In [5]:
from sklearn.model_selection import train_test_split

### BEGIN SOLUTION
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=23)
### END SOLUTION

In [6]:
assert_equal(x_test.shape, (118, 15), 
             msg="The testing set size is not correct")
assert_equal(y_test.values[0][0], 24.0,
             msg="The testing dependent variable is not correct. Make sure you set random_state to 23")
#2 random rows of the training independent variables
x_train.sample(2)

Unnamed: 0,Intercept,C(model_year)[T.71],C(model_year)[T.72],C(model_year)[T.73],C(model_year)[T.74],C(model_year)[T.75],C(model_year)[T.76],C(model_year)[T.77],C(model_year)[T.78],C(model_year)[T.79],C(model_year)[T.80],C(model_year)[T.81],C(model_year)[T.82],displacement,acceleration
187,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,304.0,13.9
353,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,120.0,18.3


---

# Problem 3: Performing Linear Regression

Your task for this problem is to build and use the scikit-learn library's `LinearRegression` estimator to  make predictions on the mpg dataset. To complete this function, you must explicitly:
- Create a `LinearRegression` estimator **lin_model** by using scikit-learn. Accept default values for all arguments.
- Fit the `LinearRegression` estimator using x_train and y_train created in problem 2.

After this problem, there will be a trained linear regression model **lin_model**.

-----

In [7]:
from sklearn.linear_model import LinearRegression

### BEGIN SOLUTION
# Create and fit our linear regression model to training data
lin_model = LinearRegression()
lin_model.fit(x_train, y_train)
### END SOLUTION

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [8]:
assert_equal(type(lin_model), type(LinearRegression()), msg="lin_model is not defined as a LinearRegression model")
assert_equal(lin_model.get_params()['fit_intercept'], True,
            msg="lin_model is not created with all default argument values")

---

# Problem 4: Checking R2 Score on Testing Dataset

For this problem, you will compute the R2 score of the lin_model.
To complete this function, you must explicitly:
- Compute the R2 score using `score` function of lin_model.
- Use x_test and y_test to calculate the score.
- Assign score to variable **r2_score**.

After this problem, there will be a new variable r2_score defined.

-----

In [9]:
### BEGIN SOLUTION
r2_score = lin_model.score(x_test, y_test)
### END SOLUTION

In [10]:
assert_almost_equal(r2_score, 0.6713752019536277, msg="R2 score is not correct")
