# Homework 06: Spark and Least Squares Linear Regression

## Introduction

In this assignment, you will implement distributed least squares linear regression using Apache Spark. As with Lab09 we will be using a service called Databricks to develop and run code. Databricks simplifies the setup of Apache Spark and the cloud, and it provides limited free cloud computing. Outside the context of this assignment, you can always run Apache Spark code on your own computer or in the cloud without Databricks.


In [2]:
# Run this cell to set up your notebook
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib notebook

from client.api.notebook import Notebook
ok = Notebook('hw6.ok')

Assignment: hw6
OK, version v1.13.9



In [3]:
ok.auth(force=False) # Change False to True if you are getting errors authenticating

Successfully logged in as shrey@berkeley.edu


## Question 1. Understanding Least Squares Regression


In the first part of this homework, we explore some properties of multiple regression.  In particular, the goals are to

* Interpret of parameters in simple and multiple linear regression
* Understand how the correlation of the explanatory variables can impact the coefficients
* Observe how te correlation between explanatry variables can impact the standard error of the coefficients.


We will also introduce the tools in scikit learn for fitting linear models. Note that these tools are not used in the second part of this assignment where you implement linear least squares using Spark.

In [4]:
# Run this cell to set up your notebook
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from IPython.display import display, Latex, Markdown

%matplotlib notebook

### Creating the Data

We generate two sets of data, a response vector `Y` and a two-column design matrix `X`. 

* In the first data set, the columns of `X` are correlated with each other as well as being correlated with `Y`.  
* In the second data set, the columns of `X` are uncorrelated with each other and both columns are correlated with `Y`.   

The following code creates the first data set. 

In [5]:
n = 100
p = 2

mean = [0, 0, 0]
cov = [[1, 0.7, 0.7], [0.7, 1, 0.9], [0.7, 0.9, 1]]

np.random.seed(1141)
v, u, Y = np.random.multivariate_normal(mean, cov, n).T
X = np.array([u, v]).T

#### Question1a 
Find the mean and standard deviation of `Y`

In [6]:
mean_Y = sum(Y)/len(Y)
sd_Y = np.std(Y)

In [7]:
_ = ok.grade('q01a')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: shrey@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/48QYB6
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



#### Three-dimensional plot

Create a 3D plot of `Y` and `X`. 
Take the following plot for a spin (literally).  Drag across the plot to spin it. Notice that we added the origin in red.

In [8]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:,0], X[:,1], Y)
# Added the origin 
ax.scatter([0],[0],[0], "o", color='red')
ax.set_xlabel(r"$X_0$ axis")
ax.set_ylabel(r'$X_1$ axis')
ax.set_zlabel('Y axis')


<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x115b12ef0>

#### Question1b
Spin the plot to examine the range of $X_0$, $X_1$ and $Y$. State whether each statement is true or false.

1. The range of $X_0$ and $X_1$ are both from about -2 to 2
1. Together $X_0$ and $X_1$ nearly fill their respective plane.
1. The response $Y$ appears correlated with both $X_0$ and $X_1$

In [9]:
Q1b_answer = '''

1. True as we see from the plot of X0 and X1. However, there is one outlier
1. True. As on both axis there are values that are present in the range of the respective axis/planes.
1. True. As all points move in the same direction and we see that they are positively correlated.
'''

display(Markdown(Q1b_answer))



1. True as we see from the plot of X0 and X1. However, there is one outlier
1. True. As on both axis there are values that are present in the range of the respective axis/planes.
1. True. As all points move in the same direction and we see that they are positively correlated.


#### Question1c 
In addition to the 3D plot, examine the three pairwise scatter plots:

* `Y` and the first column of `X`
* `Y` and the second column of `X`
*  the two columns of `X`

Arrange your 3 plots in a 2 by 2 grid (with one empty facet).

Label your axes so that you can tell which plot is which.

In [12]:
plt.figure(figsize=(8,9))
ax = plt.subplot(2,2,1)
ax.scatter(X[:,0], Y)
plt.xlabel('First Column of X')
plt.ylabel('Y')
# Added the origin 
bx = plt.subplot(2,2,2)
bx.scatter(Y, X[:,1])
plt.xlabel('Second Column of X')
plt.ylabel('Y')
cx = plt.subplot(2,2,3)
cx.scatter(X[:,0], X[:,1])
plt.xlabel('First Column of X')
plt.ylabel('Second Column of X')

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x11de72748>

Note that it is difficult to see how $Y$ depends on both $X_0$ and $X_1$ together in the pairwise plots.  

#### Question1d 
Use 'corrcoef' to find the correlation matrix of all pairwise correlation 
coefficients between $Y$, $X_0$ and $X_1$.

In [13]:
list1 = [X[:,0], X[:,1], Y]
corr = np.corrcoef(list1)

In [14]:
_ = ok.grade('q01d')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: shrey@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/qj7Mz3
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



### Fitting a least squares linear model 

Let's compare the coefficients of the least squares fit for the following models

* $Y$ as a linear function of $X_0$
* $Y$ as a linear function of $X_1$
* $Y$ as a linear function of $X_0$ and $X_0$


#### Question1e
Use 'linear_model' in scikit learn to fit the models and examine the coefficients.
Do not fit an intercept term in any of the three models.

In [15]:
# Fit Y to the first column of X
reg = linear_model.LinearRegression(fit_intercept=False)
A = X[:,1].reshape(-1, 1)
B = X[:,0].reshape(-1, 1)
C = Y.reshape(-1, 1)
model_1 = linear_model.LinearRegression(fit_intercept=False).fit(B,C)



# Fit Y to the second column of X
model_2 = linear_model.LinearRegression(fit_intercept=False).fit(A,C)

# Fit Y to X
model_3 = linear_model.LinearRegression(fit_intercept=False).fit(X,C)
model_3.coef_ = model_3.coef_[0]

In [16]:
_ = ok.grade('q01e')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: shrey@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/nZxNwD
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



### Co-plots

Compare the coefficients from the simple linear fit to the coefficients in the two-variable fit. Notice that the coefficient for $X_1$ has changed quite a bit. It is $0.71$ in the single variable model and only $0.21$ in the two-variable model.

The coefficients in the two-variable model depend on the presence of the other explanatory variables in the model. 

In this case since $X_0$ is in the model and it is very highly correlated with $Y$, then $X_1$ does not explain much additional variation in $Y$. That is, given $X_0$, the relationship between $Y$ and $X_1$ is not as strong as the relationship between $Y$ and $X_1$ without knowledge of $X_0$.

We can see this when we plot $Y$ on $X_1$ for subgroups of the data where $X_0$ is roughly constant. 

#### Question1f

Create four scatter plots of the relationship between $Y$ and $X_1$, conditioned on $X_0$. 
To do this, bin $X_0$ into the following categories: -4 to -1, -1 to 0, 0 to 1, and 1 to 4.
For each subset of records, make a scatter plot $Y$ and $X_1$. In your plot be sure to

* Keep the $Y$ limits the same on all 4 plots
* Keep the $X_1$ limits the same on all 4 plots
* Provide a title that indicates which subgroup of records is being plotted

In [18]:
plt.figure(figsize=(8,9))
bins = [-4, -1, 0, 1, 4]

x0= X[:,0]
x1= X[:,1]

plt.scatter(x1[(x0>-4) & (x0<-1)], Y[(x0>-4) & (x0<-1)])
plt.title('-4 to -1')

plt.figure(figsize=(8,9))
plt.scatter(x1[(x0>-1) & (x0<0)], Y[(x0>-1) & (x0<0)])
plt.title('-1 to 0')

plt.figure(figsize=(8,9))
plt.scatter(x1[(x0>0) & (x0<1)], Y[(x0>0) & (x0<1)])
plt.title('0 to 1')

plt.figure(figsize=(8,9))
plt.scatter(x1[(x0>1) & (x0<4)], Y[(x0>1) & (x0<4)])
plt.title('1 to 4')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x125648588>

#### Question1g
How does the relationship between $Y$ and $X_1$ change from the plot made in Q1d to these plots? State whether each statement is true or false.

1. There is a stronger linear relationship between $Y$ and $X_1$ in the plot in Q1d than in the group of 4 plots
1. Each of the above 4 plots shows a similar strength of relationship between $Y$ and $X_1$
1. The average levels of $Y$ in the 4 plots are about the same in all 4 plots

In [19]:
Q1g_answer = '''

1. False
1. True
1. False


'''

display(Markdown(Q1g_answer))



1. False
1. True
1. False




#### Question1h

Lastly, we examine the multiple correlation coefficient from the regression.

The multiple correlation coefficient is the ratio of the explained variation in $Y$ (i.e., the variation in $Y$ that has been explained by the linear fit, or the variation in $\hat{Y}$) to the total variation in $Y$. It is similar in spirit to the correlation coefficient from lab, but is useful for the multiple regression case. 

Compute the multiple $R^2$ for the 2-variable regression. To do this, 

* Compute the predicted values, $\hat{Y}$
* Compute the ratio of the explained variation $||\hat{Y} - \bar{Y}||^2$ to the total variation $||Y - \bar{Y}||^2$ using `r2_score`

In [21]:
reg = linear_model.LinearRegression(fit_intercept = False).fit(X,Y)
Y_hat = reg.predict(X)
multiple_R2 = r2_score(Y,Y_hat)

In [22]:
_ = ok.grade('q01h')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: shrey@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/lOxDD5
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



### Uncorrelated explanatory variables

Now repeat the investigation that you have done above with a different data set. Compare the plots for these data to the plots that you made with the first set of data.

First, run the following code chunk to create the data set.

In [23]:
np.random.seed(21141)
mean = [0, 0, 0]
cov = [[1, 0.7, 0.7], [0.7, 1, 0.], [0.7, 0., 1]]

Y, u, v = np.random.multivariate_normal(mean, cov, n).T
X = np.array([u, v]).T

#### Make the 3D plot of $Y$ and $X$.

In [24]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:,0], X[:,1], Y)
# Added the origin 
ax.scatter([0],[0],[0], "o", color='red')
ax.set_xlabel(r"$X_0$ axis")
ax.set_ylabel(r'$X_1$ axis')
ax.set_zlabel('Y axis')

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x1283d80b8>

#### Make three pairwise plots

* `Y` and the first column of `X`
* `Y` and the second column of `X`
*  the two columns of `X`

Arrange your 3 plots in a 2 by 2 grid (with one empty facet).

Label your axes so that you can tell which plot is which.

In [62]:
plt.figure(figsize=(8,9))
plt.subplot(2,2,1)
plt.scatter(X[:,0], Y)
plt.xlabel('First Column of X')
plt.ylabel('Y')
plt.subplot(2,2,2)
plt.scatter(Y, X[:,1])
plt.xlabel('Second Column of X')
plt.ylabel('Y')
plt.subplot(2,2,3)
plt.scatter(X[:,0], X[:,1])
plt.xlabel('First Column of X')
plt.ylabel('Second Column of X')

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x11a0e0710>

#### Compute the pairwise correlation coefficients

In [63]:
list11 = [X[:,0], X[:,1], Y]
corrs =  np.corrcoef(list11)

#### Co-plots

Create scatter plots of the relationship between $Y$ and $X_1$, conditioned on $X_0$. Bin $X_0$ into the following categories: -4 to -1, -1 to 0, 0 to 1, and 1 to 4. 

In [66]:
plt.figure(figsize=(8,9))
bins = [-4, -1, 0, 1, 4]


x0= X[:,0]
x1= X[:,1]

plt.scatter(x1[(x0>-4) & (x0<-1)], Y[(x0>-4) & (x0<-1)])
plt.title('-4 to -1')

plt.figure(figsize=(8,9))
plt.scatter(x1[(x0>-1) & (x0<0)], Y[(x0>-1) & (x0<0)])
plt.title('-1 to 0')

plt.figure(figsize=(8,9))
plt.scatter(x1[(x0>0) & (x0<1)], Y[(x0>0) & (x0<1)])
plt.title('0 to 1')

plt.figure(figsize=(8,9))
plt.scatter(x1[(x0>1) & (x0<4)], Y[(x0>1) & (x0<4)])
plt.title('1 to 4')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x12b0a3eb8>

#### Fitting the least squares linear models 

As before fit the following models and compare the coefficients

* $Y$ as a linear function of $X_0$
* $Y$ as a linear function of $X_1$
* $Y$ as a linear function of $X_0$ and $X_0$

Do not fit an intercept term in any of the three models.

In [64]:
reg = linear_model.LinearRegression(fit_intercept=False)
A = X[:,1].reshape(-1, 1)
B = X[:,0].reshape(-1, 1)
C = Y.reshape(-1, 1)
# Fit Y to the first column of X
model_1_second = linear_model.LinearRegression(fit_intercept=False).fit(B,C)


# Fit Y to the second column of X
model_2_second = linear_model.LinearRegression(fit_intercept=False).fit(A,C)

# Fit Y to X
model_3_second = linear_model.LinearRegression(fit_intercept=False).fit(X,C)


### Find the multiple correlation coefficient for the 2-variable model

In [65]:
reg = linear_model.LinearRegression(fit_intercept = False).fit(X,Y)
Y_hat_second = reg.predict(X)
multiple_R2_second = r2_score(Y,Y_hat)

#### Question1i
Now it's time to compare your findings of the two data sets.

Answer the following questions.

1. In the 3D plot, consider the spread of points in the $X_0$, $X_1$ plane. Do the two sets of data fill this plane similarly?
1. Compare the pairwise scatter plots of ($X_0$, $Y$) and ($X_1$, $Y$), and ($X_0$, $X_1$). Two of the pairs should look roughly the same for the different data sets and one should look different. Which one is different across the two data sets? How is it different? 
1. Examine the 4 co-plots for the second set of data. Is the slope of the linear relationship for these plots roughly the same? Is the strength of the relationship roughly the same? How does the linear relationship in these 4 plots compare to the relationship observed between $X_1$ and $Y$ without conditioning on $X_0$?
1. Compare the 4 co-plots for the two sets of data. the How are they different? How are they the same?
1. Consider how the single variable and two-variable coefficients change in the regressions for the second data set. How is this change different than the change observed for the first data set?
1. Compare the multiple $R^2$ of the two-variable regression for the two data sets. Do you think this $R^2$ gives any indication of whether the two variable regression would have different coefficients for the explanatory variables than the one variable regression?


In [None]:
Q1i_answer = '''

1. Yes, both of the spreads cover most of the ranges of X0 and X1, with more points concentrated in the center.
2. X_0 and X_1 are different across the two data sets, because in the first set, it has a positive trend, but in the second set it has no real trend; it is more scattered.
3. The slope of the linear relationship for the 4 co-plots is roughly the same, but the strength of the relationships are weaker in the first and the fourth plot. The linear relationship in these 4 plots is very similar to the relationship between X1 and Y without conditioning on X0. 
4. The co-plots for the first set of data didn't show as much strength in their linear relationships, but were similar to the co-plots for the second set of data in that they have an increasing slope.
5. Write your answer here, replacing this text.
6. R^2 is larger for the second data set.
'''

display(Markdown(Q1i_answer))

# Question 2

In this question we will use Apache Spark to compute the statistics needed to solve the ordinary least squared linear regression problem.

**Note: Apache Spark already has estimate a wide range of models including linear regression.  However we will be doing this by hand (for practice).**


## Setup

Step 1 is to create a Databricks account.  Go [here](https://accounts.cloud.databricks.com/registration.html#signup/community) to sign up.  Use your @berkeley email address. If you have already signed up before (in lab), go to [this](https://community.cloud.databricks.com/) page to login directly.

After you sign up, sign in to your Databricks account, then click Workspace -> Users -> `<your-username>@berkeley.edu`.    Click on the arrow pointing down beside your email address and select **`Import`**.  Import the `hw06.dbc` file in this folder containing this notebook.

![Importing](https://github.com/DS-100/sp17-materials/blob/master/sp17/hw/hw7/importing_notebooks.png?raw=true)

This will create a Databricks notebook file.  Open it.

The rest of this assignment is primarily conducted in the Databricks notebook.  However, this notebook contains the OK tests you can use to check your work, and it contains the invocations to submit your assignment when you're done.  Follow the instructions in the Databricks notebook to download your results in a form that the tests here will understand.

** Issue: **
1. Databricks Cloud runs Python 2.7 so you won't be able to use `X.T @ Y` operator.  Instead you can use `X.T.dot(Y)`.

## Question 2a

Complete question 2a and paste answer below:

In [25]:
size_of_diamonds = 3192560

In [26]:
_ = ok.grade('q02a')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: shrey@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/GZv7ly
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



## Question 2b
Complete question 2b and paste your answer below:

In [40]:
number_of_rows = 53941

In [41]:
_ = ok.grade('q02b')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: shrey@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/0gRw0G
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



## Question 2c

The size of the training data after constructing a 90/10 train test split:

In [71]:
number_of_rows_in_training = 48647

In [72]:
_ = ok.grade('q02c')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Submit... 100% complete
Submission successful for user: shrey@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/submissions/7615VA
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



## Question 2d

The average price of diamonds in the training data:

In [73]:
avg_price_of_diamonds_in_training = 3921.282674834518


In [74]:
_ = ok.grade('q02d')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Submit... 100% complete
Submission successful for user: shrey@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/submissions/r08wV6
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



## Question 3a
The value of $\theta$

In [None]:
theta = [ 7865.4432434, -147.61542311,  -101.347086456, 12609.53723419]

In [None]:
_ = ok.grade('q03a')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

## Question 3b
It seems like the weight of `carat` is way bigger than the other two, could we say it is the dominating feature?

In [None]:
"""Yes as a diamond's carat is its most important feature, the other two features in theta are negative because their values in the actual datastet are larger than the values for carat, so they should be weighted accordingly.""

## Question 3c
Compute the RMSE for $\theta$ estimated using carat, depth, table.

In [None]:
rmse = 1522.3543543265463

In [None]:
_ = ok.grade('q03c')
_ = ok.backup()

## Question 3d
Compute the improved RMSE using more features.

In [57]:
rmse_improved = 1453.434234

In [58]:
_ = ok.grade('q03d')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: shrey@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/48Q056
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



## Question 3e
Compute the improved test RMSE using additional one-hot features.

In [67]:
test_rmse = 1344.53426521

In [68]:
_ = ok.grade('q03e')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: shrey@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/VmAEOv
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



# Submitting your assignment
Congratulations, you're done with this homework!

Run the next cell to run all the tests at once.

In [70]:
_ = ok.grade_all()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------

Now, run the cell below to submit your assignment to OkPy. The autograder should email you shortly with your autograded score. The autograder will only run once every 30 minutes.

**If you're failing tests on the autograder but pass them locally**, you should simulate the autograder by doing the following:

1. In the top menu, click Kernel -> Restart and Run all.
2. Run the cell above to run each OkPy test.

**You must make sure that you pass all the tests when running steps 1 and 2 in order.** If you are still failing autograder tests, you should double check your results.

In [69]:
_ = ok.submit()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Submit... 100% complete
Submission successful for user: shrey@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/submissions/qj7v62
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit

