# Day 5 Worksheet: Homework 2 Overview

### What is Homework 2 Part 2 about?
First, there is a lot of code in Homework 2 Part 2, but most of it is a demonstration of how certain functions/libraries can be used to implement and use linear regression models.

In [None]:
# all the necessary imports for this notebook
import numpy as np
import pandas as pd

## 1. Understanding Array Shapes

Consider the following dataset, where the $x$ values represent the feature, and the $y$ values represent the response.

\begin{equation*}
(x , y) = \{(1, 0), (2, 2), (3, 4), (5, 5)\}
\end{equation*}

How do we represent this dataset in Python? --> NumPy arrays!

In [None]:
# First, we separate the dataset into x and y, representing them as numpy arrays
x = np.array([1,2,3,5])
y = np.array([0,2,4,5])
x

array([1, 2, 3, 5])

In [None]:
x.shape

(4,)

### What does this shape mean?
The shape $(n,m)$ normally means that there are $n$ rows and $m$ columns in the 2d-array.

However, when the shape is in the form $(n,)$, this means that it is a 1d-array with $n$ elements.

When we have a numpy array, we can use the `.reshape(n,m)` method to change them into a different shape

In [None]:
# Let's change our x 1d-array into a 2d-array, each one value per row
x_reshaped = x.reshape(4,1)
x_reshaped

array([[1],
       [2],
       [3],
       [5]])

### Why reshaping arrays is important:
The `sklearn` library has built-in models we can use such as

```python
lin_reg = LinearRegression().fit(X, y)
```
However, there is a constraint that the `X` must be a **2d-array**, and not a 1d-array. (There is no such constraint for `y`).

This means that for our example
```
lin_reg = LinearRegression().fit(x, y)
```
would not work, because `x` has shape `(4,)`, which is a 1d-array

Instead, we would need to use our reshaped array `x_reshaped` with shape `(4,1)`, which is a 2d-array.
```
lin_reg = LinearRegression.fit(x_reshaped, y)
```

### Your Turn:
Consider the following dataset
\begin{equation*}
(x , y) = \{(2, 3), (4, 5), (5, 8)\}
\end{equation*}
Create two objects `x1` and `y1` that represent this dataset using numpy arrays, where  `x1` is in a shape that is compatible with `sklearn`

`x1` should have shape `(3,1)` and `y1` should have shape `(3,)`

In [None]:
### Your code here ###(
x1 = np.array([2,4,5]).reshape(3,1)
y1 = np.array([3,5,8])
x1.shape, y1.shape

## 2. Working with Real Datasets
We will be working with the student_scores dataset which we have already seen before.

In [None]:
student_scores = pd.read_csv('https://drive.google.com/uc?id=1oakZCv7g3mlmCSdv9J8kdSaqO5_6dIOw')
student_scores.head(10)

### Your turn: Create your feature 2d-array `X` with the "Hours" values, and the response 1d-array `y` with the "Scores" values

#### `X` should have shape `(25,1)`, and `y` should have shape `(25,)`
Remember that you can access a column of a Pandas dataframe by `df.column_name`
or `df['column_name']`, but this column is a Pandas object. It must be converted into a numpy array before you can reshape them. You can do this by
```
np.array(df.column_name)
```


In [None]:
### Your code here ###
X = np.array(student_scores.Hours).reshape(25,1)
y = student_scores.Scores
X.shape, y.shape

### Your Turn: Fit a Linear Regression model to your data, and print out the test MSE

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
lin_reg = LinearRegression().fit(X_train,y_train)
y_pred_test = lin_reg.predict(X_test)
mse_test = mean_squared_error(y_pred_test, y_test)
print('Test MSE is:', mse_test)