# Lecture 12: September 1st, 2023

### 🎃 🍁 Happy September!

Remember! Monday is a holiday, so there will be no lecture or discussion :( I'll post some videos that you can watch on Tuesday so that we don't get too far behind.

## Performance measures for regression

Let's consider the following dataset, which only has four points.

In [1]:
import pandas as pd
import altair as alt
import numpy as np

In [2]:
df = pd.DataFrame({
    "x":np.arange(4),
    "y":[0,2,-10,6]},
)
df

Unnamed: 0,x,y
0,0,0
1,1,2
2,2,-10
3,3,6


* Plot `df` using Altair to see how the data looks.

In [3]:
alt.Chart(df).mark_circle(size=150).encode(
    x="x",
    y="y"
)

* Our goal is to figure out which of the following linear models best fits our data.
    * Line A: $f(x) = 2x$
    * Line B: $f(x) = 0.6x - 1.4$

* Add columns to `df` corresponding to these lines. Name the new columns “lineA” and “lineB”.

In [4]:
#correct, but overly complicated
df["lineA"] = df["x"].map(lambda x: 2*x)

In [5]:
#better
df["lineA"] = 2*df["x"]

In [6]:
#Now, we do the same thing for line B
df["lineB"] = 0.6*df["x"] - 1.4

In [7]:
df

Unnamed: 0,x,y,lineA,lineB
0,0,0,0,-1.4
1,1,2,2,-0.8
2,2,-10,4,-0.2
3,3,6,6,0.4


Line A and Line B represent two different possible linear models for our data.

* Plot the data together with these lines, using the color red for Line A and the color black for Line B. Use a base chart so that you are not repeating the same code three times.

The idea of a base chart is that we first include the data that's common to all of the charts.

In [8]:
base = alt.Chart(df).encode(
    x="x"
)

Notice, `base` on its own gives an error if we try to see it; it's not a valid chart (notice it doesn't have something like `mark_cirlce`/`mark_line` etc.)

In [9]:
base

SchemaValidationError: '{'data': {'name': 'data-e85beb7529ff9a23a83e49d3894e7618'}, 'encoding': {'x': {'field': 'x', 'type': 'quantitative'}}}' is an invalid value.

'mark' is a required property

alt.Chart(...)

In [10]:
#Notice this is the same as the first plot from today
c1 = base.mark_circle(size=150).encode(y="y")
c1

In [11]:
#Now let's plot lineA
c2 = base.mark_line(color="red").encode(y="lineA")
c2

In [12]:
#Now let's plot lineB
c3 = base.mark_line(color="black").encode(y="lineB")
c3

Now, let's see these three charts all together:

In [13]:
c1 + c2 + c3

* Which line fits the data better?

There's no single correct answer to this question! If you said black or red, we could justify either. It's going to depend on how we measure "better".

* Using scikit-learn, find the line of best fit (Linear Regression) for this data. How does it compare to the above lines?

In [14]:
from sklearn.linear_model import LinearRegression

In [15]:
reg = LinearRegression()

In [16]:
reg.fit(df[["x"]],df["y"])

In [17]:
df

Unnamed: 0,x,y,lineA,lineB
0,0,0,0,-1.4
1,1,2,2,-0.8
2,2,-10,4,-0.2
3,3,6,6,0.4


Why did I set a target of "y" and not something like "lineA"? Recall that we already have the equations for lineA and lineB. We want to find the line of best fit for our data, which we do not know the equation for.

In [18]:
#coeff
reg.coef_

array([0.6])

In [19]:
reg.intercept_

-1.4

Notice, this gives us lineB! This is the black line in our above plot, and it definitely appears to be a little worse of a fit than the red line...but sklearn is telling us this is the best line of all possible lines. What's going on?

Recall, `LinearRegression` minimizes mean squared error (MSE). So, if we compute the MSE for both of the lines, we should see that line B performs better than line A.

* Import the `mean_squared_error` function from sklearn.metrics. Which of our lines fits the data better according to this metric?

__Important point:__ When computing errors (or loss functions), a smaller number is better.

In [20]:
from sklearn.metrics import mean_squared_error

In [21]:
mean_squared_error(df["y"],df["lineA"])

49.0

In [22]:
mean_squared_error(df["y"],df["lineB"])

34.300000000000004

According to MSE, Line B performs better. Recall, this is what 'LinearRegression' minimizes.

As a reminder, here is how MSE is computed:

In [23]:
df

Unnamed: 0,x,y,lineA,lineB
0,0,0,0,-1.4
1,1,2,2,-0.8
2,2,-10,4,-0.2
3,3,6,6,0.4


In [24]:
#compute MSE for lineA versus y by hand
((df["lineA"] - df["y"])**2).mean()

49.0

In [25]:
#now for lineB
((df["lineB"] - df["y"])**2).mean()

34.300000000000004

* Import the `mean_absolute_error` function from sklearn.metrics. Which of our lines fits the data better according to this metric?

MAE works almost exactly the same as MSE, except now we take absolute instead of squaring to deal with possible negative values.

In [26]:
from sklearn.metrics import mean_absolute_error

In [27]:
mean_absolute_error(df["y"],df["lineA"])

3.5

In [28]:
mean_absolute_error(df["y"],df["lineB"])

4.9

Notice now that lineA performs better, since it has a smaller MAE.

In [29]:
#let's do this computation by hand
((df["lineA"] - df["y"]).abs()).mean()

3.5

Question from the chat: Which method for error is more precise? 

No good answer; it's going to depend on what you're looking for in a model. One comment, is that you might be tempted to think that MAE is beter in this case because it produces smaller numbers than MSE, but this is not the case. There's no good way to compare these different error measures. Another thing I've seen before (and something you can even see in this example) is that MSE tends to be more sensitive to outliers; this might be something to consider, depending on your data.

Main takeaways from this portion of the lecture: different performance measures will lead to different definitions of "best" fit. In theory, the smallest possible MSE/MAE woud be zero, but in this case it's impossible since no line can go through all 4 data points. This is similar to real life examples as well, we never expect our data to be perfectly linear.

In [30]:
mean_squared_error(df["y"],df["y"])

0.0

## More on Polynomial Regression

In the last lecture, we saw how to fit our data to a degree three polynomial. This took quite a bit of work, so the focus of this portion of the lecture will be on how to automate some of those steps.

### Generating Random Data

In [31]:
deg = 3
rng = np.random.default_rng(seed=27)
m = rng.integers(low=-5, high=5, size=deg+1)
print(m)
df = pd.DataFrame({"x": np.arange(-3, 2, 0.1)})
df["y_true"] = 0
for i in range(len(m)):
    df["y_true"] += m[i]*df["x"]**i

df["y"] = df["y_true"] + rng.normal(scale=5, size=len(df))

[-5  1 -3 -2]


In [32]:
df

Unnamed: 0,x,y_true,y
0,-3.0,19.0,23.824406
1,-2.9,15.648,10.237108
2,-2.8,12.584,16.919087
3,-2.7,9.796,8.955196
4,-2.6,7.272,6.323695
5,-2.5,5.0,10.602832
6,-2.4,2.968,0.784105
7,-2.3,1.164,-5.234227
8,-2.2,-0.424,-2.771499
9,-2.1,-1.808,-7.792136


Let's pick through the above code in a little more depth

* I'm trying to generate data that fits a cubic polynomial. Notice `deg = 3` makes it easy to change the code if I want to pick a polynomial of a different degree.
* The line `m = rng.integers...` is picking `deg+1` random integers between -5 and 5, these will serve as the coefficients for our cubic polynomial: $c_3x^3 + c_2x^2 + c_1x + c_0$.
* Then we create a DataFrame that stores x values.
* Lines 7-8 of the code build the polynomial for each value of x. The column "y_true" has perfectly cubic data.
* The last part of the code adds random noise to our data so that it is not perfectly cubic. Notice we use normally distributed random numbers instead of uniformly distributed random numbers (from [0,1]) so that my noise is not all within 1 unit of the actualy data.


Notice, we can improve the code above by using enumerate.

In [33]:
m

array([-5,  1, -3, -2])

Recall that enumerate pairs the values of m with corresponding indices.

In [34]:
list(enumerate(m))

[(0, -5), (1, 1), (2, -3), (3, -2)]

In [35]:
help(enumerate)

Help on class enumerate in module builtins:

class enumerate(object)
 |  enumerate(iterable, start=0)
 |  
 |  Return an enumerate object.
 |  
 |    iterable
 |      an object supporting iteration
 |  
 |  The enumerate object yields pairs containing a count (from start, which
 |  defaults to zero) and a value yielded by the iterable argument.
 |  
 |  enumerate is useful for obtaining an indexed list:
 |      (0, seq[0]), (1, seq[1]), (2, seq[2]), ...
 |  
 |  Methods defined here:
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __next__(self, /)
 |      Implement next(self).
 |  
 |  __reduce__(...)
 |      Return state information for pickling.
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  __class_getitem__(...) from builtins.type
 |      See PEP 585
 |  
 |  --------------------------------------------------------

In [36]:
deg = 3
rng = np.random.default_rng(seed=27)
m = rng.integers(low=-5, high=5, size=deg+1)
print(m)
df = pd.DataFrame({"x": np.arange(-3, 2, 0.1)})
df["y_true"] = 0
for c in enumerate(m):
    df["y_true"] += c[1]*df["x"]**c[0]

df["y"] = df["y_true"] + rng.normal(scale=5, size=len(df))

[-5  1 -3 -2]


In [37]:
for c in enumerate(m):
    print(c)

(0, -5)
(1, 1)
(2, -3)
(3, -2)


* Plot the noisy data so we have an idea of how it looks. Call the resulting chart `c1`.

In [38]:
c1 = alt.Chart(df).mark_circle().encode(
    x="x",
    y="y"
)
c1

Notice that our data looks roughly cubic, but not perfectly so. This is a good thing! We'll see if sklearn can estimate the curve :)

### PolynomialFeatures

* Demonstrate the `PolynomialFeatures` class from `sklearn.preprocessing` by evaluating it on the “x” column in `df`. Use a `degree` value of 3.
Note: this class uses `transform` rather than `predict` in the final step.

We still have our usual workflow:
import > instantiate > fit > transform

In [39]:
from sklearn.preprocessing import PolynomialFeatures

In [40]:
poly = PolynomialFeatures(degree=3)

In [41]:
type(poly)

sklearn.preprocessing._polynomial.PolynomialFeatures

This is a new-to-us object from sklearn! 

In [42]:
df

Unnamed: 0,x,y_true,y
0,-3.0,19.0,23.824406
1,-2.9,15.648,10.237108
2,-2.8,12.584,16.919087
3,-2.7,9.796,8.955196
4,-2.6,7.272,6.323695
5,-2.5,5.0,10.602832
6,-2.4,2.968,0.784105
7,-2.3,1.164,-5.234227
8,-2.2,-0.424,-2.771499
9,-2.1,-1.808,-7.792136


In [43]:
poly.fit(df[["x"]])

In [44]:
#gives an error :(
poly.predict(df[["x"]])

AttributeError: 'PolynomialFeatures' object has no attribute 'predict'

In [45]:
#notice this result is kind of difficult to understand
poly.transform(df[["x"]])

array([[ 1.00000000e+00, -3.00000000e+00,  9.00000000e+00,
        -2.70000000e+01],
       [ 1.00000000e+00, -2.90000000e+00,  8.41000000e+00,
        -2.43890000e+01],
       [ 1.00000000e+00, -2.80000000e+00,  7.84000000e+00,
        -2.19520000e+01],
       [ 1.00000000e+00, -2.70000000e+00,  7.29000000e+00,
        -1.96830000e+01],
       [ 1.00000000e+00, -2.60000000e+00,  6.76000000e+00,
        -1.75760000e+01],
       [ 1.00000000e+00, -2.50000000e+00,  6.25000000e+00,
        -1.56250000e+01],
       [ 1.00000000e+00, -2.40000000e+00,  5.76000000e+00,
        -1.38240000e+01],
       [ 1.00000000e+00, -2.30000000e+00,  5.29000000e+00,
        -1.21670000e+01],
       [ 1.00000000e+00, -2.20000000e+00,  4.84000000e+00,
        -1.06480000e+01],
       [ 1.00000000e+00, -2.10000000e+00,  4.41000000e+00,
        -9.26100000e+00],
       [ 1.00000000e+00, -2.00000000e+00,  4.00000000e+00,
        -8.00000000e+00],
       [ 1.00000000e+00, -1.90000000e+00,  3.61000000e+00,
      

* To make the result easier to read, put the transformed data into a pandas DataFrame named `df_poly`.

In [46]:
df_poly = pd.DataFrame(poly.transform(df[["x"]]))
df_poly

Unnamed: 0,0,1,2,3
0,1.0,-3.0,9.0,-27.0
1,1.0,-2.9,8.41,-24.389
2,1.0,-2.8,7.84,-21.952
3,1.0,-2.7,7.29,-19.683
4,1.0,-2.6,6.76,-17.576
5,1.0,-2.5,6.25,-15.625
6,1.0,-2.4,5.76,-13.824
7,1.0,-2.3,5.29,-12.167
8,1.0,-2.2,4.84,-10.648
9,1.0,-2.1,4.41,-9.261


In [47]:
df

Unnamed: 0,x,y_true,y
0,-3.0,19.0,23.824406
1,-2.9,15.648,10.237108
2,-2.8,12.584,16.919087
3,-2.7,9.796,8.955196
4,-2.6,7.272,6.323695
5,-2.5,5.0,10.602832
6,-2.4,2.968,0.784105
7,-2.3,1.164,-5.234227
8,-2.2,-0.424,-2.771499
9,-2.1,-1.808,-7.792136


 Notice that this is taking each x value and raising it to the corresponding power. We did this same thing last lecture, but by hand. Appreciate how much time this saved us!

### Using `Pipeline` to combine multiple step

* Import the `Pipeline` class from `sklearn.pipeline`.

In [48]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

ImportError: cannot import name 'pipeline' from 'sklearn.pipeline' (/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/pipeline.py)

* Make an instance of this `Pipeline` class. Pass to the constructor a list of length-2 tuples, where each tuple provides a name for the step (as a string) and the constructor (like `PolynomialFeatures(???)`)

In [49]:
#instantiate
#I think what went wrong in lecture was the indentation
pipe = Pipeline(
    [
        ("poly", PolynomialFeatures(degree=3)), 
        ("reg", LinearRegression())
    ]
)

SyntaxError: invalid syntax (1944942785.py, line 3)

* Fit this object to the data.

In [50]:
pipe.fit(df[["x"]],df["y"])

First, it fits and transforms using PolynomialFeatures, and then it uses LinearRegression

In [51]:
pipe["reg"].coef_

array([ 0.        ,  3.33524206, -3.03442169, -2.32952623])

In [52]:
pipe["poly"]

* Call the predict method, and add the resulting values to a new column named “pred”.

In [53]:
df["pred"] = pipe.predict(df[["x"]])

In [54]:
df

Unnamed: 0,x,y_true,y,pred
0,-3.0,19.0,23.824406,20.317732
1,-2.9,15.648,10.237108,16.359172
2,-2.8,12.584,16.919087,12.745261
3,-2.7,9.796,8.955196,9.462022
4,-2.6,7.272,6.323695,6.495478
5,-2.5,5.0,10.602832,3.831652
6,-2.4,2.968,0.784105,1.456566
7,-2.3,1.164,-5.234227,-0.643757
8,-2.2,-0.424,-2.771499,-2.483293
9,-2.1,-1.808,-7.792136,-4.07602


* Plot the resulting predictions using a red line. Name the chart `c2`.

In [55]:
c2 = alt.Chart(df).mark_line(color="red").encode(
    x="x",
    y="pred"
)

* Plot the true values using a dashed black line, using `strokeDash=[10,5]` as an argument to `mark_line`. Name the chart `c3`.

In [56]:
c3 = alt.Chart(df).mark_line(color="black",strokeDash=[10,5]).encode(
    x="x",
    y="y_true"
)

* Layer these plots on top of the above scatter plot `c1`.

In [57]:
c1+c2+c3

## Overfitting
For this portion, we'll be working with the simulated data in `sim_data.csv`. The true underlying function is of the form $f(x) = c_2x^2 + c_1x + c_0$. The true outputs are stored in the “y_true” column. We've hidden this true data by adding some random noise to each output and put the result in the “y” column.

In [58]:
df = pd.read_csv("sim_data.csv")
df.head()

Unnamed: 0,x,y_true,y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
0,-3.329208,-18.207589,-117.4849,-3.329208,11.083626,-36.899694,122.846756,-408.982395,1361.587441,-4533.00773,15091.33,-50242.16,167266.6
1,6.465018,74.160562,73.954907,6.465018,41.796463,270.214901,1746.944309,11294.027098,73016.09297,472050.384357,3051814.0,19730040.0,127555000.0
2,-4.478046,-7.670062,-13.810089,-4.478046,20.052899,-89.79781,402.118751,-1800.706392,8063.646628,-36109.383086,161699.5,-724097.8,3242544.0
3,2.043272,-7.925152,19.461182,2.043272,4.17496,8.53058,17.430295,35.614834,72.770792,148.690523,303.8152,620.7771,1268.416
4,4.850593,36.485466,22.37523,4.850593,23.528255,114.125996,553.578791,2685.185564,13024.743051,63177.731115,306449.5,1486462.0,7210222.0


This is the same type of example that we did in the previous part of lecture with the cubic polynomial. Now, we just have a degree 2 polynomial instead.

![](over-under.png)

[Image Source](https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/#)

* Bias: This is an error caused by overly simplistic assumptions about our data. It fails to capture underlying complexity of our data. In this situation, our model tends to be easy to understand, but somehow fails to capture key aspects/nuances of the data.

* Variance: These are erros due to our model being too sensitive to noise and random fluctuations in the data. It might learn to model a training set well beause it memorizes the random differences, but struggles with data it has never seen before. Failure to learn underlying patterns of the data.

* Make a layer chart with the true "y" values and the noisy "y" values.

In [59]:
c1 = alt.Chart(df).mark_circle(color="red").encode(
    x="x",
    y="y"
)
c2 = alt.Chart(df).mark_circle(color="black").encode(
    x="x",
    y="y_true"
)
c1+c2

## Dividing into a training set and a test set

What is training data? Training data is data that we show to our model for fitting. The reason we use it is to save some data for testing how well our model works; the testing data is something that our model was not fit on -- it is brand new!

Key Idea: Fit to a training set, predict or evaluate on a test set.

We partition our data so that some of it is saved for training, and the rest is used for testing.

In [60]:
max_deg = 10
cols = [f"x{i}" for i in range(1, max_deg+1)]

In [61]:
cols

['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10']

In [63]:
from sklearn.model_selection import train_test_split

In [64]:
X_train, X_test, y_train, y_test = train_test_split(df[cols],df["y"],train_size=0.24, random_state=6)

This save 24% of the rows of `df[cols]` for training. The remaining rows will be used to testing.

In [65]:
df[cols]

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
0,-3.329208,11.083626,-36.899694,122.846756,-408.982395,1361.587441,-4533.008,15091.33,-50242.16,167266.6
1,6.465018,41.796463,270.214901,1746.944309,11294.027098,73016.09297,472050.4,3051814.0,19730040.0,127555000.0
2,-4.478046,20.052899,-89.79781,402.118751,-1800.706392,8063.646628,-36109.38,161699.5,-724097.8,3242544.0
3,2.043272,4.17496,8.53058,17.430295,35.614834,72.770792,148.6905,303.8152,620.7771,1268.416
4,4.850593,23.528255,114.125996,553.578791,2685.185564,13024.743051,63177.73,306449.5,1486462.0,7210222.0
5,-0.578506,0.334669,-0.193608,0.112004,-0.064795,0.037484,-0.0216848,0.01254479,-0.007257238,0.004198356
6,-6.557409,42.999609,-281.966007,1848.966339,-12124.427923,79504.828902,-521345.7,3418677.0,-22417660.0,147001800.0
7,-5.656363,31.994443,-180.972185,1023.644381,-5790.104257,32750.931784,-185251.2,1047848.0,-5927008.0,33525310.0
8,0.108538,0.011781,0.001279,0.000139,1.5e-05,2e-06,1.774525e-07,1.92604e-08,2.090492e-09,2.268986e-10
9,4.556122,20.758251,94.577131,430.904978,1963.255801,8944.833631,40753.76,185679.1,845976.7,3854373.0


In [66]:
len(df[cols])

50

In [67]:
len(X_train)

12

In [68]:
X_train

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
13,-8.982275,80.681271,-724.701394,6509.467488,-58469.829551,525192.110405,-4717420.0,42373170.0,-380607500.0,3418721000.0
11,6.449735,41.599085,268.303082,1730.483838,11161.162581,71986.543546,464294.1,2994574.0,19314210.0,124571600.0
1,6.465018,41.796463,270.214901,1746.944309,11294.027098,73016.09297,472050.4,3051814.0,19730040.0,127555000.0
25,0.371593,0.138081,0.05131,0.019066,0.007085,0.002633,0.0009782985,0.0003635287,0.0001350847,5.01965e-05
16,-3.942982,15.547104,-61.301947,241.712453,-953.067768,3757.928725,-14817.44,58424.91,-230368.3,908338.2
45,-4.45765,19.870645,-88.576385,394.842539,-1760.069919,7845.776,-34973.72,155900.6,-694950.5,3097846.0
15,5.481667,30.048674,164.716823,902.92278,4949.522041,27131.631828,148726.6,815269.5,4469036.0,24497770.0
42,-9.80488,96.135668,-942.598671,9242.066675,-90617.352934,888492.254125,-8711560.0,85415800.0,-837491600.0,8211505000.0
20,-0.690041,0.476157,-0.328568,0.226726,-0.15645,0.107957,-0.07449486,0.05140454,-0.03547126,0.02447664
35,4.85223,23.544134,114.241552,554.326269,2689.718466,13051.132219,63327.09,307277.6,1490982.0,7234585.0


In [69]:
y_train

13      2.574974
11    130.940520
1      73.954907
25      1.134863
16    -35.901888
45    -26.760377
15     24.441828
42    131.476240
20     61.591699
35    -13.988943
9       1.089803
10     -7.251101
Name: y, dtype: float64

In [70]:
X_test

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
49,6.068927,36.83187,223.529915,1356.586642,8233.024737,49965.622715,303237.7,1840327.0,11168810.0,67782700.0
40,-0.364001,0.132497,-0.048229,0.017555,-0.00639,0.002326,-0.0008466826,0.0003081936,-0.0001121829,4.083471e-05
38,-1.602675,2.568568,-4.11658,6.597541,-10.573716,16.946233,-27.15931,43.52755,-69.76053,111.8035
23,4.441058,19.722998,87.59098,388.996638,1727.556699,7672.179799,34072.6,151318.4,672013.7,2984452.0
7,-5.656363,31.994443,-180.972185,1023.644381,-5790.104257,32750.931784,-185251.2,1047848.0,-5927008.0,33525310.0
0,-3.329208,11.083626,-36.899694,122.846756,-408.982395,1361.587441,-4533.008,15091.33,-50242.16,167266.6
6,-6.557409,42.999609,-281.966007,1848.966339,-12124.427923,79504.828902,-521345.7,3418677.0,-22417660.0,147001800.0
34,-1.384337,1.91639,-2.65293,3.67255,-5.084049,7.038039,-9.74302,13.48763,-18.67143,25.84755
14,-9.026659,81.480574,-735.497357,6639.083877,-59928.746571,540956.362638,-4883029.0,44077430.0,-397872000.0,3591455000.0
31,6.89315,47.515521,327.53163,2257.72476,15562.836137,107276.968808,739476.3,5097321.0,35136600.0,242201900.0


In [71]:
50*0.24

12.0

### Performing Polynomial Regression for Each Degree

In [72]:
#not doing anything yet, just showing how degree changes
for i in range(1,max_deg+1):
    sub_cols = cols[:i]
    reg = LinearRegression()
    print(sub_cols)

['x1']
['x1', 'x2']
['x1', 'x2', 'x3']
['x1', 'x2', 'x3', 'x4']
['x1', 'x2', 'x3', 'x4', 'x5']
['x1', 'x2', 'x3', 'x4', 'x5', 'x6']
['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7']
['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8']
['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9']
['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10']


### Evaluating the Performance, Part 1

In [73]:
mse_training_dict = {}
for i in range(1,max_deg+1):
    sub_cols = cols[:i]
    reg = LinearRegression()
    reg.fit(X_train[sub_cols],y_train)
    mse_training_dict[i] = mean_squared_error(reg.predict(X_train[sub_cols]),y_train)

In [74]:
mse_training_dict

{1: 3006.2441403930584,
 2: 1845.1659476797647,
 3: 1820.05419658703,
 4: 590.386208100543,
 5: 589.0984843939517,
 6: 523.8762601493493,
 7: 373.27949678005854,
 8: 206.85098887525973,
 9: 204.9933133278071,
 10: 101.4717525424262}

Looking at the MSE for each degree, we'd think that degree 10 is the best fit. There is a big issue, however! That is, we are predicting on the training set. This means there's a huge risk of overfitting.

### Evaluating the Performance, Part 2

To get a meaningful MSE, we should always evaluate on a test set (not use during training).

In [91]:
mse_test_dict = {}
for i in range(1,max_deg+1):
    sub_cols = cols[:i]
    reg = LinearRegression()
    reg.fit(X_train[sub_cols],y_train)
    df[f"Pred{i}"] = reg.predict(df[sub_cols])
    mse_test_dict[i] = mean_squared_error(reg.predict(X_test[sub_cols]),y_test)

In [92]:
mse_test_dict

{1: 4960.848889431734,
 2: 3724.5263419375206,
 3: 3742.5488620098768,
 4: 6252.874102524008,
 5: 6253.4123849934285,
 6: 7016.575636379848,
 7: 5609.641936194314,
 8: 9481.527367001438,
 9: 12142.080032238073,
 10: 103395.23803044269}

Notice now that degree 2 is where we have the smallest MSE (unlike the previous example, where we got 10). This is encouraging, because we know that the true underlying data has degree 2. There is some randomness here, though, if you play around with the random states, we sometimes get degree 3 or 4 as the minimum for MSE. 

### Plotting the polynomial fits

Before plotting, we add a column which tells us whether a point was in the training set, or the testing set.

In [96]:
df["In_train"] = "test"

In [97]:
X_train.index

Int64Index([13, 11, 1, 25, 16, 45, 15, 42, 20, 35, 9, 10], dtype='int64')

In [98]:
df.loc[X_train.index,"In_train"] = "train"

13    False
11    False
1     False
25    False
16    False
45    False
15    False
42    False
20    False
35    False
9     False
10    False
Name: In_train, dtype: bool

In [99]:
c = alt.Chart(df).mark_circle().encode(
    x="x", 
    y="y",
    color="In_train"
)

In [100]:
c_true = alt.Chart(df).mark_line(color="black").encode(
    x="x",
    y="y_true",
)

In [101]:
chart_list = []
for i in range(1,max_deg+1):
    c_temp = alt.Chart(df).mark_line(color="red", clip=True).encode(
        x="x",
        y=alt.Y(f"Pred{i}").scale(domain=(-100,300)),
    )
    chart_list.append(c_temp)

In [102]:
all_charts = [c+c_true+d for d in chart_list]

In [103]:
alt.vconcat(*all_charts)

We didn't get to see these plots during lecture today, but here are some explanations. We'll go through this all together next lecture. A few comments: 
* Notice that in degree 1, the training and test errors are both high. This is a sign of overfitting, and is exhibited by the line not fitting our data well. 
* In degree 2 and 3, the red curve matches the secret underlying polynomial (black curve) pretty well.
* In higher degrees, the training errors are smaller, but notice from the plots that the red lines follow the orange point (the training data) very closely. 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=9d187b6a-99ad-41f0-a5c4-98efee456075' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>