# Week 6 Wednesday

## Announcements

* HW5 due Wednesday.
* HW6 is posted (due next Wednesday).


## Plan:
* Using `Pipeline` to combine multiple steps

In [None]:
import numpy as np
import pandas as pd
import altair as alt

### Generating random data

Here we make some data that follows a random polynomial.  Can we use scikit-learn to estimate the underlying polynomial?

Here are some comments about the code:
* It's written so that if you change `deg` to another integer, the rest should work the same.
* The "y_true" column values follow a degree 3 polynomial exactly.
* The "y" column values are obtained by adding random noise to the "y_true" values.
* We use two different `size` keyword arguments, one for getting the coefficients, and one for getting a different random value for each row in the DataFrame.
* It's better to use normally distributed random values, rather than uniformly distributed values in [0,1], so that the data points are not all within a band of width 1 from the true polynomial.
* In general in Python, if you find yourself writing `range(len(???))`, you're probably not writing your code in a "Pythonic" way.  We will see an elegant way to replace `range(len(???))` below.

In [None]:
np.arange(-3, 2, 0.1)

array([-3.00000000e+00, -2.90000000e+00, -2.80000000e+00, -2.70000000e+00,
       -2.60000000e+00, -2.50000000e+00, -2.40000000e+00, -2.30000000e+00,
       -2.20000000e+00, -2.10000000e+00, -2.00000000e+00, -1.90000000e+00,
       -1.80000000e+00, -1.70000000e+00, -1.60000000e+00, -1.50000000e+00,
       -1.40000000e+00, -1.30000000e+00, -1.20000000e+00, -1.10000000e+00,
       -1.00000000e+00, -9.00000000e-01, -8.00000000e-01, -7.00000000e-01,
       -6.00000000e-01, -5.00000000e-01, -4.00000000e-01, -3.00000000e-01,
       -2.00000000e-01, -1.00000000e-01,  2.66453526e-15,  1.00000000e-01,
        2.00000000e-01,  3.00000000e-01,  4.00000000e-01,  5.00000000e-01,
        6.00000000e-01,  7.00000000e-01,  8.00000000e-01,  9.00000000e-01,
        1.00000000e+00,  1.10000000e+00,  1.20000000e+00,  1.30000000e+00,
        1.40000000e+00,  1.50000000e+00,  1.60000000e+00,  1.70000000e+00,
        1.80000000e+00,  1.90000000e+00])

In [None]:
deg = 3
rng = np.random.default_rng(seed=27)
# random integers in [-5,5)
m = rng.integers(low=-5, high=5, size=deg+1)
print(m)
# A pandas DataFrame df is created with a column x that ranges from -3 to 2 in increments of 0.1. 
df = pd.DataFrame({"x": np.arange(-3, 2, 0.1)})

# Calculate the true oolynomial values y_true
df["y_true"] = 0
for i in range(len(m)):
    df["y_true"] += m[i]*df["x"]**i
#Add noise to generate y:
df["y"] = df["y_true"] + rng.normal(scale=5, size=len(df))

[-5  1 -3 -2]


At the end of that process, here is how `df` looks.

In [None]:
df

Unnamed: 0,x,y_true,y
0,-3.0,19.0,23.824406
1,-2.9,15.648,10.237108
2,-2.8,12.584,16.919087
3,-2.7,9.796,8.955196
4,-2.6,7.272,6.323695
5,-2.5,5.0,10.602832
6,-2.4,2.968,0.784105
7,-2.3,1.164,-5.234227
8,-2.2,-0.424,-2.771499
9,-2.1,-1.808,-7.792136


Aside: If you are using `range(len(???))` in Python, there is almost always a more elegant way to accomplish the same thing.

* Rewrite the code above using `enumerate(m)` instead of `range(len(m))`.

Recall that `m` holds the four randomly chosen coefficients for our true polynomial.  Why couldn't we use just `for c in m:` above?  Because we needed to know both the value in `m` and its index.  For example, we needed to know that `-3` corresponded to the `x**2` column (`m[2]` is `-3`).

This is such a common pattern in Python, that a function is provided to help accomplish this, called `enumerate`.  When we iterate through `enumerate(m)`, pairs of elements are returned: the index, and the value.  For example in our case `m = [-5,  1, -3, -2]`, and so the initial pair returned will be `(0, -5)`, the next pair will be `(1, 1)`, the next pair will be `(2, -3)`, and the last pair will be `(3, -2)`.  We assign the values in these pairs to `i` and `c`, respectively.

In [None]:
# A pandas DataFrame df is created with a column x that ranges from -3 to 2 in increments of 0.1. 
df = pd.DataFrame({"x": np.arange(-3, 2, 0.1)})

# Calculate the true oolynomial values y_true
df["y_true"] = 0
for i,c in enumerate(m): #c is m[i]
    df["y_true"] += c*df["x"]**i
#Add noise to generate y:
df["y"] = df["y_true"] + rng.normal(scale=5, size=len(df))
df

Unnamed: 0,x,y_true,y
0,-3.0,19.0,19.658465
1,-2.9,15.648,19.004528
2,-2.8,12.584,14.544002
3,-2.7,9.796,6.982885
4,-2.6,7.272,7.80607
5,-2.5,5.0,8.32215
6,-2.4,2.968,-1.289692
7,-2.3,1.164,-0.827392
8,-2.2,-0.424,6.149442
9,-2.1,-1.808,-0.183967


* Here is what the data looks like.

Based on the values in `m` above, we know these points are approximately following the curve $y = -2x^3 - 3x^2 + x - 5$.  For example, because the leading coefficient is negative, we know the outputs should be getting more negative as `x` increases, which seems to match what we see in the plotted data.

In [None]:
c1 = alt.Chart(df).mark_circle().encode(
    x="x",
    y="y"
)

c1

### Polynomial regression using `PolynomialFeatures`

* Use the `include_bias` keyword argument in `PolynomialFeatures` so we do not get a column for $x^0$

Notice how these values correspond to powers of the "x" column, including the zero-th power (all 1s).

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
pd.DataFrame(poly.fit_transform(df[['x']]))

Unnamed: 0,0,1,2,3
0,1.0,-3.0,9.0,-27.0
1,1.0,-2.9,8.41,-24.389
2,1.0,-2.8,7.84,-21.952
3,1.0,-2.7,7.29,-19.683
4,1.0,-2.6,6.76,-17.576
5,1.0,-2.5,6.25,-15.625
6,1.0,-2.4,5.76,-13.824
7,1.0,-2.3,5.29,-12.167
8,1.0,-2.2,4.84,-10.648
9,1.0,-2.1,4.41,-9.261


Let's get rid of the column of all `1` values.  We do this by setting `include_bias=False` when we instantiate the PolynomialFeatures object. 

We can perform both the `fit` and the `transform` steps in the same step using `fit_transform`.

In [None]:
poly = PolynomialFeatures(degree=3, include_bias=False)
df_pow = pd.DataFrame(poly.fit_transform(df[['x']]))
df_pow

Unnamed: 0,0,1,2
0,-3.0,9.0,-27.0
1,-2.9,8.41,-24.389
2,-2.8,7.84,-21.952
3,-2.7,7.29,-19.683
4,-2.6,6.76,-17.576
5,-2.5,6.25,-15.625
6,-2.4,5.76,-13.824
7,-2.3,5.29,-12.167
8,-2.2,4.84,-10.648
9,-2.1,4.41,-9.261


* Name the columns using the `get_feature_names_out` method of the `PolynomialFeatures` object.

In [None]:
poly.get_feature_names_out()

array(['x', 'x^2', 'x^3'], dtype=object)

In [None]:
df_pow.columns = poly.get_feature_names_out()
df_pow

Unnamed: 0,x,x^2,x^3
0,-3.0,9.0,-27.0
1,-2.9,8.41,-24.389
2,-2.8,7.84,-21.952
3,-2.7,7.29,-19.683
4,-2.6,6.76,-17.576
5,-2.5,6.25,-15.625
6,-2.4,5.76,-13.824
7,-2.3,5.29,-12.167
8,-2.2,4.84,-10.648
9,-2.1,4.41,-9.261


* Concatenate the "y" and "y_true" columns from `df` onto the end of `df_pow` using `pd.concat((???, ???), axis=???)`.  Name the result `df_both`.

Notice how we use `axis=1`, because the column labels are changing but the row labels are staying the same.

In [None]:
df_both = pd.concat((df_pow, df[["y", "y_true"]]), axis=1)
df_both

Unnamed: 0,x,x^2,x^3,y,y_true
0,-3.0,9.0,-27.0,19.658465,19.0
1,-2.9,8.41,-24.389,19.004528,15.648
2,-2.8,7.84,-21.952,14.544002,12.584
3,-2.7,7.29,-19.683,6.982885,9.796
4,-2.6,6.76,-17.576,7.80607,7.272
5,-2.5,6.25,-15.625,8.32215,5.0
6,-2.4,5.76,-13.824,-1.289692,2.968
7,-2.3,5.29,-12.167,-0.827392,1.164
8,-2.2,4.84,-10.648,6.149442,-0.424
9,-2.1,4.41,-9.261,-0.183967,-1.808


* Find the "best" coefficient values for modeling $y \approx c_3 x^3 + c_2 x^2 + c_1 x + c_0$.


In [None]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(df_both[["x","x^2","x^3"]], df_both["y"])

* How do these values compare to the true coefficient values?


In [None]:
reg.coef_

array([ 1.6446191 , -2.03654891, -1.83956523])

In [None]:
reg.intercept_

-6.358623201499859

The true values follow the polynomial $y = -2x^3 - 3x^2 + x - 5$.  In our case, we have found approximately $-1.8 x^3 - 2x^2 + 1.6 x - 6.4$.  These two sequences of coefficients are remarkably similar.

We will see a more efficient way to do all of these steps below, using `Pipeline`.

### Using `Pipeline` to combine multiple steps

The above process is a little awkward.  We can achieve the same thing much more efficiently by using another data type defined by scikit-learn, `Pipeline`.  (The tradeoff is that it is less explicit what is happening.)

* Import the `Pipeline` class from `sklearn.pipeline`.

In [None]:
from sklearn.pipeline import Pipeline

* Make an instance of this `Pipeline` class.  Pass to the constructor a list of length-2 tuples, where each tuple provides a name for the step (as a string) and the constructor (like `PolynomialFeatures(???)`).

In [None]:
pipe = Pipeline(
    [
        ('poly', PolynomialFeatures(degree=3, include_bias=False)),
        ('reg', LinearRegression())
    ]
)

* Fit this object to the data.

This is where we really benefit from Pipeline.  The following call of `pipe.fit` first fits and transforms the data using `PolynomialFeatures`, and then fits that transformed data using `LinearRegression`.

In [None]:
pipe.fit(df[["x"]], df['y'])

* Do the coefficients match what we found above?  Use the `named_steps` attribute, or just use the name directly.

You might try calling `pipe.coef_`, but we get an error message.  It's not the `Pipeline` object itself that has the fit coefficients, but the `LinearRegression` object within it.

In [None]:
pipe.coef_

AttributeError: 'Pipeline' object has no attribute 'coef_'

The information is recorded in a Python dictionary stored in the `named_steps` attribute of our `Pipeline` object.

In [None]:
pipe.named_steps

{'poly': PolynomialFeatures(degree=3, include_bias=False),
 'reg': LinearRegression()}

The point of all that is, now that we know how to access the `LinearRegression` object, we can get its `coef_` attribute just like usual when performing linear regression.  (Remember that this attribute only exists after we call `fit`.)

Notice that these are the exact same values as what we found above.  It's worth looking over both procedures and noticing how much shorter this procedure using `Pipeline` is.

In [None]:
pipe['reg'].coef_

array([ 1.6446191 , -2.03654891, -1.83956523])

In [None]:
pipe['reg'].intercept_

-6.3586232014998565

* Call the predict method, and add the resulting values to a new column in `df` named "pred".

The following simple code is evaluating our "best fit" degree three polynomial $-1.8 x^3 - 2x^2 + 1.6 x - 6.4$ for every value of in the "x" column.  Notice how we don't need to explicitly type `"x^2"` or anything like that, the polynomial part of this polynomial regression is happening "under the hood".

In [None]:
df["pred"] = pipe.predict(df[["x"]])

* Plot the resulting predictions using a red line.  Name the chart `c2`.

This one does lie perfectly on a cubic polynomial, more specifically, that cubic polynomial is approximately $-1.8 x^3 - 2x^2 + 1.6 x - 6.4$.  This is our cubic polynomial of "best fit" (meaning the Mean Squared Error between the data and this polynomial is minimized).  For the given data, using Mean Squared Error, this polynomial fits the data "better" than the true underlying polynomial $-2x^3 - 3x^2 + x - 5$.

In [None]:
c2 = alt.Chart(df).mark_line(color = "red").encode(
    x = "x",
    y = "pred"
)
c2

* Plot the true values using a dashed black line, using `strokeDash=[10,5]` as an argument to `mark_line`.  Name the chart `c3`.

Don't focus too much on the `strokeDash=[10,5]` part, I just wanted to show you an example of an option that exists.  Here the dashes are made with 10 black pixels followed by a gap of 5 pixels.

This curve represents the true underlying polynomial that we used to generate the data (before adding the random noise to it).

In [None]:
c3 = alt.Chart(df).mark_line(color = "black", strokeDash = [10,5]).encode(
    x = "x",
    y = "y_true"
)
c3

* Layer these plots on top of the above scatter plot `c1`.

Notice how similar our two polynomial curves are.  If we had used more data points or less standard deviation for our random noise, we would expect the curves to be even closer to each other.

In [None]:
c1 + c2 + c3

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=03591586-6943-4af5-9d03-4870fcf0ad7c' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>