<h2> Linear regression </h2>

In this notebook, we're going to explore some linear regression fits and see whether they can be used to build a solid predictive model or not. We'll analyze weather data for Merced sourced from the [National Weather Service](https://www.weather.gov/wrh/climate?wfo=hnx); the file `march_weather.csv` contains the high and low temperatures for each day of March, 2024. 

March is a season when the weather is changing pretty rapidly between winter and summer, so we might guess that a linear model could predict the temperature. In this case our independent variable *x* will be the number of days into March and our dependent variable *y* will be the temperature on that day. Let's get everything initialized:

In [None]:
# We have to import a bunch of things to get the data read.
# Just run this cell as it is.

import csv, datascience
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots

plots.style.use('fivethirtyeight')
weather_table = datascience.Table().read_table('march_weather.csv')
weather_table

In [None]:
# Now let's set up the dates and the high and low temperatures.
# Again, just run this cell:
dates = [d for d in range(1, 32)]
highs = weather_table.column('High temperature')
lows = weather_table.column('Low temperature')

After running this cell, you should have three arrays: one that's just a list of days from 1 to 31 inclusive and two with temperature data. Our goall is to find the least squares regression fits for the high temperatures and the low temperatures. That is: we're going to find the coefficients $\alpha$ and $\beta$ so that the line $y = \alpha + \beta x$ is the best fit for whatever data set we're working with. You might find it helpful to refer back to equations 22.1 and 22.2 in the textbook; here is some code that will get you started:

In [None]:
def sum_of_squares(array):
    s = 0
    for a in array:
        s += a*a
    return s

def sum_of_products(array_one, array_two):
    # This is assuming they're already the same length.
    s = 0
    for k in range(len(array_one)):
        s += array_one[k] * array_two[k]
    return s

Finally, here's one last piece of code: it will make a scatter plot with the temperature data and put the regression fit on top of it in a different color. All you need to do to run it is insert appropriate values for a and b; right now, they're just given as `...`. 

In [None]:
# This code plots a regression fit along with a dataset.

data = highs             # Change the data you're plotting against as needed.
a = ...                  # Change this to be the value you find.
b = ...                  # Change this to be the value you find.

# Do not alter the remaining code in this block:
plots.scatter(range(1, 32), data)
plots.scatter(range(1, 32), [a + b*d for d in range(1, 32)])

## Questions 

* **Question 1**: Find the best fit for the high temperature data and use the given code to plot it along with the data.
* **Question 2**: Find the best fit for the low temperature data and use the given code to plot it along with the data.
* **Question 3**: Compute the sum of squares of the residuals for both lines. Which model has a better fit? Does this match the graphs?

<h2> Submitting this to Gradescope </h2>

Once you've finished modifying your notebook and answering the questions, you'll need to submit it to Gradescope along with your other homework. To do this, generate a pdf file by clicking `File -> Save and Export Notebook as... -> PDF`. Then upload that PDF to Gradescope and submit it to the assignment `Jupyter 10 - Regression`. As always -- if you have any questions or run into any issues you can
* ask during discussion,
* email your TA or instructor,
* or bring them to student hours!

Congratulations! This was the final Jupyter notebook of the semester.