# Curve fitting

Another way to summarize data is to find an equation that approximates the relationship between the variables.  This is known as "curve fitting", and when used effectively it can be a useful tool for summarizing data and making predictions.


## Reminder: visualize your data first!

Here is a [fun interactive example](http://johnburnmurdoch.github.io/cityddj/anscombe) of several datasets with the same statistics.  (Demo doesn't seem to work in Firefox, try Chrome instead.)

## Linear fit

The simplest type of "curve fit" is a "linear fit", i.e., a line.  To demonstrate this, we'll start by generating some random data with a linear trend:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

N = 100 # Number of points
m = 0.5 # Slope
b = 4 # Y-intercept
xrange = 10

np.random.seed(0)

x = np.random.random(N) * xrange # Generate random X-values between 0 and xrange
y = m * x + b + np.random.randn(N)

plt.scatter(x, y)
plt.show()

Let's try to find the line that best represents this data.  Not surprisingly, [NumPy has tools for this built in](https://numpy.org/doc/stable/reference/generated/numpy.polynomial.polynomial.Polynomial.fit.html#numpy.polynomial.polynomial.Polynomial.fit).

We won't go into the mathematical derivation here, but you can read [all about it on Wikipedia](https://en.wikipedia.org/wiki/Least_squares).

In [None]:
from numpy.polynomial import Polynomial

# Polynomial.fit takes three arguments:
# x - a NumPy vector of x-coordinates
# y - a NumPy vector of y-coordinates
# deg - the degree of the polynomial (1 for a line, 2 for a quadratic, etc)
fit = Polynomial.fit(x, y, deg=1)

# Extract the coefficients from the polynomial
coef = fit.convert().coef

# Plot the data with the calculated fit
# Your code here...


After this code runs, the variable `coef` contains the coefficients for polynomial.  In the case of a linear fit, there should be two coefficients, since the equation is a line: $y = mx + b$.


Write code to plot the linear fit line on top of the data.  Remember that if you make multiple calls to `plot()` or `scatter()` before showing the plot, matplotlib will put all the data together for you on one plot.

*Challenge*: Experiment with different input coefficients and parameters.  How do they affect the data?  How do they affect the accuracy of the fit?

*Challenge*: Try using different seeds for the random-number generator.  How much does the fit vary?


## Polynomial fit
We can do a higher-order polynomial fit just as easily:

In [None]:
# Generate some random data which we'll try and fit a curve to
N = 100 # Number of points
xrange = 10

np.random.seed(8)

x = np.random.random(N) * xrange
y = -0.18 * x**3 + 2 * x**2 - 1.5 * x + 2 + 4 * np.random.randn(N)

plt.scatter(x, y)
plt.show()

Instead of extracting the coefficients and calculating the function manually, you can simply pass an array of x-coordinates to the `fit` that was generated by the `Polynomial.fit()` function: `array_of_y_values = fit(array_of_x_values)`.

* Write code to fit a 2nd-order (quadratic) polynomial to the data above, and plot it on top of the data.
* Try a 3rd-order polynomial (cubic)
* What happens if you increase the order even higher?  What are the advantages and disadvantages of doing this?


In [None]:
# Your code here...

# CDC COVID deaths 

Data is taken from https://covid.cdc.gov/covid-data-tracker/#trends_dailydeaths.  You can download an up-to-date copy by clicking "Data Table for Daily Death Trends - The United States" and then clicking "Download Data".  The CDC provides the data as a simple CSV file.

In [None]:
datafile = open('data_table_for_daily_death_trends__the_united_states.csv')

# First four rows are header info
for i in range(4):
    datafile.readline()

# The rest of the file is data
# Assume that there is one data point per day, in reverse chronological order
deaths = []
for line in datafile:
    bits = line.split(',')
    deaths.append(int(bits[2]))

# Flip the data so oldest is first
deaths.reverse()

# Grab just the part from the first death until mid-May
deaths = deaths[30:120]

# Plot
days = range(len(deaths))
plt.bar(days, deaths)
plt.show()

In [None]:
# Fit a cubic to this data
# Based on this model, when will Covid be over?

*Challenge*: We're slicing only a few months of data (from spring 2020), but there is more to explore.  Look at the rest and try fitting other curves!