# tylerharter / cs301-projects Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
..
Failed to load latest commit information. README.md columns.png final.png inputs.png scatter-2.png scatter.png springs.png tree-df-2.png tree-df.png

# Lab 10b: Fit Lines

In this lab, you'll learn three things:

1. how to draw a fit line
2. what numpy is, and its relationship to pandas
3. how to compute a fit line

## Drawing a Fit

Scatter plots are a good way to visualization correlations. We'll often want to overlay the scattered points with a line to represent the approximate relationship. This can help us see the pattern underlying the noise.

Let's construct a DataFrame with two related columns: tree age and tree height. We'll randomly generate data. Let's imagine that `height ≈ age * 2.3` (or `height = age * 2.3 + NOISE`) and randomly generate some data. Complete and run the following to generate 100 random trees:

```import random

ages = []
heights = []
for i in range(????):
age = random.uniform(1, 10)
noise = random.uniform(-1.5, 1.5)
height = age * 2.3 + noise
ages.append(age)
heights.append(????)

trees = DataFrame({"age":ages, ????:heights})

The above code has some randomness, so you'll get different numbers each time you run it, but it should look something like this: Complete the following to plot a scatter of the data:

```# tree scatter
import matplotlib
matplotlib.rcParams["font.size"] = ????

ax = trees.plot.????(x="age", y="heights", c=????, xlim=0, ylim=0)
ax.set_xlabel(????)
ax.????("Height (feet)")```

Fill in the `????` parts to get a plot that looks something this: If we want to draw a fit line, we need to add some fitted height values to match the age values we already have in a DataFrame. Let's use a slope of 2.

```trees["height-fitted"] = trees["age"] * 2

It should look something like this: Notice there's some difference between `height` and `height-fitted`. The former are the actual values; the latter is what height would be if height were always twice age, with no noise. Let's plot the fit line on top of our scatter:

```# tree scatter
import matplotlib
matplotlib.rcParams["font.size"] = ????

ax = trees.plot.????(x="age", y="heights", c=????, xlim=0, ylim=0)
ax.set_xlabel(????)
ax.????("Height (feet)")

trees.plot.line(ax=ax, x="age", y="height-fitted", color="red")```

Note that the above cell is the same as the earlier example, with the addition of just the `trees.plot.line` line at the end, so you could copy from earlier rather than fill in all the `????` parts. It should look like this: So drawing a fit line is easy. Of course, we just made up the slope. We need to do some linear algebra to compute the slope and intercept in a meaningful way. We'll use the `numpy` module for this.

## Numpy

Numpy is the most popular way to represent matrices in Python and do linear algebra. Import it:

`import numpy as np`

The main data structure in numpy is the array; it is used to represent vectors and matrices. Try creating one:

`np.array([1,2,3,4,5,6,7,8])`

To create a matrix, you can start with a vector, then impose some structure on it with a call to `.reshape(ROWS, COLS)`.

Create a matrix with 2 rows and 4 columns:

```matrix = np.array([1,2,3,4,5,6,7,8]).reshape(2, 4)
matrix```

Now create a matrix with 4 rows and 2 columns:

```matrix = np.array([1,2,3,4,5,6,7,8]).reshape(4, 2)
matrix```

The output of the above looks like this:

``````array([[1, 2],
[3, 4],
[5, 6],
[7, 8]])
``````

As you can see, a numpy array looks like a list of lists; indeed, you can access it as such (the following gets 7):

`matrix`

Complete the following to get `6` from `matrix`:

`matrix[????][????]`

Pandas is closely integrated with numpy, so it is easy to convert a Pandas DataFrame to a numpy array, using the `.values` attribute. Try it:

`trees.values`

You should see something like this:

``````array([[ 4.74371708, 12.00074986,  9.48743415],
[ 9.29054235, 20.08030328, 18.58108469],
[ 6.01233576, 13.31622645, 12.02467151],
[ 3.97135042,  8.41919774,  7.94270085],
[ 4.13624753,  8.87260787,  8.27249507],
[ 4.02866154,  8.24494275,  8.05732308],
[ 1.92147542,  5.71251348,  3.84295085],
[ 7.36679907, 18.38501823, 14.73359814],
[ 4.74236776,  9.74468899,  9.48473553],
...
[ 2.6725559 ,  6.03215439,  5.3451118 ]])
``````

## Computing a Fit

We'll use something called the Least Squares Method to find a fit line for our trees data (https://en.wikipedia.org/wiki/Least_squares). For CS 301, you only need to understand it at an intuitive level. Imagine we attached a movable line to every point in our data, as in the following: The line will naturally settle so as to minimize the total tension in all the strings. Imagine that the tension in a spring is the square of the distance the spring is stretched (e.g., stretching a spring twice as far increases the tension in that spring by 4 times).

Rather than get into the math of computing how the line will settle, we'll use the `np.linalg.lstsq(...)` function in numpy.

To setup the problem, imagine that we want to find some coefficients to get an approximate formula that relates some of the columns in a DataFrame. For a simple `y = m*x + n` line, the relationship on the DataFrame might look like this: Note that the coefficients (even the one for the line's intercept) need to be multiplied by a column (that's just what the `lstsq` function expects), so we have a dummy column containing just ones for `n`. As you can see, we're trying to relate the `y` column Series (an output) to a `DataFrame` of values (the inputs) from which we want to estimate `y`.

Let's add the dummy column to our `trees` DataFrame and pull out the inputs and output:

```output = trees["heights"]
trees["one"] = 1
inputs = trees[["age", "one"]]

If we have a DataFrame `df`, then `df[list_of_columns]` will create a DataFrame that has a subset of the original columns (as specified in the list), so `inputs` will look something like this (`age` is the `x` in this case): Ok, now we're ready to crunch some numbers:

```result = np.linalg.lstsq(inputs, output, rcond=None)
result```

Notice we're passing our `inputs` DataFrame and `output` Series; numpy can work with these Pandas types. The `rcond=None` is an unimportant detail (you should always pass that). `result` will look something like this:

``````(array([2.27595611, 0.10250293]),
array([80.70169711]),
2,
array([64.20719236,  3.56646379]))
``````

Notice it's a tuple with four values, as described here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html. According to the documentation, the tuple is like this: `(coefficients, residuals, rank, singular_values)`. Here, we only care about the coefficients, so let's pull those out:

```slope = result
intercept = result
slope, intercept```

Let's use this slope and intercept to fill in the `height-fitted` column correctly now:

```trees["height-fitted"] = trees["age"] * slope + intercept

Let's conclude by re-plotting the scatter data and fit line:

```import matplotlib
matplotlib.rcParams["font.size"] = 16

ax = trees.plot.scatter(x="age", y="heights", c="black", xlim=0, ylim=0)
ax.set_xlabel("Age (years)")
ax.set_ylabel("Height (feet)")

trees.plot.line(ax=ax, x="age", y="height-fitted", color="red")``` You can’t perform that action at this time.