# [CPSC 322](https://github.com/GonzagaCPSC322) Data Science Algorithms
[Gonzaga University](https://www.gonzaga.edu/)

# Linear Regression
What are our learning objectives for this lesson?
* Calculate a least squares linear regression line

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Linear Regression
In scatter plots, it can be nice to "fit a line"
<img src="https://raw.githubusercontent.com/GonzagaCPSC322/U3-Data-Analysis/master/figures/linear_regression_example.png" width="600"/>

* this can be done via linear regression
* we're going to look at a simple approach called "Least Squares"

The basic idea: Given a set of points, find a line that "best" fits the points
* i.e., find values for $m$ (slope) and $b$ (intercept) that best fits $y = mx + b$

In least squares linear regression
* find $m$ and $b$ that minimizes the sum of the (vertical) squared distance to the measured data points
* once we find $m$, finding $b$ isn't difficult

The basic least squares approach:
1. Calculate the mean $\bar{x}$ of the $x$ values and the mean $\bar{y}$ of the $y$ values
    * note the line must go through the point ($\bar{x}$, $\bar{y}$)
2. Calculate the slope using the means (where n is the number of data points):
$$m = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}$$
3. Calculate the y intercept as b = $\bar{y} - m\bar{x}$
     * or, $\bar{y} = m\bar{x} + b$ ... since we know it must go through ($\bar{x}$, $\bar{y}$)
     
The correlation coefficient $r$ helps checks how good the linear relationship is:
$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}$$
* note the bottom is essentially the same as the top just squared to strip away
the signs
* if the correlation is perfectly linear, then result is 1
* if the correlation is perfect inverse linear, then result is -1
* if no relationship, the result is 0

An alternative formula (where $\sigma_x$ is the standard deviation of $x$):
$$m = r \frac{\sigma_y}{\sigma_x}$$

The covariance can also be used to assess correlation
$$cov = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{n}$$
* covariance can also be used to calculate the correlation coefficient:
$$r = \frac{cov}{\sigma_x \sigma_y}$$

The standard error is also used to help check the fit
$$stderr = \sqrt{\frac{\sum_{i=1}^{n}(y_i - y^\prime)^2}{n}}$$
* where $y^\prime$ is the "predicted" value and $y$ is the actual value
* $(y_i - y^\prime)$ is called a "residual"
* note standard error is essentially the standard deviation of the differences
* lower the value the "better" the fit

Plus more along these lines (looking at the distribution of the "residuals")

Some general hints for calculating values associated with linear regression
* use `numpy.std(xs)` to calculate standard deviation
* beware integer division (e.g., `sum(xs) // n`)
* use list comprehensions, e.g.: 
```
sum([(xs[i] - x avg) * (ys[i] - y avg) for i in range(n)])
```

Q: What does it mean if there is a strong (linear) correlation?
* one of the attributes is (potentially) redundant because it is implied by the other
* one is a good predictor for the other ... good if one is a class label
* i.e., regression is one way to make predictions (more later)