**Note: Please make your own copy of this notebook to run and execute, thank you!**

1.   Go to the menu tab on the top left corner
2.   Click on "File"
3.   Under the File tab menu click on "Save a copy in Drive..."

# Introduction to Univariate Linear Regression

Suppose we wish to analyze the relationship between a vehicle's weight and fuel economy or the price of a slice of pizza based on the volume of pizza produced. How might we analyze the relationship between these variables? In this lesson, we'll cover univariate linear regression, which is a statistical approach to find and determine a relationship among an independent variable $x$ and a dependent variable $y$.

So how does univariate linear regression help us find answers to these questions? As humans, we can plot our data and perhaps observe a general pattern or trend. Univariate linear regression works in a similar way, where it tries to find the best line that fits our data, often called the "line of best fit". 

We'll cover how univariate linear regression is used, how it makes predictions from a linear line equation, uses data to find our best fit line, and finally how to program it from scratch.

[Univariate Linear Regression Glossary](https://docs.google.com/document/d/10SHSjLqaU__uw-k3q4iFji70ScCsD0CNANujoEuTCLg/edit):

Provided as a list of terms and defintions used in the Univariate Linear Regression lesson to help keep track of the content.

# Example of Univariate Linar Regression - Saving to Buy a Phone

Suppose we currently have 200 dollars saved up and we wish to buy a used phone for 395 dollars. How long would it take for us to pay off this phone if we work as a part-time server for 12 hours per week when part of our income comes from tips? Unfortunately, we do not have a concrete answer since our saving depends on a lot of variables outside of our direct control. However, we can look at our prior income and savings to get an idea of how much money we will be saving in the future.


Assuming our work schedule and phone price are fixed and take into account living expenses and taxes, we can plot our hourly income to discover we save roughly 3 dollars an hour. That is we keep 3 dollars on average for each hour we worked.

Using this information we can estimate that if we would work roughly 65 hours (a little over 5 weeks) at our current saving rate we would be able to make about 195 dollars which we can add to the rest of the money we saved to pay the full price for our phone.

**Questions:**
*   If we currently have 250 dollars saved, how many hours would be need to work to pay the phone in full?
*   Assume we had 50 dollars saved, but our weekly schedule was 40 hrs a week, how long will it take us to make the full payment?

# Making Predictions with Univariate Linear Regression

Now that we have a concrete idea of how univariate regression can be used, let's convert it into a generalizable formula which we can use to make predictions. In univariate linear regression, we can write the relationship between the independent variable and the dependent variable as $y = mx + b$ where $m$ is the slope and $b$ is our intercept.

[Introduction to Simple Linear Regression](https://www.youtube.com/watch?v=owI7zxCqNY0):

In statistics we usually do not know the true relationship between $x$ and $y$ so we use the common notation $\hat{y} = f(x) = \beta_1\ x + \beta_0$ to estimate and model this relationship with our data. The $\hat{y}$ represents our predicted or estimated value of $y$ and the $\beta$ symbols (know as parameters) are basically the same as our slope and intercept notation. To get a better idea of what each term means and how they are calculated check out the first 7:35 minutes of the above video (we will discuss errors and how to fit the model later in this lesson) offered by the INCAE international business school.

In machine learning we typically call $\hat{y}$ our prediction, $x$ our feature, and parameters $w$ weights. Now let's rewrite our regression equation in machine learning notation.

**Regression Line:**

$$y\ estimate: \hat{y}$$

$$ feature: x$$

$$slope\ estimate\ (weight\ 1): w_1$$

$$intercept\ estimate\ (weight\ 0): w_0$$

$$\hat{y} = f(x) = w_1\ x + w_0$$

To make the above equation more concrete let's briefly go back to our phone example earlier. In this case if we were to replace the theoretical equation with data we gathered from our bank records we can replace the money we have already saved with $w_0$, the saving rate with $w_1$, the hours we plan to work as $x$, and the money we are hoping to save as the $y\ estimate \ \hat{y}$.

Now, let's step through an example of how to calculate the estimated $\hat{y}$ output of a regression where the linear equation is represented by $\hat{y} = 2x$ and our input feature is $x = 4$. In this case there is no intercept. Instead we just have a slope or $w_1$ of 2 which means for every increment we increase our input it will double the expected output.

**Example:**

$$w_1 = 2,\ x = 4,\ w_0 = 0$$

$$\hat{y} = w_1(x) + w_0 = 2 (x) + 0$$

$$\hat{y} = 2(4) + 0 = 8 + 0 = 8$$

Now that we have an idea of how to calculate a univariate linear regression to make a predicition, let's go ahead and convert this equation into a small function in Python so in the future we can have the computer do all the heavy lifting for us.

In [0]:
# Build custom univariate linear regression predictive function
def regression(slope, x, intercept):
  return slope * x + intercept

Now let's check that our regression line works as predicted.

In [0]:
# Supply univariate linear regression model with slope, intercept, and input value
slope = 2
x = 4
intercept = 0
print(regression(slope, x, intercept))

**Problem Sets:**
*   Calculate the predicted y value of a regression line where the slope is 3, the intercept is 0 and the input value is 5
*   Calculate the predicted y value of a regression line where the slope is 2, the intercept is -1 and the input value is 1
*   Calculate the predicted y value of a regression line where the slope is 3, the intercept is 2 and the input value is 4

# Fitting a Univariate Linear Regression Line

As you may have noticed, our formulas only works if we know the slope and intercept of our model, but what if we don't have this information? Before we can even find our slope and intercept we need to ask ourselves an even more important question. How can we determine if the model is accurately capturing our data - in other words is our model a good fit?  Once we have a criteria of how to evaluate a model we can then use this information to find an optimal slope and intercept.

[Squared Error of Regression Line](https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/more-on-regression/v/squared-error-of-regression-line):

While we can intuitive draw lines through data we'll need to instruct our computer how to evaluate our regression. One approach we can try is to ***minimize*** the error generated in our model by asking how far off our prediction is from the actual outcome. We can do so by measuring the difference between our predictive model $\hat{y}$ and the true value $y$ using the sum squared error method what Sal Khan will discuss futher.

**Sum of Squared Errors:**

$$SSE = \sum_{i = 0}^{N}(y_i - \hat{y})^2$$


Now that we know how to calculate the sum of squared errors, how do we minimize it to find the best fit line for our linear regression? 

Say for example we have a linear equation where the slope is 1 and the intercept is 0. How can we tell if this is a good model? Well we can compare the predicted values of $\hat{y}$ from the true values y in our recorded dataset. That is for each point in the data set we would subtract the predict value from the true value in the data and sum the total of squared errors.

**Regression Model 1:**

$$\hat{y} = 1(x)$$

$$\hat{y} = \{1,\ 2,\ 3,\ 4,\ 5\},\ y = \{2.1,\ 3.9,\ 5.8,\ 7.9,\ 10.2\}$$

$$SSE = (2.1-1)^2 + (3.9-2)^2 + (5.8-3)^2 + (7.9-4)^2 + (10.2-5)^2 = 54.91$$

This linear equation ends up having a much higher squarer error value then if the linear equation is represented by another model with a slope and intercept of 2 and 0 respectively. In this case if we were to measure the error (otherise know as a residual) we would have a squared error of roughly 0.1099 for our dataset (due to the size of the dataset and some small amount of randomization).

**Regression Model 2:**

$$\hat{y} = 2(x)$$

$$\hat{y} = \{2,\ 4,\ 6,\ 8,\ 10\},\ y = \{2.1,\ 3.9,\ 5.8,\ 7.9,\ 10.2\}$$

$$SSE = (2.1-2)^2 + (3.9-4)^2 + (5.8-6)^2 + (7.9-8)^2 + (10.2-10)^2 = 0.109\bar{9}$$

Let's convert the sum of squared error function into Python code.

In [0]:
# Sum of Squared Error
def sum_sq_error(y_test, y_train):
  return sum((y_test - y_train)**2)

Now, let's verify our code with the example we calculated above.

In [0]:
# Hypthetical predictive values generated by model
y_pred = np.array([2, 4, 6, 8, 10])
# Labeled data outcomes to compare how well are model is performing (with some amount of random noise)
y_test = np.array([2.1, 3.9, 5.8, 7.9, 10.2])

# Output the sum of squared errors between the model and the data
print(sum_sq_error(y_test, y_pred))

Okay so now that we have an error formula how can we use it to try to find our optimal slope and intercept (our weights) that will reduce our errors? Well, we can analyze the behavior of our SSE and see if it can help us determine which points we can use.

For example, say that we are throwing a baseball in the air which we can represent its trajectory using a parabola. How might we be able to find the point at which the baseball is at its highest point before it is released? Initially, the ball starts out in our hand but moves up before falling down again. From this information, we can assume the ball is at its highest when the ball is no longer moving up but has not yet started to fall. In other words, its velocity - the rate of change of the ball's height is changing is zero. By analyzing the ball's trajectory and setting our velocity to zero we can rearrange our formula to find the ball's highest point.


In calculus, we call the rate of change a ***[derivative](https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-1-new/ab-2-2/v/calculus-derivatives-1-new-hd-version)*** which Sal Khan will go into in more detail. While we won't go into all the proofs, in a nutshell, instead of manually testing various weight values, we can take the derivative of the sum of square errors to help us find our ideal weight parameters. Note: if you would like to see how these equations are derived please refer to the extra resources section at the end of this lesson.

**Slope and Intercept:**

$$slope\ estimate: w_1 = \frac{ss_{xy}}{ss_{xx}}$$

$$intercept\ estimate: w_0 = \bar{y} - w_1\bar{x}$$

**Sum of Squares Deviation (used to find the slope):**

$$ss_{xx} = \sum_{i=0}^{N}(x_i - \bar{x})^2$$

$$ss_{yy} = \sum_{i=0}^{N}(y_i - \bar{y})^2$$

$$ss_{xy} = \sum_{i=0}^{N}(x_i - \bar{x})(y_i - \bar{y})$$

**Example:**

$$X = \{x_1,\ x_2,\ x_3,\ x_4,\ x_5\} = \{1,\ 2,\ 3,\ 4,\ 5\}$$

$$\bar{x} = 3$$

$$ss_{xx} = (1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2 = 10$$

$$Y = \{y_1,\ y_2,\ y_3,\ y_4,\ y_5\} = \{2.1,\ 3.9,\ 5.8,\ 7.9,\ 10.2\}$$

$$\bar{y} = 5.98$$

$$ss_{yy} = (-2.1-5.98)^2 + (3.9-5.98)^2 + (5.8-5.98)^2 + (7.9-5.98)^2 + (10.2-5.98)^2 = 40.9079\bar{9}$$

$$ss_{xy} = (-2)(-3.88) + (-1)(-2.08) + (0)(-0.18) + (1)(1.92) + (2)(4.22) = 20.2$$


Once again let's go ahead and code these formulas into Python and verify they calculate the examples above.

In [0]:
# Sum of Squared Deviation for single variable x
def ssxx(x):
  return sum((x-np.mean(x))**2)

In [0]:
# Sum of Squared Deviation for two variables x and y
def ssxy(x, y):
  xmean = np.mean(x)
  ymean = np.mean(y)
  return sum((x-xmean)*(y-ymean))

In [0]:
x = [1, 2, 3, 4, 5]
y_pred = np.array([2, 4, 6, 8, 10])
y_test = np.array([2.1, 3.9, 5.8, 7.9, 10.2])

print(ssxx(x))
print(ssxx(y_test))
print(ssxy(x, y_test))

**Reflection:**

*   What is the purpose of calculating the sum of squared errors?
*   In general how do we go about trying to find the ideal slope and intercept of a univariate linear regression line?

# Building Our Univariate Linear Regression Model

Now that we can calculate the slope and intercept formulas, let's create our linear regression model from scratch and make a new prediction.

In [0]:
# Define the univariate linear regression
class univariate_linear_regression_model:
  # Initialize slope and intercept (weight parameters)
  def __init__(self):
    self.b = 0.0
    self.i = 0.0
  
  # Sum of Squared Deviation for single variable x
  def ssxx(self, x):
    return sum((x-np.mean(x))**2)
  
  # Sum of Squared Deviation for two variables x and y
  def ssxy(self, x, y):
    xmean = np.mean(x)
    ymean = np.mean(y)
    return sum((x-xmean)*(y-ymean))
    
  # Train regression model based of size and shape of the data
  def train(self, x, y):
    
    # Verify the features and labels of the dataset are of the same size
    assert(len(x) == len(y)) 
    
    # Calculate the slope
    ss_xy = self.ssxy(x, y)
    ss_xx = self.ssxx(x)
    self.b = ss_xy/ss_xx
    
    # Calculate the intercept
    mux = np.mean(x)
    muy = np.mean(y)
    self.i = muy - self.b*mux
	
  # Return the predicted value based off the feature and weight parameters
  def predict(self, x):
    predictions = np.zeros(len(x))
    for i in range(len(x)):
      predictions[i] = self.b * x[i] + self.i
    return predictions

Alright, let's run the model on some toy data and see if our predictions match our data.

In [0]:
# Dataset to train model
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([1.1, 1.9, 2.8, 4, 5.2, 5.8, 6.9, 8.1, 9, 9.9])

# Initialize our model
reg = univariate_linear_regression_model()

# Train our model with the data
reg.train(x, y)

# Make a prediction
print(reg.predict(3))

**Reflection:**

*   What might happen if we tried to fit our univariate linear regression model on a dataset that was nonlinear (for example an exponential relationship)?
*   What applications do you think univariate linear regression might be useful for, what do you think are some of its limitations?

# Evaluating Our Univariate Linear Regression

Now that we ran a basic regression model how do we know how effective it is? Earlier we talked about how SSE measures the overall error generated by our model. However, this is not an optimal metric to measure performance since the SSE value is highly dependent on the size of the dataset without giving us a measurement of its effectiveness. It also, unfortunately, tells us nothing about how strong the correlation is between our independent and dependent variables.

[Coefficient of Correlation](https://www.khanacademy.org/math/ap-statistics/bivariate-data-ap/correlation-coefficient-r/v/calculating-correlation-coefficient-r):

One metric to measure this is by using the coefficient of correlation. It is a measure of the strength of the linear relationship between two variables x and y where -1 is a negative correlation, 0 indicates no correlation, and 1 is a strong positive correlation between both variables. The $r$ value is calculated by summing the product differences between a point and its $x$ and $y$ values (commonly known in statistics as [ Z-Scores](https://www.khanacademy.org/math/ap-statistics/density-curves-normal-distribution-ap/measuring-position/v/z-score-introduction)) which Sal will discuss in more detail in the video provided.

**Coefficient of Correlation:**

$$r = \frac{1}{n-1}\sum_{i = 0}^{N}{Z_{x_i}}{Z_{y_i}} =\frac{1}{n-1}\sum_{i = 0}^{N}(\frac{x_i - \bar{x}}{s_x})(\frac{y_i - \bar{y}}{s_y}) = \frac{ss_{xy}}{\sqrt{ss_{xx}ss_{yy}}}$$

**Example:**

$$\bar{x} = 3,\ \bar{y} = 5.98$$

$$ss_{xx} = 10,\ ss_{yy} = 40.908,\ ss_{xy} = 20.2$$

$$r = \frac{20.2}{\sqrt{(10)(40.908)}} = 0.998728$$

Now let's once again convert this into code.

In [0]:
def coefficient_correlation(x,y):
	ss_xy = ssxy(x, y)
	ss_xx = ssxx(x)
	ss_yy = ssxx(y)
	r = (1.0 * ss_xy)/np.sqrt(ss_xx*ss_yy)
	return r

In [0]:
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([1.1, 1.9, 2.8, 4, 5.2, 5.8, 6.9, 8.1, 9, 9.9])
print(coefficient_correlation(x,y))

[R-squared or coefficient of determination](https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/more-on-regression/v/r-squared-or-coefficient-of-determination):

Alright, we can now determine a positive or negative correlation between both the independent and dependent values, but can we measure the strength of the independent value on the dependent value? In order to determine how reliable our regression model is we can use another metric called the coefficient of determination represented by the symbol $r^2$. The $r^2$ looks at the explained sample variability from the total sample variability in our data to determine how much our model actually predicts or describes the underlying data.

**Coefficient of Determination:**

$$r^2 = \frac{ss_{yy} - sse}{ss_{yy}} = 1 - \frac{sse}{ss_{yy}}$$

**Example**:

$$ss_{yy} = 40.908,\ sse = 0.109\bar{9}$$

$$r^2 = 1 - \frac{0.109\bar{9}}{40.908} = 0.997311$$

In otherwords our model explains roughly 99.7% of our variance. Now let's convert our metric into code we can later use.

In [0]:
def coefficient_determination(y_test, y_train):
  ss_yy = ssxx(y_test)
  sse = sum_sq_error(y_test, y_train)
  r2 = 1 - (1.0*sse/ss_yy)
  return r2

In [0]:
y_pred = np.array([2, 4, 6, 8, 10])
y_test = np.array([2.1, 3.9, 5.8, 7.9, 10.2])
print(coefficient_determination(y_test, y_pred))

We can use the $r^2$ metric to help us evaluate models during training. In fact, sklearn has its own internal $r^2$ method.

In [0]:
print(r2_score(y_test, y_pred))

**Reflection:**

*   What is the difference between a positive and negative correlation in a dataset? How would this be reflected in our r values?
*   If the r squared value of a model was close to zero what would this imply? Is this a model you would use? Why, why not?
*   What might be a reason why we might not get a perfect fit between the observed data and the best fit line?



# Summary

In this lesson, we covered how univariate linear regression is used to find the relationship between an independent variable x and the dependent variable y. We can gather observed data from linear regression to find an equation that minimizes the error between observed data and a hypothetical best fit line. To measure how accurate our regression line is we can use the R Squared metric which determines how much of the explained variance in the data is described by the model. Assuming we have a good fit we can then use the model to help us make informed predictions on future unseen observations.

# Extra Resources

For those that have some familiarity with calculus and would like to learn more about how to find the optimal weight parameters for univariate linear regression, check out the following several videos where Sal Khan explains them in more detail:


*   [Proof (part 1) minimizing squared error to regression line](https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/more-on-regression/v/proof-part-1-minimizing-squared-error-to-regression-line)
*   [Proof (part 2) minimizing squared error to regression line](https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/more-on-regression/v/proof-part-2-minimizing-squared-error-to-line)
*   [Proof (part 3) minimizing squared error to regression line](hhttps://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/more-on-regression/v/proof-part-3-minimizing-squared-error-to-regression-line)
*   [Proof (part 4) minimizing squared error to regression line](https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/more-on-regression/v/proof-part-4-minimizing-squared-error-to-regression-line)