# Simple Linear Regression

    Simple linear regression is an approach for predicting a continuous response using a single feature. It takes the following form:
$y = \beta_0 + \beta_1x$
- y is the response
- x is the feature
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for x

$\beta_0$ and $\beta_1$ are called the model coefficients:
- we must learn the values of these coefficients to create our model.
- And once we've learned these coefficients, we can use the model to predict Sales.

![Estimating coefficients](img/easyway.png)

![Estimating coefficients](img/example.png)

![Estimating coefficients](img/example2.png)

## Estimating ("learning") model coefficients
- Coefficients are estimated during the model fitting process using the least squares criterion.
- We are find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors").

![Estimating coefficients](img/estimating_coefficients.png)

In this diagram:

- The black dots are the **observed values** of x and y.
- The blue line is our **least squares line**.
- The red lines are the **residuals**, which are the distances between the observed values and the least squares line.

![Slope-intercept](img/slope_intercept.png)

How do the model coefficients relate to the least squares line?

- $\beta_0$ is the **intercept** (the value of $y$ when $x$=0)
- $\beta_1$ is the **slope** (the change in $y$ divided by change in $x$)


Linear Regression is highly **parametric**, meaning that is relies heavily ont he underlying shape of the data. If the data fall into a line, then lienar regression will do well. If the data does not fall in line (get it?) linear regression is likely to fail.


Let's estimate the model coefficients for the advertising data:

In [1]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

In [2]:
# read data into a DataFrame
data = pd.read_csv('advertising.csv')

In [3]:
### Statsmodels ###

# create a fitted model
lm = smf.ols(formula='Sales ~ TV', data=data).fit()

# print the coefficients
lm.params

Intercept    6.974821
TV           0.055465
dtype: float64

In [4]:
from sklearn.linear_model import LinearRegression

In [5]:
### Scikit-learn ###

# create x and y
feature_cols = ['TV']
X = data[feature_cols]
y = data.Sales

# instantiate and fit
linreg = LinearRegression()
linreg.fit(X, y)

print(linreg.intercept_) # y value when x is 0
print(linreg.coef_) # m or slope or gradient,  change in x (tv advert) and change in y (sales)

6.9748214882298925
[0.05546477]


## Interpreting model coefficients
How do we interpret the TV coefficient ($\beta_1$)?

- A "unit" increase in TV ad spending is **associated with** a 0.0554 "unit" increase in Sales.
- Meaning: An additional $1,000 spent on TV ads is **associated with** an increase in sales of 55.4 widgets.
- This is not a statement of **causation**.

If an increase in TV ad spending was associated with a **decrease** in sales, $\beta_1$ would be **negative**.


## Using the model for prediction

Let's say that there was a new market where the TV advertising spend was **$50,000**. What would we predict for the Sales in that market?

$$y = \beta_0 + \beta_1x$$
$$y = 6.9748 + 0.0554 \times 50$$

In [6]:
# manually calculate the predication
6.9748 + 0.0554*50

9.7448

In [7]:
## statsmodel ###

# you have to create a dataframe since the statsmodels formula interface expects it
X_new = pd.DataFrame({'TV': [50]})

# predict for new observation
lm.predict(X_new)

0    9.74806
dtype: float64

In [8]:
## Sckit learn ###

# predict for a new observation
linreg.predict(X_new)

array([9.74806001])

## Does the scale of the features matter

In [9]:
data['TV_dollars'] = data.TV * 1000
data.head()

Unnamed: 0,TV,Radio,Newspaper,Sales,TV_dollars
0,230.1,37.8,69.2,22.1,230100.0
1,44.5,39.3,45.1,10.4,44500.0
2,17.2,45.9,69.3,12.0,17200.0
3,151.5,41.3,58.5,16.5,151500.0
4,180.8,10.8,58.4,17.9,180800.0


In [10]:
### Scikit learn ###

# create x and y
feature_cols = ['TV_dollars']
X = data[feature_cols]
y = data.Sales

# Instantiate and fit
linreg = LinearRegression()
linreg.fit(X, y)

# Print the coefficients
print(linreg.intercept_)
print(linreg.coef_)

6.9748214882298925
[5.54647705e-05]


## Bias and Variance

Linear regression is a low variance/high bias model.
- Low variance: under repeated sampling from underlying population, the line will stay roughly in the same place.
- High bias: The line will rarely fit the data well

A closely related concept is confidence intervals

## Confidence intervals

Statsmodels calculates 95% confidence intervals for our model coefficients, which are interpreted as follows: If the population from which this sample was drawn was sampled 100 times, approximately 95 of those confidence intervals would contain the true coefficient

In [11]:
lm.conf_int()

Unnamed: 0,0,1
Intercept,6.33874,7.610903
TV,0.051727,0.059203


### Hypothesis testing and p-values

General process for hypothesis testing:
 - You Start with null hypothesis and an alternative hypothesis (that is opposite the null).
 - You check whether the data supports rejecting the null hypothesis or failing to reject the null hypothesis.
 
 For model coefficients, here is the conventional hypothesis test:
 - null hypothesis: There is no relationship between TV ads and sales (and thus $\beta_1$ equals zero)
 - alternative hypothesis: There is a relationship between TV ads and Sales (and thus $\beta_1$ is not equal to zero)
 
 How do we test this hypothesis ?
 - The p-value is the probability that the relationship we are observing is occuring purely by chance.
 - if the 95% confidence interval for a coefficent does not include zero, the p-value will be less than 0.05, and we will reject the null ( and thus believe the alternative).
 - if the 95% confidence interval includes zero, the p-value will be greater than 0.05, and we will fail to reject the null.

In [12]:
# printing the p-values for the model coefficients
lm.pvalues

Intercept    5.027719e-54
TV           7.927912e-74
dtype: float64

 Thus a p value less than 0.05 is one way to decide whether there is likely a relationship between the feature and the response.
In this case, the p-value for TV is far less than 0.05, and so we believe that there is a relationship between TV ads and Sales.

Note that we generally ignore the p-value for the intercept.

## How well does the model fit the data ?

R-Squared:
 - A common way to evaulate the overall fit of a linear model
 - Defined as the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model
 - Also defined as the reduction in error over the null model, which is the model which is the model that simply predicts the mean of the observed response.
 - Between 0 and 1, and higher is better

In [14]:
## statsmodel ##
lm.rsquared

0.8121757029987414

In [16]:
from sklearn import metrics

In [17]:
## sckit learn ##
y_pred = linreg.predict(X)
metrics.r2_score(y, y_pred)

0.8121757029987414

 - The threshold for a good R-squared value is highly dependent on the particular domain.
 - R-squared is more useful as a tool for comparing models.