In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Linear Regression

**Regression** is a statistical measurement that attempts to determine the strength of the relationship between one dependent variable $y$ **(the response)** and a series of independent variables $x_i$ **(the features)**. 

### Classification vs Regression problems:

- **Classification problem**: Predict a categorical (discrete) response. 
- **Regression problem**: Predict a continuous response.

### Form of Linear Regression

A linear model assumes that there is a linear relation between the variable $y$ and the features $x_i$

$y = \theta_0 + \theta_1x_1 + \theta_2x_2 + \cdots + \theta_nx_n$,

where:

$y$ is the response 

$\theta_0$ is the **bias** (aka the intercept)

$\theta_1$ is the **coefficient** for $x_1$ (the first feature)

$\theta_2$ is the **coefficient** for $x_2$ (the second feature)

$\vdots$

$\theta_n$ is the **coefficient** for $x_n$ (the nth feature)

The model coefficients $\theta_i$ are "learned" during the model fitting step using the "least squares" criterion. 
Then, the fitted model can be used to make predictions!

### Training a linear model

To train a linear model, one needs to find the model coefficients $\theta_i$ that minimize the **Root Mean Square Error (RMSE)**

$$
\sqrt{\frac{1}{n}\sum_{i=1}^n\left(\widehat{y}_i- y_i\right)^2}
$$

where the $y_i$ values are the actual values of the response variable, and the $\hat{y}_i$ values are the predicted values

### Example: Sales Prediction

In [None]:
# load the data
url = 'https://raw.githubusercontent.com/justmarkham/scikit-learn-videos/master/data/Advertising.csv'
sales = pd.read_csv(url, index_col=0)
sales.head()

What are the **features**?

- **TV**: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
- **Radio**: advertising dollars spent on Radio
- **Newspaper**: advertising dollars spent on Newspaper
    
What is the **response**?

- **Sales**: sales of a single product in a given market (in thousands of items)

### Visualize the relationship between the features and the response

A scatterplot can help determine if two variables are related in some systematic way.

In [None]:
plt.figure(figsize=(12,5))
plt.plot(sales['TV'],sales['Sales'],'o')
plt.xlabel('TV', fontsize=20)
plt.ylabel('Sales', fontsize=20)

In [None]:
plt.figure(figsize=(12,5))
plt.plot(sales['Radio'],sales['Sales'],'o')
plt.xlabel('Radio', fontsize=20)
plt.ylabel('Sales', fontsize=20)

In [None]:
plt.figure(figsize=(12,5))
plt.plot(sales['Newspaper'],sales['Sales'],'o')
plt.xlabel('Newspaper', fontsize=20)
plt.ylabel('Sales', fontsize=20)

**Goal:** Train a linear model which predicts sales based on the money spent on different platforms for marketing.

$y = \theta_0 + \theta_1 \times \mathrm{TV} + \theta_2 \times \mathrm{Radio} + \theta_3 \times \mathrm{Newspaper}$

### Linear Regression in scikit-learn

In [None]:
# feature matrix X / target vector y
feature_cols = ['TV', 'Radio', 'Newspaper']
X = sales[feature_cols]
y = sales.Sales

In [None]:
# train / test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
from sklearn.linear_model import LinearRegression

# initialize
linreg = LinearRegression()
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

### Interpreting linear regression coefficients

In [None]:
# coefficients
print(linreg.coef_)

In [None]:
# pair the feature names with the coefficients
coeffs = pd.DataFrame(linreg.coef_, feature_cols, columns=['coefficient'])
coeffs

In [None]:
coeffs.plot(kind='bar')

In [None]:
# bias term
linreg.intercept_

How do we interpret the TV coefficient (0.0461)?

- For a given amount of Radio and Newspaper ad spending, a "unit" (1000 dollars) increase in TV ad spending is associated with a 0.0466 "unit" increase in Sales.

### Making predictions

In [None]:
y_test_pred = linreg.predict(X_test)
y_test_pred

### Model evaluation 

**Root Mean Squared Error (RMSE)** is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

where

- $n$ is the size of the dataset
- $\hat{y}_i$ is the prediction for $y_i$

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_test_pred)

On average, the predictions are off by 2000 dollars.

### Visualize the prediction (only for small datasets)

In [None]:
# plot predicted values against observed values
plt.plot(y_test, y_test_pred, 'o')
plt.xlabel('actual')
plt.ylabel('predicted')

In [None]:
# plot the first 30 predictions
plt.figure(figsize=(12,7))
plt.plot(y_test[:30].to_numpy(),'b-.o', label='observed sales')
plt.plot(y_test_pred[:30],'r-.o', label='predicted sales')
plt.ylabel('sales',fontsize=20)
plt.legend(fontsize=20)

### Adding polynomial features

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

In [None]:
pipe = Pipeline(steps=[
    ('poly_features', PolynomialFeatures(degree=3,include_bias=False)),
    ('reg', linreg)
])

In [None]:
pipe.fit(X_train,y_train)

In [None]:
# original features
feature_cols

In [None]:
# polynomial features
poly_features_names = pipe['poly_features'].get_feature_names(feature_cols)
poly_features_names

In [None]:
# pair the feature names with the coefficients
coeffs = pd.DataFrame(pipe['reg'].coef_,poly_features_names, columns=['coefficient'])
coeffs 

In [None]:
coeffs.plot(kind='bar')

In [None]:
y_test_pred = pipe.predict(X_test)

In [None]:
plt.plot(y_test, y_test_pred, 'o')
plt.xlabel('actual')
plt.ylabel('predicted')

In [None]:
mean_squared_error(y_test, y_test_pred)

In [None]:
# plot the first 30 predictions
plt.figure(figsize=(12,7))
plt.plot(y_test[:30].to_numpy(),'b-.o', label='observed sales')
plt.plot(y_test_pred[:30],'r-.o', label='predicted sales')
plt.ylabel('sales',fontsize=20)
plt.legend(fontsize=20)