# Linear Regression

## Advertising Dataset

In [None]:
import pandas as pd

url="http://www.souravsengupta.com/ml2017/resources/Advertising.csv"
advertising = pd.read_csv(url)

The data, for the most part is read in correctly. But let's delete the 'unnamed' column, which seems unnecessary.

In [None]:
advertising.head(5)

In [None]:
advertising.columns

In [None]:
del advertising['Unnamed: 0']

In [None]:
advertising.head(5)

Generating a scatter plot between two variables in the dataframe...

In [None]:
%matplotlib inline
advertising.plot.scatter(x='TV', y='Sales')

### scikit vs. statsmodels

We'll use the `scikit` learn module once again. 

`scikit` learn is more interested in building _predictive_ models, compared to statisticians who sometimes are more interested in interpretation and model fitting. In this regard, the `scikit` linear models are a little limited. 

Also see:
* `statsmodels`, which also computes things like p-values, confidence intervals, etc.
* http://statsmodels.sourceforge.net/

### Fitting a simple linear model using scikit...

http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py

In [None]:
from sklearn import linear_model

# Create linear regression object
reg = linear_model.LinearRegression()

# Reshape your data from a Series, into a 1d array, and then into a 2d array 

# Train the model: only TV and Sales
reg.fit(advertising['TV'].values.reshape(-1,1), advertising['Sales'].values.reshape(-1,1))

Outputting the model's **coefficients**, **mean squared error**, and **r^2**:

In [None]:
import numpy as np

# The coefficients
print('Coefficients: \n', reg.coef_)
print('Intercept: \n', reg.intercept_)
# The mean squared error
print("Mean squared error: %.2f"
      % np.mean((reg.predict(advertising['TV'].values.reshape(-1,1)) - advertising['Sales'].values.reshape(-1,1)) ** 2))
# Explained variance score: 1 is perfect prediction
print('r^2 Variance score: %.2f' % reg.score(advertising['TV'].values.reshape(-1,1), advertising['Sales'].values.reshape(-1,1)))

### Drawing the model's line of best fit on top of the previous scatter plot:

In [None]:
import matplotlib.pyplot as plt

# Plot outputs
plt.scatter(advertising['TV'], advertising['Sales'],  color='black')
plt.plot(advertising['TV'], reg.predict(advertising['TV'].values.reshape(-1,1)), color='blue', linewidth=3)

### Calculating RSS (Residual Sum of Squares) and RSE (Residual Standard Error)

We have our model's predictions. (Note that it's a `numpy` array.)

In [None]:
predicts = reg.predict(advertising['TV'].values.reshape(-1,1))
print(type(predicts))
predicts

... and we have the true Sales values, which is a `pandas` Series ...

In [None]:
print(type(advertising['Sales']))
advertising['Sales']

... predicts is an array of one-item arrays ...

In [None]:
predicts

... converting to a 1d array ...

In [None]:
predicts.reshape(1,-1)

... using `tolist()` to convert from a 1d array to a python list ...

In [None]:
predicts.reshape(1,-1).tolist()[0]

Comparing each prediction against the true value.

In [None]:
# rss: residual sum of sqares

rss = sum((y_hat - y) ** 2 for y_hat, y in zip(predicts.reshape(1,-1).tolist()[0], advertising['Sales'].tolist()))
rss

In [None]:
# rse: residual standard error
import math
rse = math.sqrt(rss / (len(predicts) - 2))
rse

In [None]:
# percentage error
pe = rse / advertising['Sales'].mean()
pe

In [None]:
# r^2 statistic
reg.score(advertising['TV'].values.reshape(-1,1), advertising['Sales'].values.reshape(-1,1))

## Multiple Linear Regression

In [None]:
reg.fit(advertising.loc[:,['TV','Radio','Newspaper']], advertising['Sales'].values.reshape(-1,1))

# The columns
print('Columns: \n', advertising.columns)
# The coefficients
print('Coefficients: \n', reg.coef_)
print('Intercept: \n', reg.intercept_)
# The mean squared error
print("Mean squared error: %.2f"
      % np.mean((reg.predict(advertising.loc[:,['TV','Radio','Newspaper']]) - advertising['Sales'].values.reshape(-1,1)) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % reg.score(advertising.loc[:,['TV','Radio','Newspaper']], advertising['Sales'].values.reshape(-1,1)))

In [None]:
reg.fit(advertising['TV'].values.reshape(-1,1), advertising['Sales'].values.reshape(-1,1))
print('TV Model: \n', reg.coef_, reg.intercept_)
reg.fit(advertising['Radio'].values.reshape(-1,1), advertising['Sales'].values.reshape(-1,1))
print('Radio Model: \n', reg.coef_, reg.intercept_)
reg.fit(advertising['Newspaper'].values.reshape(-1,1), advertising['Sales'].values.reshape(-1,1))
print('Newspaper Model: \n', reg.coef_, reg.intercept_)

Correlation Matrix Plot:

In [None]:
plt.matshow(advertising.corr())

In [None]:
# correlation matrix
advertising.corr()

## Credit dataset:

In [None]:
url = 'http://www-bcf.usc.edu/~gareth/ISL/Credit.csv'

credit = pd.read_csv(url)
credit.head(5)

In [None]:
del credit['Unnamed: 0']

Scatter Plot Matrix:

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(credit, alpha=0.2, figsize=(6, 6))

`scikit`'s Linear Regression won't work out of the box on categorical predictors, such as the Gender variable.

In [None]:
# ValueError exception is raised
reg.fit(credit['Gender'].values.reshape(-1,1), credit['Balance'].values.reshape(-1,1))

### Dummy Variables (Categorial -> Numeric)

Creating a dummy variable and insert it into the dataframe:
* 1 = Male
* 0 = Famale

In [None]:
dummy = [1 if x == 'Male' else 0 for x in credit['Gender']]
dummy

In [None]:
credit.insert(loc=0, column='GenderDummy', value=dummy)

In [None]:
credit.head(5)

In [None]:
reg.fit(credit['GenderDummy'].values.reshape(-1,1), credit['Balance'].values.reshape(-1,1))
# The coefficients
print('Coefficients: \n', reg.coef_)
print('Intercept: \n', reg.intercept_)

`pandas` can automatically create dummy variables.

Observe how each level gets its own new variable.

In [None]:
credit_with_dummies = pd.get_dummies(credit, columns = ['Gender', 'Ethnicity'] )
credit_with_dummies

In [None]:
#reg.fit(credit_with_dummies.iloc[:,12:15], credit_with_dummies['Balance'].values.reshape(-1,1))
#reg.fit(credit_with_dummies.loc[:,['Ethnicity_African American','Ethnicity_Asian','Ethnicity_Caucasian']], credit_with_dummies['Balance'].values.reshape(-1,1))
reg.fit(credit_with_dummies.loc[:,['Ethnicity_African American','Ethnicity_Asian']], credit_with_dummies['Balance'].values.reshape(-1,1))
# The coefficients
print('Coefficients: \n', reg.coef_)
print('Intercept: \n', reg.intercept_)

In [None]:
reg.fit(credit_with_dummies.iloc[:,1:5], credit_with_dummies['Balance'].values.reshape(-1,1))
# The coefficients
print('Coefficients: \n', reg.coef_)
print('Intercept: \n', reg.intercept_)

In [None]:
credit_with_dummies.iloc[:,12:15]