# Introduction to Linear Regression

In this notebook, we will use regression to fit a model to data.   

Linear regression ia a very simple approach for supervised learning. In particular, linear regression is useful tool for predicting a quantitative response. Though it may seem somewhat dull compared to some of the more modern statistical learning approaches, linear regression is still a useful and widley used statistical learining mehtod.  
Moreover, it serves as a good jumping-off point for newer approaches: as we will see in later tutorials, many fancy statistical learning approaches can be seen as generalizations or extentions of linear regression.


In [2]:
%matplotlib inline

# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# We do this to ignore several specific warnings
import warnings
warnings.filterwarnings("ignore")

sns.set(style="white")

In [3]:
# Load Data Set
df = sns.load_dataset('tips')

# Display several random rown:
df.sample(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
235,10.07,1.25,Male,No,Sat,Dinner,2
197,43.11,5.0,Female,Yes,Thur,Lunch,4
207,38.73,3.0,Male,Yes,Sat,Dinner,4
105,15.36,1.64,Male,Yes,Sat,Dinner,2
19,20.65,3.35,Male,No,Sat,Dinner,3


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [6]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_bill,244.0,19.785943,8.902412,3.07,13.3475,17.795,24.1275,50.81
tip,244.0,2.998279,1.383638,1.0,2.0,2.9,3.5625,10.0
size,244.0,2.569672,0.9511,1.0,2.0,2.0,3.0,6.0


In [8]:
df.groupby('sex').describe().T

Unnamed: 0,sex,Male,Female
total_bill,count,157.0,87.0
total_bill,mean,20.744076,18.056897
total_bill,std,9.246469,8.009209
total_bill,min,7.25,3.07
total_bill,25%,14.0,12.75
total_bill,50%,18.35,16.4
total_bill,75%,24.71,21.52
total_bill,max,50.81,44.3
tip,count,157.0,87.0
tip,mean,3.089618,2.833448


In [9]:
(df['tip']/df['total_bill']).describe()

count    244.000000
mean       0.160803
std        0.061072
min        0.035638
25%        0.129127
50%        0.154770
75%        0.191475
max        0.710345
dtype: float64

## Simple Linear Regression

Simple linear regression lives up to its name: it is a very straightforward approach for predicting a quantitative response $Y$ on the basis of a single predictor variable $X$. It assumes that there is approximately a linear relationship between $X$ and $Y$. Using our tips data, suppose we wish to model the linear relationship between the tips and total bill. We can write this as follows:

$f(x_i) = \beta * x_i + \alpha + \epsilon_i$

where $\epsilon_i$ accounts for the difference between the model and the data for each data point $(x_i, y_i)$. 
In the background `linregress`, which stands for linear regression, is producting the best-fit linear relationship by minimizing the _least squares_ or $L_2$-norm criterion.

In [13]:
# Perform simple linear regression
from scipy import stats as sts

# Extract one-dimensional numpy arrays for our features
x = df.total_bill.to_numpy() # independent variable
y = df.tip.to_numpy() # dependent variable

# Compute linear model
slope, intercept, r_value, p_value, slope_std_error = sts.linregress(x, y)

# Display Results
print(f'Best fit line: y ={slope:4.2f} x + {intercept:5.3f}')
print(f'Pearsonr correlation = {r_value:5.3f}\n')

Best fit line: y =0.11 x + 0.920
Pearsonr correlation = 0.676



If we have a perfect model, these errors, $\epsilon_i$, are all zero, and $y_i = f(x_i)$. In real life, however, the error terms rarely vanish because even if the original relationship is perfect noise creeps into the measurement process. 

As a result, in this example we wish to determine the model parameters:$\beta$, and $\alpha$ that minimize the values of $\epsilon_i$. We could perform this process in an iterative manner, trying different values for the model parameters and measuring the error function. This approach is often used in machine learning, where we define a **cost function** that we seek to minimize by selecting the best model parameters. 

In the case of a simple linear model, we have several potential cost (or loss) functions that we could seek to minimize, but we will use the common _l2-norm_: $\epsilon_i^2 = \left( \ y_i - f(x_i) \ \right)^2$, where $f(x_i)$ is defined by our model parameters. We demonstrate this approach visually in the following code block, where we minimize the sum of the _l2-norm_ model residuals, which is done by finding the best model parameters: $\hat{\beta}$, and $\hat{\alpha}$.

---

In [14]:
# Load Anscombe's Quartet
adf = sns.load_dataset('anscombe')

# Select only the first data set
adfi = adf[adf.dataset == 'I']

In [15]:
# Extract the x & y coluns as Numpy arrays,
# we call them xx and yy to avoid conflict with the tips data
xx = adfi['x'].to_numpy()
yy = adfi['y'].to_numpy()

# Number of data points
n = xx.shape[0]

# Determine mean values:
mux = np.mean(xx)
muy = np.mean(yy)

# Determine best fit model parameters (from simple linear regression)
beta = np.sum(xx - mux) * (yy - muy) / np.sum((xx - mux)**2)

In [17]:
muy

7.500909090909093