<a href="https://colab.research.google.com/github/sheebajosetj/Linear-Regression/blob/main/Linear_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression

## Linear Regression Algorithm

Linear regression calculates an intercept and slope (weights) for a line that minimizes the sum of squared errors between the line and the data points.

The formula for linear regression is as follows:

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n$$

where $y$ is the target variable, $\beta_0$ is the intercept, $\beta_1$ to $\beta_n$ are the weights, and $x_1$ to $x_n$ are the features.

The algorithm is as follows:

- Initialize the weights.
- Calculate the predicted values.
- Calculate the error.
- Update the weights.
- Repeat the steps above until convergence.



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# Load anscombe's quartet
x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]
anscombe = (pd.DataFrame({'x': x, 'y1': y1, 'y2': y2, 'y3': y3, 'x4': x4, 'y4': y4})
            )

anscombe

In [None]:
# plot x y1
fig, ax = plt.subplots(figsize=(3, 3))
anscombe.plot.scatter(x='x', y='y1', ax=ax, color='k')

Let's run the algorithm on x and y1

Calculate the slope:

$$\beta_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$$

Calculate the intercept:

$$\beta_0 = \bar{y} - \beta_1 \bar{x}$$

Model Equation:

$$y = \beta_0 + \beta_1 x$$

In [None]:
# slope

x1 = anscombe['x']
y1 = anscombe['y1']
slope = ((x1 - x1.mean())*(y1 - y1.mean())).sum() / ((x1 - x1.mean())**2).sum()
slope

In [None]:
# intercept

intercept = y1.mean() - slope * x1.mean()
intercept

In [None]:
# plot x y1
fig, ax = plt.subplots(figsize=(3, 3))
anscombe.plot.scatter(x='x', y='y1', ax=ax, color='k')
# plot the line
x1 = np.linspace(4, 14, 100)
y1 = slope * x1 + intercept
ax.plot(x1, y1, color='r')

In [None]:
# If someone said I got a value of  x 10 it will go quick up here in the x axis of 10 and then check whats the y from that

## Examples with Scikit-learn

In [None]:
from sklearn.linear_model import LinearRegression

x1 = anscombe[['x']]#Sckit learn wants this as a dataframe if we use single double brackets it will consider as a column and not as a dataframe
y1 = anscombe['y1']
y2 = anscombe['y2']
y3 = anscombe['y3']

lr1 = LinearRegression()
lr1.fit(x1, y1)



In [None]:
lr1.coef_ # When things that are in the underscore they are things that they learnt from fitting

In [None]:
lr1.intercept_

In [None]:
lr2 = LinearRegression()
lr2.fit(x1, y2)
lr3 = LinearRegression()
lr3.fit(x1, y3)

In [None]:
# plot 1, 2 and 3 in different colors
fig, axs = plt.subplots(1, 3, figsize=(9, 3))
anscombe.plot.scatter(x='x', y='y1', ax=axs[0], color='k')
axs[0].plot(x1, lr1.predict(x1), color='#aaa')
axs[0].set_ylim(3, 13)
anscombe.plot.scatter(x='x', y='y2', ax=axs[1], color='b')
axs[1].plot(x1, lr2.predict(x1), color='#55a')
axs[1].set_ylim(3, 13)
anscombe.plot.scatter(x='x', y='y3', ax=axs[2], color='g')
axs[2].plot(x1, lr3.predict(x1), color='#5a5')
axs[2].set_ylim(3, 13)

## Real world example with Aircraft Elevators

From website: This data set is also obtained from the task of controlling a F16 aircraft, although the target variable and attributes are different from the ailerons domain. In this case the goal variable is related to an action taken on the elevators of the aircraft.


In [None]:
# https://www.openml.org/search?type=data&sort=runs&id=216&satatus=active
from datasets import load_dataset
elevators = load_dataset('inria-soda/tabular-benchmark', data_files='reg_num/elevators.csv')

In [None]:
elev = elevators['train'].to_pandas()
elev


In [None]:
X = elev.drop(columns=['Goal'])
y = elev['Goal']

lr_elev = LinearRegression()
lr_elev.fit(X, y)

In [None]:
lr_elev.coef_


In [None]:
lr_elev.intercept_

In [None]:
pd.Series(lr_elev.coef_, index=X.columns).sort_values().plot.barh(figsize=(8, 6))

In [None]:
# score is R^2 - the proportion of variance explained by the model
lr_elev.score(X, y)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
mean_absolute_error(y, lr_elev.predict(X)), mean_squared_error(y, lr_elev.predict(X))

In [None]:
lr_elev.predict(X.iloc[[0]])

In [None]:
y.iloc[0]

## Assumptions of Linear Regression

- Linear relationship between the features and target variable
- No multicollinearity - no correlation between the features
- Homoscedasticity - the variance of the residuals is the same for all values of the target variable
- No outliers - the residuals are normally distributed

Also, generally you will want to scale the features before running linear regression.

In [None]:
# standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled

In [None]:
X_scaled.describe()

In [None]:
lr_std = LinearRegression()
lr_std.fit(X_scaled, y)
lr_std.score(X_scaled, y)

In [None]:
pd.Series(lr_std.coef_, index=X.columns).sort_values().plot.barh(figsize=(8, 6))

In [None]:
!pip install xgboost

In [None]:
# try with XGBoost
from xgboost import XGBRegressor
xgb = XGBRegressor()
xgb.fit(X, y)
xgb.score(X, y)

## Challenge: Linear Regression

Make a model to predict how much Titanic passengers paid for their tickets with Linear Regression. (Only use the numeric columns for the model.)

In [None]:
import pandas as pd
url = 'https://github.com/mattharrison/datasets/raw/master/data/titanic3.xls'
raw = pd.read_excel(url)
raw

## Solution: Linear Regression

In [None]:
def tweak_titanic(df):
  return (df
          .loc[:, ['pclass', 'survived',  'age', 'sibsp', 'parch',
       'fare']]
        .dropna()
  )

tweak_titanic(raw)

In [None]:
#predict fare from numeric columns

X = tweak_titanic(raw).drop(columns=['fare'])
y = tweak_titanic(raw)['fare']

In [None]:
#make linear regression model
lr = LinearRegression()
lr.fit(X, y)

In [None]:
lr.score(X, y)

In [None]:
xgb = XGBRegressor()
xgb.fit(X, y)
xgb.score(X, y)

#This model performs better