In [None]:
from IPython.core.display import HTML
with open ("../style.css", "r") as file:
    css = file.read()
HTML(css)

# Simple Linear Regression with SciKit-Learn

We import the module `pandas`.  This module implements so called <em style="color:blue;">data frames</em> and is more convenient than the module `csv` when reading a <tt>csv</tt> file. 

In [None]:
import pandas as pd

The data we want to read is contained in the <tt>csv</tt> file `'cars.csv'`.  

In [None]:
cars = pd.read_csv('cars.csv')
cars

The variable `cars` contains a so called [data frame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).
We want to convert the columns containing `mpg` and `displacement` into **NumPy** arrays.  

In [None]:
import numpy as np

X = np.array(cars['displacement'])
Y = np.array(cars['mpg'])

We convert <em style="color:blue;">cubic inches</em> into <em style="color:blue;">litres</em>.

In [None]:
X = 0.0163871 * X

In order to use **SciKit-Learn** we have to reshape the array X into a matrix.

In [None]:
X = np.reshape(X, (len(X), 1))
X

We convert <em style="color:blue;">miles per gallon</em> into <em style="color:blue;">kilometer per litre</em>.

In [None]:
Y = 1.60934 / 3.78541 * Y

We convert <em style="color:blue;">kilometer per litre</em> into <em style="color:blue;">litre per 100 kilometer</em>.

In [None]:
Y = 100 / Y

We plot fuel consumption versus engine displacement.

In [None]:
import matplotlib.pyplot as plt
import seaborn           as sns
%matplotlib inline

plt.figure(figsize=(12, 10))
sns.set(style='whitegrid')
plt.scatter(X, Y, c='b', s=4) # 'b' is blue color
plt.xlabel('engine displacement in litres')
plt.ylabel('litre per 100 km')
plt.title('Fuel Consumption vs Engine Displacement')
plt.show()

We import the `linear_model` from **SciKit-Learn**:

In [None]:
import sklearn.linear_model as lm

We create a <em style="color:blue;">linear model</em>.

In [None]:
model = lm.LinearRegression()

We *train* this model using the data we have.

In [None]:
M = model.fit(X, Y)

The model `M` represents a linear relationship between `X` and `Y` of the form
$$ \texttt{Y} = \vartheta_0 + \vartheta_1 \cdot \texttt{X} $$
We extract the coefficients $\vartheta_0$ and $\vartheta_1$.

In [None]:
ϑ0 = M.intercept_
ϑ0

In [None]:
ϑ1 = M.coef_[0]
ϑ1

Let's check the quality of our linear model.  The *coefficient of determination* $R^2$ is computed by the function `score`.

In [None]:
model.score(X, Y)

The values for $\vartheta_0$, $\vartheta_0$, and $R^2$ are, no surprise there, the same values that we had already computed with the notebook `Simple-Linear-Regression.ipynb`.  We plot the data together with the regression line.

The next line is needed to suppress a deprecation warning from one of the libraries.

In [None]:
import warnings
warnings.filterwarnings('ignore')

Lets plot the regression line with the data.

In [None]:
xMax = max(X) + 0.2
plt.figure(figsize=(12, 10))
sns.set(style='whitegrid')
plt.scatter(X, Y, c='b', s=4)
plt.plot([0, xMax], [ϑ0, ϑ0 + ϑ1 * xMax], c='r')
plt.xlabel('engine displacement in cubic inches')
plt.ylabel('fuel consumption in litres per 100 km')
plt.title('Fuel Consumption vs. Engine Displacement')
plt.show()