In [None]:
from IPython.core.display import HTML
with open ("../style.css", "r") as file:
    css = file.read()
HTML(css)

# Simple Linear Regression

We need to read our data from a <tt>csv</tt> file.  The module `csv` offers a number of functions for reading and writing a <tt>csv</tt> file.

In [None]:
import csv

The data we want to read is contained in the <tt>csv</tt> file `cars.csv`, which is located in the subdirectory `Python`.  In this file, the first column has the *miles per gallon*, while the *engine displacement* is given in the third column.  On MacOs and Linux systems we can peek at this file via the next cell. 

In [None]:
!cat cars.csv || type cars.csv

In order to read the file we use the method `DictReader` from the module [csv](https://docs.python.org/3/library/csv.html).
The `DictReader` returns a dictionary for every row of the `csv` file.  The keys of this dictionary are the column headers of the `csv` file.
When reading this file, we convert *miles per gallon* into *km per litre* and *cubic inches* into *litres*.

In [None]:
with open('cars.csv') as handle:
    reader       = csv.DictReader(handle, delimiter=',')
    kpl          = [] # kilometer per litre
    displacement = [] # engine displacement
    for row in reader:
        x = float(row['displacement']) * 0.0163871
        y = float(row['mpg']) * 1.60934 / 3.78541
        print(f'{row["name"]:35s}: displacement = {x:5.2f} litres, kpl = {y:5.2f} km per litres')
        displacement.append(x)  
        kpl         .append(y)

Now `kpl` is a list of floating point numbers specifying the 
<em style="color:blue;">fuel efficiency</em>, while the list `displacement` 
contains the corresponding <em style="color:blue;">engine displacements</em> 
measured in litres.  We display these values for the first 5  cars.

In [None]:
kpl[:5]

In [None]:
displacement[:5]

The number of data pairs of the form $\langle x, y \rangle$ that we have read is stored in the variable `m`.

In [None]:
m = len(displacement)
m

In order to be able to plot the *fuel efficiency* versus the *engine displacement* we turn the 
lists `displacement` and `mpg` into `numpy` arrays.  This is also usefull in order to compute the coefficients $\vartheta_0$ and $\vartheta_1$ later.

In [None]:
import numpy             as np
import matplotlib.pyplot as plt
import seaborn           as sns

Since <em style="color:blue;">kilometres per litre</em> is the **inverse** of the fuel consumption, the vector `Y` is defined as follows:

In [None]:
X = np.array(displacement)

In [None]:
Y = np.array([100 / y for y in kpl])

In [None]:
plt.figure(figsize=(12, 10))
sns.set(style='whitegrid')
plt.scatter(X, Y, c='b', s=4) # 'b' is short for blue
plt.xlabel('engine displacement in litres')
plt.ylabel('litre per 100 km')
plt.title('Fuel Consumption vs. Engine Displacement')

We compute the average engine displacement according to the formula:
$$ \bar{\mathbf{x}} = \frac{1}{m} \cdot \sum\limits_{i=1}^m x_i $$ 

In [None]:
xMean = np.mean(X)
xMean

We compute the average fuel consumption according to the formula:
$$ \bar{\mathbf{y}} = \frac{1}{m} \cdot \sum\limits_{i=1}^m y_i $$ 

In [None]:
yMean = np.mean(Y)
yMean

The coefficient $\vartheta_1$ is computed according to the formula:
$$ \vartheta_1 = \frac{\sum\limits_{i=1}^m \bigl(x_i - \bar{\mathbf{x}}\bigr) \cdot \bigl(y_i - \bar{\mathbf{y}}\bigr)}{
                       \sum\limits_{i=1}^m \bigl(x_i - \bar{\mathbf{x}}\bigr)^2}  
$$

In [None]:
ϑ1 = np.sum((X - xMean) * (Y - yMean)) / np.sum((X - xMean) ** 2)
ϑ1

The coefficient $\vartheta_0$ is computed according to the formula:
$$ \vartheta_0 = \bar{\mathbf{y}} - \vartheta_1 \cdot \bar{\mathbf{x}} $$ 

In [None]:
ϑ0 = yMean - ϑ1 * xMean
ϑ0

Let us plot the line $y(x) = ϑ0 + ϑ1 \cdot x$ together with our data:

In [None]:
xMax = max(X) + 0.2
plt.figure(figsize=(12, 10))
sns.set(style='whitegrid')
plt.scatter(X, Y, c='b')
plt.plot([0, xMax], [ϑ0, ϑ0 + ϑ1 * xMax], c='r')
plt.xlabel('engine displacement in cubic inches')
plt.ylabel('fuel consumption in litres per 100 km')
plt.title('Fuel Consumption versus Engine Displacement')

We see there is quite a bit of variation and apparently the engine displacement explains only a part of the fuel consumption.  In order to compute the coefficient of determination, i.e. the statistics $R^2$, we first compute the *total sum of squares* `TSS` according to the following formula:
$$ \mathtt{TSS} := \sum\limits_{i=1}^m \bigl(y_i - \bar{\mathbf{y}}\bigr)^2 $$

In [None]:
TSS = np.sum((Y - yMean) ** 2)
TSS

Next, we compute the *residual sum of squares* `RSS` as follows:
$$ \mathtt{RSS} := \sum\limits_{i=1}^m \bigl(\vartheta_1 \cdot x_i + \vartheta_0 - y_i\bigr)^2 $$
    

In [None]:
RSS = np.sum((ϑ1 * X + ϑ0 - Y) ** 2)
RSS

Now $R^2$ is calculated via the formula:
$$ R^2 = 1 - \frac{\mathtt{RSS}}{\mathtt{TSS}}$$

In [None]:
R2 = 1 - RSS/TSS
R2

It seems that about $75\%$ of the fuel consumption is explained by the engine displacement.  We can get a better model of the fuel consumption if we use more variables for explaining the fuel consumption.  For example, the weight of a car is also responsible for its fuel consumption.