In [None]:
from IPython.core.display import HTML
with open ("../style.css", "r") as file:
    css = file.read()
HTML(css)

# General Linear Regression

In this case study we investigate how much of the <em style="color:blue">fuel consumption</em> of a car can be explained by the 

- number of cylinders,
- engine displacement,
- horse power,
- weight,
- acceleration, and
- the year the car has been introduced into the market.

This data is given in the `CSV` file `cars.csv`.  In this file, the engine displacement is given in *cubic inches* and the weight is given in *pounds*.  The fuel consumption is specified as *miles per galon* and the acceleration is given as the number of seconds until the car reaches 60 miles per hour.

The module `csv` offers a number of functions for reading and writing <tt>csv</tt> files.

In [None]:
import csv

Below we read the file `cars.csv` and store the fuel consumption in the list `mpg`, while the number of cylinders, the engine displacement,
the power, the weight, the acceleration, and the building year are stored as the list of lists `Features`. Note also that we have added the constant feature $1$ to every list in `Features`.

In [None]:
with open('cars.csv') as input_file:
    reader   = csv.DictReader(input_file, delimiter=',')
    Features = ['cyl', 'displacement', 'hp' ,'weight', 'acc', 'year']
    X        = []
    Y        = []
    for row in reader:
        Y.append(float(row['mpg']))
        X.append([float(row[f]) for f in Features] + [1.0])

The number of data pairs of the form $\langle \textbf{x}, y \rangle$ that we have read is stored in the variable `m`.

In [None]:
m = len(Y)
m

For efficiency reasons we transform the *feature matrix* `X`, which currently is a list of list, into a `NumPy` matrix.

In [None]:
import numpy as np

In [None]:
X

In [None]:
X = np.array(X)

Note that every row in this matrix contains the data corresponding to a single car.

Since <em style="color:blue">miles per gallon</em> is the inverse of the <em style="color:blue">fuel consumption</em>, the vector `Y` is defined as the reciprocal of the `mpg` values that are currently stored in `Y`. 

In [None]:
Y = np.array([1 / Y[i] for i in range(m)])

The weight vector `w` is specified via the *normal equation*:
$$ (X^\top \cdot X) \cdot \textbf{w} = X^\top \cdot \textbf{y} $$ 
This linear equation can be solved for `w` using the method `np.linalg.solve`.  Note that the *transpose* of the matrix `X` can be computed by writing `X.T`.
Furthermore, *matrix multiplication* of two matrices `A` and `B` is denoted as `A @ B` in `numpy`. 

In [None]:
%%time
w = np.linalg.solve(X.T @ X, X.T @ Y)
print(w)

The <em style="color:blue">residual sum of squares</em> is given by the following sum:
$$ \texttt{RSS} = \sum\limits_{i=1}^m \Bigl(\bigl(\textbf{x}^{(i)}\bigr)^\top \cdot \textbf{w} - y_i\Bigr)^2 $$
Here $\textbf{x}^{(i)}$ is the $i$-th row of the matrix $X$, while $y_i$ is the $i$-th component of the vector $\textbf{y}$.
The expression $\bigl(\textbf{x}^{(i)}\bigr)^\top \cdot \textbf{w}$ is the predicted value of the linear model, while $y_i$ is the actual value.
As the feature Matrix $X$ is defined as
$$ X = \left( \begin{array}{c}
              \bigl(\textbf{x}^{(1)}\bigr)^\top \\
              \vdots \\
              \bigl(\textbf{x}^{(m)}\bigr)^\top
              \end{array}
       \right)
$$
we can compute the variable `RSS` as follows:

In [None]:
RSS = np.sum((X @ w - Y) ** 2)
RSS

We compute the <em style="color:blue">average fuel consumption</em> $\bar{\mathbf{y}}$ according to the formula:
$$ \bar{\mathbf{y}} = \frac{1}{m} \cdot \sum\limits_{i=1}^m y_i $$ 

In [None]:
yMean = np.mean(Y)
yMean

We  compute the <em style="color:blue">total sum of squares</em> `TSS`according to the following formula:
$$ \mathtt{TSS} := \sum\limits_{i=1}^m \bigl(y_i - \bar{\mathbf{y}}\bigr)^2 $$

In [None]:
TSS = np.sum((Y - yMean) ** 2)
TSS

Now the <em style="color:blue">proportion of explained variance</em> $R^2$ is calculated via the formula:
$$ R^2 = 1 - \frac{\mathtt{RSS}}{\mathtt{TSS}}$$

In [None]:
R2 = 1 - RSS/TSS
R2

It looks like we can explain about $88\%$ of the fuel consumption by the data given in our `CSV` file.  Given that our data does not contain any parameters describing the air resistance we cannot hope to do much better. 