# Topic 1 -- Linear Regression

Welcome everyone! In this workshop, we will learn about one of the fundamental algorithms of supervised learning: Linear Regression. This notebook contains a step-by-step guide on implementing linear regression in SciKitLearn, as well as a collection of pictures and interactive graphs that will give you an intuition on the idea of linear regression.

## Table of Contents 
1. [What is Linear Regression?](#what-is-linear-regression?)
    - [The Car Price Problem](#the-car-price-problem)
    - [Finding Optimal $w$ and $b$](#finding-optimal)
    - [Gradient Descent](#gradient-descent)
    
    
2. [Training a Linear Regression Model](#training)
    - [Loading and Visualizing the Dataset](#loading)
    - [Fitting the Linear Model](#fitting)
    - [Observations](#obs1)
    
    
3. [Linear Regression with Multiple Variables](#multivar)
    - [Choosing Features](#choosing)
    - [Training our Multi-Variate Linear Regression Model](#training2)
    - [Observations](#obs2)
    
    
4. [Polynomial Regression](#poly)
    - [Intuition](#intuition)
    - [Preparing our Features and Fitting our Model](#features2)
    - [Observations](#obs3)

### Before we begin...

There is one module we need to import. This module contains some code to display the graphs.

In [1]:
from disp_utils import *

## What is Linear Regression?<a name="what-is-linear-regression?"/>

Linear regression is the simplest form of supervised learning used to tackle **regression problems**. Regression problems are problems where you try to predict a **continuous output** given various input features. Examples include predicting the height of someone given their age, ethnicity, and biological sex. Notice that height can be any value within some reasonable range (reasonable as in there are no humans that are 10 meters tall), or it could be that you predict the age of stones given the amount of carbon-14 present. All these problems can be solved using linear regression.

### The Car Price Problem<a name="the-car-price-problem"/>

Say you are given a **dataset** containing the **milage** of cars as well as their respective prices. Your task is to **train** a model from this dataset so that in the future, you can input a certain milage, and the model will return the price of that car. The dataset looks something like this:


In [2]:
show_price_vs_mileage()

Disregarding some outliers, you can see that there is a downwards trend to the data. Now the question is ***How do we predict the price of a car given its mileage in km?*** Linear regression will try to predict a function that best **fits** the data. In the future, we will use this word \"**fit**\" to reference the training of a machine learning model. And that is really what \"training a ML algorithm\" is doing -- it is trying to fit a function to the training data! For the most basic linear regression problem, we have one input variable $x$, and one output variable $y$. The function that linear regression will be fitting is shown below:


<div style="text-align: center">
    <div>&nbsp;</div>
    $\hat{y} = wx + b$
</div>

Where:
- $\hat{y}$ is your hypothesis function, AKA the function you are trying to fit
- $x$ is the input data
- $w$ is the weight (slope)
- $b$ is the bias (y-intercept)

In linear regression, the weights and bias are the **parameters** we get to tinker with. By changing the weights and bias, you change the nature of the function. See for your self by playing around with the weight and bias sliders below, and try to fit the function to the data:


In [3]:
show_pvm_with_sliders()

interactive(children=(FloatSlider(value=0.05, description='w', max=0.3, min=-0.3, step=0.001), IntSlider(value…

By playing around with the sliders, you have fit a function to the data. This is basically what all supervised learning algorithms are doing, just that usually they are fitting much, **MUCH** more complicated functions. Once you fit a function, you can use that function to make predictions! For example, if you input a mileage of 300,000 km, this function will output a price of around \$5000.

### Finding the Optimal $w$ and $b$ <a name="finding-optimal"/>

You might have the question: *How does a machine figure out which $w$ and $b$ are the correct values?* We will need a way of measuring how \"wrong\" a hypothesis function is. The measurement of error is done with a **cost function**. The cost function for linear regression is as shown:

<div style="text-align: center">
    <div>&nbsp;</div>
    $C(w, b) = {1\over m} \sum\limits_{i=1}^m (\hat{y}^{(i)}-y^{(i)})^2$
</div>

Ok, this may look a little complicated, so we will just refer to this cost function as $MSE$, which stands for **Mean Squared Error**. This is the preferred way of representing the cost for linear regression. 

#### Intuition of the cost function

This cost function effectively adds up all the distances of each point to the line, squared. This is useful, as the farther away the linear function is from all the points, the greater the distance, and therefore the greater the error. By minimizing the error, you can find the best fitting $w$ and $b$.

### Gradient Descent <a name="gradient-descent"/>

Gradient descent is a powerful algorithm that **iteratively** tries to minimize the cost function. To understand gradient descent, lets take a look at the cost function for this car price example:

<img src="images/gradient-descent.png" alt="image cannot be displayed" style="width: 700px">

To minimize the cost function, we want to travel **down** the slope towards the minimum. And that's how gradient descent gets its name! Gradient is a fancy word for \"slope\", and we descend down the gradient to find the point of least error.

---


## Training a Linear Regression Model <a name="training">

Now that we went over the basic concepts of linear regression, we will construct a basic linear regression model in SciKitLearn to predict the price of cars given their mileage. The first step is to import the necessary libraries:

- **Numpy**: a powerful linear algebra library
- **Pandas**: creates dataframes for organization and visualization
- **SKLearn**: Machine Learning framework to train our Linear Regression model.
- **MatPlotLib** and **Bokeh**: Data visualization libraries

Lets load in these modules:


In [4]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook, push_notebook

pd.set_option('display.max_columns', None)

### Loading and Visualizing the Dataset <a name="loading">

The first step in training a linear regression (or any machine learning) algorithm is to figure out what kind of data we have to work with. Let's read in `cars.csv` and load it into a pandas dataframe.

In [5]:
# Loads the dataset and displays it
cars_dataset = pd.read_csv("datasets/cars.csv")
cars_dataset.head(10)

Unnamed: 0,manufacturer_name,model_name,transmission,color,odometer_value,year_produced,engine_fuel,engine_has_gas,engine_type,engine_capacity,body_type,has_warranty,state,drivetrain,price_usd,is_exchangeable,location_region,number_of_photos,up_counter,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,duration_listed
0,Subaru,Outback,automatic,silver,190000,2010,gasoline,False,gasoline,2.5,universal,False,owned,all,10900.0,False,Минская обл.,9,13,False,True,True,True,False,True,False,True,True,True,16
1,Subaru,Outback,automatic,blue,290000,2002,gasoline,False,gasoline,3.0,universal,False,owned,all,5000.0,True,Минская обл.,12,54,False,True,False,False,True,True,False,False,False,True,83
2,Subaru,Forester,automatic,red,402000,2001,gasoline,False,gasoline,2.5,suv,False,owned,all,2800.0,True,Минская обл.,4,72,False,True,False,False,False,False,False,False,True,True,151
3,Subaru,Impreza,mechanical,blue,10000,1999,gasoline,False,gasoline,3.0,sedan,False,owned,all,9999.0,True,Минская обл.,9,42,True,False,False,False,False,False,False,False,False,False,86
4,Subaru,Legacy,automatic,black,280000,2001,gasoline,False,gasoline,2.5,universal,False,owned,all,2134.11,True,Гомельская обл.,14,7,False,True,False,True,True,False,False,False,False,True,7
5,Subaru,Outback,automatic,silver,132449,2011,gasoline,False,gasoline,2.5,universal,False,owned,all,14700.0,True,Минская обл.,20,56,False,True,False,False,False,True,False,True,True,True,67
6,Subaru,Forester,automatic,black,318280,1998,gasoline,False,gasoline,2.5,universal,False,owned,all,3000.0,True,Минская обл.,8,147,False,True,False,False,True,True,False,False,True,True,307
7,Subaru,Legacy,automatic,silver,350000,2004,gasoline,False,gasoline,2.5,sedan,False,owned,all,4500.0,False,Брестская обл.,7,29,False,True,True,False,False,False,False,False,False,True,73
8,Subaru,Outback,automatic,grey,179000,2010,gasoline,False,gasoline,2.5,universal,False,owned,all,12900.0,False,Минская обл.,17,33,False,True,True,True,True,True,True,True,True,True,87
9,Subaru,Forester,automatic,silver,571317,1999,gasoline,False,gasoline,2.5,universal,False,owned,all,4200.0,True,Минская обл.,8,11,False,True,True,False,False,True,False,False,False,True,43


Taking a look at our dataset, we can see many features such as `odometer_value`, `year_produced`, `engine_capacity`, etc. For our first linear regression model, we will only focus on the columns `odometer_value` and `price_usd`.

In [6]:
cars_dataset[['odometer_value', 'price_usd']].head(10)

Unnamed: 0,odometer_value,price_usd
0,190000,10900.0
1,290000,5000.0
2,402000,2800.0
3,10000,9999.0
4,280000,2134.11
5,132449,14700.0
6,318280,3000.0
7,350000,4500.0
8,179000,12900.0
9,571317,4200.0


Now that we isolated the two variables that we want, we can turn these variables into Numpy arrays to train on. These arrays should be split into a **training set** as well as a **test set**.

In [7]:
from sklearn.model_selection import train_test_split

# obtains np.arrays of training features and labels
features = cars_dataset['odometer_value'].to_numpy()
labels = cars_dataset['price_usd'].to_numpy()

# train test split
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=0.1)

# Reshapes X_train and X_test into 'structured' arrays
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)


print(f"The shape of X_train is {X_train.shape}")
print(f"The shape of Y_train is {Y_train.shape}")
print(f"The shape of X_test is {X_test.shape}")
print(f"The shape of Y_test is {Y_test.shape}")

The shape of X_train is (34677, 1)
The shape of Y_train is (34677,)
The shape of X_test is (3854, 1)
The shape of Y_test is (3854,)


### Fitting the Linear Model <a name="fitting">

Now the data is ready for use in training. The first linear regression model is going to be a simple, **single variable** model with no fancy bells and whistles on board. Training will be done using SKLearn's `LinearRegression` model which provides an easy, abstracted interface to work with. 

In [8]:
from sklearn.linear_model import LinearRegression

# initialization of an object of class LinearRegression
linreg = LinearRegression()

# Fitting the training set
linreg.fit(X_train, Y_train)

# Printing the R2 score
print(f'R2: {linreg.score(X_test, Y_test)}')

R2: 0.16521050331528475


The Accuracy is very low, thats terrible! Maybe plotting the predictions by our model as well as the testing set data can give us a clue on what's going on...

In [9]:
def display_predictions(X_gt, Y_gt, X_pred, Y_pred):
    """
    Displays predictions vs ground truth
    Args: 
        X_gt, Y_gt: ground truth data and labels
        X_pred, Y_pred: test input and predicted outputs
    """
    
    p = figure(width=600, 
               height=400, 
               x_range=(0, 600000), 
               y_range=(0, 50000), title="Price vs Mileage Evaluation", y_axis_label="Price", x_axis_label="Mileage (km)")
    # plot ground truth as blue dots
    X_gt = X_gt.flatten()
    p.circle(X_gt, Y_gt, color='blue')
    
    # plot prediction as red dots
    X_pred = X_pred.flatten()
    p.circle(X_pred, Y_pred, color='red')
    
    # output_notebook is required for outputting a plot to jupyter notebook
    output_notebook()
    show(p)

In [10]:
# X_pred is a range of numbers from 0 to 500000 with an interval of 50
X_pred = np.arange(0, 500000, 50).reshape(-1, 1)
Y_pred = linreg.predict(X_pred)

display_predictions(X_test, Y_test, X_pred, Y_pred)

### Observations <a name="obs1">

By plotting our testing data as well as our model prediction, we are able to see the problem: While the linear function fit looks normal, **there is too much variance in our data**. This is the possible reason why our model is reporting such a low accuracy. Let's keep track of our observations in the observation table down below:

| Model | Observation | R2 |
| :----- | :----------- | :-------- |
| Basic Linear Regression | Large spread in data | 0.16 |

---

## Linear Regression with Multiple Variables <a name="multivar">

So far, we have seen linear regression with a single variable in action. The results were not too spectacular, and that is because price of the car is a variable of **multiple features**. For example, the price of the car is not only determined by **mileage**, but also by **engine size**, the **year** it was produced, whatever **special features** it might have.

For multivariate linear regression, we will extend what we learned in single variable linear regression to produce a more powerful machine learning algorithm. With multiple input variables, multivariate linear regression would aim to fit a hypothesis function as shown:

<div style="text-align: center">
    <div>&nbsp;</div>
    $\hat{y} = w_1x_1 + w_2x_2 + ... + w_nx_n + b$
</div>

### Choosing features <a name="choosing">

Lets once again take a look at our dataset, as well as all the columns it currently has.

In [11]:
cars_dataset.head(20)

Unnamed: 0,manufacturer_name,model_name,transmission,color,odometer_value,year_produced,engine_fuel,engine_has_gas,engine_type,engine_capacity,body_type,has_warranty,state,drivetrain,price_usd,is_exchangeable,location_region,number_of_photos,up_counter,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,duration_listed
0,Subaru,Outback,automatic,silver,190000,2010,gasoline,False,gasoline,2.5,universal,False,owned,all,10900.0,False,Минская обл.,9,13,False,True,True,True,False,True,False,True,True,True,16
1,Subaru,Outback,automatic,blue,290000,2002,gasoline,False,gasoline,3.0,universal,False,owned,all,5000.0,True,Минская обл.,12,54,False,True,False,False,True,True,False,False,False,True,83
2,Subaru,Forester,automatic,red,402000,2001,gasoline,False,gasoline,2.5,suv,False,owned,all,2800.0,True,Минская обл.,4,72,False,True,False,False,False,False,False,False,True,True,151
3,Subaru,Impreza,mechanical,blue,10000,1999,gasoline,False,gasoline,3.0,sedan,False,owned,all,9999.0,True,Минская обл.,9,42,True,False,False,False,False,False,False,False,False,False,86
4,Subaru,Legacy,automatic,black,280000,2001,gasoline,False,gasoline,2.5,universal,False,owned,all,2134.11,True,Гомельская обл.,14,7,False,True,False,True,True,False,False,False,False,True,7
5,Subaru,Outback,automatic,silver,132449,2011,gasoline,False,gasoline,2.5,universal,False,owned,all,14700.0,True,Минская обл.,20,56,False,True,False,False,False,True,False,True,True,True,67
6,Subaru,Forester,automatic,black,318280,1998,gasoline,False,gasoline,2.5,universal,False,owned,all,3000.0,True,Минская обл.,8,147,False,True,False,False,True,True,False,False,True,True,307
7,Subaru,Legacy,automatic,silver,350000,2004,gasoline,False,gasoline,2.5,sedan,False,owned,all,4500.0,False,Брестская обл.,7,29,False,True,True,False,False,False,False,False,False,True,73
8,Subaru,Outback,automatic,grey,179000,2010,gasoline,False,gasoline,2.5,universal,False,owned,all,12900.0,False,Минская обл.,17,33,False,True,True,True,True,True,True,True,True,True,87
9,Subaru,Forester,automatic,silver,571317,1999,gasoline,False,gasoline,2.5,universal,False,owned,all,4200.0,True,Минская обл.,8,11,False,True,True,False,False,True,False,False,False,True,43


You'll probably notice that there are a lot of columns where the values are non-numeric. Many of those features are not important and we will not use them, however for the columns we are using we need to make sure that they are numeric. To keep things simple, we will use **4 features**: `odometer_value`, `year_produced`, `engine_capacity`, and `drivetrain`

In [12]:
# Labels stay unchanged
labels_multi = cars_dataset['price_usd'].to_numpy()

# get dummies for drivetrain, then convert to numpy
drivetrain_dummies = pd.get_dummies(cars_dataset['drivetrain']).to_numpy()
cars_dataset["drivetrain_num"] = np.argmax(drivetrain_dummies, axis=1)

# obtain features array from pandas dataframe
features_multi = cars_dataset[['odometer_value', 'year_produced', 'engine_capacity', 'drivetrain_num']].to_numpy()

print(f'features_multi shape: {features_multi.shape}')
print(f'labels_multi shape: {labels_multi.shape}')

features_multi shape: (38531, 4)
labels_multi shape: (38531,)


### Training our Multi-Variate Linear Regression Model <a name="training2">

With our features chosen, now we can train our linear regression model. The training process is quite similar to before; we will create a train-test split and then train using `LinearRegression`


In [13]:
# Train test split
X_train_m, X_test_m, Y_train_m, Y_test_m = train_test_split(features_multi, labels_multi, test_size=0.1)

# convert all NaN to 0
X_train_m = np.nan_to_num(X_train_m, nan=0)
X_test_m = np.nan_to_num(X_test_m, nan=0)

# initializing and fitting linear model
linreg_m  = LinearRegression()
linreg_m.fit(X_train_m, Y_train_m)


print(f'R2: {linreg_m.score(X_test_m, Y_test_m)}')

R2: 0.6145927596193856


### Observations <a name="obs2">

This is looking much better! Because our price is dependent on **multiple features**, it only makes sense to perform multi-variate linear regression on it. Lets fill out our observations table with our new findings.

| Model | Observation | R2 |
| :----- | :----------- | :-------- |
| Basic Linear Regression | Large spread in data | 0.16 |
| Multi-variate Linear Regression | Much better accuracy, closer fit, but can do better | 0.60 |
    
---

## Polynomial Regression <a name="poly">

Polynomial Regression take linear regression with multiple features one step further by introducing some features as the previous feature raised to some power. This allows us to fit non-linear functions, and since many real world problems have a non-linear relationship, polynomial regression allows us to train models that better fit those relationships

### Intuition <a name="intuition">

Say we used the single variable linear regression example from above. We can turn that linear regression problem into a polynomial regression problem by introducing a few more features:


<div style="text-align: center">
    <div>&nbsp;</div>
    $\hat{y} = w_1x_1 + w_2x_2 + w_3x_3 + b$
</div>

However, instead of letting $x_2$ and $x_3$ be independent features, we will define them in terms of $x_1$:

<div style="text-align: center">
    <div>&nbsp;</div>
    $x_2 = x_1^2, \ \ \ \ x_3 = x_1^3$
</div>

Therefore, the final form of the hypothesis function is as shown. Note that we renamed $x_1$ as just $x$:

<div style="text-align: center">
    <div>&nbsp;</div>
    $\hat{y} = w_1x + w_2x^2 + w_3x^3 + b$
</div>

With polynomial regression, we are able to fit more complicated non-linear functions to our data. The image below shows the example of a function that is possible with polynomial regression. Note the non-linearity shown by the curves in the surface:

<img src="images/Polynomial-Regression.png" alt="Cannot display image" style="width:700px">



See for yourselves by playing around with these sliders. Try to find a combination of $w_1$, $w_2$, $w_3$ that fits

In [14]:
show_poly()

interactive(children=(FloatSlider(value=0.001, description='w1', max=0.1, min=-0.1, readout_format='.3f', step…

### Preparing our Features and Fitting our Model<a name="features2">

To train a polynomial regression model, we first need to obtain the training data, and then convert each feature into many polynomial features. Luckily, there is a handy function in SKLearn called `PolynomialFeatures` that will transform our linear features into polynomial ones. Here, we will define a polynomial transform object with degree of 3.

In [15]:
from sklearn.preprocessing import PolynomialFeatures

# Used to convert existing features into polynomial features
poly = PolynomialFeatures(degree=3)

# Transforming the input data
X_train_p = poly.fit_transform(X_train_m)
X_test_p = poly.fit_transform(X_test_m)

# labels do not change from before
Y_train_p = Y_train_m
Y_test_p = Y_test_m

linreg_p = LinearRegression()
linreg_p.fit(X_train_p, Y_train_p)

print(f'R2: {linreg_p.score(X_test_p, Y_test_p)}')



R2: 0.7770001484468039


### Observations<a name="obs3">

We have achieved our highest $R^2$ value yet! Using polynomial regression, we were able to fit a non-linear function much closer to our data than we were able to do with linear functions. Finally, lets fill out our observation table:

| Model | Observation | R2 |
| :----- | :----------- | :-------- |
| Basic Linear Regression | Large spread in data | 0.16 |
| Multi-variate Linear Regression | Much better accuracy, closer fit, but can do better | 0.60 |
| Polynomial Regression | Even higher accuracy on the test data than Multivariate linear regression  | 0.77 |

## $\mathcal{Fin}$

Congrats on finishing the first notebook on our Beginner AI Course! Join the instructor in our next activity as we put what we learned about linear regression to practice!

<img src="images/umaru.png" alt="Cannot display Image" style="width:700px">

## References <a name="references">
    
1. [https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
2. [https://www.coursera.org/specializations/deep-learning](https://www.coursera.org/specializations/deep-learning)
3. [https://machinelearningmastery.com/linear-regression-for-machine-learning/](https://machinelearningmastery.com/linear-regression-for-machine-learning/)