# Lab 05: Linear Regression
In Lab 05, we will write a simple gradient descent algorithm to learn weights for housing price prediction. We will only use one variable (area), but we will train over a whole dataset. 

By the end of this lab, you should be know very clearly how Gradient Descent works for single variable linear regression.

**SUBMISSION**

You should submit a `lab05-yourname.py` file on Moodle, NOT a Jupyter Notebook, which answers the EXERCISE QUESTIONS (using comments) and which solves the ASSIGNMENT below. PLEASE DO NOT USE ARABIC IN THE FILENAME.

**LAB CLASS SOLUTIONS**

You may use the code that is covered in class time, but you _must_ (re-)type it yourself!!  So, during lab class, I recommend that you open a `lab05-yourname.py` file in VSCode and try to run bits of it as we go along.

**DUE DATE**

30th April, 2025 -- 1 week

**GRADING**

This Lab is worth 12.5% of your overall course grade. Completeness, correct output/answers, and style are all part of the criteria.

**LATE WORK**

Late work will be penalized by 25 points. However, it can be submitted until the day of the Final Exam.

In [None]:
import numpy as np
import pandas as pd
import kagglehub
from random import random
from matplotlib import pyplot as plt

## Walkthrough: Import Data from Kaggle
We take our data from the [Housing Prices Dataset on Kaggle](https://www.kaggle.com/datasets/yasserh/housing-prices-dataset/data). This gives the `price` of houses based on numeric features like `area` or `bedrooms`, but also categorical features like `basement` and `airconditioning`.

In [None]:
path = kagglehub.dataset_download("yasserh/housing-prices-dataset")
df = pd.read_csv(path + "/Housing.csv")
print(f"Imported housing prices dataset with {df.shape[0]} rows")

In [None]:
housing_prices = pd.DataFrame()
housing_prices['price_1M'] = df['price'].apply(lambda x:x/1000000)
housing_prices['area_100m'] = df['area'].apply(lambda x:x/1000)
housing_prices

### Exercise questions
**Q1**: What did we just do with the data in the code above? (Mark with an X instead of line)
```
    _ Regularize
    _ Normalize
    _ Calculate the gradient
    _ Determine the error
```

**Q2**: What kind of problem can we use the `housing_prices` variable to solve?
```
    _ Estimating a 'price' (numeric) based on 'area' (numeric)
    _ Estimating a 'price' (numeric) based on many variables
    _ Estimating the 'area' (numeric) based on many variables
    _ Classifying whether houses are desirable, based on 'price' and 'area'
```

## Walkthrough: Set random weights (set up for gradient descent)
In linear regression (and with lots of supervised learning techniques), we:

* Choose some random weights/parameters to start
* Calculate the gradient 
* Update the weights

We will do this with Gradient Descent using the Mean Squared Error (MSE) loss function (also known as L2 loss).

In [None]:
weights = [random()-0.5 for i in range(2)]
init_weights = weights.copy()
print(f"Initial weights: {weights}")

plt.scatter(housing_prices['area_100m'],
            housing_prices['price_1M'])
x_line = np.linspace(housing_prices['area_100m'].min(),
                     housing_prices['area_100m'].max(), 100)
y_init = init_weights[1] * x_line + init_weights[0]
plt.plot(x_line, y_init, color='red', label="Original Weights")
plt.xlabel("Area (100m²)")
plt.ylabel("Price (1M$)")
plt.title("Housing Prices with regression of randomized initial weights")
plt.show()

### Exercise questions
**Q3**: What does the plot above show us?
```
    _ Categorical variables can be plotted as different points on a scatterplot
    _ Randomized starting points need only minor tweaking
    _ Prices tend to be higher for bigger houses
    _ A poorly fit red line means that the linear hypothesis space is inappropriate for this data
```

## Assignment (REQUIRED): Implement the gradient descent updates
In the previous section, we have just defined a weight $w_1$ (`weights[1]`) and a bias term $w_0$ (`weights[0]`). Your goal now is to update those weights by comparing your estimates to the output. Essentially, we are writing an equation in a format like $y=mx+b$. Let's call the `price_1M` our $y$ and the `area_100m` our $x$. When we want to make a prediction, we put a "hat" on it, so _predicted_ price is written $\hat{y}$. 

**Step 1**: Calculate the predicted price
$$
\hat{y}=w_1 x + w_0
$$

**Step 2**: Calculate the average gradients. In class we saw equations that were like this: $\frac{\delta L}{\delta w_0} = 2(\hat{y}-y)$ and $\frac{\delta L}{\delta w_1} = 2(\hat{y}-y)\cdot x$.

Here, though, we have many values of $x$ (we can say it is "vector-valued" and write it as $\mathbf{x}$), but we want to calculate just one weight update for all of them. So average the values above across the $x$ samples. 
$$
\frac{\delta L}{\delta w_0} = \frac{1}{n}\sum_{x\in\mathbf{x}} 2(\hat{y}-y)\\
\frac{\delta L}{\delta w_1} = \frac{1}{n}\sum_{x\in\mathbf{x}} 2(\hat{y}-y)\cdot x
$$

**Step 3**: Calculate the updated weights. We mark updated values with a "prime" ($w^\prime$ instead of $w$). Just remember that we will include a learning parameter, which you should set to $\eta=0.01$ for this Lab.
$$
w_0^\prime = w_0 - \eta \frac{\delta L}{\delta w_0}\\
w_1^\prime = w_1 - \eta \frac{\delta L}{\delta w_1}
$$

Then, repeat everything 1000 times (i.e., 1000 epochs).

Your task is to put this into code!

In [None]:
# Set the learning rate
learning_rate = 0.01

## YOUR CODE HERE ##


## Extra: Plot the updated regression line and the training curve


In [None]:
## YOUR CODE HERE ##


To get a plot of the loss, you need to have tracked the loss over each epoch. Go back and add code that calculates the mean squared error into your loop, at the end of each epoch. Here's `aima-python`'s implementation of the mean squared error calculation:
```
def mean_squared_error_loss(x, y):
    return (1.0 / len(x)) * sum((_x - _y) ** 2 for _x, _y in zip(x, y))
```

Then, once you've kept a list of the loss values at each epoch, plot them below.

In [None]:
## YOUR CODE HERE ##
