<a href="https://colab.research.google.com/github/wiemila/ML-courses/blob/main/Copy_of_DNN_Lab_1_MSLE_student_version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><img src='https://drive.google.com/uc?id=1_utx_ZGclmCwNttSe40kYA6VHzNocdET' height="60"></center>

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Program Operacyjny Polska Cyfrowa na lata 2014-2020
<hr>

<center><img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'></center>

<center>
Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej"
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
    </center>

# Linear regression

In this exercise, you will use linear regression to predict flat (apartment) prices. Training will be handled via gradient descent. We will:
* have multiple features (i.e. variables used to make the prediction),
* employ some basic feature engineering,
* work with a non-standard loss function.

Let's start by obtaining the data.

In [2]:
!wget --no-verbose -O mieszkania.csv https://www.dropbox.com/s/zey0gx91pna8irj/mieszkania.csv?dl=1
!wget --no-verbose -O mieszkania_test.csv https://www.dropbox.com/s/dbrj6sbxb4ayqjz/mieszkania_test.csv?dl=1
!head mieszkania.csv mieszkania_test.csv

2025-10-08 14:40:55 URL:https://uc6b79701c12f1dcb99f87c6e84b.dl.dropboxusercontent.com/cd/0/inline/Cy3xhOYST9awmfbngpVyr9rsRYJP5kSHGdjCBesZsIkuGt-nKFeuTQwCZ66N_AO_kYC7XdML-4FdtchXqvgqSoDuHceVQ8I4pBFWyq_YrDFU3O9YnK-mbhmIQaVcqGJU9fY/file?dl=1 [6211/6211] -> "mieszkania.csv" [1]
2025-10-08 14:40:56 URL:https://uccc5a5337e3468af15a2f117098.dl.dropboxusercontent.com/cd/0/inline/Cy0CEelbDSo1AjjwHT65ZbSD7tA7hvaMUWx-ZarvSRshyDU50NIyspS57UWJogvwkAFdMtIwtizvbuYw22XPwztHxGTYUDX_f8EPCirrA7ovT7QP3qUCtNylP0HF6wWxYdo/file?dl=1 [6247/6247] -> "mieszkania_test.csv" [1]
==> mieszkania.csv <==
m2,dzielnica,ilość_sypialni,ilość_łazienek,rok_budowy,parking_podziemny,cena
104,mokotowo,2,2,1940,1,780094
43,ochotowo,1,1,1970,1,346912
128,grodziskowo,3,2,1916,1,523466
112,mokotowo,3,2,1920,1,830965
149,mokotowo,3,3,1977,0,1090479
80,ochotowo,2,2,1937,0,599060
58,ochotowo,2,1,1922,0,463639
23,ochotowo,1,1,1929,0,166785
40,mokotowo,1,1,1973,0,318849

==> mieszkania_test.csv <==
m2,dzielnica,ilość_sypialni,ilość_

Each row in the data represents a separate flat. Our goal is to use the data from `mieszkania.csv` to create a model that can predict a flat's price (i.e. `cena`) given its features (i.e. `m2,dzielnica,ilosc_sypialni,...`).

We should use only `mieszkania.csv` (dubbed the training dataset) to make our decisions and create the model. The (only) purpose of `mieszkania_test.csv` is to test our model on **unseen** data.

In [3]:
%matplotlib inline

from typing import Any

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from tqdm.auto import tqdm

NDArray = np.ndarray[Any, Any]

np.set_printoptions(precision=4, suppress=True)
np.random.seed(357)

## Loading and converting data

Let's start by loading the data and showing the range of prices we're working with.

In [4]:
def load(path: str) -> tuple[NDArray, NDArray]:
    """
    Returns (x, y) where:
    - x: input features, shape (n_apartments, n_features)
    - y: price, shape (n_apartments,)
    """
    data = pd.read_csv(path)
    y = data["cena"].to_numpy()
    x = data.loc[:, data.columns != "cena"].to_numpy()
    return x, y

In [5]:
x_train, y_train = load("mieszkania.csv")
x_test, y_test = load("mieszkania_test.csv")

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(200, 6) (200,)
(200, 6) (200,)


In [6]:
print(np.min(y_train), np.max(y_train), np.mean(y_train))

102572 1102309 507919.49


In [7]:
x_train[:3]

array([[104, 'mokotowo', 2, 2, 1940, 1],
       [43, 'ochotowo', 1, 1, 1970, 1],
       [128, 'grodziskowo', 3, 2, 1916, 1]], dtype=object)

We'll need to convert features to floats.

In [8]:
# Convert column 1 from str to (ordinal) int.
# (One-hot encoding would be better, but ordinal is OK for today.)
label_encoder = LabelEncoder()
label_encoder.fit(x_train[:, 1])
x_train[:, 1] = label_encoder.transform(x_train[:, 1])
x_test[:, 1] = label_encoder.transform(x_test[:, 1])

# Convert ints to float.
x_train = x_train.astype(np.float64)
x_test = x_test.astype(np.float64)

In [9]:
x_train[:3]

array([[ 104.,    1.,    2.,    2., 1940.,    1.],
       [  43.,    2.,    1.,    1., 1970.,    1.],
       [ 128.,    0.,    3.,    2., 1916.,    1.]])

## The loss and constant models

Our predictions should minimize the so-called *mean squared logarithmic error*:
$$
MSLE = \frac{1}{n} \sum_{i=1}^n (\log(1+y_i) - \log(1+p_i))^2,
$$
where $y_i$ is the ground truth, and $p_i$ is our prediction.

Let's implement the loss function first.

In [10]:
def mse(ys: NDArray, ps: NDArray) -> np.float64:
    assert ys.shape == ps.shape
    return np.mean((ys - ps) * (ys - ps))

In [11]:
def msle(ys: NDArray, ps: NDArray) -> np.float64:
    assert ys.shape == ps.shape
    return np.mean( (np.log(1+ys)-np.log(1+ps))**2)

The simplest model is predicting the same constant for each instance. Test your implementation of msle against outputing the mean price.

In [22]:
###################################################
# TODO: Compute msle for outputing the mean price #
###################################################
mean_price= np.mean(y_train)
print(mse(y_test, np.ones_like(y_test)*mean_price))
print(msle(y_test, np.ones_like(y_test)*mean_price))
# musza miec taki sam rozmiar dlatego tablica jedynek ktora mnozymy przez srednia cene#

86180713197.04451
0.4284115392580848


Recall that outputing the mean minimizes $MSE$. However, we're now dealing with $MSLE$.

Think of a constant that should result in the lowest $MSLE$.

In [19]:
#############################################
# TODO: Find this constant and compute msle #
#############################################
def example_plot():
    costs=np.linspace(np.min(y_train),np.max(y_train),100)
    losses =[]
    for c in costs:
        loss=msle(y_train, np.ones_like(y_train)*c)
        losses.append(loss)

minimizer =costs[np.argmin(losses)]
optimized_loss =np.min(losses)
print(f"MSLE =(optimized_loss)at cost=(minimizer)")
plt.plot(costs,losses)
plot.plot([minimizer],[optimized_loss], marker='o')
example_plot()

NameError: name 'costs' is not defined

In [None]:
#tutaj obliczenie tej wartosci zamiast z wykresu

## Linear regression (standard)

Now, let's implement training of a standard linear regression model via gradient descent.

In [27]:
def train(
    x: NDArray, y: NDArray, alpha: float = 1e-7, n_iterations: int = 100000
) -> tuple[NDArray, np.float64]:
    """Linear regression (which optimizes MSE). Returns (weights, bias)."""

    # B is batch size (number of observations).
    # F is number of (input) features.
    B, F = x.shape
    assert y.shape == (B,)

    bias =np.mean(y_train)
    weights = np.random.uniform((F,), low=-1 / np.sqrt(F,),high=+1 / np.sqrt(F,))

    for i in tqdm(range(n_interations)):
      preds =weights @ x.T +bias
      loss =msle(y, preds)

      if i% 1000 ==0:
        print("loss", loss)

      grad_bias = 0.0
      grad_weights = 0.0
      bias -= alpha * grad_bias
      weights -=alpha * grad_weights
    return weights, bias

weights, bias = train(x_train, y_train)
preds_test = weights @ x_test.T + bias # shape (F,)@ (B,F) wiec trzeba transponowac #
print("test MSLE:", msle(y_test, preds_test))

TypeError: uniform() got multiple values for keyword argument 'low'

## Linear regression (MSLE)

Note that the loss function that the algorithms optimizes (i.e $MSE$) differs from $MSLE$. We've already seen that this may result in a suboptimal solution.

How can you change the setting so that we optimze $MSLE$ instead?

Hint:
<sub><sup><sub><sup><sub><sup>
Be lazy. We don't want to change the algorithm.
Use the chain rule and previous computations to get formulas for the gradient.
</sup></sub></sup></sub></sup></sub>

In [None]:
def train_msle(
    x: NDArray, y: NDArray, alpha: float = 1e+4, n_iterations: int = 50000
) -> tuple[NDArray, NDArray]:
    """Gradient descent for MSLE."""

    #############################################
    # TODO: Optimize msle and compare the error #
    #############################################


weights, bias = train_msle(x_train, y_train)
preds_test = ... # TODO #
print("test MSLE: ", msle(y_test, preds_test))

  0%|          | 0/50000 [00:00<?, ?it/s]

loss MSE=1.96e+11, MSLE=1.37
loss MSE=4.69e+10, MSLE=0.168
loss MSE=2.97e+10, MSLE=0.0906
loss MSE=2.16e+10, MSLE=0.0651
loss MSE=1.82e+10, MSLE=0.057
loss MSE=1.67e+10, MSLE=0.0544
loss MSE=1.61e+10, MSLE=0.0536
loss MSE=1.58e+10, MSLE=0.0533
loss MSE=1.56e+10, MSLE=0.0532
loss MSE=1.55e+10, MSLE=0.0532
loss MSE=1.55e+10, MSLE=0.0531
loss MSE=1.54e+10, MSLE=0.0531
loss MSE=1.54e+10, MSLE=0.0531
loss MSE=1.54e+10, MSLE=0.0531
loss MSE=1.54e+10, MSLE=0.0531
loss MSE=1.54e+10, MSLE=0.0531
loss MSE=1.54e+10, MSLE=0.0531
test MSLE:  0.08069423124815929


## Feature engineering

Without any feature engineering our model approximates the price as a linear combination of original features:
$$
\text{price} \approx w_1 \cdot \text{area} + w_2 \cdot \text{district} + \dots.
$$
Let's now introduce some interactions between the variables. For instance, let's consider a following formula:
$$
\text{price} \approx w_1 \cdot \text{area} \cdot \text{avg. price in the district per sq. meter} + w_2 \cdot \dots + \dots.
$$
Here, we model the price with far greater granularity, and we may expect to see more acurate results.

Add some feature engineering to your model. Be sure to play with the data and not with the algorithm's code.

Think how to make sure that your model is capable of capturing the $w_1 \cdot \text{area} \cdot \text{avg. price...}$ part, without actually computing the averages.

Note that you may need to change the learning rate substantially.

Hint:
<sub><sup><sub><sup><sub><sup>
Is having a binary encoding for each district and multiplying it by area enough?
</sup></sub></sup></sub></sup></sub>

Hint 2:
<sub><sup><sub><sup><sub><sup>
Why not multiply everything together? I.e. (A,B,C) -> (AB,AC,BC).
</sup></sub></sup></sub></sup></sub>

In [None]:
###############################################
# TODO: Implement the feature engineering part #
###############################################

In [None]:
##############################################################
# TODO: Test your solution on the training and test datasets #
##############################################################

# Validation

In this exercise you will implement a validation pipeline: split the non-test set into train and validation sets and select the best model based on validation results.

So far you tested your model against the training and test datasets. As you should observe, there's a gap between the results. By validating your model, you should be able to better anticipate the test time performance and compare different models and hyperparameters on datasets they are not over-fitted to.

Implement the basic validation method, i.e. a random split. Test it with your model from Exercise MSLE.

In [None]:
x_train_val, y_train_val = x_train, y_train
x_test, y_test = x_test, y_test


def random_split(
    x: NDArray, y: NDArray, val_ratio: float = 0.2
) -> tuple[tuple[NDArray, NDArray], tuple[NDArray, NDArray]]:
    """Returns (x_train, y_train), (x_val, y_val)."""

    idxs = np.random.permutation(len(x))

    ######################################################
    # TODO: Implement the basic validation split method. #
    ######################################################




(x_train, y_train), (x_val, y_val) = random_split(x_train_val, y_train_val)

len(x_train), len(x_val), len(x_test)

(160, 40, 200)

In [None]:
#############################################################
# TODO: compare MSLE on training, validation, and test sets #
#############################################################

## Cross-validation

To make the random split validation reliable, a significant chunk of training data may be needed. To get over this problem, one may apply cross-validation.

![alt-text](https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png)

Let's now implement the method. Make sure that:
* number of partitions is a parameter,
* the method is not limited to `mieszkania.csv`,
* the method is not limited to one specific model.

In [None]:
####################################
# TODO: Implement cross-validation #
####################################
def kfold(x: NDArray, y: NDArray, n_folds: int = 5, shuffle: bool = False) -> list[float]:
    """Returns losses for each fold."""



losses = kfold(x_train_val, y_train_val, n_folds=3, shuffle=False)
print(f"k-fold loss: {np.mean(losses):.4f} +- {np.std(losses):.4f}")


## Investigating input data

Recall that sometimes validation may be tricky, e.g. significant class imbalance, having a small number of subjects, geographically clustered instances...

What could in theory go wrong here with random, unstratified partitions? Think about potential solutions and investigate the data in order to check whether these problems arise here.

In [None]:
##############################
# TODO: Investigate the data #
##############################