<center><img src='https://drive.google.com/uc?id=1_utx_ZGclmCwNttSe40kYA6VHzNocdET' height="60"></center>

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Program Operacyjny Polska Cyfrowa na lata 2014-2020
<hr>

<center><img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'></center>

<center>
Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej"
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
    </center>

# Linear regression

In this exercise, you will use linear regression to predict flat (apartment) prices. Training will be handled via gradient descent. We will:
* have multiple features (i.e. variables used to make the prediction),
* employ some basic feature engineering,
* work with a non-standard loss function.

Let's start by obtaining the data.

In [None]:
!wget --no-verbose -O mieszkania.csv https://www.dropbox.com/s/zey0gx91pna8irj/mieszkania.csv?dl=1
!wget --no-verbose -O mieszkania_test.csv https://www.dropbox.com/s/dbrj6sbxb4ayqjz/mieszkania_test.csv?dl=1
!head mieszkania.csv mieszkania_test.csv

2025-11-03 01:08:33 URL:https://ucd675835e0f7c284f17db66cb72.dl.dropboxusercontent.com/cd/0/inline/C0Z3sJW5VuInDGt7bX-LVr7WKLdU8fHanjSDr7w4SIXdFP6pW-F-AUjVzQLDG3ZjiKbUp8IqZuK3reaQN148EhcUqxZt_Af_6umPiJFklRrlCK535wD0KSIQVFGJnMlC5Dw/file?dl=1 [6211/6211] -> "mieszkania.csv" [1]
2025-11-03 01:08:35 URL:https://uc2fb86affb5c8290b654e45ed5d.dl.dropboxusercontent.com/cd/0/inline/C0Zatcyb93Fbceq4oscOpcWLCLg607XszkDPBoj9iQvreDA_mu41a6oTa48Cpvpdc3djiaHjhRmrEWyS_pmhFDK5KjCQ1ksF-sHbNh3ZIhMFcM0WoABHh0qJ-ySEqmqWvEU/file?dl=1 [6247/6247] -> "mieszkania_test.csv" [1]
==> mieszkania.csv <==
m2,dzielnica,ilość_sypialni,ilość_łazienek,rok_budowy,parking_podziemny,cena
104,mokotowo,2,2,1940,1,780094
43,ochotowo,1,1,1970,1,346912
128,grodziskowo,3,2,1916,1,523466
112,mokotowo,3,2,1920,1,830965
149,mokotowo,3,3,1977,0,1090479
80,ochotowo,2,2,1937,0,599060
58,ochotowo,2,1,1922,0,463639
23,ochotowo,1,1,1929,0,166785
40,mokotowo,1,1,1973,0,318849

==> mieszkania_test.csv <==
m2,dzielnica,ilość_sypialni,ilość_

Each row in the data represents a separate flat. Our goal is to use the data from `mieszkania.csv` to create a model that can predict a flat's price (i.e. `cena`) given its features (i.e. `m2,dzielnica,ilosc_sypialni,...`).

We should use only `mieszkania.csv` (dubbed the training dataset) to make our decisions and create the model. The (only) purpose of `mieszkania_test.csv` is to test our model on **unseen** data.

In [None]:
%matplotlib inline

from typing import Any

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from tqdm.auto import tqdm

NDArray = np.ndarray[Any, Any]

np.set_printoptions(precision=4, suppress=True)
np.random.seed(357)

## Loading and converting data

Let's start by loading the data and showing the range of prices we're working with.

In [None]:
def load(path: str) -> tuple[NDArray, NDArray]:
    """
    Returns (x, y) where:
    - x: input features, shape (n_apartments, n_features)
    - y: price, shape (n_apartments,)
    """
    data = pd.read_csv(path)
    y = data["cena"].to_numpy()
    x = data.loc[:, data.columns != "cena"].to_numpy()
    return x, y

In [None]:
x_train, y_train = load("mieszkania.csv")
x_test, y_test = load("mieszkania_test.csv")

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(200, 6) (200,)
(200, 6) (200,)


In [None]:
print(np.min(y_train), np.max(y_train), np.mean(y_train))

102572 1102309 507919.49


In [None]:
x_train[:3]

array([[104, 'mokotowo', 2, 2, 1940, 1],
       [43, 'ochotowo', 1, 1, 1970, 1],
       [128, 'grodziskowo', 3, 2, 1916, 1]], dtype=object)

We'll need to convert features to floats.

In [None]:
# Convert column 1 from str to (ordinal) int.
# (One-hot encoding would be better, but ordinal is OK for today.)
label_encoder = LabelEncoder()
label_encoder.fit(x_train[:, 1])
x_train[:, 1] = label_encoder.transform(x_train[:, 1])
x_test[:, 1] = label_encoder.transform(x_test[:, 1])

# Convert ints to float.
x_train = x_train.astype(np.float64)
x_test = x_test.astype(np.float64)

In [None]:
x_train[:3]

array([[ 104.,    1.,    2.,    2., 1940.,    1.],
       [  43.,    2.,    1.,    1., 1970.,    1.],
       [ 128.,    0.,    3.,    2., 1916.,    1.]])

## The loss and constant models

Our predictions should minimize the so-called *mean squared logarithmic error*:
$$
MSLE = \frac{1}{n} \sum_{i=1}^n (\log(1+y_i) - \log(1+p_i))^2,
$$
where $y_i$ is the ground truth, and $p_i$ is our prediction.

Let's implement the loss function first.

In [None]:
def mse(ys: NDArray, ps: NDArray) -> np.float64:
    assert ys.shape == ps.shape
    return np.mean((ys - ps) * (ys - ps))

In [None]:
def msle(ys: NDArray, ps: NDArray) -> np.float64:
    assert ys.shape == ps.shape
    n = ys.shape[0]
    return np.sum((np.log(1+np.array(ys)) - np.log(1+np.array(ps)))**2, axis=0) / n

The simplest model is predicting the same constant for each instance. Test your implementation of msle against outputing the mean price.

In [None]:
###################################################
# TODO: Compute msle for outputing the mean price #
###################################################
# ys = y_train
# n = ys.shape[0]
# print(n)
# ps = np.full(y_train.shape, 10)
# print(np.sum((np.log(1+ys) - np.log(1+ps))**2, axis=0) / n)
print(msle(y_train, np.full(y_train.shape, 10)))

112.23740241402163


Recall that outputing the mean minimizes $MSE$. However, we're now dealing with $MSLE$.

Think of a constant that should result in the lowest $MSLE$.

In [None]:
#############################################
# TODO: Find this constant and compute msle #
#############################################
import scipy

print(scipy.stats.mstats.gmean(y_train))
print(msle(y_train, np.full(y_train.shape, np.mean(y_train))))
print(msle(y_train, np.full(y_train.shape, scipy.stats.mstats.gmean(y_train))))

431435.2744452117
0.3915253538257009
0.3648896122136122


## Linear regression (standard)

Now, let's implement training of a standard linear regression model via gradient descent.

In [None]:
import tqdm
from collections.abc import Callable


def predict(weights: NDArray, x: NDArray, bias: np.float64 = 0.0) -> NDArray:
  return np.array(np.dot(x, weights) + bias)

def train(
    x: NDArray, y: NDArray, loss_f: Callable, alpha: float = 1e-7, n_iterations: int = 100000
) -> tuple[NDArray, np.float64]:
    """Linear regression (which optimizes MSE). Returns (weights, bias)."""

    # B is batch size (number of observations).
    # F is number of (input) features.
    B, F = x.shape
    assert y.shape == (B,)
    weights = np.random.random(F,).astype(np.float64)
    bias = 0.0

    for i in tqdm.tqdm(range(n_iterations)):
      predictions = predict(weights, x)
      loss = loss_f(y, predictions)

      # Calculate gradients
      d_weights = (2/B) * np.dot(x.T, (predictions - y))
      d_bias = (2/B) * np.sum(predictions - y)

      # Update weights and bias
      weights -= alpha * d_weights
      bias -= alpha * d_bias

    return weights, bias


weights, bias = train(x_train.astype(np.float64), y_train, mse)
preds_test = predict(weights, x_test.astype(np.float64)) # TODO #
print("test MSLE:", msle(y_test, preds_test))

100%|██████████| 100000/100000 [00:03<00:00, 30155.66it/s]

test MSLE: 0.0803803429858911





**by Gemini**

Let's derive the gradients for the Mean Squared Error (MSE) loss function.

The MSE loss is:
$$ MSE = \frac{1}{n} \sum_{i=1}^n (y_i - p_i)^2 $$
where $y_i$ is the true value, $p_i$ is the prediction, and $n$ is the number of samples.

Our prediction $p_i$ for a linear model is:
$$ p_i = \mathbf{w}^T \mathbf{x}_i + b $$
where $\mathbf{w}$ is the vector of weights, $\mathbf{x}_i$ is the vector of features, and $b$ is the bias.

Using the chain rule, the gradient with respect to weights ($\mathbf{w}$) is:
$$ \frac{\partial MSE}{\partial \mathbf{w}} = \frac{\partial MSE}{\partial p_i} \cdot \frac{\partial p_i}{\partial \mathbf{w}} $$
$$ \frac{\partial MSE}{\partial p_i} = -\frac{2}{n} (y_i - p_i) $$
$$ \frac{\partial p_i}{\partial \mathbf{w}} = \mathbf{x}_i $$
Summing over all samples:
$$ \frac{\partial MSE}{\partial \mathbf{w}} = \sum_{i=1}^n \left( -\frac{2}{n} (y_i - p_i) \right) \mathbf{x}_i = \frac{2}{n} \sum_{i=1}^n (p_i - y_i) \mathbf{x}_i $$
In matrix form:
$$ \frac{\partial MSE}{\partial \mathbf{w}} = \frac{2}{n} \mathbf{X}^T (\mathbf{p} - \mathbf{y}) $$

The gradient with respect to bias ($b$) is:
$$ \frac{\partial MSE}{\partial b} = \frac{\partial MSE}{\partial p_i} \cdot \frac{\partial p_i}{\partial b} $$
$$ \frac{\partial MSE}{\partial p_i} = -\frac{2}{n} (y_i - p_i) $$
$$ \frac{\partial p_i}{\partial b} = 1 $$
Summing over all samples:
$$ \frac{\partial MSE}{\partial b} = \sum_{i=1}^n \left( -\frac{2}{n} (y_i - p_i) \right) \cdot 1 = \frac{2}{n} \sum_{i=1}^n (p_i - y_i) $$

These gradients are used to update the weights and bias in gradient descent:
$$ \mathbf{w} := \mathbf{w} - \alpha \frac{\partial MSE}{\partial \mathbf{w}} $$
$$ b := b - \alpha \frac{\partial MSE}{\partial b} $$

## Linear regression (MSLE)

Note that the loss function that the algorithms optimizes (i.e $MSE$) differs from $MSLE$. We've already seen that this may result in a suboptimal solution.

How can you change the setting so that we optimze $MSLE$ instead?



```
# This is formatted as code
```

Hint:
<sub><sup><sub><sup><sub><sup>
Be lazy. We don't want to change the algorithm.
Use the chain rule and previous computations to get formulas for the gradient.
</sup></sub></sup></sub></sup></sub>

In [None]:
#TODO
def train_msle(
    x: NDArray, y: NDArray, alpha: float = 1e+4, n_iterations: int = 50000
) -> tuple[NDArray, NDArray]:
    weight, bias = train(x, y, msle)
    return weight, bias


weights, bias = train_msle(x_train, y_train)
preds_test = predict(weights, x_test) + bias
print("test MSLE: ", msle(y_test, preds_test))

100%|██████████| 100000/100000 [00:09<00:00, 10741.39it/s]

test MSLE:  0.08037778178926828





## Feature engineering

Without any feature engineering our model approximates the price as a linear combination of original features:
$$
\text{price} \approx w_1 \cdot \text{area} + w_2 \cdot \text{district} + \dots.
$$
Let's now introduce some interactions between the variables. For instance, let's consider a following formula:
$$
\text{price} \approx w_1 \cdot \text{area} \cdot \text{avg. price in the district per sq. meter} + w_2 \cdot \dots + \dots.
$$
Here, we model the price with far greater granularity, and we may expect to see more acurate results.

Add some feature engineering to your model. Be sure to play with the data and not with the algorithm's code.

Think how to make sure that your model is capable of capturing the $w_1 \cdot \text{area} \cdot \text{avg. price...}$ part, without actually computing the averages.

Note that you may need to change the learning rate substantially.

[link text](https://)Hint:
<sub><sup><sub><sup><sub><sup>
Is having a binary encoding for each district and multiplying it by area enough?
</sup></sub></sup></sub></sup></sub>

Hint 2:
<sub><sup><sub><sup><sub><sup>
Why not multiply everything together? I.e. (A,B,C) -> (AB,AC,BC).
</sup></sub></sup></sub></sup></sub>

In [None]:
import pandas as pd
mieszkania = pd.read_csv("mieszkania.csv")
mieszkania.head()

Unnamed: 0,m2,dzielnica,ilość_sypialni,ilość_łazienek,rok_budowy,parking_podziemny,cena
0,104,mokotowo,2,2,1940,1,780094
1,43,ochotowo,1,1,1970,1,346912
2,128,grodziskowo,3,2,1916,1,523466
3,112,mokotowo,3,2,1920,1,830965
4,149,mokotowo,3,3,1977,0,1090479


In [None]:
##############################################################
# TODO: Test your solution on the training and test datasets #
##############################################################
mieszkania_dataset = mieszkania.copy()
mieszkania_dataset = pd.get_dummies(mieszkania_dataset, columns=['dzielnica'])
mieszkania_dataset["total_rooms"] = mieszkania_dataset["ilość_łazienek"] + mieszkania_dataset["ilość_sypialni"]
mieszkania_dataset["area_per_room"] = mieszkania_dataset["m2"] / mieszkania_dataset["total_rooms"]
# mieszkania_dataset["mul1"] = mieszkania_dataset["m2"] * mieszkania_dataset["total_rooms"]
# mieszkania_dataset["mul2"] = mieszkania_dataset["m2"] * mieszkania_dataset["rok_budowy"]
# mieszkania_dataset["mul3"] = mieszkania_dataset["total_rooms"] * mieszkania_dataset["rok_budowy"]
mieszkania_dataset.head()

Unnamed: 0,m2,ilość_sypialni,ilość_łazienek,rok_budowy,parking_podziemny,cena,dzielnica_grodziskowo,dzielnica_mokotowo,dzielnica_ochotowo,dzielnica_wolowo,total_rooms,area_per_room
0,104,2,2,1940,1,780094,False,True,False,False,4,26.0
1,43,1,1,1970,1,346912,False,False,True,False,2,21.5
2,128,3,2,1916,1,523466,True,False,False,False,5,25.6
3,112,3,2,1920,1,830965,False,True,False,False,5,22.4
4,149,3,3,1977,0,1090479,False,True,False,False,6,24.833333


In [None]:
all_columns = ["m2", "ilość_sypialni",
"ilość_łazienek", "rok_budowy", "parking_podziemny",
                                                   "dzielnica_grodziskowo",     "dzielnica_mokotowo",   "dzielnica_ochotowo",   "dzielnica_wolowo",
                                                   "total_rooms", "area_per_room"]
mieszkania_dataset_x = mieszkania_dataset[all_columns]
mieszkania_dataset_y = mieszkania_dataset["cena"].to_numpy()
mieszkania_dataset_x.head()

Unnamed: 0,m2,ilość_sypialni,ilość_łazienek,rok_budowy,parking_podziemny,dzielnica_grodziskowo,dzielnica_mokotowo,dzielnica_ochotowo,dzielnica_wolowo,total_rooms,area_per_room
0,104,2,2,1940,1,False,True,False,False,4,26.0
1,43,1,1,1970,1,False,False,True,False,2,21.5
2,128,3,2,1916,1,True,False,False,False,5,25.6
3,112,3,2,1920,1,False,True,False,False,5,22.4
4,149,3,3,1977,0,False,True,False,False,6,24.833333


In [None]:
mieszkania_test = pd.read_csv("mieszkania_test.csv")
mieszkania_test_dataset = mieszkania.copy()
mieszkania_test_dataset = pd.get_dummies(mieszkania_test_dataset, columns=['dzielnica'])
mieszkania_test_dataset["total_rooms"] = mieszkania_test_dataset["ilość_łazienek"] + mieszkania_test_dataset["ilość_sypialni"]
mieszkania_test_dataset["area_per_room"] = mieszkania_test_dataset["m2"] / mieszkania_test_dataset["total_rooms"]
mieszkania_test_x = mieszkania_test_dataset[all_columns]
mieszkania_test_y = mieszkania_test_dataset["cena"].to_numpy()
mieszkania_test_x.head()

Unnamed: 0,m2,ilość_sypialni,ilość_łazienek,rok_budowy,parking_podziemny,dzielnica_grodziskowo,dzielnica_mokotowo,dzielnica_ochotowo,dzielnica_wolowo,total_rooms,area_per_room
0,104,2,2,1940,1,False,True,False,False,4,26.0
1,43,1,1,1970,1,False,False,True,False,2,21.5
2,128,3,2,1916,1,True,False,False,False,5,25.6
3,112,3,2,1920,1,False,True,False,False,5,22.4
4,149,3,3,1977,0,False,True,False,False,6,24.833333


# Validation

In this exercise you will implement a validation pipeline: split the non-test set into train and validation sets and select the best model based on validation results.

So far you tested your model against the training and test datasets. As you should observe, there's a gap between the results. By validating your model, you should be able to better anticipate the test time performance and compare different models and hyperparameters on datasets they are not over-fitted to.

Implement the basic validation method, i.e. a random split. Test it with your model from Exercise MSLE.

In [None]:
a = np.array([0, 2, 4, 6, 8, 10])
a[[2, 0, 3]][:2]

array([4, 0])

In [None]:
# x_train_val, y_train_val = train_x, train_y
# x_test, y_test = test_x, test_y


def random_split(
    x: NDArray, y: NDArray, val_ratio: float = 0.1
) -> tuple[tuple[NDArray, NDArray], tuple[NDArray, NDArray]]:
    """Returns (x_train, y_train), (x_val, y_val)."""

    assert x.shape[0] == y.shape[0]
    N = x.shape[0]
    idxs = np.random.permutation(N)
    cut = int(N*(1-val_ratio))
    print(idxs.shape, cut)
    train_x, train_y = x.iloc[idxs[:cut]].to_numpy().astype(np.float32), y[idxs[:cut]].astype(np.float32)
    test_x, test_y = x.iloc[idxs[cut:]].to_numpy().astype(np.float32), y[idxs[cut:]].astype(np.float32)
    print(train_x[:10], train_y[:10])
    print(test_x[:10], test_y[:10])
    return (train_x, train_y), (test_x, test_y)




(x_train, y_train), (x_val, y_val) = random_split(mieszkania_dataset_x, mieszkania_dataset_y)

len(x_train), len(x_val), len(x_test)

(200,) 180
[[  62.        2.        1.     1993.        1.        0.        1.
     0.        0.        3.       20.6667]
 [  65.        2.        1.     1960.        0.        0.        0.
     0.        1.        3.       21.6667]
 [  60.        2.        1.     1932.        0.        1.        0.
     0.        0.        3.       20.    ]
 [  15.        1.        1.     1997.        0.        0.        0.
     0.        1.        2.        7.5   ]
 [  39.        1.        1.     1912.        0.        1.        0.
     0.        0.        2.       19.5   ]
 [  83.        2.        2.     1967.        0.        0.        0.
     1.        0.        4.       20.75  ]
 [  24.        1.        1.     1943.        1.        0.        0.
     1.        0.        2.       12.    ]
 [ 101.        2.        2.     1999.        1.        0.        1.
     0.        0.        4.       25.25  ]
 [  21.        1.        1.     1961.        1.        1.        0.
     0.        0.        2.      

(180, 20, 200)

In [None]:
x_train

array([[62.    ,  2.    ,  1.    , ...,  0.    ,  3.    , 20.6667],
       [65.    ,  2.    ,  1.    , ...,  1.    ,  3.    , 21.6667],
       [60.    ,  2.    ,  1.    , ...,  0.    ,  3.    , 20.    ],
       ...,
       [91.    ,  2.    ,  2.    , ...,  0.    ,  4.    , 22.75  ],
       [44.    ,  1.    ,  1.    , ...,  0.    ,  2.    , 22.    ],
       [59.    ,  2.    ,  1.    , ...,  0.    ,  3.    , 19.6667]],
      dtype=float32)

In [None]:
#############################################################
# TODO: compare MSLE on training, validation, and test sets #
#############################################################
#TODO


weights, bias = train_msle(x_train, y_train)
preds_test = predict(weights, mieszkania_test_x.to_numpy().astype(np.float32))
print("test MSLE: ", msle(mieszkania_test_y.astype(np.float32), preds_test))
preds_val = predict(weights, x_val)
print("validation MSLE: ", msle(y_val, preds_val))
preds_train = predict(weights, x_train)
print("train MSLE: ", msle(y_train, preds_train))

100%|██████████| 100000/100000 [00:03<00:00, 26053.37it/s]

test MSLE:  0.053436449213132085
validation MSLE:  0.052519757801817536
train MSLE:  0.05353830381438927





In [83]:
weights

array([6323.5822,  161.5827,  122.9876,    4.3451,  144.0636, -672.8208,
        496.2983,  517.4224, -345.4785,  284.6141, -340.4433])

## Cross-validation

To make the random split validation reliable, a significant chunk of training data may be needed. To get over this problem, one may apply cross-validation.

![alt-text](https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png)

Let's now implement the method. Make sure that:
* number of partitions is a parameter,
* the method is not limited to `mieszkania.csv`,
* the method is not limited to one specific model.

In [None]:
####################################
# TODO: Implement cross-validation #
####################################
def kfold(x: NDArray, y: NDArray, n_folds: int = 5, shuffle: bool = False) -> list[float]:
    """Returns losses for each fold."""



losses = kfold(x_train_val, y_train_val, n_folds=3, shuffle=False)
print(f"k-fold loss: {np.mean(losses):.4f} +- {np.std(losses):.4f}")


## Investigating input data

Recall that sometimes validation may be tricky, e.g. significant class imbalance, having a small number of subjects, geographically clustered instances...

What could in theory go wrong here with random, unstratified partitions? Think about potential solutions and investigate the data in order to check whether these problems arise here.

In [None]:
##############################
# TODO: Investigate the data #
##############################