Column description:\
**city** - the name of the city where the property is located\
**type** - type of the building\
**squareMeters** - the size of the apartment in square meters\
**rooms** - number of rooms in the apartment\
**floor / floorCount** - the floor where the apartment is located and the total number of floors in the building\
**buildYear** - the year when the building was built\
**latitude, longitude** - geo coordinate of the property\
**centreDistance** - distance from the city centre in km\
**poiCount** - number of points of interest in 500m range from the apartment (schools, clinics, post offices, kindergartens, restaurants, colleges, pharmacies)\
**[poiName]Distance** - distance to the nearest point of interest (schools, clinics, post offices, kindergartens, restaurants, colleges, pharmacies)\
**ownership** - the type of property ownership\
**condition** - the condition of the apartment\
**has[features]** - whether the property has key features such as assigned parking space, balcony, elevator, security, storage room\
**price** - offer price in Polish Zloty

apartments_pl_YYYY_MM.csv: sale price


In [None]:
import glob
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import numpy as np
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import VarianceThreshold

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Import data

# Build model

Calculate baseline MAE

In [None]:
y_mean = y_train.mean()
y_pred_baseline = [y_mean] * len(y_train)
baseline_mae = mean_absolute_error(y_pred_baseline, y_train)
print("Mean apt price:", y_mean)
print("Baseline MAE:", baseline_mae)

Mean apt price: 936036.5090266876
Baseline MAE: 302454.0242212934


### Linear Regression

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

ValueError: could not convert string to float: 'blockOfFlats'

In [None]:
y_train_pred = model.predict(X_train)
y_train_mae = mean_absolute_error(y_train, y_train_pred)
print("Train MAE:", y_train_mae)

In [None]:
y_test_pred = model.predict(X_test)
y_test_mae = mean_absolute_error(y_test, y_test_pred)
print("Test MAE:", y_test_mae)

In [None]:
model.coef_

## Ridge

In [None]:
new_model = Ridge()

In [None]:
y_train_pred = new_model.predict(X_train)
y_train_mae = mean_absolute_error(y_train, y_train_pred)
print("Train MAE:", y_train_mae)

In [None]:
param_grid = {'alpha': np.logspace(-3, 3, 50)}

grid_search = GridSearchCV(new_model, param_grid, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

In [None]:
best_alpha = grid_search.best_params_['alpha']
best_score = -grid_search.best_score_  # Convert back to positive MAE

print(f"Best alpha: {best_alpha}")
print(f"Best CV MAE: {best_score:.2f}")

In [None]:
y_test_pred = new_model.predict(X_test)
y_test_mae = mean_absolute_error(y_test, y_test_pred)
print("Test MAE:", y_test_mae)

In [None]:
ridge_best = Ridge(alpha=best_alpha)
ridge_best.fit(X_train, y_train)

# Evaluate on the test set
y_pred = ridge_best.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)
print(f"Test MAE: {test_mae:.2f}")

In [None]:
new_model.coef_

## Decision Tree

Placeholder

## Random Forest

Placeholder