# Data Cleaning: Transactions

In [2]:
import os
from typing import Union

import joblib
import pandas as pd
import skops.io as sio

from helpers import (
    cross_validation_with_pipeline, train_and_validate_model, validate_impute
)
from helpers import (
    CACHE_DATA_DIR, TRANSFORMED_DATA_DIR, IMPUTER_MODEL_DIR
)

In [3]:
df_transactions_encoded = pd.read_parquet(TRANSFORMED_DATA_DIR / 'transactions_KL_ckpt4_encoded.parquet')
df_transactions_encoded

Unnamed: 0,township_BANDAR BARU SRI PETALING,township_TAMAN TUN DR ISMAIL,township_DAMANSARA HEIGHTS (BUKIT DAMANSARA),township_TAMAN BUKIT MALURI,township_KEPONG BARU,township_OVERSEAS UNION GARDEN,township_HAPPY GARDEN,township_TAMAN MIDAH,township_ALAM DAMAI,township_TAMAN SRI SINAR,...,tenure_FREEHOLD,floors,rooms,land_area,built_up,price_psf,price,year,month,day
0,1,0,0,0,0,0,0,0,0,0,...,0,1.0,,2196.0,,342.0,750000.0,2023,6,9
1,1,0,0,0,0,0,0,0,0,0,...,0,2.0,,753.0,,398.0,300000.0,2023,6,1
2,1,0,0,0,0,0,0,0,0,0,...,0,2.5,,3197.0,,188.0,600000.0,2023,5,29
3,1,0,0,0,0,0,0,0,0,0,...,0,2.0,,753.0,,531.0,400000.0,2023,5,25
4,1,0,0,0,0,0,0,0,0,0,...,0,2.5,,4801.0,,250.0,1200000.0,2023,5,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
265265,0,0,0,0,0,0,0,0,0,0,...,0,1.0,2.0,493.0,493.0,71.0,35000.0,1990,11,13
265266,0,0,0,0,0,0,0,0,0,0,...,1,1.0,3.0,1454.0,1454.0,150.0,218025.0,2005,1,10
265267,0,0,0,0,0,0,0,0,0,0,...,1,1.0,3.0,593.0,593.0,194.0,115000.0,2008,2,25
265268,0,0,0,0,0,0,0,0,0,0,...,0,1.0,2.0,1193.0,1193.0,197.0,235000.0,2009,8,10


In [4]:
df_transactions_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265270 entries, 0 to 265269
Columns: 1905 entries, township_BANDAR BARU SRI PETALING to day
dtypes: float64(6), int32(3), int64(1896)
memory usage: 3.8 GB


### Concluding Remarks from Data Cleaning 1: Recap
1. The following data cleaning steps has been performed:
    - Removed address column
    - Changed fraction to decimal
    - Removed commas in numerical values
    - Removed units in numerical values
    - Removed exact duplicates
    - Removed outliers using HDBSCAN, based on:
        - Continuous variables: `land_area`, `built_up`, `price_psf` (3D)
        - Ordinal variables: `floors`, `rooms` (2D)
2. Investigated missing values in `built_up` and `rooms`
    - Investigated correlation and association between features to determine which features to use to impute missing values
3. Encoded features for imputation using one hot encoding

Next, we should proceed to impute the missing values in `built_up` and `rooms`.

## Imputing missing values

Based on literature, the following imputation methods have been identified:
1. Random forest imputation (Jager et al., 2021) for MCAR, MAR and MNAR data in various domain
2. Multiple imputation by deterministic regression (Donlen, 2022) for MCAR data in real estate domain
3. MissForest (Waljee et al., 2013) for MCAR data in medical domain
4. Predictive mean matching, PMM (Heidt, 2019) for MAR data in medical domain
5. KNN imputation (Jadhav et al., 2019) for MCAR, MAR and MNAR data in UCI dataset

However, when filtered by domain (real estate), only three methods are identified:
1. Random forest imputation
2. KNN imputation
3. Multiple imputation by deterministic regression

These are machine learning approaches for imputation, where we treat the features with missing values as target variable and the features without missing values as independent variables. In order to obtain a better overview of the performance of the imputation methods, we use cross validation techniques:
1. Split the dataset into train and test, where train are the data with labels and test are the data without labels
2. Split the train dataset into train and validation
3. Cross validate the train data:
    - Create a pipeline with scaler and model
    - Run cross validation with scoring
    - Output both train and validation scores
    - Return the pipeline and cross validation results
4. Train and evaluate the model with validation data
    - Train the pipeline with train data
    - Predict the validation data
    - Evaluate the model with validation data
5. Evaluate the pipeline with validation data and print out the metrics
6. Predict the test data
7. Return the imputed dataset and the fitted pipeline

References:
- Jager et al. (2021): https://www.frontiersin.org/articles/10.3389/fdata.2021.693674/full
- Donlen (2022): https://egrove.olemiss.edu/cgi/viewcontent.cgi?article=3744&context=hon_thesis
- Waljee et al. (2013): https://bmjopen.bmj.com/content/3/8/e002847.citation-tools
- Heidt (2019): https://dc.etsu.edu/cgi/viewcontent.cgi?article=5014&context=etd
- Jadhav et al (2019): https://www.tandfonline.com/doi/full/10.1080/08839514.2019.1637138

In [5]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

random_state = 42

Model = Union[RandomForestRegressor, KNeighborsRegressor, RandomForestClassifier, KNeighborsClassifier]

### Multiple imputation

Scikit-learn has imputers built in, e.g. KNNImputer and IterativeImputer. Moreover, they are a replacement of the `fancyimpute` package, which is no longer maintained.

In [6]:
model_path = IMPUTER_MODEL_DIR / 'bayesianridge_multi_imputer.joblib'

if os.path.exists(model_path):
    bayesianridge_multi_imputer = joblib.load(model_path)
else:
    bayesianridge_multi_imputer = IterativeImputer(random_state=random_state, initial_strategy='median', skip_complete=True)
    bayesianridge_multi_imputer = bayesianridge_multi_imputer.fit(df_transactions_encoded)
    joblib.dump(bayesianridge_multi_imputer, model_path, compress=('lzma', 9))

df_transactions_bayesianridge_multi_imputed = pd.DataFrame(bayesianridge_multi_imputer.transform(df_transactions_encoded), columns=df_transactions_encoded.columns)
df_transactions_bayesianridge_multi_imputed.head()

Unnamed: 0,township_BANDAR BARU SRI PETALING,township_TAMAN TUN DR ISMAIL,township_DAMANSARA HEIGHTS (BUKIT DAMANSARA),township_TAMAN BUKIT MALURI,township_KEPONG BARU,township_OVERSEAS UNION GARDEN,township_HAPPY GARDEN,township_TAMAN MIDAH,township_ALAM DAMAI,township_TAMAN SRI SINAR,...,tenure_FREEHOLD,floors,rooms,land_area,built_up,price_psf,price,year,month,day
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,2.893189,2196.0,1130.0115,342.0,750000.0,2023.0,6.0,9.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,3.079764,753.0,1272.862928,398.0,300000.0,2023.0,6.0,1.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.5,3.232031,3197.0,1521.765384,188.0,600000.0,2023.0,5.0,29.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,3.091576,753.0,1256.558519,531.0,400000.0,2023.0,5.0,25.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.5,3.83752,4801.0,2202.912522,250.0,1200000.0,2023.0,5.0,22.0


Multiple imputation with bayesian ridge took 10m41s to train and impute the data. With models ready, imputation took 1m32s.

In [None]:
model_path = IMPUTER_MODEL_DIR / 'linreg_multi_imputer.joblib'

if os.path.exists(model_path):
    linreg_multi_imputer = joblib.load(model_path)
else:
    linreg_multi_imputer = IterativeImputer(estimator=LinearRegression(), random_state=random_state, initial_strategy='median', skip_complete=True)
    linreg_multi_imputer = linreg_multi_imputer.fit(df_transactions_encoded)
    joblib.dump(linreg_multi_imputer, model_path, compress=('lzma', 9))

df_transactions_linreg_multi_imputed = pd.DataFrame(linreg_multi_imputer.transform(df_transactions_encoded), columns=df_transactions_encoded.columns)
df_transactions_linreg_multi_imputed.head()

In [7]:
model_path = IMPUTER_MODEL_DIR / 'rf_multi_imputer.joblib'

if os.path.exists(model_path):
    rf_multi_imputer = joblib.load(model_path)
else:
    rf_multi_imputer = IterativeImputer(estimator=RandomForestRegressor(), random_state=random_state, initial_strategy='median', skip_complete=True)
    rf_multi_imputer = rf_multi_imputer.fit(df_transactions_encoded)
    joblib.dump(rf_multi_imputer, model_path, compress=('lzma', 9))

df_transactions_rf_multi_imputed = pd.DataFrame(rf_multi_imputer.transform(df_transactions_encoded), columns=df_transactions_encoded.columns)
df_transactions_rf_multi_imputed.head()

Unnamed: 0,township_BANDAR BARU SRI PETALING,township_TAMAN TUN DR ISMAIL,township_DAMANSARA HEIGHTS (BUKIT DAMANSARA),township_TAMAN BUKIT MALURI,township_KEPONG BARU,township_OVERSEAS UNION GARDEN,township_HAPPY GARDEN,township_TAMAN MIDAH,township_ALAM DAMAI,township_TAMAN SRI SINAR,...,tenure_FREEHOLD,floors,rooms,land_area,built_up,price_psf,price,year,month,day
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,3.0,2196.0,1051.85,342.0,750000.0,2023.0,6.0,9.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,2.44,753.0,749.25,398.0,300000.0,2023.0,6.0,1.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.5,3.7,3197.0,1955.62,188.0,600000.0,2023.0,5.0,29.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.0,2.4,753.0,784.7,531.0,400000.0,2023.0,5.0,25.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.5,4.34,4801.0,3356.08,250.0,1200000.0,2023.0,5.0,22.0


Multiple imputation with random forest took 206m49s to train and impute the data. With models ready, imputation took 2m33s.

In [8]:
# model_path = IMPUTER_MODEL_DIR / 'knn_multi_imputer.joblib'

# if os.path.exists(model_path):
#     knn_multi_imputer = joblib.load(model_path)
# else:
#     knn_multi_imputer = IterativeImputer(estimator=KNeighborsRegressor(), random_state=random_state, initial_strategy='median', skip_complete=True)
#     knn_multi_imputer = knn_multi_imputer.fit(df_transactions_encoded)
#     joblib.dump(knn_multi_imputer, model_path, compress=('lzma', 9))

# df_transactions_knn_multi_imputed = pd.DataFrame(knn_multi_imputer.transform(df_transactions_encoded), columns=df_transactions_encoded.columns)
# df_transactions_knn_multi_imputed.head()

In [9]:
# model_path = IMPUTER_MODEL_DIR / 'knn_imputer.joblib'

# if os.path.exists(model_path):
#     knn_imputer = joblib.load(model_path)
# else:
#     knn_imputer = KNNImputer(n_neighbors=5, weights='distance')
#     knn_imputer = knn_imputer.fit(df_transactions_encoded)
#     joblib.dump(knn_imputer, model_path, compress=('lzma', 5))

# knn_imputed = knn_imputer.transform(df_transactions_encoded)
# df_transactions_knn_imputed = pd.DataFrame(knn_imputed, columns=df_transactions_encoded.columns)
# df_transactions_knn_imputed.head()

KNN imputer did not even finish imputing the data after 610m. When using IterativeImputer with KNNRegressor, MemoryError occured.

### Manual imputation

#### Imputing `built_up`

The steps are:
1. Remove `rooms` from the dataset as it has too many missing values
2. Cross validate for `built_up` using:
    - Random forest imputation
    - KNN imputation
3. Use the better model to impute `built_up`

In [11]:
target = 'built_up'

# Split the dataset into train and test, where train are the data with labels and test are the data without labels
data_path = CACHE_DATA_DIR / 'encoded_transactions_train_built_up.parquet'
if os.path.exists(data_path):
    df_train = pd.read_parquet(data_path)
else:
    df_train = df_transactions_encoded[df_transactions_encoded[target].notna()].drop(columns=['rooms']).dropna()
    df_train.to_parquet(data_path)

data_path = CACHE_DATA_DIR / 'encoded_transactions_test_built_up.parquet'
if os.path.exists(data_path):
    df_test = pd.read_parquet(data_path)
else:
    df_test = df_transactions_encoded[df_transactions_encoded[target].isna()].drop(columns=['rooms'])
    df_test.to_parquet(data_path)

# Split the train dataset into train and validation
X_train, X_val, y_train, y_val = train_test_split(df_train.drop(columns=[target]), df_train[target], test_size=0.2, random_state=random_state)

##### Cross validation for `built_up` using various techniques

In [8]:
# Create a pipeline with scaler and model
rf_pipeline_built_up = Pipeline([('scaler', StandardScaler()), ('model', RandomForestRegressor(random_state=random_state, n_jobs=4))])

# Cross validate the pipeline
if os.path.exists(IMPUTER_MODEL_DIR / 'cv_results_rf_built_up.joblib'):
    cv_results_built_up = joblib.load(IMPUTER_MODEL_DIR / 'cv_results_rf_built_up.joblib')
else:
    cv_results_built_up = cross_validation_with_pipeline(rf_pipeline_built_up, X_train, y_train, 'regression')
    joblib.dump(cv_results_built_up, IMPUTER_MODEL_DIR / 'cv_results_rf_built_up.joblib')

pd.DataFrame(cv_results_built_up)

Unnamed: 0,fit_time,score_time,test_r2,train_r2,test_neg_root_mean_squared_error,train_neg_root_mean_squared_error,test_neg_mean_absolute_percentage_error,train_neg_mean_absolute_percentage_error,test_neg_median_absolute_error,train_neg_median_absolute_error
0,1221.663679,11.541488,0.824293,0.973771,-357.268651,-136.370194,-4652190000000000.0,-1715361000000000.0,-0.0,-0.0
1,1200.971122,9.686063,0.798486,0.975722,-379.701082,-131.454077,-5023980000000000.0,-1716289000000000.0,-0.0,-0.0
2,1233.660341,5.176621,0.854139,0.973795,-316.718636,-137.235538,-2803023000000000.0,-1677708000000000.0,-0.0,-0.0
3,1236.589329,6.362303,0.844816,0.973615,-323.399088,-138.03593,-5624051000000000.0,-1652899000000000.0,-0.0,-0.0
4,285.546735,1.64555,0.784114,0.977032,-404.786024,-126.8785,-3922569000000000.0,-1042910000000000.0,-0.0,-0.0


The cross validation using random forest took 55m35s. The results wasn't that great, with substantial RMSE and high MAPE. But it performed well on median-based metrics: median absolute error.

In [9]:
# Create a pipeline with scaler and model
knn_pipeline_built_up = Pipeline([('scaler', StandardScaler()), ('model', KNeighborsRegressor(n_jobs=4))])

# Cross validate the pipeline
if os.path.exists(IMPUTER_MODEL_DIR / 'cv_results_knn_built_up.joblib'):
    cv_results_built_up = joblib.load(IMPUTER_MODEL_DIR / 'cv_results_knn_built_up.joblib')
else:
    cv_results_built_up = cross_validation_with_pipeline(knn_pipeline_built_up, X_train, y_train, 'regression')
    joblib.dump(cv_results_built_up, IMPUTER_MODEL_DIR / 'cv_results_knn_built_up.joblib')

pd.DataFrame(cv_results_built_up)

Unnamed: 0,fit_time,score_time,test_r2,train_r2,test_neg_root_mean_squared_error,train_neg_root_mean_squared_error,test_neg_mean_absolute_percentage_error,train_neg_mean_absolute_percentage_error,test_neg_median_absolute_error,train_neg_median_absolute_error
0,115.349414,842.074621,0.728349,0.807001,-444.228462,-369.920969,-3509313000000000.0,-3596327000000000.0,-65.8,-51.4
1,67.173904,835.366826,0.670352,0.815942,-485.639491,-361.94947,-4940215000000000.0,-3815043000000000.0,-66.6,-51.6
2,115.66784,844.57877,0.728067,0.805778,-432.449172,-373.617202,-4629208000000000.0,-4027341000000000.0,-65.7,-51.4
3,114.721992,863.657962,0.714709,,-438.48894,,-7069854000000000.0,,-66.0,
4,17.214992,0.0,,,,,,,,


The cross validation using KNN took around 148m4s, and then failed at the fourth fold due to insufficient memory. Looking at the results, KNN performed worst then random forest, with higher RMSE, MAPE and median absolute error.

##### Check model performance on validation data

Let's check IterativeImputer models on validation data.

In [15]:
bayesianridge_pred = df_transactions_bayesianridge_multi_imputed[target].iloc[y_val.index]

validate_impute(y_val, bayesianridge_pred, 'regression')

Results for validation set:
R2 score: 1.0
RMSE score: 0.0
MAPE score: 0.0
MAE score: 0.0
Median AE score: 0.0


In [None]:
linreg_pred = df_transactions_linreg_multi_imputed[target].iloc[y_val.index]

validate_impute(y_val, linreg_pred, 'regression')

In [16]:
rf_pred = df_transactions_rf_multi_imputed[target].iloc[y_val.index]

validate_impute(y_val, rf_pred, 'regression')

Results for validation set:
R2 score: 1.0
RMSE score: 0.0
MAPE score: 0.0
MAE score: 0.0
Median AE score: 0.0


Both models perform similarly.

In [11]:
model_path = IMPUTER_MODEL_DIR / 'rf_pipeline_built_up.joblib'

if os.path.exists(model_path):
    rf_pipeline_built_up = joblib.load(model_path)
    _ = train_and_validate_model(rf_pipeline_built_up, X_train, y_train, X_val, y_val, 'regression')
else:
    rf_pipeline_built_up = train_and_validate_model(rf_pipeline_built_up, X_train, y_train, X_val, y_val, 'regression')
    joblib.dump(rf_pipeline_built_up, model_path, compress=('lzma', 9))

Results for validation set:
R2 score: 0.8080609693884588
RMSE score: 381.9099647033172
MAPE score: 3890573305511410.5
MAE score: 63.87615023388515
Median AE score: 0.0


Random forest took 6m8s for training on training data and prediction of validation data, and 2.5 second for loading fitted model and predicting validation data.

In [12]:
model_path = IMPUTER_MODEL_DIR / 'knn_pipeline_built_up.joblib'

if os.path.exists(model_path):
    knn_pipeline_built_up = joblib.load(model_path)
    _ = train_and_validate_model(knn_pipeline_built_up, X_train, y_train, X_val, y_val, 'regression')
else:
    knn_pipeline_built_up = train_and_validate_model(knn_pipeline_built_up, X_train, y_train, X_val, y_val, 'regression')
    joblib.dump(knn_pipeline_built_up, model_path, compress=('lzma', 5))

Results for validation set:
R2 score: 0.6834473458154234
RMSE score: 490.45856274893976
MAPE score: 4185698477203167.0
MAE score: 158.56516414749206
Median AE score: 63.200000000000045


KNN took 4m39s for training on training data and prediction of validation data, and 4m54s for loading fitted model and predicting validation data.

Comparing random forest and KNN on both cross validation and validation data, random forest performed better than KNN.

##### Impute `built_up` using the best model

In [13]:
def impute_with_model(pipeline: Pipeline, df_train: pd.DataFrame, df_test: pd.DataFrame, target: str):

    y_test_pred = pipeline.predict(df_test)

    df_test[target] = y_test_pred
    df_imputed = pd.concat([df_train, df_test])

    return df_imputed

In [14]:
df_test = df_test.drop(columns=[target]).dropna()
df_transactions_built_up_imputed = impute_with_model(rf_pipeline_built_up, df_train, df_test, target='built_up')
df_transactions_built_up_imputed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 265270 entries, 252 to 75120
Columns: 1904 entries, township_BANDAR BARU SRI PETALING to day
dtypes: float64(5), int32(3), int64(1896)
memory usage: 3.8 GB


##### Join the imputed `built_up` data with the original data

In [15]:
df_transactions_built_up_imputed = df_transactions_built_up_imputed.join(df_transactions_encoded['rooms'])
df_transactions_built_up_imputed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 265270 entries, 252 to 75120
Columns: 1905 entries, township_BANDAR BARU SRI PETALING to rooms
dtypes: float64(6), int32(3), int64(1896)
memory usage: 3.8 GB


In [16]:
# Check for missing values in all columns
for column in df_transactions_built_up_imputed.columns:
    isna_count = df_transactions_built_up_imputed[column].isna().sum()
    if isna_count > 0:
        print(column, isna_count)

rooms 29217


#### Imputing `rooms`

The steps are:
1. Remove `rooms` with less than 5 samples so that CV can be performed
2. Cross validate for `rooms` using:
    - Random forest imputation
    - KNN imputation
3. Use the better model to impute `rooms`

In [17]:
target = 'rooms'

# Split the dataset into train and test, where train are the data with labels and test are the data without labels
data_path = CACHE_DATA_DIR / 'encoded_transactions_train_rooms.parquet'
if os.path.exists(data_path):
    df_train = pd.read_parquet(data_path)
else:
    df_train = df_transactions_built_up_imputed[df_transactions_built_up_imputed[target].notna()]
    df_train = df_train.groupby(target).filter(lambda x : len(x) > 5).dropna() # Drop rooms with less than 5 samples so that CV can be performed
    df_train.to_parquet(data_path)

data_path = CACHE_DATA_DIR / 'encoded_transactions_test_rooms.parquet'
if os.path.exists(data_path):
    df_test = pd.read_parquet(data_path)
else:
    df_test = df_transactions_built_up_imputed[df_transactions_built_up_imputed[target].isna()]
    df_test.to_parquet(data_path)

# Split the train dataset into train and validation
X_train, X_val, y_train, y_val = train_test_split(df_train.drop(columns=[target]), df_train[target], test_size=0.2, random_state=random_state)

##### Cross validation for `rooms` using various techniques

In [18]:
# Create a pipeline with scaler and model
rf_pipeline_rooms = Pipeline([('scaler', StandardScaler()), ('model', RandomForestClassifier(random_state=random_state, n_jobs=4))])

# Cross validate the pipeline
if not os.path.exists(IMPUTER_MODEL_DIR / 'cv_results_rf_rooms.joblib'):
    cv_results_rooms = cross_validation_with_pipeline(rf_pipeline_rooms, X_train, y_train, 'regression')
    joblib.dump(cv_results_rooms, IMPUTER_MODEL_DIR / 'cv_results_rf_rooms.joblib')
else:
    cv_results_rooms = joblib.load(IMPUTER_MODEL_DIR / 'cv_results_rf_rooms.joblib')

pd.DataFrame(cv_results_rooms)

Unnamed: 0,fit_time,score_time,test_r2,train_r2,test_neg_root_mean_squared_error,train_neg_root_mean_squared_error,test_neg_mean_absolute_percentage_error,train_neg_mean_absolute_percentage_error,test_neg_median_absolute_error,train_neg_median_absolute_error
0,28.685828,0.0,,,,,,,,
1,251.693279,10.347663,0.429146,0.99697,-0.734488,-0.05248,-16458690000000.0,-178897700000.0,-0.0,-0.0
2,51.241544,0.0,,,,,,,,
3,241.38886,6.02706,0.456886,0.997788,-0.696532,-0.045157,-17889880000000.0,-119265100000.0,-0.0,-0.0
4,232.80478,6.727474,0.506015,0.99731,-0.666395,-0.04976,-19082540000000.0,-29816280000.0,-0.0,-0.0


The cross validation took 8m5s to complete. There are classes which has only one transactions, therefore the rooms with less than 5 counts were not included for cross validation.

From the cross validation results, the model overfits.

In [19]:
# Create a pipeline with scaler and model
knn_pipeline_rooms = Pipeline([('scaler', StandardScaler()), ('model', KNeighborsClassifier(n_jobs=4))])

# Cross validate the pipeline
if not os.path.exists(IMPUTER_MODEL_DIR / 'cv_results_knn_rooms.joblib'):
    cv_results_rooms = cross_validation_with_pipeline(knn_pipeline_rooms, X_train, y_train, 'regression')
    joblib.dump(cv_results_rooms, IMPUTER_MODEL_DIR / 'cv_results_knn_rooms.joblib')
else:
    cv_results_rooms = joblib.load(IMPUTER_MODEL_DIR / 'cv_results_knn_rooms.joblib')

pd.DataFrame(cv_results_rooms)

Unnamed: 0,fit_time,score_time,test_r2,train_r2,test_neg_root_mean_squared_error,train_neg_root_mean_squared_error,test_neg_mean_absolute_percentage_error,train_neg_mean_absolute_percentage_error,test_neg_median_absolute_error,train_neg_median_absolute_error
0,47.168992,423.701747,0.382646,0.51151,-0.74502,-0.670505,-18724250000000.0,-12672000000000.0,-0.0,-0.0
1,98.414525,406.607987,0.37565,,-0.768132,,-14431170000000.0,,-0.0,
2,51.702898,425.574222,0.360377,0.472983,-0.777078,-0.692177,-15266040000000.0,-13864570000000.0,-0.0,-0.0
3,35.375724,0.0,,,,,,,,
4,59.482107,0.0,,,,,,,,


##### Check model performance on validation data

In [18]:
bayesianridge_pred = df_transactions_bayesianridge_multi_imputed[target].iloc[y_val.index]

validate_impute(y_val, bayesianridge_pred, 'classification')

Results for validation set:
Accuracy score: 1.0
Balanced accuracy score: 1.0
Macro F1 score: 1.0
Weighted F1 score: 1.0
Macro precision score: 1.0
Weighted precision score: 1.0
Macro recall score: 1.0
Weighted recall score: 1.0


In [19]:
rf_pred = df_transactions_rf_multi_imputed[target].iloc[y_val.index]

validate_impute(y_val, rf_pred, 'classification')

Results for validation set:
Accuracy score: 1.0
Balanced accuracy score: 1.0
Macro F1 score: 1.0
Weighted F1 score: 1.0
Macro precision score: 1.0
Weighted precision score: 1.0
Macro recall score: 1.0
Weighted recall score: 1.0


In [20]:
model_path = IMPUTER_MODEL_DIR / 'rf_pipeline_rooms.joblib'

if not os.path.exists(model_path):
    rf_pipeline_rooms = train_and_validate_model(rf_pipeline_rooms, X_train, y_train, X_val, y_val, 'classification')
    joblib.dump(rf_pipeline_rooms, model_path, compress=('lzma', 9))
else:
    rf_pipeline_rooms = joblib.load(model_path)
    _ = train_and_validate_model(rf_pipeline_rooms, X_train, y_train, X_val, y_val, 'classification')

Results for validation set:
Accuracy score: 0.8350705478581416
Balanced accuracy score: 0.33657015622976827
Macro F1 score: 0.376981223121691
Weighted F1 score: 0.831617850387697
Macro precision score: 0.4883131102823608
Weighted precision score: 0.831007007869572
Macro recall score: 0.33657015622976827
Weighted recall score: 0.8350705478581416


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The pipeline took 4m54s to train and predict. However, the results were not good.

In [21]:
model_path = IMPUTER_MODEL_DIR / 'knn_pipeline_rooms.joblib'

if not os.path.exists(model_path):
    knn_pipeline_rooms = train_and_validate_model(knn_pipeline_rooms, X_train, y_train, X_val, y_val, 'classification')
    joblib.dump(knn_pipeline_rooms, model_path, compress=('lzma', 9))
else:
    knn_pipeline_rooms = joblib.load(model_path)
    _ = train_and_validate_model(knn_pipeline_rooms, X_train, y_train, X_val, y_val, 'classification')

Results for validation set:
Accuracy score: 0.804902334646837
Balanced accuracy score: 0.3300829740211839
Macro F1 score: 0.3648131769531161
Weighted F1 score: 0.802039007879012
Macro precision score: 0.4476826414403684
Weighted precision score: 0.8007987364640089
Macro recall score: 0.3300829740211839
Weighted recall score: 0.804902334646837


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The pipeline took 5m15s to train and predict. However, the results were competitive but not superior than random forest.

##### Impute `rooms` using the best model

In [22]:
df_test = df_test.drop(columns=[target]).dropna()
df_transactions_rooms_imputed = impute_with_model(rf_pipeline_rooms, df_train, df_test, target='rooms')
df_transactions_rooms_imputed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 265225 entries, 252 to 75120
Columns: 1905 entries, township_BANDAR BARU SRI PETALING to rooms
dtypes: float64(6), int32(3), int64(1896)
memory usage: 3.8 GB


In [23]:
for column in df_transactions_rooms_imputed.columns:
    isna_count = df_transactions_rooms_imputed[column].isna().sum()
    if isna_count > 0:
        print(column, isna_count)

In [None]:
df_transactions_rooms_imputed.to_parquet(TRANSFORMED_DATA_DIR / 'transactions_KL_ckpt5_imputed_manual.parquet')

In [20]:
df_transactions_bayesianridge_multi_imputed.to_parquet(TRANSFORMED_DATA_DIR / 'transactions_KL_ckpt5_multi_imputed_bayesianridge.parquet')
df_transactions_rf_multi_imputed.to_parquet(TRANSFORMED_DATA_DIR / 'transactions_KL_ckpt5_multi_imputed_rf.parquet')

We will use the Bayesian-ridge imputed data for subsequent process.