# Data Cleaning: Transactions

In [1]:
import os
from typing import Union

import joblib
import numpy as np
import pandas as pd
import skops.io as sio
from tqdm import tqdm

import helpers
from helpers import (
    CHARTS_DIR, RAW_DATA_DIR, IMPUTER_MODEL_DIR
)

import plotly.express as px
import plotly.graph_objects as go

In [2]:
df_transactions_encoded = pd.read_parquet(RAW_DATA_DIR / 'transactions_KL_ckpt4_encoded.parquet')
df_transactions_encoded

Unnamed: 0,township,spa_date,address,building_type,tenure,floors,rooms,land_area,built_up,price_psf,price
0,BANDAR BARU SRI PETALING,2023-06-09,"✕✕✕, JALAN PIKRAMA",TERRACE HOUSE - INTERMEDIATE,LEASEHOLD,1,,"2,196 ft²",,342,750000
1,BANDAR BARU SRI PETALING,2023-06-01,"✕✕. ✕✕, JALAN PERLAK 3",TERRACE HOUSE - INTERMEDIATE,LEASEHOLD,2,,753 ft²,,398,300000
2,BANDAR BARU SRI PETALING,2023-05-29,"✕✕ ✕, JALAN 12/149L",TERRACE HOUSE - INTERMEDIATE,LEASEHOLD,2½,,"3,197 ft²",,188,600000
3,BANDAR BARU SRI PETALING,2023-05-25,"✕✕. ✕✕✕, JALAN PASAI",TERRACE HOUSE - INTERMEDIATE,LEASEHOLD,2,,753 ft²,,531,400000
4,BANDAR BARU SRI PETALING,2023-05-22,"✕✕, JALAN SRI PETALING 5",SEMI-D,LEASEHOLD,2½,,"4,801 ft²",,250,1200000
...,...,...,...,...,...,...,...,...,...,...,...
294562,HERITAGE STATION HOTEL,1990-11-13,"✕✕✕-✕✕✕, BB WANGSA MAJU",FLAT,LEASEHOLD,1,2,493 ft²,493 ft²,71,35000
294563,IDAMAN PUTERI,2005-01-10,"✕✕-✕, JALAN GOMBAK",CONDOMINIUM,FREEHOLD,1,3,1454 ft²,1454 ft²,150,218025
294564,KELAB LE CHATEAU II,2008-02-25,"✕-✕✕-✕, JALAN KIARA 3",CONDOMINIUM,FREEHOLD,1,3,593 ft²,593 ft²,194,115000
294565,MUTIARA SENTUL CONDOMINIUM,2009-08-10,"✕-✕-✕, OFF JALAN SENTUL",APARTMENT,LEASEHOLD,1,2,1193 ft²,1193 ft²,197,235000


In [3]:
df_transactions_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294567 entries, 0 to 294566
Data columns (total 11 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   township       294567 non-null  object        
 1   spa_date       294567 non-null  datetime64[ns]
 2   address        294546 non-null  object        
 3   building_type  294567 non-null  object        
 4   tenure         294567 non-null  object        
 5   floors         294567 non-null  object        
 6   rooms          294567 non-null  object        
 7   land_area      294567 non-null  object        
 8   built_up       294567 non-null  object        
 9   price_psf      294567 non-null  object        
 10  price          294567 non-null  object        
dtypes: datetime64[ns](1), object(10)
memory usage: 24.7+ MB


### Concluding Remarks from Data Cleaning 1: Recap
1. The following data cleaning steps has been performed:
    - Removed address column
    - Changed fraction to decimal
    - Removed commas in numerical values
    - Removed units in numerical values
    - Removed exact duplicates
    - Removed outliers using HDBSCAN, based on:
        - Continuous variables: `land_area`, `built_up`, `price_psf` (3D)
        - Ordinal variables: `floors`, `rooms` (2D)
2. Investigated missing values in `built_up` and `rooms`
    - Investigated correlation and association between features to determine which features to use to impute missing values
3. Encoded features for imputation using one hot encoding

Next, we should proceed to impute the missing values in `built_up` and `rooms`.

## Imputing missing values

Based on literature, the following imputation methods have been identified:
1. Random forest imputation (Jager et al., 2021) for MCAR, MAR and MNAR data in various domain
2. Multiple imputation by deterministic regression (Donlen, 2022) for MCAR data in real estate domain
3. MissForest (Waljee et al., 2013) for MCAR data in medical domain
4. Predictive mean matching, PMM (Heidt, 2019) for MAR data in medical domain
5. KNN imputation (Jadhav et al., 2019) for MCAR, MAR and MNAR data in UCI dataset

However, when filtered by domain (real estate), only three methods are identified:
1. Random forest imputation
2. KNN imputation
3. Multiple imputation by deterministic regression

These are machine learning approaches for imputation, where we treat the features with missing values as target variable and the features without missing values as independent variables. In order to obtain a better overview of the performance of the imputation methods, we use cross validation techniques:
1. Split the dataset into train and test, where train are the data with labels and test are the data without labels
2. Split the train dataset into train and validation
3. Cross validate the train data:
    - Create a pipeline with scaler and model
    - Run cross validation with scoring
    - Output both train and validation scores
    - Return the pipeline and cross validation results
4. Train and evaluate the model with validation data
    - Train the pipeline with train data
    - Predict the validation data
    - Evaluate the model with validation data
5. Evaluate the pipeline with validation data and print out the metrics
6. Predict the test data
7. Return the imputed dataset and the fitted pipeline

References:
- Jager et al. (2021): https://www.frontiersin.org/articles/10.3389/fdata.2021.693674/full
- Donlen (2022): https://egrove.olemiss.edu/cgi/viewcontent.cgi?article=3744&context=hon_thesis
- Waljee et al. (2013): https://bmjopen.bmj.com/content/3/8/e002847.citation-tools
- Heidt (2019): https://dc.etsu.edu/cgi/viewcontent.cgi?article=5014&context=etd
- Jadhav et al (2019): https://www.tandfonline.com/doi/full/10.1080/08839514.2019.1637138

In [93]:
from math import sqrt

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import (
    accuracy_score, balanced_accuracy_score, f1_score, precision_score, recall_score, roc_auc_score,
    median_absolute_error, mean_absolute_percentage_error, mean_squared_error, r2_score
)
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.pipeline import Pipeline

random_state = 42

Model = Union[RandomForestRegressor, KNeighborsRegressor, RandomForestClassifier, KNeighborsClassifier]

#### Imputing `built_up`

The steps are:
1. Remove `rooms` from the dataset as it has too many missing values
2. Cross validate for `built_up` using:
    - Random forest imputation
    - KNN imputation
3. Use the better model to impute `built_up`

In [94]:
target = 'built_up'

# Split the dataset into train and test, where train are the data with labels and test are the data without labels
df_train = df_transactions_encoded[df_transactions_encoded[target].notna()].drop(columns=['rooms']).dropna()
df_test = df_transactions_encoded[df_transactions_encoded[target].isna()].drop(columns=['rooms'])

# Split the train dataset into train and validation
X_train, X_val, y_train, y_val = train_test_split(df_train.drop(columns=[target]), df_train[target], test_size=0.2, random_state=random_state)

In [96]:
def cross_validation_with_pipeline(pipeline: Pipeline, X_train: np.ndarray, y_train: np.ndarray, task: str):

    # Cross validate the train data
    if task == 'regression':
        scoring = ('r2', 'neg_root_mean_squared_error', 'neg_mean_absolute_percentage_error', 'neg_median_absolute_error')
    elif task == 'classification':
        scoring = ('accuracy', 'balanced_accuracy', 'f1', 'precision', 'recall', 'roc_auc')
    else:
        scoring = None

    cv_results = cross_validate(pipeline, X_train, y_train, cv=5, scoring=scoring, return_train_score=True, n_jobs=4)
    
    return cv_results

##### Cross validation for `built_up` using various techniques

In [115]:
# Create a pipeline with scaler and model
rf_pipeline_built_up = Pipeline([('scaler', StandardScaler()), ('model', RandomForestRegressor(random_state=random_state, n_jobs=4))])

# Cross validate the pipeline
if not os.path.exists(IMPUTER_MODEL_DIR / 'cv_results_rf_built_up.joblib'):
    cv_results_built_up = cross_validation_with_pipeline(rf_pipeline_built_up, X_train, y_train, 'regression')
    joblib.dump(cv_results_built_up, IMPUTER_MODEL_DIR / 'cv_results_rf_built_up.joblib')
else:
    cv_results_built_up = joblib.load(IMPUTER_MODEL_DIR / 'cv_results_rf_built_up.joblib')

pd.DataFrame(cv_results_built_up)

Results for cross validation...


Unnamed: 0,fit_time,score_time,test_r2,train_r2,test_neg_root_mean_squared_error,train_neg_root_mean_squared_error,test_neg_mean_absolute_percentage_error,train_neg_mean_absolute_percentage_error,test_neg_median_absolute_error,train_neg_median_absolute_error
0,1221.663679,11.541488,0.824293,0.973771,-357.268651,-136.370194,-4652190000000000.0,-1715361000000000.0,-0.0,-0.0
1,1200.971122,9.686063,0.798486,0.975722,-379.701082,-131.454077,-5023980000000000.0,-1716289000000000.0,-0.0,-0.0
2,1233.660341,5.176621,0.854139,0.973795,-316.718636,-137.235538,-2803023000000000.0,-1677708000000000.0,-0.0,-0.0
3,1236.589329,6.362303,0.844816,0.973615,-323.399088,-138.03593,-5624051000000000.0,-1652899000000000.0,-0.0,-0.0
4,285.546735,1.64555,0.784114,0.977032,-404.786024,-126.8785,-3922569000000000.0,-1042910000000000.0,-0.0,-0.0


The cross validation using random forest took 55m35s. The results wasn't that great:

In [118]:
# Create a pipeline with scaler and model
knn_pipeline_built_up = Pipeline([('scaler', StandardScaler()), ('model', KNeighborsRegressor(n_jobs=4))])

# Cross validate the pipeline
if not os.path.exists(IMPUTER_MODEL_DIR / 'cv_results_knn_built_up.joblib'):
    cv_results_built_up = cross_validation_with_pipeline(knn_pipeline_built_up, X_train, y_train, 'regression')
    joblib.dump(cv_results_built_up, IMPUTER_MODEL_DIR / 'cv_results_knn_built_up.joblib')
else:
    cv_results_built_up = joblib.load(IMPUTER_MODEL_DIR / 'cv_results_knn_built_up.joblib')

pd.DataFrame(cv_results_built_up)

MemoryError: Unable to allocate 2.13 GiB for an array with shape (1896, 151045) and data type int64

The cross validation using KNN took around 30m. The results wasn't that great:

##### Check model performance on validation data

In [98]:
def validate_model(pipeline: Pipeline, X_train: np.ndarray, y_train: np.ndarray, X_val: np.ndarray, y_val: np.ndarray, task: str):

    pipeline = pipeline.fit(X_train, y_train)
    y_val_pred = pipeline.predict(X_val)

    print("Results for validation set:")
    if task == 'regression':
        print(f"R2 score: {r2_score(y_val, y_val_pred)}")
        print(f"RMSE score: {sqrt(mean_squared_error(y_val, y_val_pred))}")
        print(f"MAPE score: {mean_absolute_percentage_error(y_val, y_val_pred)}")
        print(f"Median AE score: {median_absolute_error(y_val, y_val_pred)}")
    elif task == 'classification':
        print(f"Accuracy score: {accuracy_score(y_val, y_val_pred)}")
        print(f"Balanced accuracy score: {balanced_accuracy_score(y_val, y_val_pred)}")
        print(f"F1 score: {f1_score(y_val, y_val_pred)}")
        print(f"Precision score: {precision_score(y_val, y_val_pred)}")
        print(f"Recall score: {recall_score(y_val, y_val_pred)}")
        print(f"ROC AUC score: {roc_auc_score(y_val, y_val_pred)}")

    return pipeline

In [99]:
rf_pipeline_built_up = validate_model(rf_pipeline_built_up, X_train, y_train, X_val, y_val, 'regression')

Results for validation set:
R2 score: 0.8080609693884588
RMSE score: 381.9099647033172
MAPE score: 3890573305511410.5
Median AE score: 0.0


In [None]:
knn_pipeline_built_up = validate_model(knn_pipeline_built_up, X_train, y_train, X_val, y_val, 'regression')

Results for validation set:
R2 score: 0.8080609693884588
RMSE score: 381.9099647033172
MAPE score: 3890573305511410.5
Median AE score: 0.0


##### Impute `built_up` using the best model

In [100]:
def impute_with_model(pipeline: Pipeline, df_train: pd.DataFrame, df_test: pd.DataFrame, target: str):

    y_test_pred = pipeline.predict(df_test)

    df_test[target] = y_test_pred
    df_imputed = pd.concat([df_train, df_test])

    return df_imputed

In [101]:
df_test = df_test.drop(columns=[target]).dropna()
df_transactions_built_up_imputed = impute_with_model(rf_pipeline_built_up, df_train, df_test, target='built_up')
df_transactions_built_up_imputed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 265270 entries, 253 to 76762
Columns: 1904 entries, township_BANDAR BARU SRI PETALING to day
dtypes: float64(5), int32(3), int64(1896)
memory usage: 3.8 GB


##### Join the imputed `built_up` data with the original data

In [107]:
df_transactions_built_up_imputed = df_transactions_built_up_imputed.join(df_transactions_encoded['rooms'])
df_transactions_built_up_imputed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 265270 entries, 253 to 76762
Columns: 1905 entries, township_BANDAR BARU SRI PETALING to rooms
dtypes: float64(6), int32(3), int64(1896)
memory usage: 3.8 GB


In [108]:
# Check for missing values in all columns
for column in df_transactions_built_up_imputed.columns:
    isna_count = df_transactions_built_up_imputed[column].isna().sum()
    if isna_count > 0:
        print(column, isna_count)

rooms 29217


#### Imputing `rooms`

The steps are:
1. Remove `rooms` with less than 5 samples so that CV can be performed
2. Cross validate for `rooms` using:
    - Random forest imputation
    - KNN imputation
3. Use the better model to impute `rooms`

In [109]:
target = 'rooms'

# Split the dataset into train and test, where train are the data with labels and test are the data without labels
df_train = df_transactions_built_up_imputed[df_transactions_built_up_imputed[target].notna()]
df_train = df_train.groupby(target).filter(lambda x : len(x) > 5).dropna() # Drop rooms with less than 5 samples so that CV can be performed
df_test = df_transactions_built_up_imputed[df_transactions_built_up_imputed[target].isna()]

# Split the train dataset into train and validation
X_train, X_val, y_train, y_val = train_test_split(df_train.drop(columns=[target]), df_train[target], test_size=0.2, random_state=random_state)

##### Cross validation for `built_up` using various techniques

In [110]:
# Create a pipeline with scaler and model
rf_pipeline_rooms = Pipeline([('scaler', StandardScaler()), ('model', RandomForestClassifier(random_state=random_state, n_jobs=4))])

# Cross validate the pipeline
if not os.path.exists(IMPUTER_MODEL_DIR / 'cv_results_rf_rooms.joblib'):
    cv_results_rooms = cross_validation_with_pipeline(rf_pipeline_rooms, X_train, y_train, 'regression')
    joblib.dump(cv_results_rooms, IMPUTER_MODEL_DIR / 'cv_results_rf_rooms.joblib')
else:
    cv_results_rooms = joblib.load(IMPUTER_MODEL_DIR / 'cv_results_rf_rooms.joblib')

pd.DataFrame(cv_results_rooms)

Results for cross validation...
fit_time: [294.35619926 324.97581124 287.97160578 324.420753   112.68837237]
score_time: [10.89190555  8.71300745 13.77816391  8.32677245  2.44974971]
test_r2: [0.35317275 0.43742745 0.48609734 0.45900313 0.50926415]
train_r2: [0.99724572 0.99693067 0.99773382 0.99773768 0.99708288]
test_neg_root_mean_squared_error: [-0.75864412 -0.72678535 -0.69506655 -0.69731085 -0.66276899]
train_neg_root_mean_squared_error: [-0.0502898  -0.05273184 -0.04530322 -0.04552191 -0.05171767]
test_neg_mean_absolute_percentage_error: [-1.92018098e+13 -1.65779600e+13 -1.59816305e+13 -1.81284167e+13
 -1.77706190e+13]
train_neg_mean_absolute_percentage_error: [-1.78898849e+11 -1.78898849e+11 -1.78898849e+11 -1.19265899e+11
 -1.78898849e+11]
test_neg_median_absolute_error: [-0. -0. -0. -0. -0.]
train_neg_median_absolute_error: [-0. -0. -0. -0. -0.]


The cross validation took 8m5s to complete. There are classes which has only one transactions, therefore the rooms with less than 5 counts were not included for cross validation.

From the cross validation results, the model overfits.

In [None]:
# Create a pipeline with scaler and model
knn_pipeline_rooms = Pipeline([('scaler', StandardScaler()), ('model', KNeighborsClassifier(n_jobs=4))])

# Cross validate the pipeline
if not os.path.exists(IMPUTER_MODEL_DIR / 'cv_results_knn_rooms.joblib'):
    cv_results_rooms = cross_validation_with_pipeline(knn_pipeline_rooms, X_train, y_train, 'regression')
    joblib.dump(cv_results_rooms, IMPUTER_MODEL_DIR / 'cv_results_knn_rooms.joblib')
else:
    cv_results_rooms = joblib.load(IMPUTER_MODEL_DIR / 'cv_results_knn_rooms.joblib')

pd.DataFrame(cv_results_rooms)

Results for cross validation...
fit_time: [294.35619926 324.97581124 287.97160578 324.420753   112.68837237]
score_time: [10.89190555  8.71300745 13.77816391  8.32677245  2.44974971]
test_r2: [0.35317275 0.43742745 0.48609734 0.45900313 0.50926415]
train_r2: [0.99724572 0.99693067 0.99773382 0.99773768 0.99708288]
test_neg_root_mean_squared_error: [-0.75864412 -0.72678535 -0.69506655 -0.69731085 -0.66276899]
train_neg_root_mean_squared_error: [-0.0502898  -0.05273184 -0.04530322 -0.04552191 -0.05171767]
test_neg_mean_absolute_percentage_error: [-1.92018098e+13 -1.65779600e+13 -1.59816305e+13 -1.81284167e+13
 -1.77706190e+13]
train_neg_mean_absolute_percentage_error: [-1.78898849e+11 -1.78898849e+11 -1.78898849e+11 -1.19265899e+11
 -1.78898849e+11]
test_neg_median_absolute_error: [-0. -0. -0. -0. -0.]
train_neg_median_absolute_error: [-0. -0. -0. -0. -0.]


##### Check model performance on validation data

In [111]:
rf_pipeline_rooms = validate_model(rf_pipeline_rooms, X_train, y_train, X_val, y_val, 'regression')

Results for validation set:
R2 score: 0.5527093748549827
RMSE score: 0.5689410437602002
MAPE score: 15647437373178.346
Median AE score: 0.0


The pipeline took 2m27s to train and predict. However, the results were not good.

In [None]:
knn_pipeline_rooms = validate_model(knn_pipeline_rooms, X_train, y_train, X_val, y_val, 'regression')

Results for validation set:
R2 score: 0.5527093748549827
RMSE score: 0.5689410437602002
MAPE score: 15647437373178.346
Median AE score: 0.0


The pipeline took 2m27s to train and predict. However, the results were not good.

##### Impute `rooms` using the best model

In [112]:
df_test = df_test.drop(columns=[target]).dropna()
df_transactions_rooms_imputed = impute_with_model(rf_pipeline_rooms, df_train, df_test, target='rooms')
df_transactions_rooms_imputed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 265225 entries, 253 to 76762
Columns: 1905 entries, township_BANDAR BARU SRI PETALING to rooms
dtypes: float64(6), int32(3), int64(1896)
memory usage: 3.8 GB


In [113]:
for column in df_transactions_rooms_imputed.columns:
    isna_count = df_transactions_rooms_imputed[column].isna().sum()
    if isna_count > 0:
        print(column, isna_count)