# Introduction

Now it is time to train different models and see how they perform. We will:
- test different models,
- test performance,
- tune hyperparameters,
- evaluate results,
- visualize results.

In [30]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

<span style="color: lightgreen; font-weight: bold;"> 1. RandomForestRegression model </span>

In [31]:
from Models import RandomForestModel

Let's train and initialize the model.

In [32]:
data = pd.read_csv('02_Prepared_data.csv')

y = data['GrAppv']
X = data.drop(columns=['GrAppv'])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

RFmodel = RandomForestModel()
RFmodel.train(X_train, y_train)

In [33]:
# Performance evaluation

mse, r2 = RFmodel.evaluate(X_test, y_test)
print(f"MSE: {mse:.2f}")
print(f"RMSE: {mse**0.5:.2f}")
print(f"R^2: {r2:.4f}")

MSE: 38137922708.04
RMSE: 195289.33
R^2: 0.4732


The model explains 47% difference of the variance in loan amounts. This is moderate, but far from highly accurate.

RMSE 195k is way too high. This is because we trained the model on raw loan data, which is very skewed.

Let's try improve the model by training it with log-transformed target.