#Exercise
This is a dataset related to how much money a person can get from a mortgage on his or her home. This dataset includes the following features:
* Gender: Gender of the borrower (including two values 'F' and 'M')
* Age: Age of the customer applying for a loan (including positive integer values)
* Income (USD): Customer's income in USD (value is a positive number)
* Income Stability: The level of customer's income stability (including three values of Low and High)
* Property Age: Life expectancy of the house in days (including positive integer values)
* Property Location: Location of the house (including 'Rural', 'Urban', and 'Semi-Urban')
* Property Price: The value of the house in USD (including positive real values)
* Loan Sanction Amount (USD): Amount that customers can borrow in USD (target value)

Based on practice sample #1, proceed:
1. Read data
2. Visualize some information of data
3. Normalize Data to train linear regression model
4. Train linear regression model and show the model's intercepts, coeficients
5. Learn on sklearn how to use Ridge, Lasso, and ElasticNet compare the error of all 3 algorithms with Linear Regression (https://scikit-learn.org/stable/index.html)
6. Let's try Polynomial of order 2 to compare the previous results. What will the result be if we choose the n order too high?


In [1]:
# mount data from google drive to colab
from google.colab import drive
drive.mount('/content/drive')

#import library
import pandas as pd # pandas
import numpy as np # numpy
import time
import os
import matplotlib.pyplot as plt
import seaborn as sns


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Prepare and Analyze Data

1. Load Dataset
2. Analyze Dataset
3. Preprocess data (type, null, missing, ...)
4. Feature Engineering

## Load Dataset

In [2]:
# read data using Pandas DataFrame
def read_dataset(path):
    # Todo: read_csv from a path and return a DataFrame
    df = pd.read_csv(path)
    display(df.head())
    display(df.describe())
    return df

In [3]:
_ROOT = "/content/drive/MyDrive/TH_NMMH/Week 01" # Path to your file

df = read_dataset(path=os.path.join(_ROOT, 'final_house_loan.csv'))
#ToDo: Show histogram of dataframe

Unnamed: 0,Gender,Age,Income (USD),Income Stability,Property Age,Property Location,Property Price,Loan Sanction Amount (USD)
0,F,19,1641.25,Low,1651.25,Rural,59641.82,21026.420753
1,M,29,1989.71,Low,1990.71,Urban,179858.51,60595.183366
2,F,37,1849.91,Low,1856.91,Rural,117297.62,39181.648002
3,M,65,2735.18,High,2747.18,Rural,354417.72,128497.710865
4,F,62,4741.78,High,4740.78,Urban,82049.8,39386.919336


Unnamed: 0,Age,Income (USD),Property Age,Property Price,Loan Sanction Amount (USD)
count,47297.0,47265.0,47263.0,47297.0,47297.0
mean,40.000063,2586.684384,2586.611058,135088.0,46487.229765
std,16.086128,1558.768809,1558.842286,94578.75,32549.905634
min,18.0,372.7,370.7,7859.62,254.586578
25%,24.0,1653.74,1652.82,62504.08,21782.822159
50%,40.0,2245.48,2244.81,113093.6,38822.132402
75%,55.0,3128.56,3128.38,181954.6,62612.236905
max,65.0,54662.75,54647.75,1077967.0,366131.165218


## Data Analysis

In [4]:
# Data analysis
# Todo: analyze your data here

In [5]:
DISPLAY = False
plt.style.use('ggplot')

In [6]:
categorical_col = df.select_dtypes(include='object')

if DISPLAY:
  for i, col in enumerate(categorical_col):
    sns.barplot(data=df, x=col, y=df.index, hue=col, legend='full')
    plt.title(col)
    plt.show()

In [7]:
numeric_col = df.select_dtypes(include='number')

if DISPLAY:
  for i, col in enumerate(numeric_col):
    sns.histplot(data=numeric_col, x=col, kde=True)
    plt.title(col)
    plt.show()

In [8]:
if DISPLAY:
  for i, col in enumerate(numeric_col):
    sns.boxplot(data=df, x=col, y='Gender', hue='Gender')
    plt.title(col)
    plt.show()

In [9]:
display(df.isnull().sum())

Gender                         0
Age                            0
Income (USD)                  32
Income Stability              12
Property Age                  34
Property Location              3
Property Price                 0
Loan Sanction Amount (USD)     0
dtype: int64

## Preprocessing

In [10]:
def replace_null(data: pd.DataFrame, column: str, type: str = 'drop'):
  match type.lower():
    case 'drop':
      data.dropna(subset=[column], axis=0, inplace=True)
    case 'mean':
      data[column].fillna(value=data[column].mean(), inplace=True)
    case 'mode':
      data[column].fillna(value=data[column].mode(), inplace=True)
  return data

In [11]:
def remove_outliers(data: pd.DataFrame, column: str, threshold: float = 1.5):
  Q1, Q3 = data[column].quantile([0.25, 0.75])
  IQR = Q3 - Q1

  upper_bound = Q3 + (1.5 * IQR)
  lower_bound = Q1 - (1.5 * IQR)
  data_filtered = data[data[column] <= upper_bound]
  data_filtered = data_filtered[data_filtered[column] >= lower_bound]

  return data_filtered

In [12]:
def preprocessing_data(df):
    # --- (Optional) Drop null datapoints or fill missing data
    # Keep your data the same if you dont want to customize it
    output_df = df.copy(deep=True)
    output_df = replace_null(data=output_df, column='Income (USD)')
    output_df = replace_null(data=output_df, column='Income Stability')
    output_df = replace_null(data=output_df, column='Property Age')
    output_df = replace_null(data=output_df, column='Property Location')
    print('__________________Checking Null is avaliabel_________________')
    display(output_df.isnull().sum())
    print(f'\nBefore handling null, df.shape = {df.shape}')
    print(f'\nAfter handling null, df.shape = {output_df.shape}')
    print('_____________________________________________________________')

    for col in numeric_col.columns:
      output_df = remove_outliers(data=output_df, column=col)
    return output_df

In [13]:
df = preprocessing_data(df)

__________________Checking Null is avaliabel_________________


Gender                        0
Age                           0
Income (USD)                  0
Income Stability              0
Property Age                  0
Property Location             0
Property Price                0
Loan Sanction Amount (USD)    0
dtype: int64


Before handling null, df.shape = (47297, 8)

After handling null, df.shape = (47251, 8)
_____________________________________________________________


## Feature Engineering

In [14]:
categorical_col = categorical_col.columns

In [15]:
def normalize_data(df):
  # Todo: normalize data into numerical data

  # Separate numerical and categorical columns
  numerical_cols = df.select_dtypes(include=[np.number])
  categorical_cols = df.select_dtypes(include=[object])

  # Handle numerical columns (e.g., Min-Max Scaling)
  numerical_cols = (numerical_cols - numerical_cols.min()) / (numerical_cols.max() - numerical_cols.min())

  # Handle categorical columns (e.g., One-Hot Encoding)
  categorical_cols = pd.get_dummies(categorical_cols, drop_first=True)  # Avoid dummy trap

  # Combine normalized dataframes
  df_normalized = pd.concat([numerical_cols, categorical_cols], axis=1)

  return df_normalized


In [16]:
# Heatmap
import seaborn as sns

df = normalize_data(df.copy())

if DISPLAY:
  sns.heatmap(df.corr()) # Show heatmap after normalized data

In [17]:
display(df.head())

Unnamed: 0,Age,Income (USD),Property Age,Property Price,Loan Sanction Amount (USD),Gender_M,Income Stability_Low,Property Location_Semi-Urban,Property Location_Urban
0,0.021277,0.272677,0.275282,0.156941,0.186052,0,1,0,0
1,0.234043,0.34758,0.348256,0.521292,0.540466,1,1,0,1
2,0.404255,0.317529,0.319493,0.331684,0.348666,0,1,0,0
4,0.93617,0.939143,0.939443,0.224855,0.350505,0,0,0,1
5,0.212766,0.330914,0.332879,0.063139,0.087788,0,1,0,1


# Apply machine learning model

## Train-test split

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In [19]:
def prepare_X_y(df):
    # Split data into X and y. Return two dataframes
    X = df.drop(columns=['Loan Sanction Amount (USD)']) # Todo: Select features
    y = df['Loan Sanction Amount (USD)'] # Todo: Select label
    return X, y

X, y = prepare_X_y(df)

In [20]:
def split_train_test(X, y, train_size=0.7):
    # Use sklearn train_test_split to split X and y into 2 sets: train set and test set. With train_size is the proportion of train_set and fix the random_state with a number
    trainX, testX ,trainY, testY = train_test_split(X, y, train_size=train_size, random_state=24, shuffle=True)
    print('Training: ' + str(trainX.shape))
    print('Test: ' + str(testX.shape))

    return trainX, testX ,trainY, testY

In [21]:
TRAIN_SIZE = 0.7

trainX, testX ,trainY, testY = split_train_test(X, y, train_size=TRAIN_SIZE)

Training: (30096, 8)
Test: (12899, 8)


## Basic Linear Regression

In [22]:
from sklearn.linear_model import LinearRegression


def build_linear_model(X, y):
    # Todo: use sklearn model and config your parameters
    model = LinearRegression()
    # Todo: fit your model with X, y
    model.fit(X, y)
    return model

model = build_linear_model(trainX, trainY)
# Compare on training dataset
pred = model.predict(trainX)
print(f"mean absolute error of linear model on train set: {mean_absolute_error(y_pred=pred, y_true=trainY)}\n")
pred = model.predict(testX)
print(f"mean absolute error of linear model on test set: {mean_absolute_error(y_pred=pred, y_true=testY)}\n")

print(f'Coefficient of model: {model.coef_}') # print coefficient
print()
print(f'Intercept of model: {model.intercept_}') # print intercept_


mean absolute error of linear model on train set: 0.0008123334446083406

mean absolute error of linear model on test set: 0.0007647815902235622

Coefficient of model: [-0.03336023  0.02169647  0.02143111  0.99287175 -0.00204323 -0.10013464
 -0.00256543 -0.00119619]

Intercept of model: 0.1195941586912431


In [33]:
from sklearn.linear_model import Lasso


def build_lasso_model(X, y):
    # Todo: use sklearn model and config your parameters
    model = Lasso(alpha=1e-7)
    # Todo: fit your model with X, y
    model.fit(X, y)
    return model

model = build_lasso_model(trainX, trainY)
# Compare on training dataset
pred = model.predict(trainX)
print(f"mean absolute error of linear model on train set: {mean_absolute_error(y_pred=pred, y_true=trainY)}\n")
pred = model.predict(testX)
print(f"mean absolute error of linear model on test set: {mean_absolute_error(y_pred=pred, y_true=testY)}\n")

print(f'Coefficient of model: {model.coef_}') # print coefficient
print()
print(f'Intercept of model: {model.intercept_}') # print intercept_


mean absolute error of linear model on train set: 0.000812723071403332

mean absolute error of linear model on test set: 0.0007655140925335709

Coefficient of model: [-3.33576894e-02  4.30603584e-02  6.87312858e-05  9.92871044e-01
 -2.04398474e-03 -1.00130364e-01 -2.56458697e-03 -1.19610182e-03]

Intercept of model: 0.11959865685900645


In [34]:
from sklearn.linear_model import Ridge


def build_ridge_model(X, y):
    # Todo: use sklearn model and config your parameters
    model = Ridge(alpha=1e-5)
    # Todo: fit your model with X, y
    model.fit(X, y)
    return model

model = build_ridge_model(trainX, trainY)
# Compare on training dataset
pred = model.predict(trainX)
print(f"mean absolute error of linear model on train set: {mean_absolute_error(y_pred=pred, y_true=trainY)}\n")
pred = model.predict(testX)
print(f"mean absolute error of linear model on test set: {mean_absolute_error(y_pred=pred, y_true=testY)}\n")

print(f'Coefficient of model: {model.coef_}') # print coefficient
print()
print(f'Intercept of model: {model.intercept_}') # print intercept_


mean absolute error of linear model on train set: 0.0008123334373267389

mean absolute error of linear model on test set: 0.000764781575617036

Coefficient of model: [-0.03336023  0.02169644  0.02143115  0.99287174 -0.00204323 -0.10013464
 -0.00256543 -0.00119619]

Intercept of model: 0.11959415962282299


In [35]:
from sklearn.linear_model import ElasticNet


def build_elastic_model(X, y):
    # Todo: use sklearn model and config your parameters
    model = ElasticNet(alpha=1e-6)
    # Todo: fit your model with X, y
    model.fit(X, y)
    return model

model = build_elastic_model(trainX, trainY)
# Compare on training dataset
pred = model.predict(trainX)
print(f"mean absolute error of linear model on train set: {mean_absolute_error(y_pred=pred, y_true=trainY)}\n")
pred = model.predict(testX)
print(f"mean absolute error of linear model on test set: {mean_absolute_error(y_pred=pred, y_true=testY)}\n")

print(f'Coefficient of model: {model.coef_}') # print coefficient
print()
print(f'Intercept of model: {model.intercept_}') # print intercept_

mean absolute error of linear model on train set: 0.0008128132470392836

mean absolute error of linear model on test set: 0.0007656385454372306

Coefficient of model: [-3.33500760e-02  4.30484272e-02  7.56526033e-05  9.92853731e-01
 -2.04227868e-03 -1.00119981e-01 -2.56104722e-03 -1.19232230e-03]

Intercept of model: 0.11959029265342697


## Polynomial Transform

When the data feature does not conform to a linear function, a linear regression cannot be applied directly to the original data. Then, there are many possibilities that the data feature conforms to the polynomial function. Scikit-Learn supports converting data features to polynomials through ``PolynomialFeatures``.

$$
y = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + \cdots
$$

The formula above uses the transformation of the value $x$ from one dimension to the other, with the aim of being able to use linear regression to find complex relationships between $x$ and $y$.

In [26]:
#Linear Regression with Polynomial Transform
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

def build_pipeline(X, y):
    # use make_pipeline to apply PolynomialFeatures and a Regression model train your dataset
    poly_model = make_pipeline(PolynomialFeatures(2, include_bias=False), LinearRegression())
    poly_model.fit(X, y)

    return poly_model

poly_model = build_pipeline(trainX, trainY)
# Compare on training dataset
poly_pred = poly_model.predict(trainX)
print(f"mean absolute error of linear model (with poly transform) on train set: ", mean_absolute_error(y_pred=poly_pred, y_true=trainY), '\n' )

poly_pred = poly_model.predict(testX)
print(f"mean absolute error of linear model (with poly transform) on test set: ", mean_absolute_error(y_pred=poly_pred, y_true=testY))

mean absolute error of linear model (with poly transform) on train set:  0.0009633787146546368 

mean absolute error of linear model (with poly transform) on test set:  0.0009182937695706259
