## Abstraction
Overfitting is one of a common problem we need to deal with when working on a machine learning model. Whenever it is happened, your model works **perfectly on the training data**, but **badly on the real situation**. As it can be seen that, overfitting is an essential problem that we must know how to detect and avoid it.

## What is overfitting
As I already mentioned above, overfitting is a situation that your model work very well on the training data but not for the other (unseen) data. The reason why this happened is that your model is too complex which make it learn the "noise" (outliers) instead of being general. Looking at the example below, we can see that the model work perfectly on the training sample (red), but badly on the test/valid samples. In this example, I used a linear regression model with $degree=16$:
$$y=\sum_{i=0}^{16}(w_i.x^i)$$

![overfit-example](https://i.imgur.com/bx589P7.png)

## Detect overfitting
It can be seen that the key point to recognize the overfitting is the difference of error between training and testing/validation set. With a naive idea, we can split into 2 dataset train and validation set. For instance, 80% train set and 20% validation set. (This is call `percentage split` strategy).

![train-test-loss](https://i.imgur.com/aiemPZC.png)

From the image above, we can see that at the first stage, both train loss and validation loss are decrease. However, if we continue to train the model the validation loss will increase while the train loss is decrease.

Furthermore, the `percentage split` strategy seem underestimate the model. In practice, we usually use `k-fold cross validation` to evaluate the overfitting. With `k-fold`, we split our dataset into `k` equal-size folds. And then we do the validation `k` times, each time we pick a fold to be test set, and the others to be the train set. Finally, we calculate the mean value of loss:

![k-fold](https://i.imgur.com/QzfsD8P.jpg)

## How to reduce overfitting
- Enlarge the training set. Some examples:
    - Collect more data (but not practical)
    - Data augumentation: Widely use with image problem (shift, scale, etc)
    - Using GAN model to generate image
- Using regularization to penalize the weight (weight decay)
- Using drop-out (mostly in deep learning)
- Using prune tree (in decision tree)
- Using VC dimension
- Early stopping
- Doing feature selection
- Using ensemble method

**In this document, I use `linear regression` model and reduce overfitting with `feature selection`**

Using feature selection, we can reduce the complexity of our model (which also means dimensions)

# Feature selection using Pearson correlation coefficent

While using `linear regression` model, we can use `Pearson correlation coefficent` to do the feature selection. The `PCC` is a measure of **linear correlation** between **two sets of data**. That is the ratio between the `covariance` and `product of their standard deviation`:
$$\rho_{X,Y}=\frac{cov(X,Y)}{\sigma_X\sigma_Y}$$
It means that, if our two sets:
- Form an upward line, then $\rho=1$
- Form an downward line, then $\rho=-1$
- Scatter data randomly, then $\rho=0$

To understand it clearly, look at the image below:

![pearson-example](https://i.imgur.com/Ga36VPp.png)

The pearson correlation only consider about the `linear` property. If the two sets form a strange shape, PCC also equal $0$:

![pearson-strange](https://i.imgur.com/uWZjK5i.png)

> Note: The PCC only works for `numeric` type

We can see that the pearson correlation help us on selecting features while using linear regression model. We only select attributes that have the correlation with our target 50% (in term of both negative and positive).

# Housing price problem
**Problem state**

Given a dataset contains many attributes about a house which is selling. The problem is to predict the price of that house.
- Input: House's attributes
- Output: House's price

**Dataset**

I use the dataset for contest [House prices - Advanced regression techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).


## Load and explore data

**Import libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

import warnings
warnings.filterwarnings("ignore")

**Load data**

In [None]:
df = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")

**Data analysis**

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe().T

**Number of missing cell**


In [None]:
print("Missing value by column")
print("-"*20)
print(df.isna().sum())
print("-"*20)
print("Total:",df.isna().sum().sum())

**Drop missing column**

Because the data contains many missing column, I must drop the `NaN` value in order to use `linear regression`. We have two options here:
- Drop the column that have `NaN`
- Drop the row that have `NaN`

If we look in detail, we can see that we cannot drop by row since our data will be empty. So, I choose to drop the column.

In [None]:
df = df.dropna(axis=1)

print("Missing value by column")
print("-"*20)
print(df.isna().sum())
print("-"*20)
print("Total:",df.isna().sum().sum())

## Without using feature selection

**Create data**

In [None]:
X = df.drop(["Id","SalePrice"], axis=1)
y = df["SalePrice"]

**Normalize**

As you may already know that we need to normalize the data in order to mantain their magnitude between each other. In normalization, we also have many choices, two common normalizations are:
- `Z-score`
- `Min-max`

In this project, I use `Z-score` normalization in order to handle the outliers better.

In [None]:
col_types = X.dtypes
numeric_col = col_types[col_types!='object'].index

scaler = StandardScaler()
X[numeric_col] = scaler.fit_transform(X[numeric_col])
X.head()

**One-hot encoding**

In our data attributes, there are both `numeric` and `nomial` type. To use the `linear regression`, all the data must be `numeric`, so I use the one-hot encoding to convert all the `nominal` into `numeric`.

In [None]:
X = pd.get_dummies(X)
X.head()

**Split data**

As mentioned above, I will split the data into two sets to detect the overfitting:
- Train set (80%)
- Test set (20%)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Train model**

In [None]:
model1 = LinearRegression()
model1.fit(X_train, y_train)

**Evaluation**

To measure the performance of our model, I use the `root mean squared error (RMSE)` which is a common metric to evaluate the `linear regression` model.

In [None]:
def RMSE(target,pred):
    return np.sqrt(mean_squared_error(target,pred))

In [None]:
pred_train = model1.predict(X_train)
pred_test = model1.predict(X_test)

print("Train RMSE:",RMSE(y_train, pred_train))
print("Test RMSE:",RMSE(y_test, pred_test))

**Discussion**

From the result, we can see that the overfitting problem happened since the `MSE` on:
- `Training set`: is quite acceptable 
- `Test set`: is considerably huge

This is a evidence to show that with $215$ features, the `linear regression` model is overfitting.

## Using feature selection

**Calculate pearson correlation**

In [None]:
correlation = df.corr()
correlation_price = correlation["SalePrice"]

**Plot with heat map**

In [None]:
plt.figure(figsize=(13,10))
sns.heatmap(correlation, cmap="rainbow")
plt.title("Correlations Between Variables", size=15)
plt.show()

**Feature selection**

In the `Pearson correlation coeffecient`, I have said that we only choose the negative attribute and positive attribute respective to the target (`SalePrice`). Furthermore, the PCC only works for the `numeric` attribute, so we can select the `nominal` attribute with our prior knowledge.

Look at the rightmost column of the heatmap, we can see that some typical selected attributes:
- `OverallQual`
- `YearBuilt`
- `YearRemoteAdd`

For my knowledge, I also pick `nominal` attributes that is `Utilities`, `Heating`, etc. For instance, it is obvious that the utilities (e.g. electronics, gas, water) will also affact the price of that house. (This called domain knowledge)

In [None]:
positive_attributes = (correlation_price > 0.50)
negative_attributes = (correlation_price < -0.50)
numeric_col = list(correlation_price[positive_attributes | negative_attributes].index)
category_col = ["Utilities","Heating","KitchenQual","SaleCondition","LandSlope"]

important_cols = numeric_col + category_col

try:
    important_cols.remove("Id")
except:
    print("Column [Id] not in `important_cols`")

**Create data**

In [None]:
X = df[important_cols]
X = X.drop(columns="SalePrice",axis=1)
y = df["SalePrice"]
numeric_col.remove("SalePrice")

**Normalize**

In [None]:
scaler = StandardScaler()
X[numeric_col] = scaler.fit_transform(X[numeric_col])
X.head()

**One-hot encoding**

In [None]:
X = pd.get_dummies(X)
X.head()

**Split data**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Train model**

In [None]:
model2 = LinearRegression()
model2.fit(X_train, y_train)

**Evaluation**

In [None]:
pred_train = model2.predict(X_train)
pred_test = model2.predict(X_test)

print("Train RMSE:",RMSE(y_train, pred_train))
print("Test RMSE:",RMSE(y_test, pred_test))

**Discuss**

It can be seen that after doing the feature selection, the result on the test set is much better. Instead of using $215$ features, we only select and pick out $31$ features. There are many different feature selection algorithm that we can apply to make our model being better, such as PCA, ICA, IDA, etc.

## Reference
- Các phương pháp tránh Overfitting - Trần Trung Trực - Viblo - $6^{th}$ Nov, 2020
- Overfitting - Vũ Hữu Tiệp - Machine learning cơ bản - $4^{th}$ Mar, 2017
- Overfitting - IBM Cloud Education - $3^{rd}$ Mar, 2021
- Different methods for mitigating overfitting on Neural Networks - Pablo Sánchez - quantdare - $26^{th}$ May, 2021
- House Price Prediction - Advanced regression techniques - Kaggle contest - GettingStarted Prediction Competition
- House Price Prediction submission - Emre Arslan - Kaggle contest - $29^{th}$ Dec, 2021
- Correlation Coefficient - The organic chemistry tutor - $25^{th}$ Jun, 2020 
- Pearson correlation coefficient - Wikipedia - $27^{th}$ Dec, 2021

# For submission

**Import imputer and load data**

In [None]:
from  sklearn.impute import SimpleImputer

test = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")

**Take feature selection**

In [None]:
feature_cols = important_cols.copy()
feature_cols.remove("SalePrice")
X_final = test[feature_cols]
X_final.info()

**Impute missing data**

In [None]:
imp = SimpleImputer(missing_values=np.nan,strategy="most_frequent")
X_final[:] = imp.fit_transform(X_final)
X_final.info()

**Normalize**

In [None]:
col_types = X_final.dtypes
numeric_col = col_types[col_types!='object'].index

scaler = StandardScaler()
X_final[numeric_col] = scaler.fit_transform(X_final[numeric_col])
X_final.head()

**One-hot encoding and re-index with model**

In [None]:
X_final = pd.get_dummies(X_final)
X_final = X_final.reindex(columns=X_train.columns,fill_value=0)
X_final.head()

**Predictions**

In [None]:
pred = model2.predict(X_final)
test['SalePrice'] = pred
result = test[["Id","SalePrice"]]
print(result)

result.to_csv("submission.csv",index=False)