# Preface

In this notebook, it will be briefly discussed the effect of regularization on regression models and examples of how to evaluate regression models. The data used were taken from the Kaggle dataset: https://www.kaggle.com/anmolkumar/house-price-prediction-challenge

# Import Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [None]:
df = pd.read_csv('../input/house-price-prediction-challenge/train.csv')
df.head()

# Data Understanding

In [None]:
df.info()

In [None]:
df.describe()

# Modeling

## Dataset Splitting 

Performed splitting between features and targets and data used for training with data to be tested.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df.drop(columns=['TARGET(PRICE_IN_LACS)','LONGITUDE','LATITUDE','ADDRESS'])
y = df['TARGET(PRICE_IN_LACS)']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

In general, there is not much feature engineering done where only quick feature selection is carried out and then preprocessing is carried out by adjusting the nature of the feature whether numeric or catatonic. For numeric features, polynomial degrees were adjusted, data transformation was performed using the yeo-johnson method, and scaling with a standard scaler. Meanwhile, for categorical features, encoding is performed.

## Preprocessing

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, PolynomialFeatures, PowerTransformer, StandardScaler
from sklearn.compose import ColumnTransformer

In [None]:
num_pipe = Pipeline([
    ('poly',PolynomialFeatures(degree=5)),
    ('transform',PowerTransformer(method='yeo-johnson')),
    ('scaler',StandardScaler())
])

cat_pipe = Pipeline([
    ('encoder',OrdinalEncoder())
])

In [None]:
X_train.columns

In [None]:
prepro = ColumnTransformer([
    ('numeric',num_pipe,['SQUARE_FT','BHK_NO.']),
    ('categoric',cat_pipe,['POSTED_BY','UNDER_CONSTRUCTION','RERA','BHK_OR_RK','READY_TO_MOVE','RESALE'])
])

## Learning 

In [None]:
from sklearn.linear_model import LinearRegression, ElasticNet
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

To find out how the effect of regularization is, at the training or learning stage it is compared to the Linear Regression model with a 5th order polynomial with the Elastic Net Regression model with the same polynomial order, but there are additional hyperparameters, namely the weight of the penalty which functions to regularize and the ratio of penalty weight between terms l1 and l2 norm.

### Using Linear Regression (Polynomial) 

In [None]:
param_linreg = {
    'algo__fit_intercept':[True,False],
}

pipe_linreg = Pipeline([
    ('prep',prepro),
    ('algo',LinearRegression())
])

In [None]:
model_linreg = GridSearchCV(pipe_linreg,param_linreg,cv=3,n_jobs=-1,verbose=1)
model_linreg.fit(X_train,y_train)

print("Train data R squared score: ", model_linreg.score(X_train,y_train))
print("Test data R squared score: ", model_linreg.score(X_test,y_test))

The choice of degree polynomial 5 was deliberately made to see how the overfit model was. Based on the R-squared score in the Linear Regression model, it can be seen that the training data has a higher R-squared score than the R-squared score in the test data, the difference is about 0.2. This means that the model is relatively good when studying the training data, but the model's performance is not good when it is applied to the test data, in other words, there is an overfit condition.

### Using Elastic Net Regression 

In [None]:
param_enet = {
    'algo__fit_intercept':[True,False],
    'algo__alpha':np.logspace(start=-4,stop=2),
    'algo__l1_ratio':np.linspace(start=0,stop=1)
}

pipe_enet = Pipeline([
    ('prep',prepro),
    ('algo',ElasticNet())
])

In [None]:
model_enet = RandomizedSearchCV(pipe_enet,param_enet,cv=3,n_iter=100,n_jobs=-1,verbose=1,random_state=42)
model_enet.fit(X_train,y_train)

print(model_enet.best_params_)
print("Train data R squared score: ", model_enet.score(X_train,y_train))
print("Test data R squared score: ", model_enet.score(X_test,y_test))

Based on the R-squared score on the Elastic Net Regression model, it can be seen that the scores on the training data are relatively the same as the scores on the test data. This means that the l1 norm and l2 norm terms in the Elastic Net Regression model can reduce the model's tendency to overfitting.

### Using XGBoost Regressor 

In [None]:
param_xgb = {
    'algo__max_depth':np.arange(1,11),
    'algo__learning_rate':np.logspace(-2,0),
    'algo__n_estimators':np.arange(100,200),
    'algo__gamma':np.arange(1,11),
    'algo__reg_alpha':np.logspace(-3,1),
    'algo__reg_lambda':np.logspace(-3,1)
}

pipe_xgb = Pipeline([
    ('prep',prepro),
    ('algo',XGBRegressor(n_jobs=-1,random_state=42))
])

In [None]:
model_xgb = RandomizedSearchCV(pipe_xgb,param_xgb,cv=3,n_iter=100,n_jobs=-1,verbose=1,random_state=42)
model_xgb.fit(X_train,y_train)

print(model_xgb.best_params_)
print("Train data R squared score: ", model_xgb.score(X_train,y_train))
print("Test data R squared score: ", model_xgb.score(X_test,y_test))

Next, we try to improve the performance of the model using the gradient boosting algorithm. By using the XGB Regressor model, a significant increase in the R-squared score is obtained so that this model will later be used to evaluate with other metric scores.

# Evaluation

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
y_pred = model_xgb.predict(X_test)

In [None]:
mae = mean_absolute_error(y_test,y_pred)
mse = mean_squared_error(y_test,y_pred)
r2 = r2_score(y_test,y_pred)

In [None]:
print("The model performance for testing set")
print("--------------------------------------")
print('MAE is {}'.format(mae))
print('MSE is {}'.format(mse))
print('R2 score is {}'.format(r2))

In [None]:
# Make Trendline

y_test = pd.DataFrame(y_test)
y_pred = pd.DataFrame(y_pred)
lm = LinearRegression()
lm.fit(y_test,y_pred)
y_trend = lm.predict(y_test)

In [None]:
y_trend = pd.DataFrame(y_trend)

In [None]:
fig,ax = plt.subplots(figsize=(15,5))

ax1 = plt.subplot(121)
ax1.scatter(y_test,y_pred)
ax1.plot(y_test['TARGET(PRICE_IN_LACS)'],y_trend[0],color='green')
ax1.set_xlabel('Actual')
ax1.set_ylabel('Predicted')
ax1.set_title('Actual vs Predicted')

ax2 = plt.subplot(122)
sns.residplot(y_test,y_pred)
ax2.set_xlabel('Actual')
ax2.set_ylabel('Predicted')
ax2.set_title('Residual Plot')

plt.show()

Actual vs predicted and residual plot visualization are used to check whether the model is relatively good or not. Based on the results obtained, the relative residual plot has shown a symetrical and stationary distribution and the actual vs predicted plot has shown a relatively strong trend. Thus, this visualization also supports the relatively good model produced. However, there are points of prediction that are not quite right and there are still outliers. The model's performance can be further improved by eliminating outliers, feature selection by model, or perhaps doing more in-depth exploratory data analysis.

# Recap

In this notebook it has been shown that the effect of the weight penalty is for the regularization process which reduces the model's tendency to overfit. And it has also been shown that the score metrics for evaluating the regression model used the MAE, MSE, and R-squared metrics. Actual vs predicted visualization and residual plot can be used to better illustrate the model's performance and what are the strategies to improve model performance.