# **Baseline Model**

We are going to train a baseline model with all the features and see the performance of the model.

---


In [42]:
# import the required libraries
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd


## **Load the data**


In [43]:
input = '../data/clean-data/flats-house-cleaned-v5.csv'

df = pd.read_csv(input)

print("Shape of the data : ",df.shape)
df.head()

Shape of the data :  (3540, 13)


Unnamed: 0,property_type,sector,bedRoom,bathroom,balcony,agePossession,built_up_area,servant room,store room,furnishing_type,luxury_category,floor_category,price
0,0.0,97.0,3,3,3.0,3.0,1310.0,0,0,2,2.0,2.0,1.5
1,0.0,19.0,3,3,2.0,3.0,1500.0,1,0,0,2.0,1.0,0.85
2,0.0,27.0,2,2,1.0,1.0,800.0,0,0,1,1.0,1.0,0.6
3,0.0,119.0,2,1,0.0,1.0,500.0,0,0,0,1.0,0.0,0.3
4,0.0,13.0,3,4,3.0,3.0,1315.0,1,0,2,0.0,2.0,1.55


### **Applying Linear Regression for baseline Model**

- In order to apply Linear Regression model , we will perfrom three tasks :-



**Task1** :  Since we are using Linear models , so we need to apply one hot encoding for the ordinaliy encoded categorical features :-

- `sector`, 
- `balcony`,

- `agePossession`, 
- `furnishing_type`, 
  
- `luxury_category`, 
- `floor_category`


**Task2** :  We need to apply standard scaling for the numerical features :-

- `property_type`,
- `bedRoom`,
- `bathroom`,
- `built_up_area`,
- `servant room`,
- `store room`



**Task3** :  Since target column `price` is right skewed , so We need to apply log transformation to convert it into normal distribution :-


In [62]:
X = df.drop(columns=['price'])
y = df['price']

print("Shape of X : ",X.shape)
print("Shape of y : ",y.shape)

Shape of X :  (3540, 12)
Shape of y :  (3540,)


**NOTE**:  Sometimes error occurs because the OneHotEncoder encountered a category in the test set that was not present in the training set. 
To avoid this error , we need to ensure all categories are present in both sets or handle unknown categories using handle_unknown='ignore' in OneHotEncoder.


In [71]:
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='sklearn.preprocessing._encoders')

# Task1 & Task2 : Creating a column transformer for preprocessing , Scaling and One hot encoding

columns_to_encode = ['sector', 'balcony', 'agePossession', 'furnishing_type', 'luxury_category', 'floor_category']
column_to_scale = ['property_type', 'bedRoom', 'bathroom', 'built_up_area', 'servant room', 'store room']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), column_to_scale),
        ('cat', OneHotEncoder(drop='first',handle_unknown='ignore'), columns_to_encode)
    ], 
    remainder='passthrough'
)

# Task3 : Applying the log1p transformation to the target variable
y_transformed = np.log1p(y)
y_transformed

# Creating a pipeline1 : first we are applying the preprocessing steps and then we are training the model
pipeline1 = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Creating a pipeline2 : first we are applying the preprocessing steps and then we are training the model
pipeline2 = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', SVR(kernel='rbf'))
])

# K-fold cross-validation, executing the pipeline
kfold = KFold(n_splits=10, shuffle=True, random_state=42)
scores1 = cross_val_score(pipeline1, X, y_transformed, cv=kfold, scoring='r2')
scores2 = cross_val_score(pipeline2, X, y_transformed, cv=kfold, scoring='r2')

print("***\n\n Linear Regression : ***")
print("Mean of the scores : ",scores1.mean())
print("Standard Deviation of the scores : ",scores1.std())

print("***\n\n SVM with RBF Kernel : ***")
print("Mean of the scores : ",scores2.mean())
print("Standard Deviation of the scores : ",scores2.std())

***

 Linear Regression : ***
Mean of the scores :  0.8574332627994418
Standard Deviation of the scores :  0.026612661684479906
***

 SVM with RBF Kernel : ***
Mean of the scores :  0.886260010170018
Standard Deviation of the scores :  0.014531612196396282


In [75]:
pipeline1

In [76]:
pipeline2

 - **With Linear Regression as Baseline, We got an R2-Score of 0.85 with a standard deviation of 0.026, which is considered as good for Baseline Model**

 - **With SVM with RBF Kernel, We got an R2-Score of 0.88 with a standard deviation of 0.014, which is considered as excellent for the model**

- **Now , Testing `Mean Absolute Error` because we are doing a predictive task. We got a good R2-Score, but MAE should also be good.**

In [72]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_transformed, test_size=0.2, random_state=42)

# Fitting the pipeline on the training data
pipeline1.fit(X_train, y_train)
pipeline2.fit(X_train, y_train)

# Predicting the target variable for the test set
y_pred1 = pipeline1.predict(X_test)
y_pred2 = pipeline2.predict(X_test)

# Applying the inverse of the log1p transformation to the predictions using expm1
# expm1(x) is equivalent to exp(x) - 1, which reverses the log1p transformation
y_pred1 = np.expm1(y_pred1)
y_pred2 = np.expm1(y_pred2)

# Calculating the mean absolute error between the actual and predicted values
mae1 = mean_absolute_error(np.expm1(y_test), y_pred1)
mae2 = mean_absolute_error(np.expm1(y_test), y_pred2)

print("Mean Absolute Error for Linear Regression : ",mae1)
print("Mean Absolute Error for SVM with RBF Kernel : ",mae2)

Mean Absolute Error for Linear Regression :  0.5855329477859486
Mean Absolute Error for SVM with RBF Kernel :  0.5257976803686002


 - On average, our Linear Regression Baseline model is giving 58.55 as Mean Absolute Error. It means that, if the actual price is 100 Lakhs, our model is predicting it around 158.55 Lakhs.

 - On average, our SVM with RBF Kernel model is giving 45.32 as Mean Absolute Error.It means that, if the actual price is 100 Lakhs, our model is predicting it around 145.32 Lakhs.

**Now We have choosen SVR as Baseline Model , And we are going to improve the model by tuning the hyperparameters**

**END**

---