# House Price Prediction Problem

In this notebook, a comprehensive approach to predict the price of a house is taken.

Approach taken:

* Initial Data Exploration
* Exploratory Data Analysis
* Data Cleaning
    - Handling missing values
* Visualizing data distributtions and relationships
* Handling Outliers
* Handling Categorical Variables
* Feature Selection
* Model Training
    - Splitting Dataset
    - Hyperparameter Tuning
    - Fitting model with best parameters
    - Predicting on testing data
    - Evaluating and comparing performance of different models

In the modelling stage, I trained several machine learning algorithms, Logistic Regression, XGBoost Regressor, and Random Forest Regressor, for which I also performed hyperparameter tuning.

Moreover, I experimented with the CatBoost Regressor, which naturally handles categorical data. 

It's important to note that while this comprehensive approach generally aids in creating robust models, the small size of the dataset, limited to 1460 entries, might have impacted the overall performance of the models.

Overall, this project served as an excellent application of various data science concepts and provided valuable insights into the workings of different machine learning models.

## Data Loading and Exploration

##### Importing necessary libraries

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from category_encoders import HashingEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from catboost import CatBoostRegressor

##### Loading training dataset

In [None]:
# Loading and reading training dataset as pandas dataframe
df = pd.read_csv("data.csv")
df

##### Initial Exploration of the dataset

In [None]:
# Displaying the first 5 rows of the dataframe
df.head() 

In [None]:
# Checking the dimension of the dataframe
rows, cols = df.shape[0], df.shape[1]
print(f"There are {rows} rows and {cols} columns in the dataframe.")


In [None]:
# Getting overview of the dataset
df.info() 

In [None]:
# Statistical details of the dataset (numerical variables only)
df.describe()

Categorical columns are not included in the statistical summary provided by df.describe(). But we can get a summary of the categorical variables by passing only the columns that are categorical to describe() like this:

In [None]:
# getting statistical details of categorical variables
cat_vars = df.dtypes[df.dtypes == "object"].index
df[cat_vars].describe()

## Data Cleaning

##### Handling missing values

In [None]:
# counting null values in each column
df.isnull().sum()

Now, I am dropping the columns that are missing a lot of data as it is better to do so instead of filling it with mean, median or something else because it might mislead. Here, I am using 30% threshold which means dropping columns with more than 30% missing values.

In [None]:
# Drop columns with more than 30% missing values
threshold = 0.3  

# Calculate the percentage of missing values in each column
missing_percentages = df.isnull().mean()

# Get the list of columns to drop
columns_to_drop = missing_percentages[missing_percentages > threshold].index

# Drop the columns from the DataFrame
df_dropped = df.drop(columns=columns_to_drop)

# Drop the 'Id' column 
df_dropped = df_dropped.drop(['Id'], axis=1)

df_dropped

In [None]:
# categorical and numerical columns
cat_columns = df_dropped.select_dtypes(include=['object']).columns
num_columns = df_dropped.select_dtypes(include=[np.number]).columns

print(f"There are {len(cat_columns)} categorical variables.")
print(f"There are {len(num_columns)} numerical variables.")


In [None]:
# Finding out which of the categorical and numerical columns have null values separately
cat_null_columns = [col for col in cat_columns if df_dropped[col].isnull().any()]
num_null_columns = [col for col in num_columns if df_dropped[col].isnull().any()]

print(f"Categorical columns with null values: {cat_null_columns}")
print(f"Total number of categorical columns with null values: {len(cat_null_columns)}")

print(f"\nNumerical columns with null values: {num_null_columns}")
print(f"Total number of numerical columns with null values: {len(num_null_columns)}")


Filling missing values:
* Mode for categorical variables
* Mean for numerical variables

In [None]:
# Filling missing values in categorical columns with mode
most_freq_imp = SimpleImputer(strategy="most_frequent")
for column in cat_null_columns:
    df_dropped[column] = most_freq_imp.fit_transform(df_dropped[[column]]).ravel()

In [None]:
# Making sure that all the categorical missing values are filled
df_dropped.columns[df_dropped.isnull().any()]

In [None]:
# Filling missing values in numerical columns with mean
for column in num_null_columns:
    df_dropped[column].fillna(df_dropped[column].mean(), inplace=True)

In [None]:
# Count missing values in the dataframe
still_null = df_dropped.columns.isna().sum()
print(f"There are {still_null} null values.")

## Visualizing data distributions and relationships

##### Histograms for numerical variables

In [None]:
# Histograms for all numerical variables
fig = plt.figure(figsize=(14, 18))
for i, col in enumerate(num_columns):
    plt.subplot(10, 4, i + 1)
    df_dropped[col].hist()
    plt.title(col)

fig.tight_layout()
plt.show()

From the above histograms we can observe that some variables are distributed normal (symmetrical) whereas some are skewed. 

##### Bar charts for categorical variables

In [None]:
# Bar charts for categorical variables
fig = plt.figure(figsize=(14, 24))
for i, col in enumerate(cat_columns):
    plt.subplot(10, 4, i + 1)
    df_dropped[col].value_counts().plot(kind='bar')
    plt.title(col)

fig.tight_layout()
plt.show()

In the above bar plots, we can observe the frequencies of each category for all categorical variables.

##### Heatmap for correlation between numerical variables

In [None]:
# Correlation heatmap for numerical variables
plt.figure(figsize=(16, 16))
sns.heatmap(df_dropped[num_columns].corr(), annot=True, square=True, cmap='coolwarm', annot_kws={"size": 5})
plt.show()

In the above heatmap of correlation between numerical variables, we can observe that the relationship ranges between variables from moderately negative (-0.4) to strongly positive (0.8).

#### Handling Outliers

##### Box plots for numerical variables

In [None]:
df_clean = df_dropped.copy()

In [None]:
# Box plots for all numerical variables
fig = plt.figure(figsize=(14, 18))
for i, col in enumerate(num_columns):
    plt.subplot(10, 4, i + 1)
    sns.boxplot(df_clean[col])
    plt.title(col)

fig.tight_layout()
plt.show()

In the above boxplots we can see that there are data that are beyond the inter quartile ranges. So, if we use IQR method to remove the outliers it will remove nearly half of the data (already tried). Also, with Z-score method with 3 standard deviation as threshold removes around 30% of the data (already tried as well). So, I checked the histograms and the data description and then figured out that the data points that seems like outliers in the box plots are little bit rare but possible and logical data points. So, instead of removing them I decided to keep all of them. The code for IQR and Z-score method I used is mentioned below:

IQR method:

In [None]:
"""
def remove_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    filter = (df[col] >= Q1 - 1.5 * IQR) & (df[col] <= Q3 + 1.5 * IQR)
    return df.loc[filter]  

for col in num_columns:
    df_clean = remove_outliers(df_clean, col)
"""

Z-score method:

In [None]:
"""
# Define a threshold
threshold = 3

# For each numeric column, if the zscore is greater than threshold, remove it
for col in num_columns:
    df_clean = df_clean[(np.abs(zscore(df_clean[col])) < threshold)]
"""

In [None]:
df_clean

##### Handling Categorical Variables (Encoding)

In [None]:
# all categorical columns in the dataframe
cat_columns

There are 37 categorical variables in total. Some of them have specific order or hierarchy in the categories, some of them do not inherent order in the categories and some of them have only two categories. So, based on that we need to identify the optimal encoder for that cateorical variable. With the help of data description, I figured out the types of categories (specific order or not) for each categorical variables and divided them into three different groups to encode them using three different encoders.

In [None]:
# variables with specific order in the categories
ord_enc_cols = ['LotShape', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 
        'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 
        'GarageQual', 'GarageCond']

# variables without any order in the categories
one_hot_enc_cols = ['MSZoning', 'Street', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'Foundation', 'Heating', 'Electrical', 'Functional', 'GarageType', 
       'GarageFinish', 'PavedDrive', 'SaleType', 'SaleCondition']

# variables with only 2 categories
bin_enc_cols = ['CentralAir']

In [None]:
# Defining the encoders
ord_enc = OrdinalEncoder()
bin_enc = LabelEncoder()

In [None]:
# Encoding variables with a specific order in the categories
for col in ord_enc_cols:
    df_clean[col] = ord_enc.fit_transform(df_clean[[col]])

df_clean

In [None]:
# Encoding variables without any order in the categories
df_clean = pd.get_dummies(df_clean, columns=one_hot_enc_cols)

In [None]:
# Encoding binary variables
for col in bin_enc_cols:
    df_clean[col] = bin_enc.fit_transform(df_clean[col])

In [None]:
# Converting True and False values to 1 and 0s
df_clean = df_clean.astype(int)

In [None]:
df_encoded = df_clean.copy()
df_encoded

In [None]:
df_encoded.columns

## Feature Selection

For feature selection I tried Recursive Feature Elimination (RFE) method but as the dimensionality of the data is huge (203 variables) it was computationally expensive. So, I am using Random Forest Importance method which measures the importance of a feature bu calculating the total reduction in the crterion (Gini importance or Mean Decrease Impurity) brought by that feature.

In [None]:
model = RandomForestRegressor(random_state=1)
model.fit(df_encoded.drop('SalePrice', axis=1), df_encoded['SalePrice'])

# Get the importance of each feature
importance = model.feature_importances_

# Map feature importance values with their corresponding feature names
feature_importance = pd.Series(importance, index=df_encoded.drop('SalePrice', axis=1).columns).sort_values(ascending=False)

# Keeping features which have importance of more than 0.001
selected_features = feature_importance[feature_importance > 0.001].index


In [None]:
# Print the selected features
selected_features

In [None]:
# Creating a new dataframe with the selected features and the target vairable
df_final = pd.concat([df_encoded[selected_features], df_encoded[['SalePrice']]], axis=1)
df_final

## Model Training

##### Splitting the dataset

In [None]:
# Function to split the dataset
def split_dataset(df, target_var):
    features = df.drop(target_var, axis = 1) # Features
    target = df[target_var] # Target variable
    X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2) # Splitting the data
    return X_train, X_test, y_train, y_test # Returning train and test data

In [None]:
# Splitting the df_final dataset
X_train, X_test, y_train, y_test = split_dataset(df_final, "SalePrice")

##### Training the model

I am going the train various models (Logistic Regression, Random Forest, XGBoost) and compare their performance on 'df_final' dataframe where categorical variables are encoded and features are selected using Random Forest feature importance method. 

Then I will train CatBoostRegressor where I will use the dataframe that I have before encoding the cateorical features and feature selection. CatBoostRegressor can be trained on categorical variables by specifying them while fitting the model.

Logistic Regression

In [None]:
# Defining the parameters to be tuned
param_grid_lr = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

# Create a Logistic Regression estimator
estimator_lr = LogisticRegression(solver='liblinear')

# Create the GridSearchCV object
grid_search_lr = GridSearchCV(estimator=estimator_lr, param_grid=param_grid_lr, cv=5)

# Fit the GridSearchCV object to the data
grid_search_lr.fit(X_train, y_train)

# Getting the best parameters
best_params_lr = grid_search_lr.best_params_

In [None]:
# Printing the best parameters for LogisticRegression
best_params_lr

In [None]:
# Create a new LogisticRegression with the best hyperparameters
best_lr_model = LogisticRegression(**best_params_lr)

# Fit the model with the training data
best_lr_model.fit(X_train, y_train)

Random Forest

In [None]:
# Defining the parameters to be tuned
param_grid_rf = {
    'n_estimators': [100, 200],
    'max_features': ['auto', 'sqrt'],
    'max_depth' : [i for i in range(6,12)]
}

# Create a Random Forest estimator
estimator_rf = RandomForestRegressor()

# Create the GridSearchCV object
grid_search_rf = GridSearchCV(estimator=estimator_rf, param_grid=param_grid_rf, cv=5)

# Fit the GridSearchCV object to the data
grid_search_rf.fit(X_train, y_train)

# Getting the best parameters
best_params_rf = grid_search_rf.best_params_

In [None]:
# Printing the best parameters for RandomForestRegressor
best_params_rf

In [None]:
# Create a new RandomForestRegressor with the best hyperparameters
best_rf_model = RandomForestRegressor(**best_params_rf)

# Fit the model with the training data
best_rf_model.fit(X_train, y_train)

XGBoost

In [None]:
# Defining the parameters to be tuned
param_grid_xgb = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [7, 10],
    'colsample_bytree': [0.5, 0.7, 1],
    'gamma': [0.0, 0.1, 0.2]
}

# Create a XGBoost estimator
estimator_xgb = xgb.XGBRegressor()

# Create the GridSearchCV object
grid_search_xgb = GridSearchCV(estimator=estimator_xgb, param_grid=param_grid_xgb, cv=5)

# Fit the GridSearchCV object to the data
grid_search_xgb.fit(X_train, y_train)

# Getting the best parameters
best_params_xgb = grid_search_xgb.best_params_

In [None]:
# Printing the best parameters for XGBRegressor
best_params_xgb

In [None]:
# Create a new XGBRegressor with the best hyperparameters
best_xgb_model = xgb.XGBRegressor(**best_params_xgb)

# Fit the model with the training data
best_xgb_model.fit(X_train, y_train)

##### Predicting on testing dataset with above models

In [None]:
# Logistic Regression
lr_pred = best_lr_model.predict(X_test)

# Random Forest
rf_pred = best_rf_model.predict(X_test)

# XGBoost
xgb_pred = best_xgb_model.predict(X_test)

##### Evaluating the models performance

In [None]:
# Function to evaluate performance of the model based on different metrices
def eval_metrics(y_test, pred):
    # Calculate Mean Absolute Error (MAE)
    mae = mean_absolute_error(y_test, pred)
    print(f"Mean Absolute Error (MAE): {mae}")

    # Calculate Mean Squared Error (MSE)
    mse = mean_squared_error(y_test, pred)
    print(f"Mean Squared Error (MSE): {mse}")

    # Calculate Root Mean Squared Error (RMSE)
    rmse = sqrt(mse)
    print(f"Root Mean Squared Error (RMSE): {rmse}")

    # Calculate R^2 Score
    r2 = r2_score(y_test, pred)
    print(f"R^2 Score: {r2}")

In [None]:
# Logistic Regression
print("Logistic Regression performance:\n")
eval_metrics(y_test, lr_pred)

In [None]:
# Random Forest
print("Random Forest performance:\n")
eval_metrics(y_test, rf_pred)

In [None]:
# XGBoost Regression
print("XGBoost performance:\n")
eval_metrics(y_test, xgb_pred)

## Training CatBoostRegressor with categorical variables

Here, I am going to train a CatBoostRegressor model without encoding the cateorical variables and without doing feature selection. So I am using 'df_dropped' dataframe as it is the one before encoding categorical variables. 

In [None]:
# Print df_dropped dataframe
df_dropped

In [None]:
# Splitting the dataframe into training and testing
X_train_01, X_test_01, y_train_01, y_test_01 = split_dataset(df_dropped, "SalePrice")

In [None]:
# Storing the indices of categorical variables by excluding float and int data types
categorical_features_indices = np.where((X_train_01.dtypes != np.float64) & (X_train_01.dtypes != np.int64))[0]

In [None]:
# Parameters grid to tune
param_grid_cat = {
    'depth': [6, 8, 10],
    'learning_rate': [0.01, 0.05, 0.1],
    'iterations': [50, 100, 200]
}

# CatBoost regressor
estimator_cat = CatBoostRegressor(loss_function='RMSE', cat_features=categorical_features_indices)

# Grid search
grid_search_cat = GridSearchCV(estimator=estimator_cat, param_grid=param_grid_cat, cv=3, scoring='neg_root_mean_squared_error')

# Fitting
grid_search_cat.fit(X_train_01, y_train_01)

# Get the best parameters
best_params_cat = grid_search_cat.best_params_

In [None]:
# Printing the best parameters for CatBoostRegressor
best_params_cat

In [None]:
# Create a new CatBoostRegressor with the best hyperparameters
best_cat_model = CatBoostRegressor(**best_params_cat, cat_features=categorical_features_indices)

# Fit the model with the training data
best_cat_model.fit(X_train_01, y_train_01)

In [None]:
# Predicting on testing dataset
cat_pred = best_cat_model.predict(X_test_01)

In [None]:
# Performance Evaluation of CatBoostRegressor
print("XGBoost performance:\n")
eval_metrics(y_test_01, cat_pred)