# Description:

The Objective of the dataset is to predict Prices of Houses based on the other features provided. The dataset contains 18 Features that contains many descriptive information of the houses which is necessary to predict prices like bedrooms, bathrooms, bathrooms, sqft_above. These are some of the features available in the dataset which can be considered Directly Proportional to the target variable Price. By Using Various Machine Learning methods and Algorithms the Final Model will be designed. 

## Step 0 : Importing Libraries and Dataset 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import  GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error

import warnings
warnings.filterwarnings("ignore")

: 

In [None]:
#Importing Dataset
data = pd.read_csv("data.csv")

: 

## Step 1: Descriptive Statistics

In [None]:
#Previewing Data
data.head()

: 

In [None]:
#Checking number of rows and column in Dataset
data.shape

: 

In [None]:
#Checking for data types of the column in dataset
data.info()

: 

In [None]:
# Analysing the Descriptive Statistics of the Data
data.describe().T

: 

In [None]:
#Checking for Missing Values in the Data
data.isnull().sum()

: 

### Observations:
1. There are Total 4600 records and 18 Features in the Dataset.
2. The are in int, Float and Object Datatype.
3. There are no NaN values in the Dataset.
4. The price is the Target variable in the Dataset.

## Step 2: Data Visualization

In [None]:
plt.figure(figsize=(14,8))
sns.heatmap(data.corr(), annot=True, cmap='viridis')
plt.show()

: 

In [None]:
# Histograms for numerical features
numerical_features = ["bedrooms", "bathrooms", "sqft_living", "sqft_lot"]
for feature in numerical_features:
    plt.figure(figsize=(8, 6))
    sns.histplot(data[feature], kde=True, bins=30, color='skyblue')
    plt.title(f'Histogram of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Frequency')
    plt.show()

: 

In [None]:
sns.countplot(x='condition', data=data, palette='viridis')
plt.xlabel('Condition')
plt.ylabel('Count')
plt.title('Employee Satisfaction Count')
plt.show()

: 

In [None]:
# Bar plots for categorical features
categorical_features = ["waterfront", "view", "condition"]
for feature in categorical_features:
    plt.figure(figsize=(8, 6))
    sns.countplot(x=feature, data=data, palette='Set2')
    plt.title(f'Bar Plot of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Count')
    plt.show()

: 

In [None]:
# Pair plot for numerical features
sns.pairplot(data[numerical_features])
plt.suptitle('Pair Plot of Numerical Features')
plt.show()

: 

In [None]:
#Visualizing Outlier in the price using boxplot
sns.boxplot(x=data['price'])
plt.show()

: 

### Visualization 
- We have Visualized how the distribution of data in the bedroom, bathroom, sqft_living and sqft_lot.
- To get the better insights or the pattern from data we use pairplot.
- A Heat map is used to show the correlation between different features.
- To Visualize the Outliers the Box plot methodology has been used.

## Step 3: Data Preprocessing

In [None]:
# Checking if any duplicate values are present in the dataset
data.duplicated().any()

: 

In [None]:
df1 = data.copy()

: 

In [None]:
#Removing the zero prices from the price Column 
df1['price'].replace(0, np.NaN ,inplace=True)

: 

In [None]:
#Checking the total number of NaN values after replacing them with 0.
df1.isnull().sum()

: 

In [None]:
#Checking Shape of Dataframe before droping the NaN rows
df1.shape

: 

In [None]:
#droping the NaN values rows
df1.dropna(inplace=True)

: 

In [None]:
#Checking the shape of droping NaN values
df1.shape

: 

In [None]:
# Sorting the dataframe by price in descending order to get the outliers on top
df1 = df1.sort_values(by='price', ascending=False).reset_index()

: 

In [None]:
df1.head()

: 

In [None]:
# Droping the top 3 rows having outlier values in it
df1 = df1[3:].reset_index(drop=True)

: 

In [None]:
#Checking for the updated dataframe
df1.head()

: 

In [None]:
# Reseting the index number of each row in a dataframe
df1.index = pd.RangeIndex(start=1, stop=len(df1)+1, step=1)

: 

In [None]:
# Droping Index Column in the DataFrame
df1.drop(columns='index', inplace=True)

: 

In [None]:
# Checking the index number in the dataframe
df1

: 

### Changes Made
1. As our Dataset did not contain any null value, so we check for the duplicated values.
2. There are no duplicate values are present in our dataset. 
3. So , we check for any 0 values present in the Target Variable.
4. We Found 49 , 0 Values present in price(Target Variable).
5. Now we  replace all the 0 values with a NaN (Not A Number) Value.
6. After Replacing 0's With NaN , we drop all rows which contains NaN values .
4. Now we sorted our Dataframe by selling_price in descending order to get the ouliers on the top.
5. Then we dropped the top 3 rows containing the highest prices as they are most likely outliers.
6. The final dataset now has 4548 rows and 18 Features

## Step 4: One Hot Encoding 

In [None]:
df1.columns

: 

In [None]:
# Selecting the categorical columns for Frequency encoding
columns_encoded = ['city']

: 

In [None]:
# Calculate frequency of each category in the 'city' column
city_frequency = df1['city'].value_counts(normalize=True)

: 

In [None]:
# Map the frequency values to the 'city' column
df1['city_encoded'] = df1['city'].map(city_frequency)

: 

In [None]:
# Display the first few rows of the DataFrame with the encoded 'city' feature
print(df1[['city', 'city_encoded']].head())

: 

In [None]:
df1.columns

: 

In [None]:
#Droping city column from the Data
df1.drop(columns='city',axis=1,inplace=True)

: 

In [None]:
#Checking if the city column is droped.
df1.columns

: 

In [None]:
#Checking the final Shape of the Dataset
df1.shape

: 

In [None]:
df1.to_csv('data_Encoded.csv', index=False)

: 

### Necessary Changes Made
1. As Our DataSet had Categorical Features in it , We used Frequency Encoding to encode Categorical Features.
2. We Used Frequency Encoding Because It Is More Suitable For Machine Learning Algorithms To Work With Categorical Data.
3. And as our data in the categorical column is not cardinal an not aligning with ordinal, we need to convert it using Frequency Encoding.

## Step 5: Data Modelling

In [None]:
# Selecting Features for training and testing
X = df1.iloc[:, [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 17]].values
Y = df1.iloc[:, 1].values

: 

In [None]:
# Splitting X and Y
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.20, random_state = 42)

: 

In [None]:
# Checking dimensions
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("Y_train shape:", Y_train.shape)
print("Y_test shape:", Y_test.shape)

: 

### Random Forest Regression Model

In [None]:
# Random Forest Model
random_forest = RandomForestRegressor(n_estimators=100, random_state=42)

: 

In [None]:
# Fit the model for training data
random_forest.fit(X_train, Y_train)

: 

In [None]:
# Make prediction on the test data
rand_pred = random_forest.predict(X_test)  

: 

### Gradient Boosting Model

In [None]:
# Initializing the Gradient Boosting Regressor
gradient_boost = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

: 

In [None]:
# Fitting the training data in the model
gradient_boost.fit(X_train, Y_train)

: 

In [None]:
# Making predictions on the test data
grad_pred = gradient_boost.predict(X_test)

: 

### Support Vector Regressor Machine Model

In [None]:
# Initializing the Support Vector Regressor
svr_model = SVR(C=1.0, epsilon=0.2)

: 

In [None]:
# Fitting the model on the training data
svr_model.fit(X_train, Y_train)

: 

In [None]:
# Make predictions on the test data
svr_pred = svr_model.predict(X_test)

: 

### Linear Regression Model

In [None]:
# Initialize the Linear Regression model
linear_regression = LinearRegression()

: 

In [None]:
# Fit the model on the training data
linear_regression.fit(X_train, Y_train)

: 

In [None]:
# Make predictions on the test data
lr_pred = linear_regression.predict(X_test)

: 

### Model Used for the House Price Prediction
#### 1. We First used train and test splitting for  our dataset to divide it into two parts, training  set and testing set. The ratio of split was 8:2 
#### 2. Random Forest Regression :
Random Forest Regression is an Ensemble Method that used Multiple Descision Trees and Averages their Result. The Model is Robust to OverFitting and can handle a large number of Features, making it a good choice for complex data.
#### 3. Gradient Boosting Regression :
Gradient Boosting is Another Ensemble Method that builsds multiple weak prediction models, typically descision trees, in a stage-wise fashion. It is known for its Efficiency and Accuracy.
#### 4. Support Vector Regression (SVR) :
Support Vector Regression(SVR) uses the priciple of Support Vector Machines (SVM), But for Regression Problems. It tries to fit best line within a threshold value. It can be effective when the data has multiple features with complex relationship.
#### 5. Linear Regression :
Despite its simplicity, linear regression can serve as a baseline model for price prediction. It assumes a linear relationship between the independent variables and the target variable. It can provide a quick and easy way to understand the influence of each feature on the house prices.


## Step 6: Model Evaluation 

In [None]:
# Calculating Mean Squared Error (MSE) for random forest Regression 
rf_mse = mean_squared_error(Y_test, rand_pred)
print(f"Mean Sqaured Error(MSE) for Random Forest Regression: {rf_mse: .2f}\n")

# Calculate Mean Squared Error (MSE) for Gradient Boosting Regression
gb_mse = mean_squared_error(Y_test, grad_pred)
print(f"Mean Squared Error(MSE) for Gradient Boosting Regression: {gb_mse: .2f}\n")

# Calculate Mean Squared Error (MSE) for Support Vector Regression 
svr_mse = mean_squared_error(Y_test, svr_pred)
print(f"Mean Squared Error(MSE) for Support Vector Regression: {svr_mse: .2f}\n")

# Calculate Mean Squared Error (MSE) for Linear Regression
lr_mse = mean_squared_error(Y_test, lr_pred)
print(f"Mean Squared Error(MSE) for Linear Regression: {lr_mse: .2f}")

: 

In [None]:
# Calculating R-Squared Score for Random Forest Regression 
rf_r2 = r2_score(Y_test, rand_pred)
print(f"R-Squared Score for Random Forest Regression: {rf_r2: .2f}\n")

# Calculating R-Squared Score for Gradient Boosting Regression 
gb_r2 = r2_score(Y_test, grad_pred)
print(f"R-Squared Score for Gradient Boosting Regression: {gb_r2: .2f}\n")

# Calculating R-Squared Score for Support Vector Regression
svr_r2 = r2_score(Y_test, svr_pred)
print(f"R-Squared Score for Support Vector Regression: {svr_r2: .2f}\n")

# Calculating R-Squared Score for Linear Regression
lr_r2 = r2_score(Y_test, lr_pred)
print(f"R-Squared Score for Linear Regression {lr_r2: .2f}")

: 

In [None]:
# Calculating Root Mean Squared Error (RMSE) for random forest Regression 
rf_rmse = mean_squared_error(Y_test, rand_pred, squared=False)
print(f"Root Mean Sqaured Error(RMSE) for Random Forest Regression: {rf_rmse: .2f}\n")

# Calculating Root Mean Squared Error (RMSE) for Gradient Boosting Regression
gb_rmse = mean_squared_error(Y_test, grad_pred, squared=False)
print(f"Root Mean Squared Error(RMSE) for Gradient Boosting Regression: {gb_rmse: .2f}\n")

# Calculating Root Mean Squared Error (RMSE) for Support Vector Regression 
svr_rmse = mean_squared_error(Y_test, svr_pred, squared=False)
print(f"Root Mean Squared Error(RMSE) for Support Vector Regression: {svr_rmse: .2f}\n")

# Calculating Root Mean Squared Error (RMSE) for Linear Regression
lr_rmse = mean_squared_error(Y_test, lr_pred, squared=False)
print(f"Root Mean Squared Error(RMSE) for Linear Regression: {lr_rmse: .2f}")

: 

In [None]:
# Calculating Mean Absolute Error (MAE) for random forest Regression 
rf_mae = mean_absolute_error(Y_test, rand_pred)
print(f"Mean Absolute Error(MAE) for Random Forest Regression: {rf_mae: .2f}\n")

# Calculating Mean Absolute Error (MAE) for Gradient Boosting Regression
gb_mae = mean_absolute_error(Y_test, grad_pred)
print(f"Mean Absolute Error(MAE) for Gradient Boosting Regression: {gb_mae: .2f}\n")

# Calculating Mean Absolute Error (MAE) for Support Vector Regression 
svr_mae = mean_absolute_error(Y_test, svr_pred)
print(f"Mean Absolute Error(MAE) for Support Vector Regression: {svr_mae: .2f}\n")

# Calculating Mean Absolute Error (MAE) for Linear Regression
lr_mae = mean_absolute_error(Y_test, lr_pred)
print(f"Mean Absolute Error(MAE) for Linear Regression: {lr_mae: .2f}")

: 

In [None]:
# Calculating Mean Absolute Percentage Error (MAPE) for Random Forest Regression 
rf_mape = mean_absolute_percentage_error (Y_test, rand_pred)
print(f"Mean Absolute Percentage Error(MAPE) for Random Forest Regression: {rf_mape: .2f}% \n")

# Calculating Mean Absolute Percentage Error (MAPE) for Gradient Boosting Regression
gb_mape = mean_absolute_percentage_error(Y_test, grad_pred)
print(f"Mean Absolute Percentage Error(MAPE) for Gradient Boosting Regression: {gb_mape: .2f}% \n")

# Calculating Mean Absolute Percentage Error (MAPE) for Support Vector Regression 
svr_mape = mean_absolute_percentage_error(Y_test, svr_pred)
print(f"Mean Absolute Percentage Error(MAPE) for Support Vector Regression: {svr_mape: .2f}% \n")

# Calculating Mean Absolute Percentage Error (MAPE) for Linear Regression
lr_mape = mean_absolute_percentage_error(Y_test, lr_pred)
print(f"Mean Absolute Percentage Error(MAPE) for Linear Regression: {lr_mape: .2f}%")

: 

## Step 7: Model Selection

- From the Above Evaluation Metrics we can see that the Random Forest Regression is more Suitable for this Machine Learning Problem.
- Because the MSE, RMSE and R2 are all based on squared differences between actual and predicted values, they measure the average squared difference which makes them suitable for regression problems. 
- In the Above Evaluation we can see that Random Forest Regression Models has the best performance in MSE, R2, RMSE, MAE, MAPE.
- From This we can conclude that the Random Forest model is best suitable for this dataset.

### Conclusion :

### The Random Forest Model is Selected for this Regression Problem.

In [None]:
%pip install scikit-learn==1.2.1

: 