# **Pakistan House Price Prediction**
---

## **Importing Data**

In [1]:
# importing data and libraries
import numpy as np
import pandas as pd

# disable warnings
import warnings
warnings. filterwarnings('ignore')

In [2]:
# reading data
df = pd.read_csv('house_prices.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,property_type,price,location,city,baths,purpose,bedrooms,Area_in_Marla
0,0,Flat,10000000,G-10,Islamabad,2,For Sale,2,4.0
1,1,Flat,6900000,E-11,Islamabad,3,For Sale,3,5.6
2,2,House,16500000,G-15,Islamabad,6,For Sale,5,8.0
3,3,House,43500000,Bani Gala,Islamabad,4,For Sale,4,40.0
4,4,House,7000000,DHA Defence,Islamabad,3,For Sale,3,8.0


In [3]:
df.tail()

Unnamed: 0.1,Unnamed: 0,property_type,price,location,city,baths,purpose,bedrooms,Area_in_Marla
99494,168435,Flat,7500000,Bahria Town Karachi,Karachi,3,For Sale,3,8.0
99495,168436,House,8800000,Bahria Town Karachi,Karachi,4,For Sale,3,8.0
99496,168438,House,14000000,Bahria Town Karachi,Karachi,3,For Sale,3,8.0
99497,168439,House,14000000,Bahria Town Karachi,Karachi,4,For Sale,4,14.0
99498,168445,House,9000000,Bahria Town Karachi,Karachi,3,For Sale,3,9.4


## **Data Cleaning and Preprocessing**

In [4]:
# dropping unecessary columns
df.drop(["Unnamed: 0"], axis = 1, inplace = True)

In [5]:
df.head()

Unnamed: 0,property_type,price,location,city,baths,purpose,bedrooms,Area_in_Marla
0,Flat,10000000,G-10,Islamabad,2,For Sale,2,4.0
1,Flat,6900000,E-11,Islamabad,3,For Sale,3,5.6
2,House,16500000,G-15,Islamabad,6,For Sale,5,8.0
3,House,43500000,Bani Gala,Islamabad,4,For Sale,4,40.0
4,House,7000000,DHA Defence,Islamabad,3,For Sale,3,8.0


### **Checking for null values**

In [6]:
df.isna().sum()

Unnamed: 0,0
property_type,0
price,0
location,0
city,0
baths,0
purpose,0
bedrooms,0
Area_in_Marla,0


In [7]:
df.city.value_counts()

Unnamed: 0_level_0,count
city,Unnamed: 1_level_1
Karachi,37066
Lahore,26221
Islamabad,22243
Rawalpindi,11738
Faisalabad,2231


In [8]:
df.property_type.value_counts()

Unnamed: 0_level_0,count
property_type,Unnamed: 1_level_1
House,58169
Flat,26658
Upper Portion,8539
Lower Portion,5549
Penthouse,255
Room,241
Farm House,88


In [9]:
df.location.value_counts()

Unnamed: 0_level_0,count
location,Unnamed: 1_level_1
DHA Defence,11787
Bahria Town Karachi,6697
Bahria Town Rawalpindi,5257
Bahria Town,4437
Gulistan-e-Jauhar,3532
...,...
Times Residency,1
CBR Town Phase 2,1
Montgomery Road,1
Sahianwala,1


In [10]:
df.purpose.value_counts()

Unnamed: 0_level_0,count
purpose,Unnamed: 1_level_1
For Sale,70947
For Rent,28552


### **Dropping duplicate values**

In [11]:
df = df.drop_duplicates().reset_index(drop = True)

### **Feature Engineering**

In [12]:
# converting marla to area in sq ft
df['area'] = df['Area_in_Marla'] * 272.25
df.drop('Area_in_Marla',axis=1, inplace = True)

df.head()

Unnamed: 0,property_type,price,location,city,baths,purpose,bedrooms,area
0,Flat,10000000,G-10,Islamabad,2,For Sale,2,1089.0
1,Flat,6900000,E-11,Islamabad,3,For Sale,3,1524.6
2,House,16500000,G-15,Islamabad,6,For Sale,5,2178.0
3,House,43500000,Bani Gala,Islamabad,4,For Sale,4,10890.0
4,House,7000000,DHA Defence,Islamabad,3,For Sale,3,2178.0


In [13]:
# re-arranging features
df = df[["property_type", "location", "city", "purpose", "baths", "bedrooms", "area", "price"]]
df.columns = ["type", "location", "city", "purpose", "baths", "beds", "area", "price"]

df.head()

Unnamed: 0,type,location,city,purpose,baths,beds,area,price
0,Flat,G-10,Islamabad,For Sale,2,2,1089.0,10000000
1,Flat,E-11,Islamabad,For Sale,3,3,1524.6,6900000
2,House,G-15,Islamabad,For Sale,6,5,2178.0,16500000
3,House,Bani Gala,Islamabad,For Sale,4,4,10890.0,43500000
4,House,DHA Defence,Islamabad,For Sale,3,3,2178.0,7000000


In [14]:
df.type.value_counts()

Unnamed: 0_level_0,count
type,Unnamed: 1_level_1
House,35014
Flat,17170
Upper Portion,5221
Lower Portion,3702
Penthouse,239
Room,213
Farm House,82


### **Categorizing Features**

In [15]:
# categorical columns
cat_cols = ["type", "location", "city", "purpose"]

# numerical columns
num_cols = ["area", "baths", "beds"]

In [16]:
from sklearn.preprocessing import LabelEncoder


encoders={}
# apply label encoder to each column individually
for column in df[cat_cols]:
    # initialize label encoder
    label_encoder = LabelEncoder()
    df[column] = label_encoder.fit_transform(df[column])
    encoders[column]=label_encoder

df.head()

Unnamed: 0,type,location,city,purpose,baths,beds,area,price
0,1,452,1,1,2,2,1089.0,10000000
1,1,382,1,1,3,3,1524.6,6900000
2,2,457,1,1,6,5,2178.0,16500000
3,2,198,1,1,4,4,10890.0,43500000
4,2,327,1,1,3,3,2178.0,7000000


In [17]:
# save the encoders as a pickle file
import pickle
with open("encoders.pkl", "wb") as f:
  pickle.dump(encoders, f)

In [18]:
from sklearn.preprocessing import StandardScaler


scaler_data={}
# apply the standard scaler to each column individually
for column in num_cols:
    # initialize the standard scaler
    scaler = StandardScaler()
    # reshape the column to a 2D array
    df[column] = scaler.fit_transform(df[[column]])
    scaler_data[column]=scaler

# display the first few rows of the dataframe
df.head()

Unnamed: 0,type,location,city,purpose,baths,beds,area,price
0,1,452,1,1,-0.980008,-1.023639,-0.563772,10000000
1,1,382,1,1,-0.305714,-0.25647,-0.367333,6900000
2,2,457,1,1,1.717169,1.277868,-0.072673,16500000
3,2,198,1,1,0.36858,0.510699,3.856117,43500000
4,2,327,1,1,-0.305714,-0.25647,-0.072673,7000000


In [19]:
# save the scalers as a pickle file
with open("scalers.pkl", "wb") as f:
  pickle.dump(scaler_data, f)

### **Splitting Data into Feature and Target Variable**

In [20]:
X = df.drop('price', axis = 1)
y = df['price']

## **Machine Learning**

### **Splitting Data into Test and Train**

In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)

### **1. Decision Tree Regressor**

In [28]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Train the DecisionTreeRegressor model
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)

# Predict on the test set
y_pred = dt_model.predict(X_test)

# Evaluate the model
mae_dt = mean_absolute_error(y_test, y_pred)
mse_dt = mean_squared_error(y_test, y_pred)
rmse_dt = np.sqrt(mse_dt)
r2_dt = r2_score(y_test, y_pred)

print(f'Mean Absolute Error: {mae_dt}')
print(f'Mean Squared Error: {mse_dt}')
print(f'Root Mean Squared Error: {rmse_dt}')
print(f'R-squared: {r2_dt}')

Mean Absolute Error: 2244718.4434984005
Mean Squared Error: 18756180178776.586
Root Mean Squared Error: 4330840.585703494
R-squared: 0.8260381905798786


### **2. Random Forest Regressor**

In [29]:
from sklearn.ensemble import RandomForestRegressor

# Train the RandomForestRegressor model
rfr_model = RandomForestRegressor(random_state=42)
rfr_model.fit(X_train, y_train)

# Predict on the test set
y_pred = rfr_model.predict(X_test)

# Evaluate the model
mae_rfr = mean_absolute_error(y_test, y_pred)
mse_rfr = mean_squared_error(y_test, y_pred)
rmse_rfr = np.sqrt(mse_rfr)
r2_rfr = r2_score(y_test, y_pred)

print(f'Mean Absolute Error: {mae_rfr}')
print(f'Mean Squared Error: {mse_rfr}')
print(f'Root Mean Squared Error: {rmse_rfr}')
print(f'R-squared: {r2_rfr}')

Mean Absolute Error: 2000681.6318901249
Mean Squared Error: 13577140060177.936
Root Mean Squared Error: 3684717.0936420527
R-squared: 0.874073301220919


### **3. Gradient Boosting Regressor**

In [30]:
from sklearn.ensemble import GradientBoostingRegressor

# Train the GradientBoostingRegressor model
gbr_model = GradientBoostingRegressor(random_state=42)
gbr_model.fit(X_train, y_train)

# Predict on the test set
y_pred_gbr = gbr_model.predict(X_test)

# Evaluate the model
mae_gbr = mean_absolute_error(y_test, y_pred_gbr)
mse_gbr = mean_squared_error(y_test, y_pred_gbr)
rmse_gbr = np.sqrt(mse_gbr)
r2_gbr = r2_score(y_test, y_pred_gbr)

print(f'Gradient Boosting Regressor Results:')
print(f'Mean Absolute Error: {mae_gbr}')
print(f'Mean Squared Error: {mse_gbr}')
print(f'Root Mean Squared Error: {rmse_gbr}')
print(f'R-squared: {r2_gbr}')

Gradient Boosting Regressor Results:
Mean Absolute Error: 2862601.4668208617
Mean Squared Error: 21364918456707.418
Root Mean Squared Error: 4622220.078783291
R-squared: 0.8018423880866878


### **4. XGB Regressor**

In [31]:
import xgboost as xgb

# Train the XGBoost Regressor model
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)
xgb_model.fit(X_train, y_train)

# Predict on the test set
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate the model
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
rmse_xgb = np.sqrt(mse_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f'XGBoost Regressor Results:')
print(f'Mean Absolute Error: {mae_xgb}')
print(f'Mean Squared Error: {mse_xgb}')
print(f'Root Mean Squared Error: {rmse_xgb}')
print(f'R-squared: {r2_xgb}')

XGBoost Regressor Results:
Mean Absolute Error: 2137504.75
Mean Squared Error: 13522593382400.0
Root Mean Squared Error: 3677307.898775951
R-squared: 0.8745791912078857


In [32]:
with open("xgb_model.pkl", "wb") as f:
  pickle.dump(xgb_model, f)

# **Testing with Example**

In [26]:
test = X_test.iloc[211].values.reshape(1, -1)

# printing predicted fares
print(f'House Price predicted with Decision Tree Regressor: {(xgb_model.predict(test)).astype(int)}')
print(f'House Price predicted with Random Forest Regressor: {(rfr_model.predict(test)).astype(int)}')
print(f'House Price predicted with Gradient Boost Regressor: {(gbr_model.predict(test)).astype(int)}')
print(f'House Price predicted with XGBoost Regressor: {(xgb_model.predict(test)).astype(int)}')
print('--------------------------------------------------------------')
print(f'The actual House Price is: {y_test.iloc[211]}')

House Price predicted with Decision Tree Regressor: [10307438]
House Price predicted with Random Forest Regressor: [10911064]
House Price predicted with Gradient Boost Regressor: [10308253]
House Price predicted with XGBoost Regressor: [10307438]
--------------------------------------------------------------
The actual House Price is: 13800000


In [27]:
print(f'The property price in the desired location will range from Rs. {y_test.iloc[211] - 0.13*y_test.iloc[211]} to Rs. {y_test.iloc[211] + 0.13*y_test.iloc[211]}')

The property price in the desired location will range from Rs. 12006000.0 to Rs. 15594000.0


**Absolutely! Let's analyze the regression results and determine the best model to choose:

**Understanding the Metrics**

* **Mean Absolute Error (MAE):** The average absolute difference between predicted and actual values. Lower is better.
* **Mean Squared Error (MSE):** The average squared difference between predicted and actual values. Lower is better.
* **Root Mean Squared Error (RMSE):** The square root of MSE. Lower is better. It's in the same units as the target variable, making it easier to interpret.
* **R-squared (R²):** Represents the proportion of the variance in the dependent variable that is predictable from the independent variables. Higher is better (closer to 1).

**Analyzing the Results**

* **Decision Tree Regressor:**
    * MAE: 2244718.44
    * MSE: 18756180178776.59
    * RMSE: 4330840.59
    * R²: 0.826
* **Random Forest Regressor:**
    * MAE: 2000681.63
    * MSE: 13577140060177.94
    * RMSE: 3684717.09
    * R²: 0.874
* **Gradient Boosting Regressor:**
    * MAE: 2862601.47
    * MSE: 21364918456707.42
    * RMSE: 4622220.08
    * R²: 0.802
* **XGB Regressor:**
    * MAE: 2137504.75
    * MSE: 13522593382400.00
    * RMSE: 3677307.90
    * R²: 0.875

**Model Selection**

Based on the metrics:

* **XGB Regressor and Random Forest Regressor** are the top performers. They have the lowest RMSE and MSE, and the highest R-squared values.
* **XGB Regressor** has the slight edge with the lowest RMSE and the highest R-squared.
* **Gradient Boosting Regressor** shows the worst performance of the ensemble methods.
* **Decision Tree Regressor** performance is the lowest of all the models.

**Why Choose XGB Regressor?**

1.  **High R-squared:**
    * An R² of 0.875 indicates that XGBoost explains a large portion of the variance in your target variable, suggesting a good fit.
2.  **Low RMSE:**
    * The RMSE of 3677307.90 is the lowest, meaning that the model's predictions are, on average, closer to the actual values than the other models.
3.  **Low MSE:**
    * The MSE is also very low.
4.  **Balance of Bias and Variance:**
    * XGBoost is known for its ability to balance bias and variance, reducing the risk of overfitting.
5.  **Performance and Speed:**
    * XGBoost is generally fast and efficient, even with large datasets.

**Recommendation:**

* **XGB Regressor** is the best choice based on these metrics. It provides the most accurate and reliable predictions.

**Important Note:**

* Always consider the context of your problem and the specific requirements of your application when choosing a model.
* It is always a good idea to perform cross validation on your data to ensure that the model is not overfitted.
**