<a href="https://colab.research.google.com/github/shemanto27/End-to-End-BD-House-Price-Prediction/blob/main/BD_House_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step-0: Setup and Project Details 📜


## Projects Details
- Title: Bangladesh House Price Prediction
- Problem Statement:

  The real estate market in Bangladesh is rapidly growing, yet pricing decisions often lack accurate data-driven insights. This project aims to predict house prices using machine learning techniques, helping stakeholders make informed decisions and promoting transparency in the housing market.
  
- Objectives:
  - Predict house prices based on features like location, size, number of rooms, etc.
  - Build a user-friendly interface or API for users to estimate house prices

- Target Audience:
  - Homebuyers
  - Real estate agents
  - Property developers
  - Government and urban planners

- Data Source:
  Data was collected from Kaggle, here is the [link](https://www.kaggle.com/datasets/durjoychandrapaul/house-price-bangladesh)
-About the Dataset:

  This dataset contains property listings from various cities across Bangladesh, specifically including Dhaka, Chattogram, Cumilla, Narayanganj City, and Gazipur, with prices listed in Bangladeshi Taka (৳). The dataset provides valuable insights into various features of the properties, including the number of bedrooms, bathrooms, floor number, floor area in square feet, and their respective prices. The data has been collected from a real estate website, offering a comprehensive view of the housing market across these key cities in Bangladesh.

- Feature Description
  - Title: The title or description of the property listing.

  - Bedrooms: The number of bedrooms in the property.

  - Bathrooms: The number of bathrooms in the property.

  - Floor_no: The floor number on which the property is located.

  - Occupancy_status: Indicates whether the property is vacant or occupied.

  - Floor_area: The total floor area of the property in square feet.

  - City: The city where the property is located. This dataset includes listings from Dhaka, Chattogram, Cumilla, Narayanganj City, and Gazipur.

  - Price_in_taka: The listing price of the property in Bangladeshi Taka (৳).

  - Location: The specific location or address within the city.

## Plan:
- It is a **Regression Problem**
- We will test Regression Algorithms: LinearRegression, SVR, DecisionTreeRegressor, RandomForestRegressor, KNeighborsRegressor, GradientBoostingRegress
xgboost
- Evaluation Matrics: mean_squared_error, mean_absolute_error, r2_score

## Project setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set the Seaborn style
sns.set_style("darkgrid")  # You can also use "darkgrid", "white", "ticks"
sns.set_palette("viridis")  # You can also try "rocket", "mako", "flare", "magma", "viridis"


In [None]:
# # Install MLflow and Pyngrok
# !pip install mlflow -q
# !pip install pyngrok -q

In [None]:
# import mlflow
# import subprocess
# from pyngrok import ngrok, conf
# import getpass

In [None]:
# # Define MLflow backend store URI with SQLite
# MLFLOW_TRACKING_URI = "sqlite:///mlflow.db"
# # Start the MLflow  using subprocess
# subprocess.Popen(["mlflow", "ui", "--backend-store-uri", MLFLOW_TRACKING_URI])

In [None]:
# # Set the tracking URI
# mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
# # Set or create an experiment. mlflow will create an experiment if it doesn't exist
# mlflow.set_experiment("BD House Price Pridiction")

2pYrZiuK7OfKJj0aUqvOAwOCzQL_889Gi4aGbgY5GDJNgKYbP

In [None]:
# # Configure ngrok and expose the MLflow UI
# print("Enter your authtoken, which can be copied from https://dashboard.ngrok.com/auth")
# conf.get_default().auth_token = getpass.getpass()
# # Expose the MLflow UI on port 5000
# port=5000
# public_url = ngrok.connect(port).public_url
# print(f' * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:{port}\"')

https://ed5e-34-85-169-36.ngrok-free.app

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/BD House Price Prediction/house_price_bd.csv')

#  Step-1: Data Preprocessing 🧹
- imputation(Handle missing values),Remove duplicates and outliers
- Data Transformation- encoding(Cat-->Num),  Standardization or normalization
- Data Splitting

In [None]:
df.sample(10)

Unnamed: 0,Title,Bedrooms,Bathrooms,Floor_no,Occupancy_status,Floor_area,City,Price_in_taka,Location
3472,"5 Katha Plot In Bproperty Village, Narayanganj...",,,,vacant,3600.0,narayanganj-city,"৳7,625,000","Rupganj, Narayanganj"
2772,Apartment Of 1500 Sq Ft For Sale In Hamjarbag,2.0,2.0,3.0,vacant,1500.0,chattogram,"৳6,750,000","Hamjarbag, 7 No. West Sholoshohor Ward"
3840,"Built With Modern Amenities, Check This 1080 S...",3.0,2.0,5.0,vacant,1080.0,gazipur,"৳4,320,000","Joydebpur, Gazipur Sadar Upazila"
2340,Great News! In The Suitable Location Of Sholok...,3.0,2.0,7.0,vacant,1000.0,chattogram,"৳5,500,000","Badurtala, Sholokbahar"
2440,Ready 1550 Sq Ft Flat Is Now For Sale In Doubl...,3.0,3.0,2.0,vacant,1550.0,chattogram,"৳9,335,000","South Agrabad, Double Mooring"
2077,1400 Sq Ft Flat Is Up For Sale At 7 No. West S...,3.0,3.0,1.0,vacant,1400.0,chattogram,"৳8,500,000","Mohammad Pur Road, 7 No. West Sholoshohor Ward"
1653,Look At This Nice 2200 Sq Ft Flat Is Up For Sa...,3.0,3.0,8.0,vacant,2200.0,chattogram,"৳12,000,000","Lake Valley R/A, 9 No. North Pahartali Ward"
1729,A Nice And Comfortable 1750 Sq Ft Flat Is Up F...,3.0,3.0,3.0,vacant,1750.0,chattogram,"৳7,875,000","Hamjarbag, 7 No. West Sholoshohor Ward"
571,"Picture Yourself, Residing In This Well Constr...",4.0,5.0,2.0,vacant,2258.0,dhaka,"৳28,000,000","Sector 13, Uttara"
390,"In the location of Mirpur, very close to Medic...",2.0,2.0,5.0,vacant,800.0,dhaka,"৳3,700,000","Section 6, Mirpur"


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3865 entries, 0 to 3864
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Title             3865 non-null   object 
 1   Bedrooms          2864 non-null   float64
 2   Bathrooms         2864 non-null   float64
 3   Floor_no          3181 non-null   object 
 4   Occupancy_status  3766 non-null   object 
 5   Floor_area        3766 non-null   float64
 6   City              3865 non-null   object 
 7   Price_in_taka     3865 non-null   object 
 8   Location          3859 non-null   object 
dtypes: float64(3), object(6)
memory usage: 271.9+ KB


In [None]:
df['City'].unique()

array(['dhaka', 'chattogram', 'cumilla', 'narayanganj-city', 'gazipur'],
      dtype=object)

In [None]:
df['Occupancy_status'].unique()

array(['vacant', 'occupied', nan], dtype=object)

In [None]:
df.isnull().sum()

Unnamed: 0,0
Title,0
Bedrooms,1001
Bathrooms,1001
Floor_no,684
Occupancy_status,99
Floor_area,99
City,0
Price_in_taka,0
Location,6


In [None]:
df.duplicated().sum()

934

In [None]:
df.drop_duplicates(inplace=True)
df.duplicated().sum()

0

In [None]:
df.drop(['Title','Occupancy_status'], axis=1, inplace=True)

In [None]:
df.head()

Unnamed: 0,Bedrooms,Bathrooms,Floor_no,Floor_area,City,Price_in_taka,Location
0,3.0,4.0,3,1960.0,dhaka,"৳39,000,000","Gulshan 1, Gulshan"
1,3.0,3.0,1,1705.0,dhaka,"৳16,900,000","Lake Circus Road, Kalabagan"
2,3.0,3.0,6,1370.0,dhaka,"৳12,500,000","Shukrabad, Dhanmondi"
3,3.0,3.0,4,2125.0,dhaka,"৳20,000,000","Block L, Bashundhara R-A"
4,3.0,3.0,4,2687.0,dhaka,"৳47,500,000","Road No 25, Banani"


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2931 entries, 0 to 3862
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Bedrooms       2100 non-null   float64
 1   Bathrooms      2100 non-null   float64
 2   Floor_no       2356 non-null   object 
 3   Floor_area     2842 non-null   float64
 4   City           2931 non-null   object 
 5   Price_in_taka  2931 non-null   object 
 6   Location       2925 non-null   object 
dtypes: float64(3), object(4)
memory usage: 183.2+ KB


In [None]:
df.isnull().sum()

Unnamed: 0,0
Bedrooms,831
Bathrooms,831
Floor_no,575
Floor_area,89
City,0
Price_in_taka,0
Location,6


In [None]:
df.isnull().mean() * 100

Unnamed: 0,0
Bedrooms,28.352098
Bathrooms,28.352098
Floor_no,19.617878
Floor_area,3.036506
City,0.0
Price_in_taka,0.0
Location,0.204708


In [None]:
df['Price_in_taka'] = df['Price_in_taka'].replace('[৳,]', '', regex=True)

In [None]:
df['Price_in_taka'].sample()

Unnamed: 0,Price_in_taka
2907,6680000


In [None]:
df[df['Floor_no'].str.contains('th ', case=False, na=False)]

Unnamed: 0,Bedrooms,Bathrooms,Floor_no,Floor_area,City,Price_in_taka,Location
2674,3.0,3.0,4th to 8th Backside,1250.0,chattogram,3800000,"Dakshin Kattali, 11 No. South Kattali Ward"


In [None]:
df['Floor_no'] = df['Floor_no'].replace('[^0-9.]', '', regex=True) #replace any character that is not a digit (0-9) or a decimal point (.) with an empty string.
df['Floor_no'] = pd.to_numeric(df['Floor_no'], errors='coerce')

In [None]:
df['Price_in_taka'] = df['Price_in_taka'].astype(int)
df['City'] = df['City'].astype(str)
df['Floor_no'] = df['Floor_no'].astype(np.float64)

In [None]:
# mean_Bedrooms = df['Bedrooms'].mean()
# mean_Bathrooms = df['Bathrooms'].mean()
# mean_Floor_no = df['Floor_no'].mean()
# mean_Floor_area = df['Floor_area'].mean()

In [None]:
# median_Bedrooms = df['Bedrooms'].median()
# median_Bathrooms = df['Bathrooms'].median()
# median_Floor_no = df['Floor_no'].median()
# median_Floor_area = df['Floor_area'].mean()

In [None]:
# df['Bedrooms_mean'] = df['Bedrooms'].fillna(mean_Bedrooms)
# df['Bathrooms_mean'] = df['Bathrooms'].fillna(mean_Bathrooms)
# df['Floor_no_mean'] = df['Floor_no'].fillna(mean_Floor_no)
# df['Floor_area_mean'] = df['Floor_area'].fillna(mean_Floor_area)

In [None]:
# df['Bedrooms_median'] = df['Bedrooms'].fillna(median_Bedrooms)
# df['Bathrooms_median'] = df['Bathrooms'].fillna(median_Bathrooms)
# df['Floor_no_median'] = df['Floor_no'].fillna(median_Floor_no)
# df['Floor_area_median'] = df['Floor_area'].fillna(median_Floor_area)

In [None]:
# # kernel density estimate (KDE) plot: visualize the distribution of data in a dataset

# plt.figure(figsize=(10, 8))

# # Create subplots(ROW,COL,INDEX)
# plt.subplot(2, 2, 1)
# sns.kdeplot(df['Bedrooms'], label='Bedrooms')
# sns.kdeplot(df['Bedrooms_median'], label='Bedrooms_median')
# sns.kdeplot(df['Bedrooms_mean'], label='Bedrooms_mean')
# plt.legend()

# plt.subplot(2, 2, 2)
# sns.kdeplot(df['Bathrooms'], label='Bathrooms')
# sns.kdeplot(df['Bathrooms_median'], label='Bathrooms_median')
# sns.kdeplot(df['Bathrooms_mean'], label='Bathrooms_mean')
# plt.legend()

# plt.subplot(2, 2, 3)
# sns.kdeplot(df['Floor_no'], label='Floor_no')
# sns.kdeplot(df['Floor_no_median'], label='Floor_no_median')
# sns.kdeplot(df['Floor_no_mean'], label='Floor_no_mean')
# plt.legend()

# plt.subplot(2, 2, 4)
# sns.kdeplot(df['Floor_area'], label='Floor_area')
# sns.kdeplot(df['Floor_area_median'], label='Floor_area_median')
# sns.kdeplot(df['Floor_area_mean'], label='Floor_area_mean')
# plt.legend()


# plt.tight_layout()
# plt.show()

In [None]:
# print('Original variable variance: ', df['Bedrooms'].var())
# print('Variance after median imputation: ', df['Bedrooms_median'].var())
# print('Variance after mean imputation: ', df['Bedrooms_mean'].var())
# print("")

# print('Original variable variance: ', df['Bathrooms'].var())
# print('Variance after median imputation: ',df['Bathrooms_median'].var())
# print('Fare Variance after mean imputation: ', df['Bathrooms_mean'].var())
# print("")

# print('Original variable variance: ', df['Floor_no'].var())
# print('Variance after median imputation: ', df['Floor_no_median'].var())
# print('Variance after mean imputation: ', df['Floor_no_mean'].var())
# print("")

# print('Original variable variance: ', df['Floor_area'].var())
# print('Variance after median imputation: ',df['Floor_area_median'].var())
# print('Variance after mean imputation: ', df['Floor_area_mean'].var())
# print("")


In [None]:
df.dropna(subset=['Location'], inplace=True)

In [None]:
from sklearn.impute import SimpleImputer

imputer_median = SimpleImputer(strategy='median')
imputer_mean = SimpleImputer(strategy='mean')


In [None]:
df['Bedrooms'] = imputer_median.fit_transform(df[['Bedrooms']])
df['Bathrooms'] = imputer_median.fit_transform(df[['Bathrooms']])
df['Floor_no'] = imputer_median.fit_transform(df[['Floor_no']])
df['Floor_area'] = imputer_median.fit_transform(df[['Floor_area']])

In [None]:
df.isnull().sum()

Unnamed: 0,0
Bedrooms,0
Bathrooms,0
Floor_no,0
Floor_area,0
City,0
Price_in_taka,0
Location,0


## 🔍 Observations & Approach:
- Tile and Occupancy status are no use --> drop column✅
- NaN is present in almost every features,implies to missing values, there are 800+ (28% of total) missing values in some features,so we can not drop them --> Impute
- most of the features are in objects --> convert to num
- Bedroom,bathroom,floor_area are in float type,not possible --> convert to integer (first need to remove NAN)
- price column has special character ৳ & , --> need to replace✅

