## Boston House Prediction App



#### Business Understanding And Project Overview

- The goal of this project is to develop a Boston House Price Prediction App using scraped real estate data from Realtor.com. 

- The app will help potential homebuyers, sellers, and real estate agents estimate property prices based on key features such as location, square footage, number of bedrooms, and other relevant factors. 

- By analyzing pricing trends and influential factors, the tool will provide data-driven insights for informed decision-making, ensuring fair market valuations. 

- Success will be measured by model accuracy (R² > 0.85) and user adoption, with potential expansion to other markets. 

- The project aligns with CRISP-DM methodology, starting with business objectives before progressing to data analysis, modeling, and deployment

#### Analytical Questions

1. What are the key factors influencing house prices in Boston?

2. How does location (neighborhood/zip code) affect pricing trends?

3. Are there seasonal trends in Boston’s real estate market?

4. How does the number of bedrooms and bathrooms impact price?

5. What is the price per square foot distribution across different areas?

6. Can we predict price outliers (luxury vs. budget homes)?

7. How well do different ML models (Linear Regression, Random Forest, XGBoost) perform in predicting prices?

In [2]:
## Libraies required for this project
import pandas as pd

In [3]:
## Data Loading
listings = pd.read_csv("properties_regex.csv")

In [None]:
## Checking the first 10 rows of our data
listings.head(10)

Unnamed: 0,name,price,url,image,bedrooms,sqft,address,city,state,zip
0,"300 Summer St Apt 62, Boston, MA 02210",523793,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/c992713d11f1c3fdd375bbed...,2.0,1557.0,300 Summer St Apt 62,Boston,MA,2210
1,"25 Addington Rd, Boston, MA 02132",685000,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/5fdedb8d9f4636980353b729...,4.0,1479.0,25 Addington Rd,Boston,MA,2132
2,"39 Hancock St, Boston, MA 02114",5995000,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/0711ad284ccdd5cc04f54cf3...,5.0,5177.0,39 Hancock St,Boston,MA,2114
3,"107 West St, Boston, MA 02136",470000,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/fcd3f7febefe8494afbdebe2...,4.0,2156.0,107 West St,Boston,MA,2136
4,"34-36 Juniper St Unit 3, Roxbury, MA 02119",579000,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/0543eb621484bfe5fe80586c...,2.0,1240.0,34-36 Juniper St Unit 3,Roxbury,MA,2119
5,"79-81 Rossmore Rd Unit 3, Boston, MA 02130",749000,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/60a83e077ee7cdb1c96bc4a7...,3.0,1146.0,79-81 Rossmore Rd Unit 3,Boston,MA,2130
6,"99 Brookley Rd Apt 3, Boston, MA 02130",949000,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/018ec779439c5d836488494d...,3.0,1372.0,99 Brookley Rd Apt 3,Boston,MA,2130
7,"22 Alaska St, Boston, MA 02119",1175000,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/125255acec84400f8a3bb0b1...,6.0,3018.0,22 Alaska St,Boston,MA,2119
8,"78 Park St, Boston, MA 02132",999000,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/d9ecb16358c6980f22de7a4e...,8.0,3674.0,78 Park St,Boston,MA,2132
9,"132 Chelsea St, Boston, MA 02128",1199900,https://www.realtor.com/realestateandhomes-det...,https://ap.rdcpix.com/9ee635442e8e2f966b871dab...,8.0,2800.0,132 Chelsea St,Boston,MA,2128


### Data Cleaning

In [4]:
## checking for missing values
listings.isnull().sum()

name        0
price       0
url         0
image       0
bedrooms    5
sqft        5
address     0
city        0
state       0
zip         0
dtype: int64

In [7]:
### checking for duplicates
listings.duplicated().value_counts()

False    336
Name: count, dtype: int64

In [None]:
## checking for the data types of the various columns
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336 entries, 0 to 335
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      336 non-null    object 
 1   price     336 non-null    int64  
 2   url       336 non-null    object 
 3   image     336 non-null    object 
 4   bedrooms  331 non-null    float64
 5   sqft      331 non-null    float64
 6   address   336 non-null    object 
 7   city      336 non-null    object 
 8   state     336 non-null    object 
 9   zip       336 non-null    int64  
dtypes: float64(2), int64(2), object(6)
memory usage: 26.4+ KB


In [None]:

## checking the statistical summary of the listing data
listings.describe()

Unnamed: 0,price,bedrooms,sqft,zip
count,336.0,331.0,331.0,336.0
mean,2165353.0,3.145015,2226.05136,2128.604167
std,5198853.0,2.162378,5558.444337,26.30387
min,40000.0,0.0,175.0,2108.0
25%,599974.2,2.0,893.5,2116.0
50%,849000.0,3.0,1297.0,2127.0
75%,1398000.0,4.0,2391.5,2131.0
max,65000000.0,16.0,98205.0,2467.0


Insights and Observations
- We have 5 missing entries (which we will handle by replacing the missing bedroom sqft entries by replacing with the mean value so that it doesnt affect the structure of the data. Also we do not have enough data so deleting wont be a good idea)
- We have 336 rows of data
- The data has 9 distinct rows

From the statistical summary, we can see some insights as well
-  High Price Variability (Wide Spread in price, mean price aroud 2.16M while there is a max price of around 65M)
-  There are Suspiciously High sqft Outliers (mean sqft around 2,200 while there is an outlier around 98,000 sqft)
-  Bedrooms: Zero to Unusually High (16 bedrooms which could be a mansion or a data error)
- We can also see our distribution is not normally distributed since the mean of all columns =! median(or not even close). Our data is actually right skewed (a few entries are pulling the distribution to the right). We will plot a boxplot to visualize the distribution in the coming steps

