**Project- Part A: Airbnb Price Prediction and Insights**
**1. Overview**
Airbnb provides a platform for property owners to rent out their spaces to travelers. Pricing a
listing effectively is critical for maximizing revenue while staying competitive in the market.
For hosts, understanding what factors influence the price of their listings is essential.
This project aims to build a machine learning model to predict the price of Airbnb listings
based on various features such as property type, room type, location, amenities, and host
characteristics. By analyzing these factors, this project will provide actionable insights to
Airbnb hosts to optimize their listing prices.
**2. Problem Statement**
The primary objective of this project is to develop a regression model that predicts the price
of an Airbnb listing. Using features such as property type, room type, number of reviews,
location, and amenities, the model will aim to estimate the price accurately.
The insights derived from this analysis will help Airbnb hosts understand the key drivers of
price, enabling them to make data-driven decisions for pricing their properties. Additionally,
the project will help Airbnb refine its recommendations for pricing to improve host and guest
satisfaction.


**Part 1: Data Exploration and Preprocessing**

In [3]:
#Step 1 : Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import datetime
import warnings
warnings.filterwarnings('ignore')

In [4]:
# Step 2 : Load the Dataset

df = pd.read_csv('/content/Airbnb_data - airbnb_data.csv')

#Look at the dataset

print(df.shape)
df.head()

(11684, 29)


Unnamed: 0,id,log_price,property_type,room_type,amenities,accommodates,bathrooms,bed_type,cancellation_policy,cleaning_fee,...,latitude,longitude,name,neighbourhood,number_of_reviews,review_scores_rating,thumbnail_url,zipcode,bedrooms,beds
0,6901257,5.010635,Apartment,Entire home/apt,"{""Wireless Internet"",""Air conditioning"",Kitche...",3,1.0,Real Bed,strict,True,...,40.696524,-73.991617,Beautiful brownstone 1-bedroom,Brooklyn Heights,2.0,100.0,https://a0.muscache.com/im/pictures/6d7cbbf7-c...,11201,1.0,1.0
1,6304928,5.129899,Apartment,Entire home/apt,"{""Wireless Internet"",""Air conditioning"",Kitche...",7,1.0,Real Bed,strict,True,...,40.766115,-73.98904,Superb 3BR Apt Located Near Times Square,Hell's Kitchen,6.0,93.0,https://a0.muscache.com/im/pictures/348a55fe-4...,10019,3.0,3.0
2,7919400,4.976734,Apartment,Entire home/apt,"{TV,""Cable TV"",""Wireless Internet"",""Air condit...",5,1.0,Real Bed,moderate,True,...,40.80811,-73.943756,The Garden Oasis,Harlem,10.0,92.0,https://a0.muscache.com/im/pictures/6fae5362-9...,10027,1.0,3.0
3,13418779,6.620073,House,Entire home/apt,"{TV,""Cable TV"",Internet,""Wireless Internet"",Ki...",4,1.0,Real Bed,flexible,True,...,37.772004,-122.431619,Beautiful Flat in the Heart of SF!,Lower Haight,0.0,,https://a0.muscache.com/im/pictures/72208dad-9...,94117,2.0,2.0
4,3808709,4.744932,Apartment,Entire home/apt,"{TV,Internet,""Wireless Internet"",""Air conditio...",2,1.0,Real Bed,moderate,True,...,38.925627,-77.034596,Great studio in midtown DC,Columbia Heights,4.0,40.0,,20009,0.0,1.0


In [5]:
# step 3 : Basic Data Info

df.info()
df.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11684 entries, 0 to 11683
Data columns (total 29 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      11684 non-null  int64  
 1   log_price               11684 non-null  float64
 2   property_type           11684 non-null  object 
 3   room_type               11684 non-null  object 
 4   amenities               11684 non-null  object 
 5   accommodates            11684 non-null  int64  
 6   bathrooms               11649 non-null  float64
 7   bed_type                11684 non-null  object 
 8   cancellation_policy     11684 non-null  object 
 9   cleaning_fee            11684 non-null  bool   
 10  city                    11684 non-null  object 
 11  description             11684 non-null  object 
 12  first_review            9152 non-null   object 
 13  host_has_profile_pic    11650 non-null  object 
 14  host_identity_verified  11650 non-null

Unnamed: 0,id,log_price,property_type,room_type,amenities,accommodates,bathrooms,bed_type,cancellation_policy,cleaning_fee,...,latitude,longitude,name,neighbourhood,number_of_reviews,review_scores_rating,thumbnail_url,zipcode,bedrooms,beds
count,11684.0,11684.0,11684,11684,11684,11684.0,11649.0,11684,11684,11684,...,11683.0,11683.0,11683,10620,11683.0,9008.0,10359,11526.0,11673.0,11666.0
unique,,,26,3,11302,,,5,5,2,...,,,11660,505,,,10359,538.0,,
top,,,Apartment,Entire home/apt,{},,,Real Bed,strict,True,...,,,Cozy Corner,Williamsburg,,,https://a0.muscache.com/im/pictures/74187103/7...,11211.0,,
freq,,,7729,6518,101,,,11356,5106,8613,...,,,2,466,,,1,265.0,,
mean,11283580.0,4.77952,,,,3.137367,1.227917,,,,...,38.457386,-92.478067,,,21.020885,94.021092,,,1.251863,1.70204
std,6093213.0,0.715266,,,,2.13028,0.571162,,,,...,3.070548,21.762602,,,38.334355,8.094685,,,0.829106,1.235574
min,3362.0,0.0,,,,1.0,0.0,,,,...,33.339327,-122.508663,,,0.0,20.0,,,0.0,1.0
25%,6262226.0,4.317488,,,,2.0,1.0,,,,...,34.132382,-118.343465,,,1.0,92.0,,,1.0,1.0
50%,12347860.0,4.70048,,,,2.0,1.0,,,,...,40.664879,-76.99508,,,6.0,96.0,,,1.0,1.0
75%,16426980.0,5.192957,,,,4.0,1.0,,,,...,40.745799,-73.954055,,,24.0,100.0,,,1.0,2.0


In [6]:
# Step 4 : Check Missing Vlaues

print(df.isnull().sum())
#percentage of missing values

missing = df.isnull().mean().sort_values(ascending=False)*100
missing[missing>0]

id                           0
log_price                    0
property_type                0
room_type                    0
amenities                    0
accommodates                 0
bathrooms                   35
bed_type                     0
cancellation_policy          0
cleaning_fee                 0
city                         0
description                  0
first_review              2532
host_has_profile_pic        34
host_identity_verified      34
host_response_rate        2932
host_since                  34
instant_bookable             1
last_review               2526
latitude                     1
longitude                    1
name                         1
neighbourhood             1064
number_of_reviews            1
review_scores_rating      2676
thumbnail_url             1325
zipcode                    158
bedrooms                    11
beds                        18
dtype: int64


Unnamed: 0,0
host_response_rate,25.094146
review_scores_rating,22.903115
first_review,21.670661
last_review,21.619308
thumbnail_url,11.340294
neighbourhood,9.10647
zipcode,1.352277
bathrooms,0.299555
host_identity_verified,0.290996
host_has_profile_pic,0.290996


In [7]:
# step 5 :Data Cleaning Steps

#Drop Columns that are not useful or have too many missing values

df.drop(columns=['id','description','thumbnail_url','first_review','last_review','host_since','zipcode','name'],inplace=True)

#Fill missing numeric values

df['bathrooms'].fillna(df['bathrooms'].mean(),inplace=True)
df['bedrooms'].fillna(df['bedrooms'].median(),inplace=True)
df['beds'].fillna(df['beds'].median(),inplace=True)
df['review_scores_rating'].fillna(df['review_scores_rating'].median(),inplace=True)


#fill missing categorical vlaues

for col in ['neighbourhood','host_response_rate','host_has_profile_pic','host_identity_verified']:
  df[col].fillna(df[col].mode()[0],inplace=True)

#Count number of amenities

df['amenities_count'] = df['amenities'].apply(lambda x: len(str(x).split(',')))
df.drop(columns=['amenities'],inplace=True)


#quick look on data and basic data info

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11684 entries, 0 to 11683
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   log_price               11684 non-null  float64
 1   property_type           11684 non-null  object 
 2   room_type               11684 non-null  object 
 3   accommodates            11684 non-null  int64  
 4   bathrooms               11684 non-null  float64
 5   bed_type                11684 non-null  object 
 6   cancellation_policy     11684 non-null  object 
 7   cleaning_fee            11684 non-null  bool   
 8   city                    11684 non-null  object 
 9   host_has_profile_pic    11684 non-null  object 
 10  host_identity_verified  11684 non-null  object 
 11  host_response_rate      11684 non-null  object 
 12  instant_bookable        11683 non-null  object 
 13  latitude                11683 non-null  float64
 14  longitude               11683 non-null

Unnamed: 0,log_price,property_type,room_type,accommodates,bathrooms,bed_type,cancellation_policy,cleaning_fee,city,host_has_profile_pic,...,host_response_rate,instant_bookable,latitude,longitude,neighbourhood,number_of_reviews,review_scores_rating,bedrooms,beds,amenities_count
0,5.010635,Apartment,Entire home/apt,3,1.0,Real Bed,strict,True,NYC,t,...,100%,f,40.696524,-73.991617,Brooklyn Heights,2.0,100.0,1.0,1.0,9
1,5.129899,Apartment,Entire home/apt,7,1.0,Real Bed,strict,True,NYC,t,...,100%,t,40.766115,-73.98904,Hell's Kitchen,6.0,93.0,3.0,3.0,15
2,4.976734,Apartment,Entire home/apt,5,1.0,Real Bed,moderate,True,NYC,t,...,100%,t,40.80811,-73.943756,Harlem,10.0,92.0,1.0,3.0,19
3,6.620073,House,Entire home/apt,4,1.0,Real Bed,flexible,True,SF,t,...,100%,f,37.772004,-122.431619,Lower Haight,0.0,96.0,2.0,2.0,15
4,4.744932,Apartment,Entire home/apt,2,1.0,Real Bed,moderate,True,DC,t,...,100%,t,38.925627,-77.034596,Columbia Heights,4.0,40.0,0.0,1.0,12


In [8]:
# Step 6 : Encode categorical variable

categorical_columns = ['property_type','room_type','cancellation_policy','city','bed_type','neighbourhood']

df_encoded = pd.get_dummies(df,columns = categorical_columns,drop_first = True)

#convert boolean columns to binary

boolean_columns = ['cleaning_fee','host_has_profile_pic','host_identity_verified','instant_bookable']

for col in boolean_columns:
    df_encoded[col] = df_encoded[col].map({'t': 1, 'f': 0})

# Handle "host_response_rate" conver "100%" To 1.0

df_encoded['host_response_rate'] = df_encoded['host_response_rate'].str.replace('%','').astype(float)/100.0


In [9]:
# Step 7 : Prepare Features & Target

from sklearn.model_selection import train_test_split

x = df_encoded.drop ('log_price',axis=1)
y = df_encoded['log_price']

# split data

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 42)

In [10]:
# Step 8 : Train a Random Forest Regression Model

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

model = RandomForestRegressor(n_estimators = 100,random_state =42)

model.fit(x_train,y_train)

y_pred = model.predict(x_test)

In [11]:
# Step 9 : Evaluate Model Performance

rmse = np.sqrt(mean_squared_error(y_test,y_pred))
mae = mean_absolute_error(y_test,y_pred)
r2 = r2_score(y_test,y_pred)

print('RMSE:',rmse)
print('MAE:',mae)
print('R2:',r2)


RMSE: 0.4177302554159272
MAE: 0.3063734562420839
R2: 0.6639861537981037


**Acceptable Performance Metrics**
*     
Achieved via RMSE, MAE, and R² metrics from your Random Forest model.
RMSE should be low indicates less prediction error, R² should be high like close to 1.

