# Airbnb NYC Rental Price Prediction
## 1. Domain Understanding
As of August 2019, the data set contains almost 50 thousand airbnb listings in NYC. The purpose of this task is to predict the price of NYC Airbnb rentals based on the data provided and any external dataset(s) with relevant information.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#loading data
#checking the size to get idea the size of data
df = pd.read_csv('AB_NYC_2019.csv')
df.shape

(48895, 16)

## 2. Data Exploration and Collection

In [3]:
#checking type of all columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

In [4]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


id and name

In [4]:
#checking all numerical value
df.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


All numerical data seem to be good. No outlier

In [5]:
#check value of neighbourhood_group to check if there is unwanted character
df['neighbourhood_group'].unique()

array(['Brooklyn', 'Manhattan', 'Queens', 'Staten Island', 'Bronx'],
      dtype=object)

In [6]:
#check value of neighbourhood to check if there is unwanted character
df['neighbourhood'].unique()

array(['Kensington', 'Midtown', 'Harlem', 'Clinton Hill', 'East Harlem',
       'Murray Hill', 'Bedford-Stuyvesant', "Hell's Kitchen",
       'Upper West Side', 'Chinatown', 'South Slope', 'West Village',
       'Williamsburg', 'Fort Greene', 'Chelsea', 'Crown Heights',
       'Park Slope', 'Windsor Terrace', 'Inwood', 'East Village',
       'Greenpoint', 'Bushwick', 'Flatbush', 'Lower East Side',
       'Prospect-Lefferts Gardens', 'Long Island City', 'Kips Bay',
       'SoHo', 'Upper East Side', 'Prospect Heights',
       'Washington Heights', 'Woodside', 'Brooklyn Heights',
       'Carroll Gardens', 'Gowanus', 'Flatlands', 'Cobble Hill',
       'Flushing', 'Boerum Hill', 'Sunnyside', 'DUMBO', 'St. George',
       'Highbridge', 'Financial District', 'Ridgewood',
       'Morningside Heights', 'Jamaica', 'Middle Village', 'NoHo',
       'Ditmars Steinway', 'Flatiron District', 'Roosevelt Island',
       'Greenwich Village', 'Little Italy', 'East Flatbush',
       'Tompkinsville', 'Asto

Neighbourhood has so many categorical values. Need to be analysed more.

In [7]:
#check value of room_type to check if there is unwanted character
df['room_type'].unique()

array(['Private room', 'Entire home/apt', 'Shared room'], dtype=object)

## 3. Data Cleaning

In [8]:
#checking duplicated values
df.duplicated().sum()

0

In [9]:
#checking null value
df.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [11]:
#null values of reviews_per_month to be fill with 0
df['reviews_per_month'].fillna(0,inplace=True)

In [10]:
#converting last_review string to date
import datetime
df['last_review']=pd.to_datetime(df['last_review'])

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              48895 non-null  int64         
 1   name                            48879 non-null  object        
 2   host_id                         48895 non-null  int64         
 3   host_name                       48874 non-null  object        
 4   neighbourhood_group             48895 non-null  object        
 5   neighbourhood                   48895 non-null  object        
 6   latitude                        48895 non-null  float64       
 7   longitude                       48895 non-null  float64       
 8   room_type                       48895 non-null  object        
 9   price                           48895 non-null  int64         
 10  minimum_nights                  48895 non-null  int64         
 11  nu

In [None]:
#creating new feature recent_review_days
today = datetime.datetime(2020, 6, 19)
df['last_review_days'] = today-df['last_review']

In [None]:
#checking the stats of last_review_days
df['last_review_days'].describe()

Counted values is 38843. Total value is 48895, more than 1000 data is null. Null value is defined as higher than existing highest value. It means that the resident had been last reviewed for a long time ago. Since the highest value is 3371, null value will filled by 7000.

In [None]:
#converting to numeric
df['last_review_days'] = pd.to_numeric(df['last_review_days'].dt.days, downcast='integer')

In [None]:
#checking the stats of last_review_days
df['last_review_days'].describe()

In [None]:
#fill null value
df['last_review_days'].fillna(7000,inplace=True)

In [None]:
#dropping unnecessary feature
df = df.drop(['id','name','host_id','host_name','last_review'],axis=1)

In [None]:
neighbourhood_count = df.groupby('neighbourhood')['neighbourhood'].agg('count').sort_values(ascending=False)
neighbourhood_count

There are 221 unique values of neighbourhood. It will create model with too many dimension.

In [None]:
#we are checking neighbourhood that is counted less than 100. further it will be converted to 'Other' 
len(neighbourhood_count[neighbourhood_count<500])

In [None]:
#converting 'less than 100 counted' to 'Other'
df['neighbourhood'] = df['neighbourhood'].apply(lambda x: 'Other' if x in neighbourhood_count[neighbourhood_count<500] else x)

In [None]:
#checking unique values of neighbourhood
len(df['neighbourhood'].unique())

## 4. Feature Engineering

In [None]:
#check linear correlation
cor=df.corr()
plt.figure(figsize=(8,6))
ax=sns.heatmap(cor,annot=True,cmap='coolwarm')
bot,top = ax.get_ylim()
ax.set_ylim(bot+0.5, top-0.5)
plt.show()

All of features have bad correlation with price. We can say, that this model is not linear 

In [None]:
df.columns

In [None]:
num=['latitude', 'longitude','minimum_nights', 'number_of_reviews','reviews_per_month',
     'calculated_host_listings_count','availability_365', 'last_review_days']
cat=['neighbourhood_group','neighbourhood','room_type']

In [None]:
#Anova analysis to check the importance of numerical features in correlatin with price
x = df[num]
y = df['price']

from sklearn.feature_selection import f_regression
fvalue,pval = f_regression(x,y)
for i in range(len(x.columns)):
    print(x.columns[i],pval[i]) 

Assuming business confidence is 5%. All of numerical features are important because pvalue is lower than 5%.

In [None]:
#chi square analysis to check the importance of categorical features in correlation with price
#decoding string type categorical features
xcat = df[cat]
from sklearn.preprocessing import LabelEncoder
xcat['neighbourhood_group']= LabelEncoder().fit_transform(xcat['neighbourhood_group'])
xcat['neighbourhood']= LabelEncoder().fit_transform(xcat['neighbourhood'])
xcat['room_type']= LabelEncoder().fit_transform(xcat['room_type'])

#running chi square analysis
from sklearn.feature_selection import chi2
cval,pval = chi2(xcat,y)
for i in range(len(cat)):
    print(cat[i],' ',pval[i])

Assuming business confidence is 5%. neighbourhood_group is unimportant for its pvalue is higher than 5%.

In [None]:
#dropping unimportant features
x= df.drop(['neighbourhood_group', 'price', 'room_type','neighbourhood'],axis=1)
y=df['price']

In [None]:
x.head()

In [None]:
#encoding room_type
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
encoder = ColumnTransformer([('ohe',OneHotEncoder(),[0,3])],remainder='passthrough')
encoder.fit(x)
x2 = encoder.transform(x)
x2 = pd.DataFrame(x2)
x2.head()

## 5. Preprocessing of Data

In [None]:
#splitting test and train data
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2, random_state=5)

## 6. Apply Machine Learning Algorithm

In [None]:
#train decision tree machine 
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion="gini",max_depth=8,max_leaf_nodes=30)
model.fit(xtrain,ytrain)

## 7. Performance Analysis

In [None]:
from sklearn import metrics
from sklearn.metrics import r2_score
print(r2_score(ytrain,model.predict(xtrain)))
print(r2_score(ytest,model.predict(xtest)))

- low recall score
- seem to be overfitting

In [None]:
#build new model
from sklearn.tree import DecisionTreeClassifier
model2 = DecisionTreeClassifier(criterion='gini',max_depth=25,min_samples_leaf=10,min_samples_split=20,random_state=5)
model2.fit(xtrain,ytrain)

In [None]:
print(r2_score(ytrain,model2.predict(xtrain)))
print(r2_score(ytest,model2.predict(xtest)))

### Visualizing Decision Tree

In [None]:
from sklearn.ensemble import RandomForestClassifier
model5=RandomForestClassifier(n_estimators=50, max_depth=20,min_samples_leaf=15,min_samples_split=40,oob_score=True)
model5.fit(xtrain,ytrain)

In [None]:
print(r2_score(ytrain,model5.predict(xtrain)))
print(r2_score(ytest,model5.predict(xtest)))