# Machine Learning Applications for Airbnb Data

### Group 3 - Dhruv Shah, Jenn Hong, Setu Shah, Sonya Dreyer

---



• State the problem

• Tell us who cares about this problem and Why

• Describe your data – where it came from, what it contains

• Present some interesting descriptive analyses (plots/tables) that motivates your exercise

• Present your main results

• Which methods worked best for your problem?

• What were the challenges you faced? Tell us about the biggest challenge you faced and how you
overcame it (or, tried but did not – that’s fine too – not every problem has a solution.)

• Conclude – what have you learnt that can be put to practice?

# Data Cleaning

---



In [1]:
# Import preprocessing libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Download the file

!wget 'https://maven-datasets.s3.amazonaws.com/Airbnb/Airbnb+Data.zip'

--2023-11-19 22:28:11--  https://maven-datasets.s3.amazonaws.com/Airbnb/Airbnb+Data.zip
Resolving maven-datasets.s3.amazonaws.com (maven-datasets.s3.amazonaws.com)... 52.217.226.1, 3.5.28.122, 3.5.29.188, ...
Connecting to maven-datasets.s3.amazonaws.com (maven-datasets.s3.amazonaws.com)|52.217.226.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 91005234 (87M) [application/zip]
Saving to: ‘Airbnb+Data.zip’


2023-11-19 22:28:12 (57.8 MB/s) - ‘Airbnb+Data.zip’ saved [91005234/91005234]



In [3]:
# Unzip the file

!unzip Airbnb+Data.zip

Archive:  Airbnb+Data.zip
   creating: Airbnb Data/
  inflating: Airbnb Data/Listings.csv  
  inflating: Airbnb Data/Listings_data_dictionary.csv  
  inflating: Airbnb Data/Reviews.csv  
  inflating: Airbnb Data/Reviews_data_dictionary.csv  


In [4]:
# Load the data frames

listings =  pd.read_csv('/content/Airbnb Data/Listings.csv', encoding = 'latin1', low_memory = False)

reviews = pd.read_csv('/content/Airbnb Data/Reviews.csv', encoding = 'latin1', low_memory = False)

In [5]:
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279712 entries, 0 to 279711
Data columns (total 33 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   listing_id                   279712 non-null  int64  
 1   name                         279539 non-null  object 
 2   host_id                      279712 non-null  int64  
 3   host_since                   279547 non-null  object 
 4   host_location                278872 non-null  object 
 5   host_response_time           150930 non-null  object 
 6   host_response_rate           150930 non-null  float64
 7   host_acceptance_rate         166625 non-null  float64
 8   host_is_superhost            279547 non-null  object 
 9   host_total_listings_count    279547 non-null  float64
 10  host_has_profile_pic         279547 non-null  object 
 11  host_identity_verified       279547 non-null  object 
 12  neighbourhood                279712 non-null  object 
 13 

In [6]:
# Converting to datetime

listings.host_since = pd.to_datetime(listings.host_since)

In [7]:
# Converting to out-of-10 scale

listings.review_scores_rating = listings.review_scores_rating / 10

In [8]:
# Converting prices to USD

cities = listings['city'].unique()
exchange_rates = [1.0808, 1, 0.028388, 0.20328, 0.65462, 0.039480, 1.0808, 0.12777, 0.0493, 0.053215] # update these numbers before fitting models
currency_map = dict(zip(cities, exchange_rates))

listings['usd_price'] = listings.apply(lambda row: row['price'] * currency_map[row['city']], axis=1) # create new column
listings.drop('price', axis = 1, inplace = True) # drop original column

In [9]:
# Converting to numerical category

# Potentially problematic -> Converting NULL values to zero

listings.host_is_superhost = listings.host_is_superhost.apply(lambda x: 1 if x == 't' else 0)
listings.host_has_profile_pic = listings.host_has_profile_pic.apply(lambda x: 1 if x == 't' else 0)
listings.host_identity_verified = listings.host_identity_verified.apply(lambda x: 1 if x == 't' else 0)
listings.instant_bookable = listings.instant_bookable.apply(lambda x: 1 if x == 't' else 0)

In [10]:
# We can or should drop listing_id, host_id, property, neighbourhood, district, property_type
# We can drop Districts as it has only districts of New York, rest are all NULL
# We should drop name and possibly host_location (unless we want to/can figure out how to extract precise location --> latitude and longitude can be used to create clusters like in the lab)

columns_to_drop = ['listing_id', 'host_id', 'property_type', 'neighbourhood', 'district', 'property_type','name','host_location']

listings = listings.drop(columns=columns_to_drop, axis=1)

In [11]:
# We need to possibly impute values (or drop columns) for host response time/rate, host_acceptance_rate, and some of the ratings columns [Iterative Imputer]

listings.isnull().sum() / len(listings) * 100


host_since                      0.058989
host_response_time             46.040928
host_response_rate             46.040928
host_acceptance_rate           40.429799
host_is_superhost               0.000000
host_total_listings_count       0.058989
host_has_profile_pic            0.000000
host_identity_verified          0.000000
city                            0.000000
latitude                        0.000000
longitude                       0.000000
room_type                       0.000000
accommodates                    0.000000
bedrooms                       10.523324
amenities                       0.000000
minimum_nights                  0.000000
maximum_nights                  0.000000
review_scores_rating           32.678255
review_scores_accuracy         32.788368
review_scores_cleanliness      32.771208
review_scores_checkin          32.809104
review_scores_communication    32.779073
review_scores_location         32.810534
review_scores_value            32.814109
instant_bookable

In [12]:
# Deop host_since, host_total_listings_count, bedrooms, since their percentage of null values is less than 15%
columns_to_dropna = ['host_since', 'host_total_listings_count', 'bedrooms']

listings.dropna(subset=columns_to_dropna, inplace=True)

In [13]:
listings.host_response_time.unique()

array([nan, 'within a few hours', 'within a day', 'within an hour',
       'a few days or more'], dtype=object)

In [14]:
# Option 1: Impute with the most common response time
#common_response_time = listings['host_response_time'].mode()[0]

#listings['host_response_time'].fillna(common_response_time, inplace=True)


In [15]:
# Option 2: Impute with 'unknown'
listings['host_response_time'].fillna('unknown', inplace=True)

In [16]:
listings.host_response_rate.unique()

array([ nan, 1.  , 0.  , 0.5 , 0.67, 0.9 , 0.86, 0.83, 0.8 , 0.75, 0.88,
       0.79, 0.89, 0.94, 0.71, 0.95, 0.3 , 0.25, 0.6 , 0.33, 0.43, 0.2 ,
       0.84, 0.4 , 0.92, 0.17, 0.97, 0.87, 0.38, 0.7 , 0.58, 0.76, 0.78,
       0.44, 0.91, 0.1 , 0.14, 0.57, 0.98, 0.69, 0.81, 0.82, 0.96, 0.73,
       0.85, 0.46, 0.22, 0.29, 0.63, 0.77, 0.56, 0.99, 0.93, 0.23, 0.13,
       0.36, 0.12, 0.47, 0.55, 0.06, 0.72, 0.08, 0.09, 0.62, 0.65, 0.64,
       0.28, 0.53, 0.39, 0.59, 0.41, 0.27, 0.31, 0.74, 0.03, 0.52, 0.68,
       0.11, 0.04, 0.54, 0.61, 0.21, 0.07, 0.45, 0.42, 0.51, 0.48, 0.15,
       0.01, 0.19, 0.24, 0.05])

In [17]:
# Impute with the average response and acceptance rate (since the values in the column are between 0 and 1, this seems reasonable)

mean_response_rate = listings['host_response_rate'].mean()
mean_acceptance_rate = listings['host_acceptance_rate'].mean()

listings['host_response_rate'].fillna(mean_response_rate, inplace=True)
listings['host_acceptance_rate'].fillna(mean_acceptance_rate, inplace=True)

In [18]:
listings['review_scores_rating'].unique()

array([10. ,  9.8,  9.9,  9.3,  9.6,  9.7,  9.5,  9. ,  8.8,  9.2,  8. ,
        9.4,  6. ,  9.1,  8.5,  8.7,  8.9,  7.5,  8.6,  8.4,  8.3,  7. ,
        8.2,  8.1,  7.2,  5. ,  4. ,  7.7,  7.1,  6.7,  4.8,  2. ,  5.6,
        nan,  7.3,  7.6,  2.7,  7.8,  6.8,  6.4,  7.4,  5.3,  5.2,  7.9,
        6.5,  4.7,  6.9,  5.7,  5.8,  3.3,  3. ,  6.6,  5.4,  4.5,  6.3,
        5.5,  3.6,  3.1,  4.9,  6.2,  4.3,  3.5,  6.1,  4.4])

In [24]:
# Impute review ratings with medians (since we are dealing with a rating scale (0 to 10), the median makes sense, especially if the distribution of ratings is not symmetric or if there are outliers.)

median_score_rating = listings['review_scores_rating'].median()
median_score_accuracy = listings['review_scores_accuracy'].median()
median_score_cleanliness = listings['review_scores_cleanliness'].median()
median_score_checkin = listings['review_scores_checkin'].median()
median_score_communication = listings['review_scores_communication'].median()
median_score_location = listings['review_scores_location'].median()
median_score_value = listings['review_scores_value'].median()

listings['review_scores_rating'].fillna(median_score_rating, inplace=True)
listings['review_scores_accuracy'].fillna(median_score_accuracy, inplace=True)
listings['review_scores_cleanliness'].fillna(median_score_cleanliness, inplace=True)
listings['review_scores_checkin'].fillna(median_score_checkin, inplace=True)
listings['review_scores_communication'].fillna(median_score_communication, inplace=True)
listings['review_scores_location'].fillna(median_score_location, inplace=True)
listings['review_scores_value'].fillna(median_score_value, inplace=True)

In [25]:
listings.isnull().sum() / len(listings) * 100

host_since                     0.0
host_response_time             0.0
host_response_rate             0.0
host_acceptance_rate           0.0
host_is_superhost              0.0
host_total_listings_count      0.0
host_has_profile_pic           0.0
host_identity_verified         0.0
city                           0.0
latitude                       0.0
longitude                      0.0
room_type                      0.0
accommodates                   0.0
bedrooms                       0.0
amenities                      0.0
minimum_nights                 0.0
maximum_nights                 0.0
review_scores_rating           0.0
review_scores_accuracy         0.0
review_scores_cleanliness      0.0
review_scores_checkin          0.0
review_scores_communication    0.0
review_scores_location         0.0
review_scores_value            0.0
instant_bookable               0.0
usd_price                      0.0
dtype: float64

In [26]:
listings

Unnamed: 0,host_since,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_total_listings_count,host_has_profile_pic,host_identity_verified,city,latitude,...,maximum_nights,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,usd_price
0,2011-12-03,unknown,0.866905,0.825717,0,1.0,1,0,Paris,48.88668,...,1125,10.0,10.0,10.0,10.0,10.0,10.0,10.0,0,57.2824
1,2013-11-29,unknown,0.866905,0.825717,0,1.0,1,1,Paris,48.88617,...,1125,10.0,10.0,10.0,10.0,10.0,10.0,10.0,0,129.6960
2,2014-07-31,unknown,0.866905,0.825717,0,1.0,1,0,Paris,48.88112,...,1125,10.0,10.0,10.0,10.0,10.0,10.0,10.0,0,96.1912
3,2013-12-17,unknown,0.866905,0.825717,0,1.0,1,1,Paris,48.84571,...,1125,10.0,10.0,10.0,10.0,10.0,10.0,10.0,0,62.6864
4,2014-12-14,unknown,0.866905,0.825717,0,1.0,1,0,Paris,48.85500,...,1125,10.0,10.0,10.0,10.0,10.0,10.0,10.0,0,64.8480
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279707,2015-04-13,unknown,0.866905,0.825717,0,1.0,1,1,Paris,48.82701,...,7,10.0,10.0,10.0,10.0,10.0,10.0,10.0,0,129.6960
279708,2013-11-27,unknown,0.866905,0.825717,0,1.0,1,1,Paris,48.89309,...,15,10.0,10.0,10.0,10.0,10.0,10.0,10.0,0,64.8480
279709,2012-04-27,unknown,0.866905,0.825717,0,1.0,1,1,Paris,48.88699,...,30,10.0,10.0,10.0,10.0,10.0,10.0,10.0,0,54.0400
279710,2015-07-16,unknown,0.866905,0.825717,0,1.0,1,1,Paris,48.86687,...,18,10.0,10.0,10.0,10.0,10.0,10.0,10.0,0,113.4840


# Preprocessing

---



In [None]:
# Jenn's one hot encoding for amenities

In [None]:
# Splitting the data into training and test sets to estimate generalization error

from sklearn.model_selection import train_test_split

X = listings.drop("usd_price", axis=1)
y = listings["usd_price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Building preprocessing pipeline

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

from sklearn import set_config
set_config(display='diagram')

cat_attribs = ['host_response_time', 'host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'city', 'room_type', 'amenities', 'instant_bookable'] # not sure if host_since (maybe split by months) is included here

num_attribs = ['host_response_rate', 'host_acceptance_rate', 'host_total_listings_count', 'accommodates', 'bedrooms', 'review_scores_rating', 'review_scores_accuracy', 'minimum_nights',
               'maximum_nights', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value'] # excluding latitude and longitude

preprocess_pipeline = ColumnTransformer([
        ("cat", OneHotEncoder(drop="first"), cat_attribs),
        ("num", StandardScaler(), num_attribs),
    ])

preprocess_pipeline