# Machine Learning Applications for Airbnb Data

### Group 3 - Dhruv Shah, Jenn Hong, Setu Shah, Sonya Dreyer

---



• State the problem

• Tell us who cares about this problem and Why

• Describe your data – where it came from, what it contains

• Present some interesting descriptive analyses (plots/tables) that motivates your exercise

• Present your main results

• Which methods worked best for your problem?

• What were the challenges you faced? Tell us about the biggest challenge you faced and how you
overcame it (or, tried but did not – that’s fine too – not every problem has a solution.)

• Conclude – what have you learnt that can be put to practice?

# Data Cleaning

---



In [None]:
# Import preprocessing libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Download the file

!wget 'https://maven-datasets.s3.amazonaws.com/Airbnb/Airbnb+Data.zip'

--2023-11-18 19:42:28--  https://maven-datasets.s3.amazonaws.com/Airbnb/Airbnb+Data.zip
Resolving maven-datasets.s3.amazonaws.com (maven-datasets.s3.amazonaws.com)... 54.231.230.89, 52.217.102.228, 3.5.28.19, ...
Connecting to maven-datasets.s3.amazonaws.com (maven-datasets.s3.amazonaws.com)|54.231.230.89|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 91005234 (87M) [application/zip]
Saving to: ‘Airbnb+Data.zip’


2023-11-18 19:42:30 (49.1 MB/s) - ‘Airbnb+Data.zip’ saved [91005234/91005234]



In [None]:
# Unzip the file

!unzip Airbnb+Data.zip

Archive:  Airbnb+Data.zip
   creating: Airbnb Data/
  inflating: Airbnb Data/Listings.csv  
  inflating: Airbnb Data/Listings_data_dictionary.csv  
  inflating: Airbnb Data/Reviews.csv  
  inflating: Airbnb Data/Reviews_data_dictionary.csv  


In [None]:
# Load the data frames

listings =  pd.read_csv('/content/Airbnb Data/Listings.csv', encoding = 'latin1', low_memory = False)

reviews = pd.read_csv('/content/Airbnb Data/Reviews.csv', encoding = 'latin1', low_memory = False)

In [None]:
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279712 entries, 0 to 279711
Data columns (total 33 columns):
 #   Column                       Non-Null Count   Dtype         
---  ------                       --------------   -----         
 0   listing_id                   279712 non-null  int64         
 1   name                         279539 non-null  object        
 2   host_id                      279712 non-null  int64         
 3   host_since                   279547 non-null  datetime64[ns]
 4   host_location                278872 non-null  object        
 5   host_response_time           150930 non-null  object        
 6   host_response_rate           150930 non-null  float64       
 7   host_acceptance_rate         166625 non-null  float64       
 8   host_is_superhost            279547 non-null  object        
 9   host_total_listings_count    279547 non-null  float64       
 10  host_has_profile_pic         279547 non-null  object        
 11  host_identity_verified    

In [None]:
# Converting to datetime

listings.host_since = pd.to_datetime(listings.host_since)

In [None]:
# Converting to out-of-10 scale

listings.review_scores_rating = listings.review_scores_rating / 10

In [None]:
# Converting prices to USD

cities = listings['city'].unique()
exchange_rates = [1.0808, 1, 0.028388, 0.20328, 0.65462, 0.039480, 1.0808, 0.12777, 0.0493, 0.053215] # update these numbers before fitting models
currency_map = dict(zip(cities, exchange_rates))

listings['usd_price'] = listings.apply(lambda row: row['price'] * currency_map[row['city']], axis=1) # create new column
listings.drop('price', axis = 1, inplace = True) # drop original column

In [None]:
# Converting to numerical category

# Potentially problematic -> Converting NULL values to zero

listings.host_is_superhost = listings.host_is_superhost.apply(lambda x: 1 if x == 't' else 0)
listings.host_has_profile_pic = listings.host_has_profile_pic.apply(lambda x: 1 if x == 't' else 0)
listings.host_identity_verified = listings.host_identity_verified.apply(lambda x: 1 if x == 't' else 0)
listings.instant_bookable = listings.instant_bookable.apply(lambda x: 1 if x == 't' else 0)

In [None]:
# We can or should drop listing_id, host_id, property, neighbourhood

# We can drop Districts as it has only districts of New York, rest are all NULL

# We should drop name and possibly host_location (unless we want to/can figure out how to extract precise location --> latitude and longitude can be used to create clusters like in the lab)

# All host locations within each country have been mapped to the most prominent city in that country

# We need to possibly impute values (or drop columns) for host response time/rate, host_acceptance_rate, and some of the ratings columns [Iterative Imputer]

In [None]:
listings.head(2)

Unnamed: 0,listing_id,name,host_id,host_since,host_location,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_total_listings_count,...,maximum_nights,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,usd_price
0,281420,"Beautiful Flat in le Village Montmartre, Paris",1466919,2011-12-03,"Paris, Ile-de-France, France",,,,0,1.0,...,1125,10.0,10.0,10.0,10.0,10.0,10.0,10.0,0,57.2824
1,3705183,39 mÃÂ² Paris (Sacre CÃâur),10328771,2013-11-29,"Paris, Ile-de-France, France",,,,0,1.0,...,1125,10.0,10.0,10.0,10.0,10.0,10.0,10.0,0,129.696


In [None]:
listings.neighbourhood.value_counts()

I Centro Storico     14874
Sydney                8074
Copacabana            7712
Cuauhtemoc            7626
Buttes-Montmartre     7237
                     ...  
Lighthouse Hill          1
Willowbrook              1
Magalhaes Bastos         1
Woodrow                  1
Agua Santa               1
Name: neighbourhood, Length: 660, dtype: int64

# Preprocessing

---



In [None]:
# Jenn's one hot encoding for amenities

In [None]:
# Splitting the data into training and test sets to estimate generalization error

from sklearn.model_selection import train_test_split

X = listings.drop("usd_price", axis=1)
y = listings["usd_price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((223769, 32), (55943, 32), (223769,), (55943,))

In [None]:
# Building preprocessing pipeline

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

from sklearn import set_config
set_config(display='diagram')

cat_attribs = ['host_response_time', 'host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'city', 'room_type', 'amenities', 'instant_bookable'] # not sure if host_since (maybe split by months) is included here

num_attribs = ['host_response_rate', 'host_acceptance_rate', 'host_total_listings_count', 'accommodates', 'bedrooms', 'review_scores_rating', 'review_scores_accuracy', 'minimum_nights',
               'maximum_nights', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value'] # excluding latitude and longitude

preprocess_pipeline = ColumnTransformer([
        ("cat", OneHotEncoder(drop="first"), cat_attribs),
        ("num", StandardScaler(), num_attribs),
    ])

preprocess_pipeline