In [1]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

In [2]:
df = pd.read_csv('listings.csv')
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,241032,https://www.airbnb.com/rooms/241032,20160104002432,2016-01-04,Stylish Queen Anne Apartment,,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,none,,...,10.0,f,,WASHINGTON,f,moderate,f,f,2,4.07
1,953595,https://www.airbnb.com/rooms/953595,20160104002432,2016-01-04,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,none,"Queen Anne is a wonderful, truly functional vi...",...,10.0,f,,WASHINGTON,f,strict,t,t,6,1.48
2,3308979,https://www.airbnb.com/rooms/3308979,20160104002432,2016-01-04,New Modern House-Amazing water view,New modern house built in 2013. Spectacular s...,"Our house is modern, light and fresh with a wa...",New modern house built in 2013. Spectacular s...,none,Upper Queen Anne is a charming neighborhood fu...,...,10.0,f,,WASHINGTON,f,strict,f,f,2,1.15
3,7421966,https://www.airbnb.com/rooms/7421966,20160104002432,2016-01-04,Queen Anne Chateau,A charming apartment that sits atop Queen Anne...,,A charming apartment that sits atop Queen Anne...,none,,...,,f,,WASHINGTON,f,flexible,f,f,1,
4,278830,https://www.airbnb.com/rooms/278830,20160104002432,2016-01-04,Charming craftsman 3 bdm house,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,none,We are in the beautiful neighborhood of Queen ...,...,9.0,f,,WASHINGTON,f,strict,f,f,1,0.89


The aim is to identify possibly relevant variables from the dataset that may differentiate between superhost listings and nonsuperhost listings for classification purposes. 

In addition, criteria used by AirBnB to confer the superhost status will be ignored, as the objective is to see if there are any other variables that can help classify between superhosts and nonsuperhosts. As such, the 4 criterion below (adapted from the AirBnB website) and its corresponding variables will not be used as predictors for this model:
-response rate (host_response_rate)
-cancellation rate (host_acceptance_rate?)
-overall rating (review_scores_rating)
-Completed at least 10 trips / 3 reservations that total at least 100 nights (may not have a corresponding column)

For this model, image variables are not used as predictors, as they require additional preprocessing to be converted into a useful feature. Some of the textual variables however, were utilised. Their character count was obtained. Note though that this dataset had a limit of 1000 characters for most of the textual variables, cutting off these for some listings. Hence, quite a few listings have 1000 characters as the value.

Also, since there isn't an exact description for each column in this dataset, some of the possibly useful predictors e.g. guests_included, extra_people, were omitted as I was unsure of what exactly they meant.

Neighbourhood was omitted as the superhosts and non-superhosts can be visually observed from the map of Seattle

Extract the desired variables to be utilised (some may be dropped later; not all will be in the model)

In [3]:
df=df[['price','amenities','accommodates','security_deposit','cleaning_fee','availability_365','calculated_host_listings_count','reviews_per_month','minimum_nights','cancellation_policy','host_identity_verified','instant_bookable','host_is_superhost','host_since','last_review','first_review', "host_verifications", 'last_scraped', 'calendar_updated','number_of_reviews','host_response_time','space',"description",'neighborhood_overview','notes','host_about']]

In [4]:
df.shape

(3818, 26)

In [5]:
df.dtypes

price                              object
amenities                          object
accommodates                        int64
security_deposit                   object
cleaning_fee                       object
availability_365                    int64
calculated_host_listings_count      int64
reviews_per_month                 float64
minimum_nights                      int64
cancellation_policy                object
host_identity_verified             object
instant_bookable                   object
host_is_superhost                  object
host_since                         object
last_review                        object
first_review                       object
host_verifications                 object
last_scraped                       object
calendar_updated                   object
number_of_reviews                   int64
host_response_time                 object
space                              object
description                        object
neighborhood_overview             

In [6]:
df.isna().sum()/len(df)*100

price                              0.000000
amenities                          0.000000
accommodates                       0.000000
security_deposit                  51.126244
cleaning_fee                      26.977475
availability_365                   0.000000
calculated_host_listings_count     0.000000
reviews_per_month                 16.422211
minimum_nights                     0.000000
cancellation_policy                0.000000
host_identity_verified             0.052383
instant_bookable                   0.000000
host_is_superhost                  0.052383
host_since                         0.052383
last_review                       16.422211
first_review                      16.422211
host_verifications                 0.000000
last_scraped                       0.000000
calendar_updated                   0.000000
number_of_reviews                  0.000000
host_response_time                13.698271
space                             14.903091
description                     

For security_deposit and cleaning_fee, I'll assume that null value means there is no such fee for that listing. Thus I'll fill with 0. 

In [7]:
df['security_deposit'] = df['security_deposit'].fillna('$0.0')
df['cleaning_fee'] = df['cleaning_fee'].fillna('$0.0')

In [8]:
df.isna().sum()/len(df)*100

price                              0.000000
amenities                          0.000000
accommodates                       0.000000
security_deposit                   0.000000
cleaning_fee                       0.000000
availability_365                   0.000000
calculated_host_listings_count     0.000000
reviews_per_month                 16.422211
minimum_nights                     0.000000
cancellation_policy                0.000000
host_identity_verified             0.052383
instant_bookable                   0.000000
host_is_superhost                  0.052383
host_since                         0.052383
last_review                       16.422211
first_review                      16.422211
host_verifications                 0.000000
last_scraped                       0.000000
calendar_updated                   0.000000
number_of_reviews                  0.000000
host_response_time                13.698271
space                             14.903091
description                     

Before I drop NaN values for the other columns, I'll first obtain character count for each of the textual variables 

In [9]:
df['space'] = df['space'].astype(str)

In [10]:
for i in df["space"].index:
    if (df.at[i,"space"] == 'nan'):
        df.at[i,"space_char_count"] = 0
    else:
        df.at[i,"space_char_count"] = len(df.at[i,"space"])

In [11]:
df['space_char_count'].head()

0    1000.0
1    1000.0
2    1000.0
3       0.0
4     488.0
Name: space_char_count, dtype: float64

In [12]:
df['description'] = df['description'].astype(str)

In [13]:
for i in df["description"].index:
    if (df.at[i,"description"] == 'nan'):
        df.at[i,"description_char_count"] = 0
    else:
        df.at[i,"description_char_count"] = len(df.at[i,"description"])

In [14]:
df['description_char_count'].head()

0    1000.0
1    1000.0
2    1000.0
3     243.0
4    1000.0
Name: description_char_count, dtype: float64

In [15]:
df['neighborhood_overview'] = df['neighborhood_overview'].astype(str)

In [16]:
for i in df["neighborhood_overview"].index:
    if (df.at[i,"neighborhood_overview"] == 'nan'):
        df.at[i,"neighborhood_overview_char_count"] = 0
    else:
        df.at[i,"neighborhood_overview_char_count"] = len(df.at[i,"neighborhood_overview"])

In [17]:
df['neighborhood_overview_char_count'].head()

0      0.0
1    167.0
2    669.0
3      0.0
4    492.0
Name: neighborhood_overview_char_count, dtype: float64

In [18]:
df['notes'] = df['notes'].astype(str)

In [19]:
for i in df["notes"].index:
    if (df.at[i,"notes"] == 'nan'):
        df.at[i,"notes_char_count"] = 0
    else:
        df.at[i,"notes_char_count"] = len(df.at[i,"notes"])

In [20]:
df['notes_char_count'].head()

0       0.0
1    1000.0
2     155.0
3       0.0
4       9.0
Name: notes_char_count, dtype: float64

In [21]:
df['host_about'] = df['host_about'].astype(str)

In [22]:
for i in df["host_about"].index:
    if (df.at[i,"host_about"] == 'nan'):
        df.at[i,"host_about_char_count"] = 0
    else:
        df.at[i,"host_about_char_count"] = len(df.at[i,"host_about"])

In [23]:
df['host_about_char_count'].head()

0    372.0
1     74.0
2    343.0
3      0.0
4    354.0
Name: host_about_char_count, dtype: float64

Now we can drop the textual columns

In [24]:
df = df.drop(['space', 'description', 'neighborhood_overview', 'notes', 'host_about'], axis = 1)

Next, drop the remaining rows containing NaN values

In [25]:
df.isna().sum()/len(df)*100

price                                0.000000
amenities                            0.000000
accommodates                         0.000000
security_deposit                     0.000000
cleaning_fee                         0.000000
availability_365                     0.000000
calculated_host_listings_count       0.000000
reviews_per_month                   16.422211
minimum_nights                       0.000000
cancellation_policy                  0.000000
host_identity_verified               0.052383
instant_bookable                     0.000000
host_is_superhost                    0.052383
host_since                           0.052383
last_review                         16.422211
first_review                        16.422211
host_verifications                   0.000000
last_scraped                         0.000000
calendar_updated                     0.000000
number_of_reviews                    0.000000
host_response_time                  13.698271
space_char_count                  

In [26]:
df = df.dropna()

Final check for Null values

In [27]:
df.isnull().any()

price                               False
amenities                           False
accommodates                        False
security_deposit                    False
cleaning_fee                        False
availability_365                    False
calculated_host_listings_count      False
reviews_per_month                   False
minimum_nights                      False
cancellation_policy                 False
host_identity_verified              False
instant_bookable                    False
host_is_superhost                   False
host_since                          False
last_review                         False
first_review                        False
host_verifications                  False
last_scraped                        False
calendar_updated                    False
number_of_reviews                   False
host_response_time                  False
space_char_count                    False
description_char_count              False
neighborhood_overview_char_count  

Next up is data processing

In [28]:
df.dtypes

price                                object
amenities                            object
accommodates                          int64
security_deposit                     object
cleaning_fee                         object
availability_365                      int64
calculated_host_listings_count        int64
reviews_per_month                   float64
minimum_nights                        int64
cancellation_policy                  object
host_identity_verified               object
instant_bookable                     object
host_is_superhost                    object
host_since                           object
last_review                          object
first_review                         object
host_verifications                   object
last_scraped                         object
calendar_updated                     object
number_of_reviews                     int64
host_response_time                   object
space_char_count                    float64
description_char_count          

In [29]:
df['price'] = df['price'].str.replace('$', '').astype(float)

Create a new column, 'price_per_person' obtained by 'price' / 'accommodates'

In [30]:
df['price_per_person'] = df['price'] / df['accommodates']

Drop 'accommodates' as we won't need it anymore

In [31]:
df = df.drop(['accommodates'], axis = 1)

In [32]:
df.dtypes

price                               float64
amenities                            object
security_deposit                     object
cleaning_fee                         object
availability_365                      int64
calculated_host_listings_count        int64
reviews_per_month                   float64
minimum_nights                        int64
cancellation_policy                  object
host_identity_verified               object
instant_bookable                     object
host_is_superhost                    object
host_since                           object
last_review                          object
first_review                         object
host_verifications                   object
last_scraped                         object
calendar_updated                     object
number_of_reviews                     int64
host_response_time                   object
space_char_count                    float64
description_char_count              float64
neighborhood_overview_char_count

In [33]:
df['security_deposit'] = df['security_deposit'].str.replace('$', '').str.replace(',', '').astype(float)

In [34]:
df['cleaning_fee'] = df['cleaning_fee'].str.replace('$', '').str.replace(',', '').astype(float)

In [35]:
df.dtypes

price                               float64
amenities                            object
security_deposit                    float64
cleaning_fee                        float64
availability_365                      int64
calculated_host_listings_count        int64
reviews_per_month                   float64
minimum_nights                        int64
cancellation_policy                  object
host_identity_verified               object
instant_bookable                     object
host_is_superhost                    object
host_since                           object
last_review                          object
first_review                         object
host_verifications                   object
last_scraped                         object
calendar_updated                     object
number_of_reviews                     int64
host_response_time                   object
space_char_count                    float64
description_char_count              float64
neighborhood_overview_char_count

In [37]:
df['cancellation_policy'].value_counts()

strict      1169
moderate    1023
flexible     679
Name: cancellation_policy, dtype: int64

For ordered categorical variable 'cancellation_policy', change each category to a number where 1 is the least strict / most flexible and 3 is the strictest

In [36]:
dict_1 = {'flexible' : 1, 'moderate': 2, 'strict': 3}


In [40]:
df['cancellation_policy'] = df['cancellation_policy'].map(dict_1)

In [42]:
df.dtypes

price                               float64
amenities                            object
security_deposit                    float64
cleaning_fee                        float64
availability_365                      int64
calculated_host_listings_count        int64
reviews_per_month                   float64
minimum_nights                        int64
cancellation_policy                   int64
host_identity_verified               object
instant_bookable                     object
host_is_superhost                    object
host_since                           object
last_review                          object
first_review                         object
host_verifications                   object
last_scraped                         object
calendar_updated                     object
number_of_reviews                     int64
host_response_time                   object
space_char_count                    float64
description_char_count              float64
neighborhood_overview_char_count

For the Boolean variables, assign True to 1 and False to 0 for usage in models later

In [43]:
dict_2 = {'t' : 1, 'f' : 0}

In [47]:
df['host_identity_verified'] = df['host_identity_verified'].map(dict_2)

In [48]:
df['instant_bookable'] = df['instant_bookable'].map(dict_2)

In [49]:
df['host_is_superhost'] = df['host_is_superhost'].map(dict_2)

In [50]:
df.dtypes

price                               float64
amenities                            object
security_deposit                    float64
cleaning_fee                        float64
availability_365                      int64
calculated_host_listings_count        int64
reviews_per_month                   float64
minimum_nights                        int64
cancellation_policy                   int64
host_identity_verified                int64
instant_bookable                      int64
host_is_superhost                     int64
host_since                           object
last_review                          object
first_review                         object
host_verifications                   object
last_scraped                         object
calendar_updated                     object
number_of_reviews                     int64
host_response_time                   object
space_char_count                    float64
description_char_count              float64
neighborhood_overview_char_count

In [51]:
import datetime

#change the dates into a recognisable date format

df['host_since']=pd.to_datetime(df['host_since'])
df['first_review']=pd.to_datetime(df['first_review'])
df['last_review']=pd.to_datetime(df['last_review'])
df['last_scraped']=pd.to_datetime(df['last_scraped'])

Create a column 'listing_duration_days' as an indicator of how long the listing has been listed

In [54]:
df['listing_duration_days'] = df['last_review'] - df['first_review']
df['listing_duration_days'].head()

0   1523 days
1    862 days
2    400 days
4   1201 days
6    679 days
Name: listing_duration_days, dtype: timedelta64[ns]

Convert to int type

In [55]:
df["listing_duration_days"] = df["listing_duration_days"].dt.days
df['listing_duration_days'].head()

0    1523
1     862
2     400
4    1201
6     679
Name: listing_duration_days, dtype: int64

Create a column 'hosting_duration_days' as a relative indicator of how long the host has been hosting (experience)

In [56]:
df = df.assign(hosting_duration_days = df['last_review'] - df['host_since'])

In [57]:
df['hosting_duration_days'].head()

0   1605 days
1   1041 days
2    448 days
4   1425 days
6   1286 days
Name: hosting_duration_days, dtype: timedelta64[ns]

In [58]:
df["hosting_duration_days"] = df["hosting_duration_days"].dt.days
df['hosting_duration_days'].head()

0    1605
1    1041
2     448
4    1425
6    1286
Name: hosting_duration_days, dtype: int64

In [59]:
df.dtypes

price                                      float64
amenities                                   object
security_deposit                           float64
cleaning_fee                               float64
availability_365                             int64
calculated_host_listings_count               int64
reviews_per_month                          float64
minimum_nights                               int64
cancellation_policy                          int64
host_identity_verified                       int64
instant_bookable                             int64
host_is_superhost                            int64
host_since                          datetime64[ns]
last_review                         datetime64[ns]
first_review                        datetime64[ns]
host_verifications                          object
last_scraped                        datetime64[ns]
calendar_updated                            object
number_of_reviews                            int64
host_response_time             

For 'calendar_updated', it is to be converted into a useable form

In [60]:
df["calendar_updated"] = df["calendar_updated"].str.replace('a ', '1 ')

In [61]:
from dateutil.relativedelta import relativedelta

In [62]:
def get_past_date(str_days_ago):
    day_scraped = df.at[0,"last_scraped"]
    splitted = str_days_ago.split()
    if len(splitted) == 1 and splitted[0].lower() == 'today':
        return str(day_scraped.isoformat())
    elif len(splitted) == 1 and splitted[0].lower() == 'yesterday':
        date = day_scraped - relativedelta(days=1)
        return str(date.isoformat())
    elif len(splitted) == 1 and splitted[0].lower() == 'never':
        return "never"
    elif splitted[1].lower() in ['day', 'days', 'd']:
        date = day_scraped - relativedelta(days=int(splitted[0]))
        return str(date.isoformat())
    elif splitted[1].lower() in ['wk', 'wks', 'week', 'weeks', 'w']:
        date = day_scraped - relativedelta(weeks=int(splitted[0]))
        return str(date.isoformat())
    elif splitted[1].lower() in ['mon', 'mons', 'month', 'months', 'm']:
        date = day_scraped - relativedelta(months=int(splitted[0]))
        return str(date.isoformat())
    elif splitted[1].lower() in ['yrs', 'yr', 'years', 'year', 'y']:
        date = day_scraped - relativedelta(years=int(splitted[0]))
        return str(date.isoformat())
    else:
        return "Wrong Argument format"

In [63]:
for i in df["calendar_updated"].index:
    df.at[i,"calendar_updated_temp"] = get_past_date(df.at[i,"calendar_updated"])

Drop rows containing 'never' (only 3 rows)

In [67]:
df = df[df.calendar_updated_temp != 'never']

In [68]:
df["calendar_updated_temp"].value_counts()

2016-01-04T00:00:00    614
2015-12-21T00:00:00    280
2015-12-14T00:00:00    243
2015-12-28T00:00:00    225
2016-01-03T00:00:00    204
2015-11-04T00:00:00    180
2015-12-07T00:00:00    172
2016-01-01T00:00:00    137
2015-11-30T00:00:00    125
2015-12-31T00:00:00    118
2015-10-04T00:00:00    114
2015-12-30T00:00:00    113
2015-11-23T00:00:00     86
2016-01-02T00:00:00     81
2015-09-04T00:00:00     49
2015-11-16T00:00:00     45
2015-08-04T00:00:00     33
2015-12-29T00:00:00     22
2015-06-04T00:00:00     10
2015-07-04T00:00:00     10
2015-03-04T00:00:00      3
2015-05-04T00:00:00      2
2015-04-04T00:00:00      1
2014-11-04T00:00:00      1
Name: calendar_updated_temp, dtype: int64

In [70]:
df["calendar_updated_temp"]=pd.to_datetime(df["calendar_updated_temp"])


Create a new column, 'days_since_calendar_updated', to replace 'calendar_updated', where the data is in a more usable form

In [71]:
df['days_since_calendar_updated'] = df['last_scraped'] - df["calendar_updated_temp"]

In [72]:
df['days_since_calendar_updated'] = df['days_since_calendar_updated'].dt.days

In [73]:
df['days_since_calendar_updated'].head()

0    28
1     0
2    35
4    49
6    35
Name: days_since_calendar_updated, dtype: int64

In [74]:
df.dtypes

price                                      float64
amenities                                   object
security_deposit                           float64
cleaning_fee                               float64
availability_365                             int64
calculated_host_listings_count               int64
reviews_per_month                          float64
minimum_nights                               int64
cancellation_policy                          int64
host_identity_verified                       int64
instant_bookable                             int64
host_is_superhost                            int64
host_since                          datetime64[ns]
last_review                         datetime64[ns]
first_review                        datetime64[ns]
host_verifications                          object
last_scraped                        datetime64[ns]
calendar_updated                            object
number_of_reviews                            int64
host_response_time             

Drop the unneeded columns:

In [75]:
df = df.drop(['calendar_updated_temp','calendar_updated'], axis = 1)

In [76]:
df.dtypes

price                                      float64
amenities                                   object
security_deposit                           float64
cleaning_fee                               float64
availability_365                             int64
calculated_host_listings_count               int64
reviews_per_month                          float64
minimum_nights                               int64
cancellation_policy                          int64
host_identity_verified                       int64
instant_bookable                             int64
host_is_superhost                            int64
host_since                          datetime64[ns]
last_review                         datetime64[ns]
first_review                        datetime64[ns]
host_verifications                          object
last_scraped                        datetime64[ns]
number_of_reviews                            int64
host_response_time                          object
space_char_count               

Since 'host_verifications' is a list of sorts, count the number in each 'list'

In [77]:
for i in df["host_verifications"].index:
    if len(df.at[i,"host_verifications"]) > 2 :
        count = 1
        for char in df.at[i,"host_verifications"]:
            if char == ',':
                count+=1
    else:
        count = 0
    df.at[i,"no_of_host_verifications"] = count
    count = 0

In [78]:
df["no_of_host_verifications"].value_counts()

5.0    1188
4.0     716
6.0     528
3.0     265
7.0     132
2.0      23
8.0      15
1.0       1
Name: no_of_host_verifications, dtype: int64

In [79]:
df = df.drop(['host_verifications'], axis = 1)

Next, check for and remove any anomalies because **at least 100 days have to elapse before superhost can be awarded**, so I will be excluding those with new accounts as they may not be an accurate indicator about whether it meets the superhost requirement.

In [80]:
df['account_duration_days'] = df['last_scraped'] - df['host_since']

In [81]:
df["account_duration_days"] = df["account_duration_days"].dt.days

In [82]:
df['evaluation_period_elapsed'] = df['account_duration_days'].apply(lambda x: True if x >= 100 else False)

In [83]:
pd.crosstab(df["evaluation_period_elapsed"],df["host_is_superhost"])

host_is_superhost,0,1
evaluation_period_elapsed,Unnamed: 1_level_1,Unnamed: 2_level_1
False,65,0
True,2084,719


As one can see, none of the superhosts have AirBnB account age of less than or equal to 100 days old. I will drop the rows with account age less than or equal to 100 days old:

In [86]:
df = df[df.account_duration_days >= 100]

In [87]:
pd.crosstab(df["evaluation_period_elapsed"],df["host_is_superhost"])

host_is_superhost,0,1
evaluation_period_elapsed,Unnamed: 1_level_1,Unnamed: 2_level_1
True,2084,719


In [88]:
df.dtypes

price                                      float64
amenities                                   object
security_deposit                           float64
cleaning_fee                               float64
availability_365                             int64
calculated_host_listings_count               int64
reviews_per_month                          float64
minimum_nights                               int64
cancellation_policy                          int64
host_identity_verified                       int64
instant_bookable                             int64
host_is_superhost                            int64
host_since                          datetime64[ns]
last_review                         datetime64[ns]
first_review                        datetime64[ns]
last_scraped                        datetime64[ns]
number_of_reviews                            int64
host_response_time                          object
space_char_count                           float64
description_char_count         

In [89]:
df = df.drop(['evaluation_period_elapsed','last_scraped','first_review','last_review','host_since'], axis = 1)

In [90]:
df.dtypes

price                               float64
amenities                            object
security_deposit                    float64
cleaning_fee                        float64
availability_365                      int64
calculated_host_listings_count        int64
reviews_per_month                   float64
minimum_nights                        int64
cancellation_policy                   int64
host_identity_verified                int64
instant_bookable                      int64
host_is_superhost                     int64
number_of_reviews                     int64
host_response_time                   object
space_char_count                    float64
description_char_count              float64
neighborhood_overview_char_count    float64
notes_char_count                    float64
host_about_char_count               float64
price_per_person                    float64
listing_duration_days                 int64
hosting_duration_days                 int64
days_since_calendar_updated     

In [91]:
pd.crosstab(df['host_response_time'],df["host_is_superhost"], normalize='columns')

host_is_superhost,0,1
host_response_time,Unnamed: 1_level_1,Unnamed: 2_level_1
a few days or more,0.012956,0.001391
within a day,0.196257,0.075104
within a few hours,0.301823,0.269819
within an hour,0.488964,0.653686


For 'host_response_time', similarly it will be converted correspondingly, with 1 for 'within an hour', the fastest, 2 for 'within a few hours', 3 for 'within a day' and 4 for 'a few days or more', the slowest.

In [92]:
dict_3 = {'within an hour' : 1, 'within a few hours': 2, 'within a day': 3, 'a few days or more': 4}

In [97]:
df['host_response_time'] = df['host_response_time'].map(dict_3)

In [98]:
df.dtypes

price                               float64
amenities                            object
security_deposit                    float64
cleaning_fee                        float64
availability_365                      int64
calculated_host_listings_count        int64
reviews_per_month                   float64
minimum_nights                        int64
cancellation_policy                   int64
host_identity_verified                int64
instant_bookable                      int64
host_is_superhost                     int64
number_of_reviews                     int64
host_response_time                    int64
space_char_count                    float64
description_char_count              float64
neighborhood_overview_char_count    float64
notes_char_count                    float64
host_about_char_count               float64
price_per_person                    float64
listing_duration_days                 int64
hosting_duration_days                 int64
days_since_calendar_updated     

Lastly, for amenities first a count of amenities for each listing is obtained:

In [99]:
for i in df["amenities"].index:
    if len(df.at[i,"amenities"]) > 2 :
        count = 1
        for char in df.at[i,"amenities"]:
            if char == ',':
                count+=1
    else:
        count = 0
    df.at[i,"amenities_count"] = count
    count = 0

Also, some pre-calculations show that certain specific amenities may be more common in superhosts than non-superhosts. These shall be extracted as 1/0 (T/F) in a new column

In [101]:
df["Dog(s)"] = df["amenities"].map(lambda x: 1 if "Dog(s)" in x else 0)

In [102]:
pd.crosstab(df["Dog(s)"],df["host_is_superhost"], normalize = 'columns')

host_is_superhost,0,1
Dog(s),Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.885797,0.808067
1,0.114203,0.191933


In [103]:
df["Pets live on this property"] = df["amenities"].map(lambda x: 1 if "Pets live on this property" in x else 0)

In [104]:
pd.crosstab(df["Pets live on this property"],df["host_is_superhost"], normalize = 'columns')

host_is_superhost,0,1
Pets live on this property,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.793186,0.682893
1,0.206814,0.317107


In [105]:
df.dtypes

price                               float64
amenities                            object
security_deposit                    float64
cleaning_fee                        float64
availability_365                      int64
calculated_host_listings_count        int64
reviews_per_month                   float64
minimum_nights                        int64
cancellation_policy                   int64
host_identity_verified                int64
instant_bookable                      int64
host_is_superhost                     int64
number_of_reviews                     int64
host_response_time                    int64
space_char_count                    float64
description_char_count              float64
neighborhood_overview_char_count    float64
notes_char_count                    float64
host_about_char_count               float64
price_per_person                    float64
listing_duration_days                 int64
hosting_duration_days                 int64
days_since_calendar_updated     

In [106]:
df = df.drop(["amenities"], axis = 1)

Finally, convert some floats to int

In [107]:
df["amenities_count"] = df["amenities_count"].astype(int)

df["space_char_count"] = df["space_char_count"].astype(int)

df["description_char_count"] = df["description_char_count"].astype(int)

df["neighborhood_overview_char_count"] = df["neighborhood_overview_char_count"].astype(int)

df["notes_char_count"] = df["notes_char_count"].astype(int)

df["host_about_char_count"] = df["host_about_char_count"].astype(int)

In [109]:
df["no_of_host_verifications"] = df["no_of_host_verifications"].astype(int)

In [112]:
df = df.rename(columns={"Dog(s)": "Dog(s) present"})

In [113]:
df.dtypes

price                               float64
security_deposit                    float64
cleaning_fee                        float64
availability_365                      int64
calculated_host_listings_count        int64
reviews_per_month                   float64
minimum_nights                        int64
cancellation_policy                   int64
host_identity_verified                int64
instant_bookable                      int64
host_is_superhost                     int64
number_of_reviews                     int64
host_response_time                    int64
space_char_count                      int64
description_char_count                int64
neighborhood_overview_char_count      int64
notes_char_count                      int64
host_about_char_count                 int64
price_per_person                    float64
listing_duration_days                 int64
hosting_duration_days                 int64
days_since_calendar_updated           int64
no_of_host_verifications        

cancellation_policy, host_identity_verified, instant_bookable, host_is_superhost, host_response_time, Dog(s) present, Pets live on this property are categorical variables but are encoded with labels.

In [114]:
df.to_csv ('~/Downloads/cleaned_df.csv', index = False, header=True)

Additional steps to further discretize the char_counts for the textual variables and sort them into bins.

In [1]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

In [2]:
df = pd.read_csv('cleaned_df.csv')
df.head()

Unnamed: 0,price,security_deposit,cleaning_fee,availability_365,calculated_host_listings_count,reviews_per_month,minimum_nights,cancellation_policy,host_identity_verified,instant_bookable,...,host_about_char_count,price_per_person,listing_duration_days,hosting_duration_days,days_since_calendar_updated,no_of_host_verifications,account_duration_days,amenities_count,Dog(s) present,Pets live on this property
0,85.0,0.0,0.0,346,2,4.07,1,2,1,0,...,372,21.25,1523,1605,28,4,1607,10,0,0
1,150.0,100.0,40.0,291,6,1.48,2,3,1,0,...,74,37.5,862,1041,0,6,1047,16,0,0
2,975.0,1000.0,300.0,220,2,1.15,4,3,1,0,...,343,88.636364,400,448,35,5,571,21,1,1
3,450.0,700.0,125.0,365,1,0.89,1,3,1,0,...,354,75.0,1201,1425,49,5,1497,13,0,0
4,80.0,150.0,0.0,346,1,2.46,3,2,1,0,...,229,40.0,679,1286,35,4,1314,7,0,0


In [3]:
df['host_about_char_count_grouped'] = pd.cut(df['host_about_char_count'], range(0, 1001, 50), precision=0, include_lowest = True)


host about doesn't have 1000 char limit though

In [5]:
df.dtypes

price                                float64
security_deposit                     float64
cleaning_fee                         float64
availability_365                       int64
calculated_host_listings_count         int64
reviews_per_month                    float64
minimum_nights                         int64
cancellation_policy                    int64
host_identity_verified                 int64
instant_bookable                       int64
host_is_superhost                      int64
number_of_reviews                      int64
host_response_time                     int64
space_char_count                       int64
description_char_count                 int64
neighborhood_overview_char_count       int64
notes_char_count                       int64
host_about_char_count                  int64
price_per_person                     float64
listing_duration_days                  int64
hosting_duration_days                  int64
days_since_calendar_updated            int64
no_of_host

In [4]:
df['space_char_count_grouped'] = pd.cut(df['space_char_count'], range(0, 1001, 50), precision=0, include_lowest = True)


In [5]:
df['description_char_count_grouped'] = pd.cut(df['description_char_count'], range(0, 1001, 50), precision=0, include_lowest = True)


In [6]:
df['neighborhood_overview_char_count_grouped'] = pd.cut(df['neighborhood_overview_char_count'], range(0, 1001, 50), precision=0, include_lowest = True)


In [7]:
df['notes_char_count_grouped'] = pd.cut(df['notes_char_count'], range(0, 1001, 50), precision=0, include_lowest = True)


In [8]:
df = df.drop(['host_about_char_count_grouped'], axis = 1)

host_about doesn't have 1000 char limit though so it'll not be discretized for now

In [9]:
df.dtypes

price                                        float64
security_deposit                             float64
cleaning_fee                                 float64
availability_365                               int64
calculated_host_listings_count                 int64
reviews_per_month                            float64
minimum_nights                                 int64
cancellation_policy                            int64
host_identity_verified                         int64
instant_bookable                               int64
host_is_superhost                              int64
number_of_reviews                              int64
host_response_time                             int64
space_char_count                               int64
description_char_count                         int64
neighborhood_overview_char_count               int64
notes_char_count                               int64
host_about_char_count                          int64
price_per_person                             f

In [10]:
df.to_csv ('~/Downloads/cleaned_df_v2.csv', index = False, header=True)

May want to drop those non-discretized textual character count variables later

In [1]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

In [2]:
df = pd.read_csv('cleaned_df_final.csv')
df.head()

Unnamed: 0,host_is_superhost,host_identity_verified,Dog(s) present,Pets live on this property,cancellation_policy,no_of_host_verifications,host_response_time,number_of_reviews,amenities_count,calculated_host_listings_count,...,days_since_calendar_updated,account_duration_days,space_char_count,neighborhood_overview_char_count,notes_char_count,host_about_char_count,space_char_count_grouped,description_char_count_grouped,neighborhood_overview_char_count_grouped,notes_char_count_grouped
0,0,1,0,0,2,4,2,207,10,2,...,28,1607,1000,0,0,372,20,20,1,1
1,1,1,0,0,3,6,1,43,16,6,...,0,1047,1000,167,1000,74,20,20,4,20
2,0,1,1,1,3,5,2,20,21,2,...,35,571,1000,669,155,343,20,20,14,4
3,0,1,0,0,3,5,1,38,13,1,...,49,1497,488,492,9,354,10,20,10,1
4,1,1,0,0,2,4,1,58,7,1,...,35,1314,1000,95,82,229,20,20,2,2


In [3]:
df.columns

Index(['host_is_superhost', 'host_identity_verified', 'Dog(s) present',
       'Pets live on this property', 'cancellation_policy',
       'no_of_host_verifications', 'host_response_time', 'number_of_reviews',
       'amenities_count', 'calculated_host_listings_count',
       'reviews_per_month', 'price', 'security_deposit', 'cleaning_fee',
       'availability_365', 'listing_duration_days', 'hosting_duration_days',
       'price_per_person', 'days_since_calendar_updated',
       'account_duration_days', 'space_char_count',
       'neighborhood_overview_char_count', 'notes_char_count',
       'host_about_char_count', 'space_char_count_grouped',
       'description_char_count_grouped',
       'neighborhood_overview_char_count_grouped', 'notes_char_count_grouped'],
      dtype='object')

Encode each bin with a numerical label; **1 = 0-50, 2 = 51-100, 3 = 101-150, ..., 20 = 951-1000**

In [16]:
df['space_char_count_grouped'] = pd.cut(df['space_char_count'], range(0, 1001, 50), precision=0, include_lowest = True, labels=[1, 2, 3, 4,
                                                                                              5,6,7,8,9,10,11,12,
                                                                                              13,14,15,16,17,18,
                                                                                              19,20])


In [21]:
df['notes_char_count_grouped'] = pd.cut(df['notes_char_count'], range(0, 1001, 50), precision=0, include_lowest = True, labels=[1, 2, 3, 4,
                                                                                              5,6,7,8,9,10,11,12,
                                                                                              13,14,15,16,17,18,
                                                                                              19,20])


In [33]:
df['neighborhood_overview_char_count_grouped'] = pd.cut(df['neighborhood_overview_char_count'], range(0, 1001, 50), precision=0, include_lowest = True, labels=[1, 2, 3, 4,
                                                                                              5,6,7,8,9,10,11,12,
                                                                                              13,14,15,16,17,18,
                                                                                              19,20])

In [35]:
df2 = pd.read_csv('cleaned_df_v2.csv')

In [39]:
df['description_char_count_grouped']=pd.cut(df2['description_char_count'], range(0, 1001, 50), precision=0, include_lowest = True, labels=[1, 2, 3, 4,
                                                                                              5,6,7,8,9,10,11,12,
                                                                                              13,14,15,16,17,18,
                                                                                              19,20])


In [43]:
df.to_csv ('~/Downloads/cleaned_df_v3.1.csv', index = False, header=True)

After feature selection:

In [5]:
df = df.drop(['host_identity_verified', 'Dog(s) present','Pets live on this property','description_char_count_grouped','cancellation_policy'], axis = 1)


In [6]:
df.shape

(2803, 23)

In [7]:
df.to_csv ('~/Downloads/cleaned_df_dropped_fts.csv', index = False, header=True)