# 3. Data Modeling

## 3.1 Import Data

**Analysis Questions:**

Q1. From a traveler's perspective, does a "superhost" enhance the guest experience?

Q2. What features have the most influence on the success and profitability of an Airbnb listing from an investor's standpoint?

Q3. How significantly do customer reviews influence the booking frequency of a listing?

In [1]:
# Import packages
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore')
import statsmodels.stats.multitest as multi


import func

pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', 100)

In [2]:
# Import cleaned dataframes
df_listings = pd.read_pickle('../data/listings.pkl')
df_reviews = pd.read_pickle('../data/reviews.pkl')

In [3]:
# One-hot encode catagory features
cat_cols = df_listings.select_dtypes('category').columns.tolist()

for col in cat_cols:
    dummies = pd.get_dummies(df_listings[col], prefix=col)
    df_listings = pd.concat([df_listings, dummies], axis=1)
    df_listings.drop(col, axis=1,inplace=True)

In [4]:
# Merge data
df_reviews.listing_id = df_reviews.listing_id.astype('str')
df_full = df_listings.merge(df_reviews, left_on='id', right_on='listing_id')

## 3.2 From a traveler's perspective, does a "superhost" enhance the guest experience?

In [5]:
df_listings.columns

Index(['host_is_superhost', 'host_about', 'host_response_time',
       'host_response_rate', 'host_listings_count', 'host_has_profile_pic',
       'host_identity_verified', 'calculated_host_listings_count', 'id',
       'name',
       ...
       'room_type_Shared room', 'bed_type_Airbed', 'bed_type_Couch',
       'bed_type_Futon', 'bed_type_Pull-out Sofa', 'bed_type_Real Bed',
       'cancellation_policy_flexible', 'cancellation_policy_moderate',
       'cancellation_policy_strict', 'cancellation_policy_super_strict_30'],
      dtype='object', length=116)

In [6]:
group = 'host_is_superhost'
num_cols = df_listings.select_dtypes(exclude=['object']).columns
test_results = []

for col in num_cols:
    if col=="host_is_superhost":
        continue
    else:
        test_results.append(func.bootstrap_t_pvalue(df_listings, group, col))

# Display the t-test result
test_results = pd.DataFrame(test_results,columns=['feature','pvalue','statistics'])

In [12]:
# Label signifigance
multitest_result = multi.multipletests(test_results.pvalue,method="bonferroni")
test_results['significant'],test_results['adjusted_pvalue']=multitest_result[0],multitest_result[1]
test_results.sort_values(['significant','pvalue'],ascending=[False,True]).style.bar(subset='statistics', align='zero', color=['#d65f5f', '#5fba7d'])

Unnamed: 0,feature,pvalue,statistics,significant,adjusted_pvalue
1,host_response_rate,0.0,14.092996,True,0.0
3,host_has_profile_pic,0.0,2.452137,True,0.0
4,host_identity_verified,0.0,7.996837,True,0.0
17,extra_people,0.0,4.115656,True,0.0
20,number_of_reviews,0.0,10.096419,True,0.0
21,review_scores_rating,0.0,23.96347,True,0.0
22,review_scores_accuracy,0.0,21.52845,True,0.0
23,review_scores_cleanliness,0.0,20.308684,True,0.0
24,review_scores_checkin,0.0,17.516334,True,0.0
25,review_scores_communication,0.0,18.979648,True,0.0


In [9]:
# About 43% of the significant features are amenities.
#test_results.feature[test_results.pvalue < .05].isin(unqiue_amenities).sum()/np.sum(ttest_result.significant==True)