#Analysis of the Airbnb data of London

Airbnb is an online marketplace and hospitality service with millions of listings. In this project, we analyze the Airbnb data of London and try to use the available information to predict the review scores of each listing. The data sets are available [here](http://insideairbnb.com/get-the-data.html).

##Reading data

There are three files regarding the calendar, listings, and reviews. Among the three files, the listings data provide detailed information of the amenities, prices, description, etc. of all the listings. Thus, we focus on the listings data for the moment.

In [1]:
import pandas as pd


listings = pd.read_csv('../Airbnb/src/data_sets/listings.csv.gz')
listings = listings.rename(columns={'id': 'listing_id'})

##Clean and transform data

As we want to predict the review scores, we remove all the missing values and bin it into five groups. Then we clean the data and transform the format of some of the variables that have (relatively) strong relationship with the review scores.

Notice that special care has to be taken regarding the host verifications and amenities: Each item contains a list of features of each listing in a string format; thus, we first extract all the features in that string and then convert them into multiple variables of binary values where `1` indicates the presence of this feature and `0` indicates absence. Also, there are two amenities regarding missing translations; these are removed from the amenities.

In [2]:
import numpy as np


resp = 'review_scores_rating'  # Set response
listings = listings[~listings[resp].isnull()].reset_index(0, True)  # Remove missing values
listings[resp] = pd.cut(listings[resp], np.arange(0, 120, 20), labels=np.arange(5),
                        include_lowest=True)  # Bin response into five levels

In [3]:
def transform_zipcode(df, variable='zipcode'):
    """Transform zipcode to outward code
    
    :param df: data frame to be transformed
    :param variable: zipcode
    :return: transformed data frame
    """
    df = df.copy()
    df[variable] = df[variable].fillna('').str.lower().str.replace(r' +', '') \
        # Reformat zipcode
    pat = df[variable].str.match(r'^[a-z]{1,2}[0-9][a-z0-9]?([0-9][a-z]{2})?$') \
        # Find zipcode of correct format
    df.loc[~pat, variable] = ''  # Remove wrongly recorded zipcode
    df[variable] = df[variable].apply(lambda x: x if len(x) <= 4 else x[:-3]) \
        # Extract outward code (this step is redundant)
    df[variable] = df[variable].str.extract(r'([a-z]{1,2})', expand=False) \
        # Extract postcode area
    return df


listings = transform_zipcode(listings)

In [4]:
from sklearn.preprocessing import LabelBinarizer


def transform_label(df, variables=None):
    """Binarize labels using numerical values
    
    :param df: data frame to be transformed
    :param variables: list of variables to be binarized
    :return: transformed data frame
    """
    df = df.copy()
    label_binarizer = LabelBinarizer()
    transformed_list = []
    
    for variable in variables:
        df[variable] = df[variable].fillna('')  # Reformat variables
        transformed = label_binarizer.fit_transform(df[variable])  # Binarize variables
        
        # Set column names of the transformed data
        if transformed.shape[1] == 1:
            transformed_columns = [variable]
        else:
            columns = pd.Series(label_binarizer.classes_).str.lower() \
                .str.replace(r'[&\-/ ]', '_').str.replace(r'_+', '_')
            transformed_columns = [variable + '.' + column for column in columns]
        
        transformed = pd.DataFrame(transformed, columns=transformed_columns)
        transformed_list.append(transformed)  # Add transformed data to the transformed list
        df = df.drop(variable, 1)  # Remove original data
        
    df = pd.DataFrame(pd.concat([df] + transformed_list, 1))  # Concatenate transformed data
    return df


vars_label = [
    'experiences_offered',
    'host_response_time',
    'zipcode',
    'property_type',
    'room_type',
    'bed_type',
    'cancellation_policy',
    'host_is_superhost',
    'is_location_exact',
    'requires_license',
    'instant_bookable',
    'require_guest_profile_picture',
    'require_guest_phone_verification'
]
listings = transform_label(listings, vars_label)

In [5]:
def transform_host_since(df):
    """Transform host start date to duration
    
    :param df: data frame to be transformed
    :return: transformed data frame
    """
    df = df.copy()
    df['host_since'] = pd.to_datetime(df['host_since'], yearfirst=True) \
        # Convert to datetime format
    df['host_for'] = (df['host_since'].max() - df['host_since']).dt.days \
        # Calculate the duration of hosting
    return df


listings = transform_host_since(listings)

In [6]:
from sklearn.feature_extraction.text import CountVectorizer


def transform_text(df, variables=None, max_features=100):
    """Transform variables consisting of text into a sparse format
    
    :param df: data frame to be transformed
    :param variables: list of variables to be transformed
    :param max_features: maximum number of features
    :return: transformed data frame
    """
    df = df.copy()
    count_vectorizer = CountVectorizer(stop_words='english', max_features=max_features)
    transformed_list = []
    
    for variable in variables:
        df[variable] = df[variable].fillna('')
        transformed = count_vectorizer.fit_transform(df[variable])
        
        # Set column names of the transformed data
        columns = sorted(count_vectorizer.vocabulary_.keys())
        transformed_columns = [variable + '.' + column for column in columns]
        
        transformed = pd.DataFrame(transformed.toarray(), columns=transformed_columns)
        transformed_list.append(transformed)  # Add transformed data to the transformed list
        df = df.drop(variable, 1)  # Remove original data
        
    df = pd.DataFrame(pd.concat([df] + transformed_list, 1))  # Concatenate transformed data
    return df


vars_text = ['host_verifications']
listings = transform_text(listings, vars_text)

In [7]:
def transform_percent(df, variables=None):
    """Transform strings of percentages to decimals
    
    :param df: data frame to be transformed
    :param variables: list of variables to be transformed
    :return: transformed data frame
    """
    df = df.copy()
    for variable in variables:
        df[variable] = df[variable].str.strip('%')
        df[variable] = df[variable].astype(np.float64) / 100
    return df


vars_percent = ['host_response_rate']
listings = transform_percent(listings, vars_percent)

In [8]:
def transform_price(df, variables=None):
    """Transform strings of price to numerals
    
    :param df: data framed to be transformed
    :param variables: list of variables to be transformed
    :return: transformed data frame
    """
    df = df.copy()
    for variable in variables:
        df[variable] = df[variable].str.strip('$').str.replace(',', '') \
            # Remove dollar signs and thousands separators
        df[variable] = df[variable].astype(np.float64)
    return df


vars_price = ['price']
listings = transform_price(listings, vars_price)

In [9]:
def transform_amenities(df, variable='amenities'):
    """Extract amenities
    
    :param df: data frame to be transformed
    :param variable: amenities
    :return: transformed data frame
    """
    df = df.copy()
    df[variable] = df[variable].str.replace(r'[:\-\./ ]', '_').str.replace(r'[\(\)]', '') \
        .str.replace(r'_+', '_').str.lower()  # Reformat amenities
    df = transform_text(df, [variable])  # Transform amenities using transform_text
    columns_to_remove = [column for column in df.columns if 'missing' in column]
    df = pd.DataFrame(df.drop(columns_to_remove, 1))  # Remove unwanted amenities
    return df


listings = transform_amenities(listings)

After transforming the format of the variables, we remove those having high percentage of missing values and/or having little or no relationship with the review scores.

In [10]:
listings = listings.drop([
    'listing_id',
    'listing_url',
    'scrape_id',
    'last_scraped',
    'name',
    'summary',
    'space',
    'description',
    'neighborhood_overview',
    'notes',
    'transit',
    'access',
    'interaction',
    'house_rules',
    'thumbnail_url',
    'medium_url',
    'picture_url',
    'xl_picture_url',
    'host_url',
    'host_name',
    'host_since',
    'host_location',
    'host_about',
    'host_acceptance_rate',
    'host_thumbnail_url',
    'host_picture_url',
    'host_neighbourhood',
    'host_listings_count',
    'host_total_listings_count',
    'host_has_profile_pic',
    'host_identity_verified',
    'street',
    'neighbourhood',
    'neighbourhood_cleansed',
    'neighbourhood_group_cleansed',
    'city',
    'state',
    'market',
    'smart_location',
    'country_code',
    'country',
    'latitude',
    'longitude',
    'is_location_exact',
    'square_feet',
    'weekly_price',
    'monthly_price',
    'security_deposit',
    'cleaning_fee',
    'guests_included',
    'extra_people',
    'minimum_nights',
    'maximum_nights',
    'calendar_updated',
    'has_availability',
    'availability_30',
    'availability_60',
    'availability_90',
    'availability_365',
    'calendar_last_scraped',
    'first_review',
    'last_review',
    'review_scores_accuracy',
    'review_scores_cleanliness',
    'review_scores_checkin',
    'review_scores_communication',
    'review_scores_location',
    'review_scores_value',
    'requires_license',
    'license',
    'jurisdiction_names',
    'instant_bookable',
    'require_guest_profile_picture',
    'require_guest_phone_verification',
    'reviews_per_month'
], 1)

Notice that there are too many variables regarding host verifications and amenities, and some of them may be redundant to predict the review scores. For example, almost every host is verified by phone, so this variable may be of little significance. Therefore, we perform $\chi^{2}$ tests to contract the feature space by selecting those with a significantly small $p$-value. However, the analysis of significance should be performed using the train data only; thus, we impute the data and split it into the train and test sets.

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer


imputer = Imputer(strategy='median')  # Replace missing values with the median
listings_imputed = imputer.fit_transform(listings)

train, test = train_test_split(listings_imputed)  # Split the data into train and test sets
train = pd.DataFrame(train, columns=listings.columns)
test = pd.DataFrame(test, columns=listings.columns)

In [12]:
from itertools import compress
from sklearn.feature_selection import chi2


def remove_redundant_features(train, test, variables=None, resp='review_scores_rating'):
    """Perform chi-squared test to remove sparse features of low significance
    
    :param train: train data frame to be transformed
    :param test: test data frame to be transformed
    :param variables: list of ``sparse`` variables to be transformed
    :param resp: response
    :return: transformed data frame
    """
    train = train.copy()
    test = test.copy()
    for variable in variables:
        variables_list = [item for item in train.columns if variable + '.' in item] \
            # Find sparse features
        tmp = train[variables_list + [resp]]  # Form data frame of sparse feature and response
        _, p_val = chi2(tmp[variables_list], tmp[resp])  # Perform chi-squared test
        variables_list = list(compress(variables_list, (p_val > 0.05))) \
            # Find sparse features of low significance
        train = train.drop(variables_list, 1)  # Remove sparse features of low significance
        test = test.drop(variables_list, 1)  # Remove sparse features of low significance
    return train, test


train, test = remove_redundant_features(train, test, ['host_verifications', 'amenities'])

##Predict the review scores

After removing the redundant variables, we train a random forest classifier to predict the review scores. However, since the review scores are imbalanced (reviewers tend to give high scores), we upsample the classes of small sample size before training the classifier.

In [13]:
from sklearn.utils import resample

train_resampled_list = []
max_samples = train[resp].value_counts().max() \
    # Find the class with maximum number of samples
for i in range(5):
    resampled = resample(train[train[resp] == i], n_samples=max_samples)  # Upsample
    train_resampled_list.append(resampled)  # Add upsampled data to the transformed list

train_resampled = pd.DataFrame(pd.concat(train_resampled_list, ignore_index=True)) \
    # Concatenate upsampled data

X_train = train_resampled.drop([resp], 1)
y_train = train_resampled[resp]
X_test = test.drop([resp], 1)
y_test = test[resp]

In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss


clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)
log_loss(y_test, y_pred)

1.3014864494413203

We can use `GridSearchCV` to tune the parameters of the estimator based on the train set. Notice that we use the logarithmic loss as the evaluation metric; thus, we need to redefine the scorer.

In [15]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV


estimator = RandomForestClassifier()
param_grid = {
    'n_estimators': [250, 500],
    'min_samples_leaf': [1, 5, 10]
}
scoring = make_scorer(log_loss, False, True)

clf = GridSearchCV(estimator, param_grid, scoring, cv=5)
clf.fit(X_train, y_train)
clf.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=250, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

Finally, we use the best estimator to predict the review scores of the test set.

In [16]:
y_pred = clf.best_estimator_.predict_proba(X_test)
log_loss(y_test, y_pred)

0.54126133477354821

To further enhance the prediction accuracy, the reviews data can be incorporated as well. It may contain useful information such as the number of reviews made by return customers and the keywords leading to high review scores. This will be considered in the second part of this project.