### Predicting Price of Listings

Author: Alexandru Papiu

Let's try to take a look at the Airbnb listings and see if we can accurately predict the prices asked based on the information in the listing:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import Ridge, RidgeCV, Lasso
from sklearn import metrics
from sklearn.model_selection import cross_val_score, train_test_split
import xgboost as xgb



%config InlineBackend.figure_format = 'png'

In [None]:
train = pd.read_csv("../input/listings.csv")

Let's just keep some columns since there are so many of them:

In [None]:
columns_to_keep = ["price", "neighbourhood_cleansed", "bedrooms",
                   "property_type", "room_type", "name", "summary",
                   "amenities", "latitude", "longitude", "number_of_reviews",
                   "require_guest_phone_verification", "minimum_nights"]

train = train[columns_to_keep]

In [None]:
train.head(3)

Let's clean up the data a bit. We will define a function called clean that tidies up some of the columns:

In [None]:
def clean(train):

    train["bedrooms"] = train["bedrooms"].fillna(0.5) #these are studios
    train["summary"] = train["summary"].fillna("")
    train["bedrooms"] = train["bedrooms"].astype("str")

    #replace unpopular types with other 
    popular_types = train["property_type"].value_counts().head(6).index.values
    train.loc[~train.property_type.isin(popular_types), "property_type"] = "Other"

    #make price numeric:
    train["price"] = train["price"].str.replace("[$,]", "").astype("float")
    #eliminate crazy prices:
    train = train[train["price"] < 600]
    
    return train

In [None]:
train = clean(train)

### EDA:

Let's look at the distribution of prices:

In [None]:
train["price"].hist(bins = 30)
train["price"].std()

### Price by number of bedrooms:

In [None]:
(train.pivot(columns = "bedrooms", values = "price")
         .plot.hist(bins = 30, stacked = True))

In [None]:
sns.barplot(x = "bedrooms", y = "price", data = train)

### Price by room_type:

In [None]:
(train.pivot(columns = "room_type", values = "price")
         .plot.hist(bins = 30, stacked = False, alpha = 0.8))

In [None]:
train.groupby("room_type")["price"].mean()

As was expected the room type matters a lot - while private rooms average around 92$, entire homes average around $213. 

###  Pre-Processing:

This is a pretty interesting dataset since it contains very "diverse" data classes: looks like we have numerical, categorical and text data.  Let's split them up in two group since text data needs to be processed differently:

In [None]:
y = train["price"]
train_num_cat = train[["neighbourhood_cleansed", "bedrooms",
                   "property_type", "room_type", "latitude", "longitude",
                   "number_of_reviews", "require_guest_phone_verification",
                    "minimum_nights"]]

train_text = train[["name", "summary", "amenities"]]

Now let's one hot encode the categorical data:

In [None]:
X_num = pd.get_dummies(train_num_cat)

In [None]:
train_text.head()

In [None]:
train.amenities = train.amenities.str.replace("[{}]", "")

In [None]:
amenity_ohe = train.amenities.str.get_dummies(sep = ",")

Amenities are interesting. I will follow the same idea as in a previous script [here](https://www.kaggle.com/residentmario/d/airbnb/boston/modeling-prices) and one-hot encode the amenities. Turns you can do this with one line of pandas, sweet!

In [None]:
train.amenities = train.amenities.str.replace("[{}]", "")
amenity_ohe = train.amenities.str.get_dummies(sep = ",")

In [None]:
amenity_ohe.head(3)

What shall we do with the name and summary data? Let's just concatenate the two and then create a bag of words from them.

In [None]:
train["text"] = train["name"].str.cat(train["summary"], sep = " ")

In [None]:
vect = CountVectorizer(stop_words = "english", min_df = 10)
X_text = vect.fit_transform(train["text"])

### Models:

Now let's build some models! But first let's define some helper functions:

In [None]:
#metric:
def rmse(y_true, y_pred):
    return(np.sqrt(metrics.mean_squared_error(y_true, y_pred)))

#evaluates rmse on a validation set:
def eval_model(model, X, y, state = 3):
    X_tr, X_val, y_tr, y_val = train_test_split(X, y, random_state = state)
    preds = model.fit(X_tr, y_tr).predict(X_val)
    return rmse(y_val, preds)

Ok so we have three different matrices: 

In [None]:
(X_num.shape, X_text.shape, amenity_ohe.shape)

In [None]:
#this is numeric + amenities:
X = np.hstack((X_num, amenity_ohe))

#this is all of them:
X_full = np.hstack((X_num, amenity_ohe, X_text.toarray()))

In [None]:
models_rmse = [eval_model(xgb.XGBRegressor(), X_num, y),
 eval_model(xgb.XGBRegressor(), X, y),
 eval_model(Ridge(), X_num, y),
 eval_model(Ridge(), X, y)]

In [None]:
models_rmse = pd.Series(models_rmse, index = ["xgb_num", "xgb_ame", "ridge", "ridge_ame"] )

In [None]:
models_rmse

In [None]:
models_rmse.plot(kind = "barh")

Ok so it looks like xgboost is doing a little bit better than ridge and adding the amenities helps a bit ( around -1 rmse). This is however only on one validation set.

To test our different models more in depth, we will do repeated train-validation split (note this is not exactly cross validation) and then see how our errors are distributed. We also add a baseline model that always predicts the mean.

In [None]:
results = []
for i in range(30):
    X_tr, X_val, y_tr, y_val = train_test_split(X_num, y)
    y_baseline = [np.mean(y_tr)]*len(y_val)

    model = Ridge(alpha = 5)
    preds_logit = model.fit(X_tr, y_tr).predict(X_val)


    model = xgb.XGBRegressor()  
    preds_xgb = model.fit(X_tr, y_tr).predict(X_val)
    
    results.append((rmse(y_baseline, y_val),
                    rmse(preds_logit, y_val),
                    rmse(preds_xgb, y_val)
                    ))

In [None]:
results = pd.DataFrame(results, columns = ["baseline", "ridge", "xgb"])
results.plot.hist(bins = 15, alpha = 0.5)

In [None]:
pd.DataFrame([results.mean(), results.std()])

Clearly both ridge and xgboost beat the baseline performance by a decent margin. Also it looks like the xgboost model is performing slightly better than the Ridge Regression, good to know! If we wanted to see if the difference in rmse is statistically significant we could do a permutation test or a t-test but I'll skip that for now.