In this analysis, I will try different feature engineering and modeling approaches to predict the destination a user will book on AirBnb. Because there a multiple country destinations represented in the dataset, this is a multi-class classification problem. 

I was interested in this AirBnb problem because it presents an unbalanced class-representation problem: a large majority of the outcome variable is concentrated in two classes, with NDF and the US representing 58% and 29% respectively. Furthermore, the problem also includes an additional dataset (sessions) which contains additional information about user behavior on the site. I thought it would be interesting to engineer features from this data to see if it could improve the performance of a model to predict user booking destinations. 

To solve the unbalanced problem, I test various ensemble classifiers and hyperparameter configurations. In particular, I compare a boosted ensemble approach (using AdaBoost) with an averaging approach (using ExtraTrees). I also engineer features from the sessions data to measure performance gains. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 
import warnings
from tabulate import tabulate
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

Below I read in two dataframe: 

1. train_users is the dataset with information about users and the outcome variable ('country_destination')
2. sessions is a transactional dataset with entries for user behavior on the site (e.g. clicks, searches, etc.)

I will first try modeling with just the train_users data to see if country_destination can be predicted based on user attributes. Then, I will do additional feature engineering on the sessions data to see if including this information improves model performance. For example, does knowing how many times a user searched on the site increase the liklihood of correctly predicting his or her ultimate destination choice? 

In [None]:
train_users = pd.read_csv('/kaggle/input/airbnb-recruiting-new-user-bookings/train_users_2.csv.zip')
sessions=pd.read_csv('/kaggle/input/airbnb-recruiting-new-user-bookings/sessions.csv.zip')
train_users.head()

I will use scikit pipelines to build a pre-processing and classification pipeline to make predictions. Before starting with scikit pipelines, it is first useful to look at the datatypes of predictor variables

In [None]:
train_users.dtypes

I need to remove the id column and outcome variable from the dataset. Also, I need to split it into a train & test set.

In [None]:
from sklearn.model_selection import train_test_split

def get_split(df):
    X = df.drop(columns=['country_destination', 'id'])
    y = df['country_destination']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    return(X_train,X_test,y_train,y_test)

X_train,X_test,y_train,y_test=get_split(train_users)


For the scikit pre-processing pipeline, I will have a separate procedure for numeric and categorical column/feature types. Therefore, below I will create a list of the column names for each type.

In [None]:
#get columns by type
def get_coltypes(df):
    numeric_features = df.select_dtypes(include=['int64', 'float64']).columns
    categorical_features = df.select_dtypes(include=['object']).columns
    return numeric_features,categorical_features

numeric_features,categorical_features=get_coltypes(X_train)


### Scikit Pipeline  
1. First, I will define a transformer for each column type (numeric/categorical. 
2. Then, I will put them together into a preprocessing object and specify the columns captured above to which the transformers should be applied.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

#define transformers as pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

With the pre-processing part of the pipeline defined, now I will add the classification step. 
1. First, I define two classifiers I want to compare. Here I use AdaBoost and ExtraTrees classifiers.
2. For each classifier, I apply the pipeline: pre-processing and fitting the classifier. Then the model is scored on the test data. 

Here I will compare an AdaBoost classifier with an ExtraTrees classifier. AdaBoost is a "meta-estimator" that initially fits a "base" classifier (here a decision tree) on the original dataset and then fits additional copies of the classifier on the same dataset but adjusts weights of incorrect classifications to target these more difficult cases.

ExtraTrees (from sklearn.ensemble) is also a "meta-estimator" that fits randomized decision trees ("extra-trees") on sub-samples of the dataset and then uses averaging to 

In [None]:
from sklearn.metrics import accuracy_score, log_loss
from sklearn.ensemble import AdaBoostClassifier,ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
# classifiers = [
#     AdaBoostClassifier(),
#     ExtraTreesClassifier()
#     ]

def make_preds(classifier):
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', classifier)])
    model=pipe.fit(X_train, y_train)
    print(classifier)
    print("model score: ",pipe.score(X_test, y_test))
    y_pred = pipe.predict(X_test)
    return model,y_pred

ada_model,ada_pred=make_preds(AdaBoostClassifier())
et_model,et_pred=make_preds(ExtraTreesClassifier())

Below I will write a function to compare the distribution of predictions produced by each model above to the actual distribution of the outcome variable (country destination) in the dataset. 
It first gets the value counts of each country in the original and predicted datasets. It then adds % columns. 

In [None]:
def compare_preds(y_pred):
    preds_df = pd.DataFrame(data = y_pred, columns = ['y_pred'], index = X_test.index.copy())
    df_out = pd.merge(y_test, preds_df, how = 'left', left_index = True, right_index = True)
    preds_summary=df_out.apply(pd.Series.value_counts).fillna(0)
    preds_summary['cdest_pct'] = preds_summary.country_destination / preds_summary.country_destination.sum()
    preds_summary['predicted_pct'] = preds_summary.y_pred / preds_summary.y_pred.sum()
    return preds_summary.reset_index().sort_values('country_destination',ascending=False)

ada_preds_df=compare_preds(ada_pred)
et_preds_df=compare_preds(et_pred)

ada_preds_df

From the above we can see that AdaBoost only ever predicts NDF or US as destinations. The percent of predictions for NDF is roughly the same as in the dataset but the US is over-predicted. 

Below I gather variables for plotting with ggplot..(the one part of R I can't give up)

In [None]:
ada_plot=pd.melt(ada_preds_df,id_vars=['index'], value_vars=['country_destination','y_pred','cdest_pct','predicted_pct'])
ada_plot=ada_plot[ada_plot['variable'].str.contains("pct")]

et_plot=pd.melt(et_preds_df,id_vars=['index'], value_vars=['country_destination','y_pred','cdest_pct','predicted_pct'])
et_plot=et_plot[et_plot['variable'].str.contains("pct")]

ada_plot.head()

In [None]:
from plotnine import *

(ggplot(ada_plot)+
    aes(x='index',y='value')+
    geom_col()+
    facet_wrap('variable')+
    xlab("country")+
    ylab("percent"))


In [None]:
print(et_preds_df)
(ggplot(et_plot)+
    aes(x='index',y='value')+
    geom_col()+
    facet_wrap('variable')+
    xlab("country")+
    ylab("percent"))

Interestingly, although the Extra Trees classifier was less accurate overall, it does have some predictions for all classes and more closely resembles the distribution of the outcome variable in the orginal dataset. 

So why does the extra trees classifier have predictions for each class while AdaBoost does not? According to the scikit learn [documentation](https://scikit-learn.org/stable/modules/ensemble.html#forest), ExtraTrees (like RandomForest) is a "perturb-and-combine technique" specifically designed for trees, which means a diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ExtraTrees ensemble is then constructed as the averaged prediction of the individual classifiers.

Additional Feature Engineering from sessions data

Can adding additional features from user sessions history improve performance?

In [None]:
#base new features
sess_feat = sessions.loc[ : , ['user_id', 'secs_elapsed','action']] \
    .groupby('user_id')\
    . agg(total_secs=('secs_elapsed', 'sum'),
          total_actions=('action', 'count'))

sess_feat.head()



Now I will get the get the top 10 actions and append a count of each action for each user

In [None]:
#get top 10 actions
top_actions=sessions \
    .groupby('action')\
    .count().sort_values('user_id',ascending=False).nlargest(10,'action_type').reset_index()
print(top_actions['action'])


sessions=sessions.loc[sessions['action'].isin(top_actions['action'])]



In [None]:
# gets count of each user,action pair and counts--> pivots to wide w/unstack
user_actions=sessions.groupby(['user_id', 'action']) \
        .size().unstack('action',fill_value=0).reset_index()
        

user_actions=user_actions.drop(columns=['index'],axis=1)

user_actions.head()

Joining additional features (user_actions) back to train_users..
Two-part join: 
1. First join to get all user ids in train users in user_actions. Fill missing users with 0, as they have not completed the actions.

In [None]:
user_actions=train_users[['id']].merge(user_actions,right_on="user_id",left_on="id",how="left").fillna(0)

user_actions=user_actions.drop('user_id',axis=1)

# add session features to user_actions
user_actions=user_actions.merge(sess_feat,left_on="id",right_on="user_id",how="left").fillna(0)

user_actions.head()

Check for all users...

In [None]:
assert user_actions['id'].nunique() == train_users['id'].nunique(), "Uh oh.."

2. Then join again to update train users to contain the additional features in user_actions

In [None]:
#join to train df
train_users=train_users \
    .merge(user_actions,on="id",how="left")


Now that I have a "new" dataset with additional features, I have to re-implement the pre-processing and classification on the new train_users df. Luckily I have a convenient pipeline!! I can just call the functions/steps I defined above. 
1. First update X_train, X_test, etc. to reflect additional features added
2. Then get new list of column names by type
3. Then apply pipeline to updated data and column types

I will again use AdaBoost and ExtraTrees to see if there is any improvement. However, since AdaBoost only predicted two classes using the default parameters, I will tune the parameters to add additional estimators.

In [None]:
#1
X_train,X_test,y_train,y_test=get_split(train_users)

#2
numeric_features,categorical_features=get_coltypes(X_train)

#3
ada_model,ada_pred=make_preds(AdaBoostClassifier(n_estimators=100))
et_model,et_pred=make_preds(ExtraTreesClassifier())

#preds summary df (see above)
ada_preds_df=compare_preds(ada_pred)
et_preds_df=compare_preds(et_pred)


Count number of rows with each unique value of outcome variable (destination)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, ada_pred))
print(classification_report(y_test, et_pred))

**Feature Importance**

In [None]:
headers = ["name", "score"]
ada_values = sorted(zip(X_train.columns, ada_model['classifier'].feature_importances_), key=lambda x: x[1] * -1)
et_values=sorted(zip(X_train.columns, et_model['classifier'].feature_importances_), key=lambda x: x[1] * -1)

print(tabulate(ada_values, headers, tablefmt="plain"))
print(tabulate(et_values, headers, tablefmt="plain"))