# Airbnb New User Bookings

## Step 1: Frame the Problem

- <b>Objective: </b>In this challenge, we have a list of users along with information of their activity on website such as date of account created, time when user first active on the website, the country for which user has done booking etc. We also got have some personal information about each user. Our task is to build a machine learning model which will predict which country a new user's first booking destination will be.


- <b>Data: </b>Following are the features present in training dataset:
    - id: user id
    - date_account_created: the date of account creation
    - timestamp_first_active: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
    - date_first_booking: date of first booking
    - gender
    - age
    - signup_method: whether user has signup from website or by using facebook, gmail etc.
    - signup_flow: the page a user came to signup up from
    - language: international language preference
    - affiliate_channel: what kind of paid marketing
    - affiliate_provider: where the marketing is e.g. google, craigslist, other
    - first_affiliate_tracked: whats the first marketing the user interacted with before the signing up
    - signup_app
    - first_device_type
    - first_browser
    - country_destination: this is the target variable. There are 12 possible outcomes of the destination country: 'US', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL','DE', 'AU', 'NDF' (no destination found), and 'other'. 
    <br>Note: 
        - 'other' means there was a booking, but is to a country not included in the list
        - 'NDF' means there wasn't a booking.


- There other 3 files given along with train and test dataset.
    1. sessions.csv - this file contain all web sessions log for each user
    2. countries.csv - summary statistics of destination countries in this dataset and their locations
    3. age_gender_bkts.csv - summary statistics of users' age group, gender, country of destination

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

In [None]:
data = pd.read_csv('../input/airbnb-recruiting-new-user-bookings/train_users_2.csv.zip')
print(data.shape)
data.head()

## Step 2: Data Exploration

In [None]:
data_explore = data.copy()

In [None]:
data_explore = data_explore.drop(['id'], axis=1)

In [None]:
data_explore.info()

In [None]:
dac = np.vstack(data_explore.date_account_created.astype(str).apply(lambda x: list(map(int, x.split('-')))).values)
data_explore['dac_year'] = dac[:,0]
data_explore['dac_month'] = dac[:,1]
data_explore['dac_day'] = dac[:,2]
data_explore = data_explore.drop(['date_account_created'], axis=1)

In [None]:
data_explore[data_explore['country_destination']!='NDF']['date_first_booking'].isna().sum()

- This clears that there no missing value for first booking date column when there is booking done.

In [None]:
data_explore.date_first_booking = data_explore.date_first_booking.fillna('2000-01-01')
first_booking = np.vstack(data_explore.date_first_booking.astype(str).apply(lambda x: list(map(int, x.split('-')))).values)
data_explore['first_booking_year'] = first_booking[:,0]
data_explore['first_booking_month'] = first_booking[:,1]
data_explore['first_booking_day'] = first_booking[:,2]
data_explore = data_explore.drop(['date_first_booking'], axis=1)

In [None]:
data_explore.nunique()

In [None]:
data_explore.describe()

- Max age value is 2014 which not valid. Hence I will replace all those values above 2000 by median age.

In [None]:
data_explore.isna().sum()

In [None]:
age_values = data_explore.age.values
data_explore['age'] = np.where(age_values>1000, np.random.randint(28, 43), age_values)
data_explore['age'] = data_explore['age'].fillna(np.random.randint(28, 43))

data_explore['first_affiliate_tracked'] = data_explore['first_affiliate_tracked'].fillna(data_explore['first_affiliate_tracked'].mode().values[0])

- There are several categorical columns. Lets explore them.

In [None]:
data_explore['language'].value_counts()[:10]

In [None]:
def plot_histogram(data):
    ax = plt.gca()
    counts, _, patches = ax.hist(data)
    for count, patch in zip(counts, patches):
        if count>0:
            ax.annotate(str(int(count)), xy=(patch.get_x(), patch.get_height()+5))
    if data.name:
        plt.xlabel(data.name)

In [None]:
plt.figure(figsize=(8, 5))
plot_histogram(data_explore['age'])
plt.xlim(15, 100)
plt.show()

In [None]:
plt.figure(figsize=(15, 6))
plt.subplot(1, 3, 1)
grp = data_explore[['gender', 'age']].groupby(by='gender').count()
plt.pie(grp.values, labels=list(grp.index), shadow=True, startangle=0,
        autopct='%1.1f%%', wedgeprops={'edgecolor':'black'})
plt.title('Gender')
plt.subplot(1, 3, 2)
grp = data_explore[['dac_year', 'age']].groupby(by='dac_year').count()
plt.pie(grp.values, labels=list(grp.index), shadow=True, startangle=0,
        autopct='%1.1f%%', wedgeprops={'edgecolor':'black'})
plt.title('Account Created: Year')
plt.subplot(1, 3, 3)
grp = data_explore[['dac_month', 'age']].groupby(by='dac_month').count()
plt.pie(grp.values, labels=list(grp.index), shadow=True, startangle=0,
        autopct='%1.1f%%', wedgeprops={'edgecolor':'black'})
plt.title('Account Created: Month')
plt.show()

In [None]:
ax = sns.countplot(x='affiliate_channel', data=data_explore)
for p in ax.patches:
    ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.xticks(rotation=-45)
plt.show()

In [None]:
plt.figure(figsize=(16, 7))
plt.subplot(1, 2, 1)
grp = data_explore[['affiliate_channel', 'age']].groupby(by='affiliate_channel').count()
plt.pie(grp.values, labels=list(grp.index), shadow=True, startangle=0,
        autopct='%1.1f%%', wedgeprops={'edgecolor':'black'})
plt.title('Affiliate Channels')
plt.subplot(1, 2, 2)
ax = sns.countplot(x='affiliate_provider', data=data_explore)
for p in ax.patches:
    ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.xticks(rotation=-45)
plt.xlim(-0.5, 10.5)
plt.title('Affiliate Providers')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
ax = sns.countplot(x='country_destination', data=data_explore)
for p in ax.patches:
    ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))

- In further analysis I will focus on the users which have done the booking.

In [None]:
data_explore_booked = data_explore[data_explore['country_destination']!='NDF']
data_explore.shape, data_explore_booked.shape

In [None]:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plot_histogram(data_explore_booked['dac_year'])
plt.subplot(1, 2, 2)
plot_histogram(data_explore_booked['dac_month'])
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plot_histogram(data_explore_booked['first_booking_year'])
plt.subplot(1, 2, 2)
plot_histogram(data_explore_booked['first_booking_month'])
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plot_histogram(data_explore_booked[data_explore_booked['country_destination']=='US']['first_booking_year'])
plt.title('# of Booking in US')
plt.subplot(1, 2, 2)
plot_histogram(data_explore_booked[data_explore_booked['country_destination']=='US']['first_booking_month'])
plt.title('# of Booking in US')
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plot_histogram(data_explore_booked[data_explore_booked['country_destination']=='FR']['first_booking_year'])
plt.title('# of Booking in France')
plt.subplot(1, 2, 2)
plot_histogram(data_explore_booked[data_explore_booked['country_destination']=='FR']['first_booking_month'])
plt.title('# of Booking in France')
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(x='country_destination', hue='gender', data=data_explore_booked)
plt.title('Geneder distribution across destination countries')
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(x='first_booking_year', hue='gender', data=data_explore_booked[data_explore_booked['country_destination']=='US'])
plt.title('# of Travellers to USA')
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(x='first_booking_year', hue='gender', data=data_explore_booked[data_explore_booked['country_destination']=='FR'])
plt.title('# of Travellers to France')
plt.show()

In [None]:
plt.figure(figsize=(15, 6))
sns.boxplot(x='country_destination', y='age', hue='gender', data=data_explore_booked)
plt.ylim(15, 60)
plt.legend(loc='lower right')
plt.show()

In [None]:
plt.figure(figsize=(15, 6))
sns.boxplot(x='dac_year', y='age', hue='gender', data=data_explore_booked)
plt.ylim(15, 60)
plt.legend(loc='lower right')
plt.show()

- Observation:
    - Median age of people who are creating account is decreasing which indicates that many young peoples are attracted to website.
    - In all years the median age of females is higher than males.

## Step 3: Data Preprocessing

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [None]:
X = data.drop(columns=['country_destination'], axis=1).copy()
y = data['country_destination'].copy()

label_enc = LabelEncoder()
y = label_enc.fit_transform(y)
X.shape, y.shape

In [None]:
label_enc.classes_

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

In [None]:
cat_attrs = ['gender', 'language', 'affiliate_channel', 'affiliate_provider']

- I will drop columns which gives information about users first activity on website, the device that has been used and date of first booking, Since all that inforamtion is reduandant for making predictions.

In [None]:
pre_process = ColumnTransformer([('drop_cols', 'drop', ['id', 'date_first_booking', 'date_account_created', 'signup_method', 'timestamp_first_active', 
                                                        'signup_app', 'first_device_type', 'first_browser', 'first_affiliate_tracked', 'signup_flow']),
                                 ('num_imputer', SimpleImputer(strategy='median'), ['age']),
                                 ('cat_imputer', SimpleImputer(strategy='most_frequent'), cat_attrs)], remainder='passthrough')

X_train_transformed = pre_process.fit_transform(X_train)
X_test_transformed = pre_process.transform(X_test)
X_train_transformed.shape, X_test_transformed.shape

In [None]:
X_train_transformed = pd.DataFrame(X_train_transformed, columns=['age', 'gender', 'language', 'affiliate_channel', 'affiliate_provider'])
X_test_transformed = pd.DataFrame(X_test_transformed, columns=['age', 'gender', 'language', 'affiliate_channel', 'affiliate_provider'])
X_train_transformed.shape, X_test_transformed.shape

## Step 4: Modelling

- The dataset contains many categorical fetures. Performing one hot encoding on them will increase the dimensionality and inturn will increase the training time. 
- For dataset which has many categorical features, most suitable algorithm will be the CatBoost. CatBoost algorithm handles categorical features automatically using various statistical methods.
- Evaluation metric will be Normalized Discounted Cumulative Gain (NDCG).

In [None]:
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

Thanks to [NDCG Scorer](https://www.kaggle.com/davidgasquez/ndcg-scorer) kernel from where the scorer function for NDCG is taken.

In [None]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import make_scorer, ndcg_score
ndcg_scorer = make_scorer(ndcg_score, needs_proba=True, k=5)

def dcg_score(y_true, y_score, k=5):
    """Discounted cumulative gain (DCG) at rank K.

    Parameters
    ----------
    y_true : array, shape = [n_samples]
        Ground truth (true relevance labels).
    y_score : array, shape = [n_samples, n_classes]
        Predicted scores.
    k : int
        Rank.

    Returns
    -------
    score : float
    """
    order = np.argsort(y_score)[::-1]
    y_true = np.take(y_true, order[:k])

    gain = 2 ** y_true - 1

    discounts = np.log2(np.arange(len(y_true)) + 2)
    return np.sum(gain / discounts)


def ndcg_score(ground_truth, predictions, k=5):
    """Normalized discounted cumulative gain (NDCG) at rank K.

    Normalized Discounted Cumulative Gain (NDCG) measures the performance of a
    recommendation system based on the graded relevance of the recommended
    entities. It varies from 0.0 to 1.0, with 1.0 representing the ideal
    ranking of the entities.

    Parameters
    ----------
    ground_truth : array, shape = [n_samples]
        Ground truth (true labels represended as integers).
    predictions : array, shape = [n_samples, n_classes]
        Predicted probabilities.
    k : int
        Rank.

    Returns
    -------
    score : float

    Example
    -------
    >>> ground_truth = [1, 0, 2]
    >>> predictions = [[0.15, 0.55, 0.2], [0.7, 0.2, 0.1], [0.06, 0.04, 0.9]]
    >>> score = ndcg_score(ground_truth, predictions, k=2)
    1.0
    >>> predictions = [[0.9, 0.5, 0.8], [0.7, 0.2, 0.1], [0.06, 0.04, 0.9]]
    >>> score = ndcg_score(ground_truth, predictions, k=2)
    0.6666666666
    """
    lb = LabelBinarizer()
    lb.fit(range(len(predictions) + 1))
    T = lb.transform(ground_truth)

    scores = []

    # Iterate over each y_true and compute the DCG score
    for y_true, y_score in zip(T, predictions):
        actual = dcg_score(y_true, y_score, k)
        best = dcg_score(y_true, y_true, k)
        score = float(actual) / float(best)
        scores.append(score)

    return np.mean(scores)


# NDCG Scorer function
ndcg_scorer = make_scorer(ndcg_score, needs_proba=True, k=5)

In [None]:
def grid_search(model, grid_param):
    print("Obtaining Best Model for {}".format(model.__class__.__name__))
    grid_search = GridSearchCV(model, grid_param, cv=kf, scoring=ndcg_scorer, return_train_score=True, n_jobs=-1)
    grid_search.fit(X_train_transformed, y_train)
    
    print("Best Parameters: ", grid_search.best_params_)
    print("Best Score: ", grid_search.best_score_)
    
    cvres = grid_search.cv_results_
    print("Results for each run of {}...".format(model.__class__.__name__))
    for train_mean_score, test_mean_score, params in zip(cvres["mean_train_score"], cvres["mean_test_score"], cvres["params"]):
        print(train_mean_score, test_mean_score, params)
        
    return grid_search.best_estimator_

In [None]:
results = []
    
def performance_measures(model, store_results=True):
    train_ndcg = cross_val_score(model, X_train_transformed, y_train, scoring=ndcg_scorer, cv=kf, n_jobs=-1)
    test_ndcg = cross_val_score(model, X_test_transformed, y_test, scoring=ndcg_scorer, cv=kf, n_jobs=-1)
    print("Mean Train NDGC: {}\nMean Test NDGC: {}".format(train_ndcg.mean(), test_ndcg.mean()))

In [None]:
def plot_feature_importance(feature_columns, importance_values,top_n_features=0):
    feature_imp = [ col for col in zip(feature_columns, importance_values)]
    feature_imp.sort(key=lambda x:x[1], reverse=True)

    if top_n_features:
        imp = pd.DataFrame(feature_imp[0:top_n_features], columns=['feature', 'importance'])
    else:
        imp = pd.DataFrame(feature_imp, columns=['feature', 'importance'])
    plt.figure(figsize=(10, 8))
    sns.barplot(y='feature', x='importance', data=imp, orient='h')
    plt.title('Most Important Features', fontsize=16)
    plt.ylabel("Feature", fontsize=16)
    plt.xlabel("")
    plt.show()

In [None]:
from catboost import CatBoostClassifier


catboost_grid_params = [{'iterations':[500, 1000, 1500], 'depth':[4, 6, 8, 10],}]

catboost_clf = CatBoostClassifier(task_type="GPU", loss_function='MultiClass', bagging_temperature=0.3, 
                                  cat_features=[1, 2, 3, 4], random_state=42, verbose=0)

grid_search_results = catboost_clf.grid_search(catboost_grid_params,
            X_train_transformed,
            y_train,
            cv=5,
            partition_random_seed=42,
            calc_cv_statistics=True,
            search_by_train_test_split=True,
            refit=True,
            shuffle=True,
            stratified=None,
            train_size=0.8,
            verbose=0,
            plot=False)

In [None]:
grid_search_results['params']

In [None]:
catboost_clf.is_fitted()

In [None]:
catboost_clf.feature_importances_

In [None]:
plot_feature_importance(['age', 'gender', 'language', 'affiliate_channel', 'affiliate_provider'], catboost_clf.feature_importances_)

In [None]:
performance_measures(catboost_clf, store_results=False)

## Step 5: Prediction Analysis

- Lets evaluate model's prediction on overall dataset.

In [None]:
X_trasformed = pre_process.transform(X)
predicted_country = catboost_clf.predict(X_trasformed)
predicted_country = label_enc.inverse_transform(predicted_country)
data['predicted_country'] = predicted_country

In [None]:
plt.figure(figsize=(10, 10))
plt.subplot(2, 1, 1)
ax = sns.countplot(x='country_destination', data=data)
for p in ax.patches:
    ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.subplot(2, 1, 2)
ax = sns.countplot(x='predicted_country', data=data)
for p in ax.patches:
    ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))

## Step 6: Make submission

In [None]:
final_model = Pipeline([('pre_process', pre_process),
                        ('catboost_clf', catboost_clf)])
final_model.fit(X_train, y_train)

In [None]:
test_data = pd.read_csv('../input/airbnb-recruiting-new-user-bookings/test_users.csv.zip')
test_data.head()

In [None]:
test_data.info()

In [None]:
predictions = final_model.predict_proba(test_data)

In [None]:
#Taking the 5 classes with highest probabilities
id_test = list(test_data.id)
ids = []
countries = []
for i in range(len(id_test)):
    idx = id_test[i]
    ids += [idx] * 5
    countries += label_enc.inverse_transform(np.argsort(predictions[i])[::-1])[:5].tolist()

In [None]:
output = pd.DataFrame(np.column_stack((ids, countries)), columns=['id', 'country'])
output.head()

In [None]:
output.to_csv("./submission.csv", index=False)