# Airbnb New User Bookings

Instead of waking to overlooked "Do not disturb" signs, [Airbnb](www.airbnb.com) travelers find themselves rising with the birds in a whimsical treehouse, having their morning coffee on the deck of a houseboat, or cooking a shared regional breakfast with their hosts.

New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By accurately predicting where a new user will book their first travel experience, Airbnb can share more personalized content with their community, decrease the average time to first booking, and better forecast demand.

In this competition, Airbnb challenges you to predict in which country a new user will make his or her first booking.

## Evaluation

The evaluation metric for this competition is NDCG ([Normalized discounted cumulative gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain)) @k where k=5. NDCG is calculated as:

$DCG_k$=$\sum_{i=1}^k$ $\frac{2^{reli}−1}{log_2(i+1)}$,

$nDCG_k=\frac{DCG_k}{IDCG_k}$,

where $rel_i$ is the relevance of the result at position $i$.

$IDCG_k$ is the maximum possible (ideal) $DCG$ for a given set of queries. All $NDCG$ calculations are relative values on the interval 0.0 to 1.0.

For each new user, you are to make a maximum of 5 predictions on the country of the first booking. The ground truth country is marked with relevance = 1, while the rest have relevance = 0.

For example, if for a particular user the destination is FR, then the predictions become:

[ FR ]  gives a $NDCG= \frac{2^1−1}{log_2(1+1)}=1.0$

[ US, FR ] gives a $DCG= \frac{2^0−1}{log_2(1+1)} + \frac{2^1−1}{log_2(2+1)} = 11.58496 = 0.6309$

## User Data Exploration

In [1]:
# Import modules for handling and plotting the data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# Set plot styles
sns.set_style("whitegrid", {'ytick.major.size': 8.0})
sns.set_context("poster", font_scale=1.1)

In [3]:
# Load the data into DataFrames
train_users = pd.read_csv('train_users_2.csv')
test_users = pd.read_csv('test_users.csv')

In [7]:
print "We have", train_users.shape[0], "users in the training set and", test_users.shape[0], "in the test set."
print "In total we have", train_users.shape[0] + test_users.shape[0], "users."

We have 213451 users in the training set and 62096 in the test set.
In total we have 275547 users.


Let's combine our training and testing set.

In [8]:
# Merge train and test users
users = pd.concat((train_users, test_users), axis=0, ignore_index=True)

# Remove ID's since now we are not interested in making predictions
#users.drop('id',axis=1, inplace=True)

users.head()

Unnamed: 0,affiliate_channel,affiliate_provider,age,country_destination,date_account_created,date_first_booking,first_affiliate_tracked,first_browser,first_device_type,gender,id,language,signup_app,signup_flow,signup_method,timestamp_first_active
0,direct,direct,,NDF,2010-06-28,,untracked,Chrome,Mac Desktop,-unknown-,gxn3p5htnn,en,Web,0,facebook,20090319043255
1,seo,google,38.0,NDF,2011-05-25,,untracked,Chrome,Mac Desktop,MALE,820tgsjxq7,en,Web,0,facebook,20090523174809
2,direct,direct,56.0,US,2010-09-28,2010-08-02,untracked,IE,Windows Desktop,FEMALE,4ft3gnwmtx,en,Web,3,basic,20090609231247
3,direct,direct,42.0,other,2011-12-05,2012-09-08,untracked,Firefox,Mac Desktop,FEMALE,bjjt8pjhuk,en,Web,0,facebook,20091031060129
4,direct,direct,41.0,US,2010-09-14,2010-02-18,untracked,Chrome,Mac Desktop,-unknown-,87mebub9p4,en,Web,0,basic,20091208061105


In [9]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 275547 entries, 0 to 275546
Data columns (total 16 columns):
affiliate_channel          275547 non-null object
affiliate_provider         275547 non-null object
age                        158681 non-null float64
country_destination        213451 non-null object
date_account_created       275547 non-null object
date_first_booking         88908 non-null object
first_affiliate_tracked    269462 non-null object
first_browser              275547 non-null object
first_device_type          275547 non-null object
gender                     275547 non-null object
id                         275547 non-null object
language                   275547 non-null object
signup_app                 275547 non-null object
signup_flow                275547 non-null int64
signup_method              275547 non-null object
timestamp_first_active     275547 non-null int64
dtypes: float64(1), int64(2), object(13)
memory usage: 33.6+ MB


Let's clean up the data to get it into a more useful format as well as dealing with missing values.

In [12]:
# change unknown genders to 'NaN'
users.gender.replace('-unknown-', np.nan, inplace=True)

In [13]:
# calculate % of 'NaN' values
users_nan = (users.isnull().sum() / users.shape[0]) * 100
users_nan[users_nan > 0].drop('country_destination')

age                        42.412365
date_first_booking         67.733998
first_affiliate_tracked     2.208335
gender                     46.990169
dtype: float64

There is a significant amount of missing information for **age** and **gender** which we will need to correct in order to make a proper analysis. The feature **date_first_booking** has a 58% of NaN values because this feature is not present at the tests users, and therefore, we won't need it at the modeling part.

In [20]:
users['age'].describe()

count    158681.000000
mean         47.145310
std         142.629468
min           1.000000
25%          28.000000
50%          33.000000
75%          42.000000
max        2014.000000
Name: age, dtype: float64

It appears there is some errant data in the user ages. Firstly, Airbnb requires users to be at least 18 years old. Secondly, we better call the Guiness Book of World Records because we have found at least one person who is 2014 years old. 

In [24]:
too_young = sum(users['age'] < 18)
too_old = sum(users['age'] > 110)

print "There are {} users under the age of 18, and {} users over the age of 110.".format(too_young,too_old)

There are 188 users under the age of 18, and 850 users over the age of 110.


In [25]:
users.loc[users['age'] > 100, 'age'] = np.nan
users.loc[users['age'] < 13, 'age'] = np.nan