# History Kaggle: Airbnb New User Bookings

## Download data
Data for this project are downloaded from the following link:<br/>
https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data

I learn from script posts:
1. https://www.kaggle.com/kevinwu06/airbnb-exploratory-analysis\
2. https://www.kaggle.com/davidgasquez/user-data-exploration
3. https://www.kaggle.com/svpons/feature-engineering
4. https://www.kaggle.com/svpons/three-level-classification-architecture

Format follow:
https://github.com/udacity/machine-learning/blob/master/projects/capstone/capstone_report_template.md

In [29]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import seaborn as sns
import datetime
from scipy import sparse
# import warnings
# warnings.filterwarnings('ignore')    # suppress warnings for clean demo

pd.set_option('display.max_rows', None)    #don't hide any rows or columns when display
pd.set_option('display.max_columns', None)

from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from scipy import sparse
from sklearn.model_selection import train_test_split, GridSearchCV
import xgboost as xgb

### Data Preprocessing

#### Users dataset (2 files)

In [2]:
train_users_data = pd.read_csv("train_users_2.csv")
test_users_data = pd.read_csv("test_users.csv")

In [3]:
num_rows_train, num_cols_train = train_users_data.shape
print("There are {:,} rows and {:,} columns in the train_users data.".format(num_rows_train, num_cols_train))
num_rows_test, num_cols_test = test_users_data.shape
print("There are {:,} rows and {:,} columns in the test_users data.".format(num_rows_test, num_cols_test))

There are 213,451 rows and 16 columns in the train_users data.
There are 62,096 rows and 15 columns in the test_users data.


#### User Web Sessions Record (1 file)

Sessions file contains, for each user_id, the action, action type, action detail, device type and the time elapsed from the previous action. The sessions data goes back to Jan 1, 2014 which is only a small recent portion of the users data that dates back to 2010. 

In [4]:
sessions_data = pd.read_csv("sessions.csv")

In [5]:
num_rows_sessions, num_cols_sessions = sessions_data.shape
print("There are {:,} rows and {:,} columns in the sessions data".format(num_rows_sessions, num_cols_sessions))

There are 10,567,737 rows and 6 columns in the sessions data


In [6]:
sessions_data = sessions_data[sessions_data.user_id.notnull()]

In [9]:
sessions_total_secs = sessions_data.loc[:,['user_id', 'secs_elapsed']].groupby('user_id').sum()

In [10]:
sessions_action_counts = sessions_data.loc[:,['user_id', 'action']].groupby('user_id').count()

#### Datetime features & `country_destination`

In [8]:
train_users_data.loc[:,'date_account_created'] = pd.to_datetime(train_users_data.date_account_created)
test_users_data.loc[:,'date_account_created'] = pd.to_datetime(test_users_data.date_account_created)

In [9]:
train_users_data.loc[:,'date_account_created'] = train_users_data.date_account_created.apply(lambda x: pd.to_datetime(x.strftime('%x')))
test_users_data.loc[:,'date_account_created'] = test_users_data.date_account_created.apply(lambda x: pd.to_datetime(x.strftime('%x')))

In [10]:
train_users_data.loc[:,'timestamp_first_active'] = pd.to_datetime(train_users_data.timestamp_first_active.apply(str))
test_users_data.loc[:,'timestamp_first_active'] = pd.to_datetime(test_users_data.timestamp_first_active.apply(str))

#### Combine train & test data to preprocess together

In [11]:
id_test = test_users_data['id']
train_users_data.drop(['date_first_booking'], axis = 1, inplace = True)
test_users_data.drop(['date_first_booking'], axis = 1, inplace = True)

In [14]:
full_data = pd.concat([train_users_data, test_users_data], axis = 0, ignore_index = True)
full_data.drop(['country_destination'], axis = 1, inplace = True)

In [22]:
full_data.shape

(275547, 15)

#### Convert the continuous `Age` into categories

In [16]:
# set age outside of valid age range (10, 100] to nan
valid_age_index = full_data.age.apply(lambda x: 10 < x <= 100)
full_data.loc[~valid_age_index, 'age'] = np.nan

In [17]:
# pd.cut has the convention that 30 belongs to 25-29, but 25 doesn't  
age_gender_bkts_data = pd.read_csv("age_gender_bkts.csv")
age_bins = np.arange(10., 105., 5.).tolist()
age_names = list(reversed(age_gender_bkts_data.age_bucket.unique()))[2:-1]
full_data['age_bucket_col'] = pd.cut(full_data.age, age_bins, labels = age_names)

#### Add features engineered using `sessions` data

In [19]:
# sessions_total_secs['id'] = sessions_total_secs.index
# sessions_action_counts['id'] = sessions_action_counts.index

In [20]:
# full_data = pd.merge(full_data, sessions_total_secs, how = 'left', on = 'id')
# full_data = pd.merge(full_data, sessions_action_counts, how = 'left', on = 'id')

#### Add features from datetime feature

In [21]:
full_data.columns

Index(['affiliate_channel', 'affiliate_provider', 'age',
       'date_account_created', 'first_affiliate_tracked', 'first_browser',
       'first_device_type', 'gender', 'id', 'language', 'signup_app',
       'signup_flow', 'signup_method', 'timestamp_first_active',
       'age_bucket_col'],
      dtype='object')

In [23]:
# full_data.loc[:,'date_account_created'] = pd.to_datetime(full_data.date_account_created)
# full_data.loc[:,'date_account_created'] = full_data.date_account_created.apply(lambda x: pd.to_datetime(x.strftime('%x')))
# full_data.loc[:,'timestamp_first_active'] = pd.to_datetime(full_data.timestamp_first_active.apply(str))

full_data['year_account_created'] = full_data.date_account_created.apply(lambda x: x.year)
full_data['month_account_created'] = full_data.date_account_created.apply(lambda x: x.month)
full_data['dayinmonth_account_created'] = full_data.date_account_created.apply(lambda x: x.day)
# Monday is 1 and Sunday is 7
full_data['dayinweek_account_created'] = full_data.date_account_created.apply(lambda x: x.isoweekday())

In [24]:
full_data['days_delay'] = full_data.date_account_created - full_data.timestamp_first_active.apply(lambda x: pd.to_datetime(x.strftime('%x')))

In [25]:
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar

cal = calendar()
holidays = cal.holidays(start = full_data.date_account_created.min(), 
                        end = full_data.date_account_created.max())

full_data['Holiday'] = full_data.date_account_created.isin(holidays)

#### Label Encoding

In [62]:
full_data.loc[:, 'age_bucket_col']  = full_data['age_bucket_col'].astype('str')

In [63]:
#TypeError: unorderable types: float() > str(): need to .fillna()
#ValueError: fill value must be in categories: 'age_bucket_col' is already in categorical type
label_encoder = LabelEncoder()

cat_feats = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 
            'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser', 
             'year_account_created', 'month_account_created', 'dayinmonth_account_created', 
             'dayinweek_account_created', 'days_delay', 'Holiday', 'age_bucket_col']
LE_vars=[]
LE_map=dict()
for cat_var in cat_feats:
    print ("Label Encoding %s" % (cat_var))
    LE_var=cat_var+'_le'
    full_data[LE_var]=label_encoder.fit_transform(full_data[cat_var].fillna('NaN'))
    LE_vars.append(LE_var)
    LE_map[cat_var]=label_encoder.classes_
    
print ("Label-encoded feaures: %s" % (LE_vars))

Label Encoding gender
Label Encoding signup_method
Label Encoding signup_flow
Label Encoding language
Label Encoding affiliate_channel
Label Encoding affiliate_provider
Label Encoding first_affiliate_tracked
Label Encoding signup_app
Label Encoding first_device_type
Label Encoding first_browser
Label Encoding year_account_created
Label Encoding month_account_created
Label Encoding dayinmonth_account_created
Label Encoding dayinweek_account_created
Label Encoding days_delay
Label Encoding Holiday
Label Encoding age_bucket_col
Label-encoded feaures: ['gender_le', 'signup_method_le', 'signup_flow_le', 'language_le', 'affiliate_channel_le', 'affiliate_provider_le', 'first_affiliate_tracked_le', 'signup_app_le', 'first_device_type_le', 'first_browser_le', 'year_account_created_le', 'month_account_created_le', 'dayinmonth_account_created_le', 'dayinweek_account_created_le', 'days_delay_le', 'Holiday_le', 'age_bucket_col_le']


#### One-hot Encoding

In [64]:
%%time
OHE = OneHotEncoder(sparse=True)
OHE.fit(full_data[LE_vars])
OHE_sparse=OHE.transform(full_data[LE_vars])
OHE_vars = [var[:-3] + '_' + str(level).replace(' ','_')\
                for var in cat_feats for level in LE_map[var] ]

print ("OHE_sparse size :" ,OHE_sparse.shape)
print ("One-hot encoded catgorical feature samples : %s" % (OHE_vars[:100]))

OHE_sparse size : (275547, 373)
One-hot encoded catgorical feature samples : ['gen_-unknown-', 'gen_FEMALE', 'gen_MALE', 'gen_OTHER', 'signup_met_basic', 'signup_met_facebook', 'signup_met_google', 'signup_met_weibo', 'signup_f_0', 'signup_f_1', 'signup_f_2', 'signup_f_3', 'signup_f_4', 'signup_f_5', 'signup_f_6', 'signup_f_8', 'signup_f_10', 'signup_f_12', 'signup_f_14', 'signup_f_15', 'signup_f_16', 'signup_f_20', 'signup_f_21', 'signup_f_23', 'signup_f_24', 'signup_f_25', 'langu_-unknown-', 'langu_ca', 'langu_cs', 'langu_da', 'langu_de', 'langu_el', 'langu_en', 'langu_es', 'langu_fi', 'langu_fr', 'langu_hr', 'langu_hu', 'langu_id', 'langu_is', 'langu_it', 'langu_ja', 'langu_ko', 'langu_nl', 'langu_no', 'langu_pl', 'langu_pt', 'langu_ru', 'langu_sv', 'langu_th', 'langu_tr', 'langu_zh', 'affiliate_chan_api', 'affiliate_chan_content', 'affiliate_chan_direct', 'affiliate_chan_other', 'affiliate_chan_remarketing', 'affiliate_chan_sem-brand', 'affiliate_chan_sem-non-brand', 'affiliate_cha

In [65]:
train_x = OHE_sparse[:num_rows_train]
test_x = OHE_sparse[num_rows_train:]
#train_y = target

#### Label encode `country_destination`

In [66]:
target = label_encoder.fit_transform(train_users_data['country_destination'])
country_code_map=dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

In [67]:
# map between country and its code
country_code_map

{'AU': 0,
 'CA': 1,
 'DE': 2,
 'ES': 3,
 'FR': 4,
 'GB': 5,
 'IT': 6,
 'NDF': 7,
 'NL': 8,
 'PT': 9,
 'US': 10,
 'other': 11}

In [68]:
xgb_params = {
    'max_depth': 6,
    'min_child_weight': 1,
    'subsample': 0.5,
    'colsample_bytree': 0.5,
    'gamma': 0,
    'objective': 'multi:softprob',
    'eta': 0.3,
    'seed': 1234,
    'num_class': 12}

model = xgb.train(xgb_params, 
                  xgb.DMatrix(train_x, label = target)
                 )

preds = model.predict(xgb.DMatrix(test_x))

In [71]:
ids = []  #list of ids
cts = []  #list of countries
for i in range(len(id_test)):
    idx = id_test[i]
    ids += [idx] * 5
    cts += label_encoder.inverse_transform(np.argsort(preds[i])[::-1])[:5].tolist()

In [72]:
sub = pd.DataFrame(np.column_stack((ids, cts)), columns=['id', 'country'])
sub.to_csv('sub.csv',index=False)

In [75]:
OHE_vars[300:]

['days_de_30153600000000000_nanoseconds',
 'days_de_30585600000000000_nanoseconds',
 'days_de_30758400000000000_nanoseconds',
 'days_de_30931200000000000_nanoseconds',
 'days_de_31017600000000000_nanoseconds',
 'days_de_31276800000000000_nanoseconds',
 'days_de_31449600000000000_nanoseconds',
 'days_de_31708800000000000_nanoseconds',
 'days_de_31881600000000000_nanoseconds',
 'days_de_34128000000000000_nanoseconds',
 'days_de_34473600000000000_nanoseconds',
 'days_de_34646400000000000_nanoseconds',
 'days_de_37238400000000000_nanoseconds',
 'days_de_37843200000000000_nanoseconds',
 'days_de_38620800000000000_nanoseconds',
 'days_de_38707200000000000_nanoseconds',
 'days_de_38966400000000000_nanoseconds',
 'days_de_40262400000000000_nanoseconds',
 'days_de_41126400000000000_nanoseconds',
 'days_de_43804800000000000_nanoseconds',
 'days_de_44409600000000000_nanoseconds',
 'days_de_44582400000000000_nanoseconds',
 'days_de_48816000000000000_nanoseconds',
 'days_de_50716800000000000_nanose