
# Air Bnb New User Bookings 
**Predict user country destination**

New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By accurately predicting where a new user will book their first travel experience, Airbnb can share more personalized content with their community, decrease the average time to first booking, and better forecast demand.

In this recruiting competition, Airbnb challenges you to predict in which country a new user will make his or her first booking. Kagglers who impress with their answer (and an explanation of how they got there) will be considered for an interview for the opportunity to join Airbnb's.

# Important Notes
* countries.csv: summary statistics of destination countries in this dataset and their locations
* age_gender_bkts.csv: summary statistics of users' age group, gender, country of destination
* **'NDF'** --> no booking
* **country_destination** is our predection target it is not avalible at test dataset.

**Submission File:**
For every user in the dataset, submission files should contain two columns: id and country. The destination country predictions must be ordered such that the most probable destination country goes first.

**Evaluation:**
For each new user, you are to make a maximum of 5 predictions on the country of the first booking. The ground truth country is marked with relevance = 1, while the rest have relevance = 0.

For example, if for a particular user the destination is FR, then the predictions become



**Data Cleansing & Preparation Summery**

* Gender: convert all "unknown" values into nan 
* Filter age between 1-100 and set to others to NAN**
* First_browser: convert all "unknown" values into nan
* date_account_created: convert to datetime format
* timestamp_first_active: convert to datetime format

In [None]:
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

from datetime import datetime
from sklearn.preprocessing import LabelEncoder
from xgboost.sklearn import XGBClassifier

import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

#add colores for plot bars
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
          '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']        

# Load data

In [None]:
# convert date columns into datetime format while loading
train_df = pd.read_csv('../input/airbnb-recruiting-new-user-bookings/train_users_2.csv.zip', parse_dates=['timestamp_first_active','date_account_created','date_first_booking'])

test_df= pd.read_csv('../input/airbnb-recruiting-new-user-bookings/test_users.csv.zip',parse_dates=['timestamp_first_active','date_account_created','date_first_booking'])

bookings_data  = pd.read_csv("/kaggle/input/airbnb-recruiting-new-user-bookings/train_users_2.csv.zip")

# Combining test and train datasets

for this compitition If this is not done, the number of dummy variable columns do not match in test and train data. Some items present in train data and are not present in test data like browser type. 

i got many errors like: **ValueError: X has 122 features per sample; expecting 155**

In [None]:
train_test_combin = pd.concat((train_df, test_df), axis = 0, ignore_index = True)

# Exploratory Data Analysis (EDA)

In [None]:
train_test_combin.head()

In [None]:
train_test_combin.shape

In [None]:
train_df.head()

In [None]:
train_df.shape

In [None]:
test_df.head()

In [None]:
test_df.shape

**check for Null values**

In [None]:
train_test_combin.isnull().sum()

# Data Visulization

**Min / Max Date in our data**

The data is 5 years data starting from 2010 up to 2014

In [None]:
print(train_test_combin.date_account_created.min())
print(train_test_combin.date_account_created.max())

**Signup method used by users**

In [None]:
train_test_combin['signup_method'].value_counts()

In [None]:
train_test_combin.signup_method.value_counts(dropna=False).plot(kind='bar', color=colors)

**Users Devices**

In [None]:
train_test_combin['first_device_type'].value_counts()

In [None]:
train_test_combin.first_device_type.value_counts(dropna=False).plot(kind='bar', color=colors)


**detailed view with % percent**

In [None]:
plt.figure(figsize=(10,6))
cd_count_idx = bookings_data['first_device_type'].value_counts().index
sns.countplot(data = bookings_data, x = 'first_device_type', order = cd_count_idx, color = sns.color_palette()[0])
plt.xlabel('Devices Type')
plt.ylabel('Count')
plt.title('Users Devices')
plt.xticks(rotation=90)
cd_count_val = bookings_data['first_device_type'].value_counts()

for i in range(cd_count_val.shape[0]):
    count = cd_count_val[i]
    percentage ='{:0.1f}%'.format(100 * count / len(bookings_data))
    plt.text(i, count+1000, percentage, ha='center')

**Internet browsers used**

In [None]:
train_test_combin['first_browser'].value_counts()

> Google Chrome is the most poular Internet exolorer 

**Top 10 browsers**

In [None]:
train_test_combin['first_browser'].value_counts().head(10).plot(kind='bar', color=colors)

**Gender and Age**
* Average age around 35.
* we have more females users than male

In [None]:
train_test_combin['gender'].value_counts()

In [None]:
train_test_combin.groupby('gender').age.agg(['min','max','mean','count'])

**Filter age between 1-100 and set others to NAN**

>  some of the age valus are not logical like(2014, 188...etc.)

In [None]:
train_test_combin.loc[train_test_combin.age > 100, 'age'] = np.nan

In [None]:
train_test_combin.groupby('gender').age.agg(['min','max','mean','count'])

In [None]:
train_test_combin['gender'].value_counts().plot(kind='bar', color=colors)

**Gender % percent**

In [None]:
plt.figure(figsize=(10,6))
cd_count_idx = train_test_combin['gender'].value_counts().index
sns.countplot(data = train_test_combin, x = 'gender', order = cd_count_idx, color = sns.color_palette()[0])
plt.xlabel('gender')
plt.ylabel('Count')
plt.title('gender')
plt.xticks(rotation=90)

cd_count_val = train_test_combin['gender'].value_counts()

for i in range(cd_count_val.shape[0]):
    count = cd_count_val[i]
    percentage ='{:0.1f}%'.format(100 * count / len(train_test_combin))
    plt.text(i, count+1000, percentage, ha='center')

In [None]:
# Splitting date into Day-Month-Year
# account created
train_test_combin['dac_year'] = train_test_combin.date_account_created.dt.year
train_test_combin['dac_month'] = train_test_combin.date_account_created.dt.month
train_test_combin['dac_day'] = train_test_combin.date_account_created.dt.day

# Splitting date into Day-Month-Year
# time first active
train_test_combin['tfa_year'] = train_test_combin.timestamp_first_active.dt.year
train_test_combin['tfa_month'] = train_test_combin.timestamp_first_active.dt.month
train_test_combin['tfa_day'] = train_test_combin.timestamp_first_active.dt.day

In [None]:
train_test_combin.dac_year.value_counts(sort=False).plot(kind='bar', title='User Accounts Created Per Year')

**Countries visited by Users**

In [None]:
train_test_combin.country_destination.value_counts(normalize=True).plot(kind='bar',title='Countries Visited by Users')

> NDF: No data

> US is the main visisted country 

**Dropping uneanted columns**

* id: drop
* date_first_booking: drop
* date_account_created: drop
* timestamp_first_active: drop
* missing value: fill with 0
* gender replace -unknown- with NA

There is no impact on dropping all date columns since it is not used for the predetion 

In [None]:
X_train_data = train_df.drop(['id','date_account_created','date_first_booking','timestamp_first_active', 'country_destination'], axis=1)
X_test_data = test_df.drop(['id','date_account_created','date_first_booking','timestamp_first_active'], axis=1)
train_test_combin = train_test_combin.drop(['id', 'date_account_created','date_first_booking','timestamp_first_active','country_destination'], axis=1)


In [None]:
# Replace unknown with NAN to avoid error 
train_test_combin.gender.replace('-unknown-', np.nan, inplace=True)
train_test_combin.first_browser.replace('-unknown-', np.nan, inplace=True)

# Create Fetures 
convert all string values into numbers values using pandas get_dummies

In [None]:
# Create Fetures categorical columns 

features = ['gender', 'signup_method', 'signup_flow', 'language',
                'affiliate_channel', 'affiliate_provider',
                'first_affiliate_tracked', 'signup_app',
                'first_device_type', 'first_browser']

# Get Dummies Convert categorical variable into dummy/indicator variables.

train_test_combin = pd.get_dummies(train_test_combin,columns=features)

In [None]:
train_test_combin.head()

# xgboost

In [None]:
from xgboost.sklearn import XGBClassifier
from sklearn.preprocessing import LabelEncoder

In [None]:
train_df_rows_no = train_df.shape[0]  

# Splitting the data into train and test again for the classifier
# rows number of train data will be used for splitting

# convert train_test_combin dataset into list
all_data_list = train_test_combin.values
X_train = all_data_list[:train_df_rows_no] #213451
X_test = all_data_list[train_df_rows_no:] # 213451

#Create labels
#labels = train_df['country_destination'].values
labler = LabelEncoder()
y = labler.fit_transform(train_df['country_destination'].values)

# Implementation of the classifier (decision tree)
xgb = XGBClassifier(max_depth=6, learning_rate=0.3, n_estimators=22,
                    objective='multi:softprob', subsample=0.6, colsample_bytree=0.6, seed=0)               
xgb.fit(X_train, y)
y_pred = xgb.predict_proba(X_test) 


In [None]:
#Taking the 5 classes with highest probabilities
# the requirements is to predict 5 predicts for each ID (user)
# so we will predict 5 times and save the result for each user 

test_ids = test_df['id']
ids = []  #list of ids
cts = []  #list of countries
for i in range(len(test_ids)):
    idx = test_ids[i]
    ids += [idx] * 5
    #save 5 results for each user 
    cts += labler.inverse_transform(np.argsort(y_pred[i])[::-1])[:5].tolist()

#Generate submission
sub = pd.DataFrame(np.column_stack((ids, cts)), columns=['id', 'country'])
sub.to_csv('submission.csv',index=False)

**Please consider upvoting if you find it useful to you !**

Thanks