# History Kaggle: Airbnb New User Bookings

## Download data
Data for this project are downloaded from the following link:<br/>
https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data

## Data Exploration

We are given demographics of users (train_users.csv, test_users.csv), records of users' web sessions (sessions.csv) as well as some basic information (age_gender_bkts.csv, countries.csv) as input datasets.

In the cells below, I'll load each one of the dataset and discuss features, calculate statistics, and note any missing values or outliers.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import xgboost as xgb
from sklearn.model_selection import GridSearchCV



### Load Datasets (5 files in total)

#### Demographics of Users (2 files)

In [2]:
train_users_data = pd.read_csv("train_users_2.csv")
test_users_data = pd.read_csv("test_users.csv")

In [53]:
num_rows_train, num_cols_train = train_users_data.shape
print("There are {} rows and {} columns in the train_users data.".format(num_rows_train, num_cols_train))
num_rows_test, num_cols_test = test_users_data.shape
print("There are {} rows and {} columns in the test_users data.".format(num_rows_test, num_cols_test))

There are 213451 rows and 16 columns in the train_users data.
There are 62096 rows and 15 columns in the test_users data.


In [54]:
train_users_data.head(3)

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,gxn3p5htnn,2010-06-28,20090319043255,,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,820tgsjxq7,2011-05-25,20090523174809,,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
2,4ft3gnwmtx,2010-09-28,20090609231247,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US


In [55]:
test_users_data.head(3)

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser
0,5uwns89zht,2014-07-01,20140701000006,,FEMALE,35.0,facebook,0,en,direct,direct,untracked,Moweb,iPhone,Mobile Safari
1,jtl0dijy2j,2014-07-01,20140701000051,,-unknown-,,basic,0,en,direct,direct,untracked,Moweb,iPhone,Mobile Safari
2,xx0ulgorjt,2014-07-01,20140701000148,,-unknown-,,basic,0,en,direct,direct,linked,Web,Windows Desktop,Chrome


In [11]:
train_users_data.dtypes

id                          object
date_account_created        object
timestamp_first_active       int64
date_first_booking          object
gender                      object
age                        float64
signup_method               object
signup_flow                  int64
language                    object
affiliate_channel           object
affiliate_provider          object
first_affiliate_tracked     object
signup_app                  object
first_device_type           object
first_browser               object
country_destination         object
dtype: object

In [26]:
test_users_data.dtypes

id                          object
date_account_created        object
timestamp_first_active       int64
date_first_booking         float64
gender                      object
age                        float64
signup_method               object
signup_flow                  int64
language                    object
affiliate_channel           object
affiliate_provider          object
first_affiliate_tracked     object
signup_app                  object
first_device_type           object
first_browser               object
dtype: object

In [52]:
print("There are {} unique values in signup_flow column in training data.".format(len(train_users_data.signup_flow.unique())))
print("There are {} unique values in signup_flow column in testing data.".format(len(test_users_data.signup_flow.unique())))
print("There are {} unique values in signup_flow column in all data.".format(len(np.unique(train_users_data.signup_flow.unique().tolist() + test_users_data.signup_flow.unique().tolist()))))

There are 17 unique values in signup_flow column in training data.
There are 7 unique values in signup_flow column in testing data.
There are 18 unique values in signup_flow column in all data.


*Note: most of the columns are categorical variable. We will transform `timestamp_first_active` into date-time format. `signup_flow` refers to "the page a user came to signup up from" according to the document provided by Kaggle. In addition to that, since there are 18 unique values in the `signup_flow` column in the all the data, we are going to treat it as a categorical variable.*

#### Age, Gender Statistics (1 file)

This file contains the census data of age and gender distribution for the 10 destination countries as of 2015.

In [28]:
age_gender_bkts_data = pd.read_csv("age_gender_bkts.csv")

In [29]:
num_rows_age, num_cols_age = age_gender_bkts_data.shape
print("There are {} rows and {} columns in the age_gender_bkts data".format(num_rows_age, num_cols_age))

There are 420 rows and 5 columns in the age_gender_bkts data


In [30]:
age_gender_bkts_data.year.unique()

array([ 2015.])

In [56]:
age_gender_bkts_data.country_destination.unique()

array(['AU', 'CA', 'DE', 'ES', 'FR', 'GB', 'IT', 'NL', 'PT', 'US'], dtype=object)

In [31]:
age_gender_bkts_data.head(3)

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year
0,100+,AU,male,1.0,2015.0
1,95-99,AU,male,9.0,2015.0
2,90-94,AU,male,47.0,2015.0


#### Country Information (1 file)

This file contains the latitude, longitude, distance to US, area, language of 10 destination countries.

In [32]:
countries_data = pd.read_csv("countries.csv")

In [34]:
num_rows_country, num_cols_country = countries_data.shape
print("There are {} rows and {} columns in the countries data".format(num_rows_country, num_cols_country))

There are 10 rows and 7 columns in the country data


In [35]:
countries_data.head(3)

Unnamed: 0,country_destination,lat_destination,lng_destination,distance_km,destination_km2,destination_language,language_levenshtein_distance
0,AU,-26.853388,133.27516,15297.744,7741220.0,eng,0.0
1,CA,62.393303,-96.818146,2828.1333,9984670.0,eng,0.0
2,DE,51.165707,10.452764,7879.568,357022.0,deu,72.61


In [57]:
countries_data.country_destination.unique()

array(['AU', 'CA', 'DE', 'ES', 'FR', 'GB', 'IT', 'NL', 'PT', 'US'], dtype=object)

#### Sessions Record (1 file)

Sessions file contains the action, action type, action detail, device type and the time elapsed from the previous action for each user_id.

In [59]:
sessions_data = pd.read_csv("sessions.csv")

In [62]:
num_rows_sessions, num_cols_sessions = sessions_data.shape
print("There are {} rows and {} columns in the sessions data".format(num_rows_sessions, num_cols_sessions))

There are 10567737 rows and 6 columns in the sessions data


In [61]:
sessions_data.head(3)

Unnamed: 0,user_id,action,action_type,action_detail,device_type,secs_elapsed
0,d1mm9tcy42,lookup,,,Windows Desktop,319.0
1,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,67753.0
2,d1mm9tcy42,lookup,,,Windows Desktop,301.0
