<a href="https://colab.research.google.com/github/uyangas/Visualizations-in-Python/blob/main/AirBnB_new_user_booking_destination_plotly.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AirBnB new user booking destination (Plotly)

https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data


`train_users.csv` - the training set of users
- `id`: user id
- `date_account_created`: the date of account creation
- `timestamp_first_active`: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
- `date_first_booking`: date of first booking
- `gender`
- `age`
- `signup_method`
- `signup_flow`: the page a user came to signup up from
- `language`: international language preference
- `affiliate_channel`: what kind of paid marketing
- `affiliate_provider`: where the marketing is e.g. google, craigslist, other
- `first_affiliate_tracked`: whats the first marketing the user interacted with before the signing up
- `signup_app`
- `first_device_type`
- `first_browser`

`country_destination`: this is the target variable you are to predict

`sessions.csv` - web sessions log for users

`user_id`: to be joined with the column 'id' in users table
- `action`
- `action_type`
- `action_detail`
- `device_type`
- `secs_elapsed`

`countries.csv` - summary statistics of destination countries in this dataset and their locations

`age_gender_bkts.csv` - summary statistics of users' age group, gender, country of destination
sample_submission.csv - correct format for submitting your predictions


In [None]:
from google.colab import files
files.upload()

In [None]:
!mkdir ~/.kaggle

In [None]:
# save kaggle.json to "kaggle" directory
! cp kaggle.json ~/.kaggle/

In [None]:
# change the permission of the file
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
# download the dataset
! kaggle competitions download -c airbnb-recruiting-new-user-bookings --force

Downloading sample_submission_NDF.csv.zip to /content
  0% 0.00/478k [00:00<?, ?B/s]
100% 478k/478k [00:00<00:00, 60.7MB/s]
Downloading countries.csv.zip to /content
  0% 0.00/546 [00:00<?, ?B/s]
100% 546/546 [00:00<00:00, 1.03MB/s]
Downloading test_users.csv.zip to /content
  0% 0.00/1.03M [00:00<?, ?B/s]
100% 1.03M/1.03M [00:00<00:00, 69.9MB/s]
Downloading sessions.csv.zip to /content
 69% 41.0M/59.1M [00:02<00:01, 11.4MB/s]
100% 59.1M/59.1M [00:02<00:00, 27.9MB/s]
Downloading train_users_2.csv.zip to /content
100% 4.07M/4.07M [00:00<00:00, 17.4MB/s]

Downloading age_gender_bkts.csv.zip to /content
  0% 0.00/2.46k [00:00<?, ?B/s]
100% 2.46k/2.46k [00:00<00:00, 2.41MB/s]


In [None]:
# create directory to unzip the data
! mkdir ~/.airbnb_booking

In [None]:
! unzip countries.csv.zip -d airbnb_booking
! unzip train_users_2.csv.zip -d airbnb_booking
! unzip age_gender_bkts.csv.zip -d airbnb_booking
! unzip sessions.csv.zip -d airbnb_booking
! unzip test_users.csv.zip -d airbnb_booking

Archive:  countries.csv.zip
  inflating: airbnb_booking/countries.csv  
Archive:  train_users_2.csv.zip
  inflating: airbnb_booking/train_users_2.csv  
Archive:  age_gender_bkts.csv.zip
  inflating: airbnb_booking/age_gender_bkts.csv  
Archive:  sessions.csv.zip
  inflating: airbnb_booking/sessions.csv  
Archive:  test_users.csv.zip
  inflating: airbnb_booking/test_users.csv  


## 1. Preprocessing

In [176]:
# import necessary packages
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime

In [54]:
# load the datasets
countries = pd.read_csv("airbnb_booking/countries.csv")
train = pd.read_csv("airbnb_booking/train_users_2.csv")
test = pd.read_csv("airbnb_booking/test_users.csv")
age_gender = pd.read_csv("airbnb_booking/age_gender_bkts.csv")
sessions = pd.read_csv("airbnb_booking/sessions.csv")

In [55]:
# merge train and test datasets
test['country_destination'] = 'Test'
all = pd.concat([train, test], axis=0, ignore_index=True)

In [56]:
all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 275547 entries, 0 to 275546
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   id                       275547 non-null  object 
 1   date_account_created     275547 non-null  object 
 2   timestamp_first_active   275547 non-null  int64  
 3   date_first_booking       88908 non-null   object 
 4   gender                   275547 non-null  object 
 5   age                      158681 non-null  float64
 6   signup_method            275547 non-null  object 
 7   signup_flow              275547 non-null  int64  
 8   language                 275547 non-null  object 
 9   affiliate_channel        275547 non-null  object 
 10  affiliate_provider       275547 non-null  object 
 11  first_affiliate_tracked  269462 non-null  object 
 12  signup_app               275547 non-null  object 
 13  first_device_type        275547 non-null  object 
 14  firs

In [63]:
# convert datetime
all['date_account_created'] = pd.to_datetime(all['date_account_created'])
all['date_first_booking'] = pd.to_datetime(all['date_first_booking'])
all['timestamp_first_active'] = pd.to_datetime(all['timestamp_first_active'], format='%Y%m%d%H%M%S')

In [100]:
print(f"Total number of users: {all.shape[0]}, Number of variables: {all.shape[1]}")

Total number of users: 275547, Number of variables: 16


In [64]:
all.head()

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,gxn3p5htnn,2010-06-28,2009-03-19 04:32:55,NaT,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,820tgsjxq7,2011-05-25,2009-05-23 17:48:09,NaT,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
2,4ft3gnwmtx,2010-09-28,2009-06-09 23:12:47,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US
3,bjjt8pjhuk,2011-12-05,2009-10-31 06:01:29,2012-09-08,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other
4,87mebub9p4,2010-09-14,2009-12-08 06:11:05,2010-02-18,-unknown-,41.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US


### 1.1. Handling missing values

In [169]:
# plot the percentage of NA values in each column
fig = px.bar(x='Columns', 
             y='value', 
             data_frame=(all.isna().sum()/all.shape[0])\
             .reset_index()\
             .assign(Non_NA = lambda x: 1-x[0])\
             .rename({'index':'Columns',0:'NA_values'},axis=1)\
             .melt('Columns',['NA_values', 'Non_NA']),
             color='variable',
             orientation='v',
             category_orders={"Columns": ["NA values", "Non NA values"]},
             color_discrete_sequence=["rgb(0,184,184)", "rgb(227,227,227)"],
             hover_name='Columns',
             title='Percentage of NA values')

fig.update_layout(width=1000, height=400,
                  yaxis={'title':'Percentage'},
                  xaxis={'title':''},
                  plot_bgcolor='#fff')
fig.show()

`date_first_booking`, `age` and `first_affiliate_tracked` columns contained `67.7%`,  `42.4%` and `2.2%` NA values respectively.

In [170]:
# explore NA values of date_first_booking
all[all.date_first_booking.isna()][['id','date_account_created','timestamp_first_active','date_first_booking','country_destination']].head(10)

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,country_destination
0,gxn3p5htnn,2010-06-28,2009-03-19 04:32:55,NaT,NDF
1,820tgsjxq7,2011-05-25,2009-05-23 17:48:09,NaT,NDF
11,om1ss59ys8,2010-01-05,2010-01-05 05:18:12,NaT,NDF
13,dy3rgx56cu,2010-01-05,2010-01-05 08:32:59,NaT,NDF
14,ju3h98ch3w,2010-01-07,2010-01-07 05:58:20,NaT,NDF
16,2dwbwkx056,2010-01-07,2010-01-07 21:51:25,NaT,NDF
18,cxlg85pg1r,2010-01-08,2010-01-08 01:56:41,NaT,NDF
23,jha93x042q,2010-01-11,2010-01-11 22:40:15,NaT,NDF
24,7i49vnuav6,2010-01-11,2010-01-11 23:08:08,NaT,NDF
26,bjg0m5otl3,2010-01-12,2010-01-12 15:54:20,NaT,NDF


In [93]:
# count the destination countries
all[all.date_first_booking.isna()]['country_destination'].value_counts()

NDF     124543
Test     62096
Name: country_destination, dtype: int64

To explore more about NA values of `date_first_booking` column. When `date_first_booking` is null, the person didn't place any booking. It seems it is not necessary to impute this value because imputation would alter the meaning of this variable.

Let's look at `age` column.

In [175]:
# descriptive statistics of age
all.age.describe()

count    158681.000000
mean         47.145310
std         142.629468
min           1.000000
25%          28.000000
50%          33.000000
75%          42.000000
max        2014.000000
Name: age, dtype: float64

The maximum value that the `age` column contains is 2014 which could be a typo or an error because it is unlikely that people can live up to over 2000 years, yet alone over 100 years.

In [186]:
all['age'][all.age>100].value_counts().head(10)

105.0     1351
2014.0     710
110.0      228
104.0       52
101.0       40
102.0       39
2013.0      39
109.0       36
103.0       30
107.0       28
Name: age, dtype: int64

In [182]:
fig = px.histogram(all[all.age<110],'age')
fig.show()

In [185]:
all[all.age>100].head(20)

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
388,v2x0ms9c62,2010-04-11,2010-04-11 06:56:02,2010-04-13,-unknown-,2014.0,basic,3,en,other,craigslist,untracked,Web,Windows Desktop,Firefox,FR
398,9ouah6tc30,2010-04-12,2010-04-12 23:15:34,2010-04-12,FEMALE,104.0,facebook,3,en,other,craigslist,linked,Web,iPhone,Mobile Safari,FR
627,dc3udjfdij,2010-05-19,2010-05-19 01:24:55,2010-06-16,-unknown-,105.0,basic,2,en,other,craigslist,omg,Web,Mac Desktop,Safari,FR
673,umf1wdk9uc,2010-05-25,2010-05-25 15:55:41,NaT,FEMALE,2014.0,basic,2,en,other,craigslist,untracked,Web,Mac Desktop,Safari,NDF
1040,m82epwn7i8,2010-07-14,2010-07-14 23:05:56,2010-07-15,MALE,2014.0,facebook,0,en,other,craigslist,untracked,Web,Mac Desktop,Chrome,US
1177,2th813zdx7,2010-07-25,2010-07-25 23:44:19,2010-07-26,MALE,2013.0,facebook,3,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US
1190,qc9se9qucz,2010-07-27,2010-07-27 00:20:29,2010-07-27,-unknown-,105.0,basic,2,en,other,craigslist,untracked,Web,Mac Desktop,Firefox,US
1200,3amf04n3o3,2010-07-27,2010-07-27 19:04:47,2010-07-29,FEMALE,2014.0,basic,2,en,direct,direct,untracked,Web,Windows Desktop,IE,US
1208,cguxptdi6h,2010-07-28,2010-07-28 03:44:15,2010-07-28,-unknown-,105.0,basic,3,en,direct,direct,untracked,Web,Mac Desktop,Firefox,US
1239,6vpmryt377,2010-07-30,2010-07-30 05:52:04,2010-07-30,FEMALE,2014.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,CA
