# Airbnb data exploration

*With the guide from https://www.kaggle.com/davidgasquez/airbnb-recruiting-new-user-bookings/user-data-exploration/notebook*


Import necessary stuffs.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Allow matplot to display via notebook
%matplotlib inline

# Set figure aesthetics
sns.set_style("white", {'ytick.major.size': 10.0})
sns.set_context("poster", font_scale=1.1)

## 1. Loading data
Load data and check basic properties

In [2]:
# Load the data into DataFrames
train_users = pd.read_csv('./data/train_users_2.csv')
test_users = pd.read_csv('./data/test_users.csv')

# Print total user from dataframe.shape[rows,cols]
print("Loaded", train_users.shape[0], "train users and", test_users.shape[0], "users.")

print("====[ Train users have", len(train_users.columns.values.tolist()), "columns ]====\n",\
      train_users.columns.values.tolist(),"\n")

print("====[ Test users have", len(test_users.columns.values.tolist()), "columns ]====\n",\
      test_users.columns.values.tolist(),"\n")

print("Different column is ###", set(train_users.columns.values.tolist()) - set(test_users.columns.values.tolist()), "###")

Loaded 213451 train users and 62096 users.
====[ Train users have 16 columns ]====
 ['id', 'date_account_created', 'timestamp_first_active', 'date_first_booking', 'gender', 'age', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser', 'country_destination'] 

====[ Test users have 15 columns ]====
 ['id', 'date_account_created', 'timestamp_first_active', 'date_first_booking', 'gender', 'age', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser'] 

Different column is ### {'country_destination'} ###


## 2. Merging data
Merge **train_users** and **test_users** into **users**

**But the field 'country_destination' will be 'NaN' for the 'test_users' field.**

In [11]:
# Merge train and test users
users = pd.concat((train_users, test_users), axis=0, ignore_index=True)

# Print total user from dataframe.shape[rows,cols]
print("Merged into", users.shape[0], "users.")

Merged into 275547 users.


### Display users [0-4]

In [12]:
users.head(5)

Unnamed: 0,affiliate_channel,affiliate_provider,age,country_destination,date_account_created,date_first_booking,first_affiliate_tracked,first_browser,first_device_type,gender,id,language,signup_app,signup_flow,signup_method,timestamp_first_active
0,direct,direct,,NDF,2010-06-28,,untracked,Chrome,Mac Desktop,-unknown-,gxn3p5htnn,en,Web,0,facebook,20090319043255
1,seo,google,38.0,NDF,2011-05-25,,untracked,Chrome,Mac Desktop,MALE,820tgsjxq7,en,Web,0,facebook,20090523174809
2,direct,direct,56.0,US,2010-09-28,2010-08-02,untracked,IE,Windows Desktop,FEMALE,4ft3gnwmtx,en,Web,3,basic,20090609231247
3,direct,direct,42.0,other,2011-12-05,2012-09-08,untracked,Firefox,Mac Desktop,FEMALE,bjjt8pjhuk,en,Web,0,facebook,20091031060129
4,direct,direct,41.0,US,2010-09-14,2010-02-18,untracked,Chrome,Mac Desktop,-unknown-,87mebub9p4,en,Web,0,basic,20091208061105


### Replace a gender '-unknown-' with NaN

In [32]:
users.gender.replace('-unknown-', np.nan,inplace=True)

In [33]:
users.head(5)

Unnamed: 0,affiliate_channel,affiliate_provider,age,country_destination,date_account_created,date_first_booking,first_affiliate_tracked,first_browser,first_device_type,gender,id,language,signup_app,signup_flow,signup_method,timestamp_first_active
0,direct,direct,,NDF,2010-06-28,,untracked,Chrome,Mac Desktop,,gxn3p5htnn,en,Web,0,facebook,20090319043255
1,seo,google,38.0,NDF,2011-05-25,,untracked,Chrome,Mac Desktop,MALE,820tgsjxq7,en,Web,0,facebook,20090523174809
2,direct,direct,56.0,US,2010-09-28,2010-08-02,untracked,IE,Windows Desktop,FEMALE,4ft3gnwmtx,en,Web,3,basic,20090609231247
3,direct,direct,42.0,other,2011-12-05,2012-09-08,untracked,Firefox,Mac Desktop,FEMALE,bjjt8pjhuk,en,Web,0,facebook,20091031060129
4,direct,direct,41.0,US,2010-09-14,2010-02-18,untracked,Chrome,Mac Desktop,,87mebub9p4,en,Web,0,basic,20091208061105


In [35]:
users_nan = (users.isnull().sum() / users.shape[0]) * 100

# This is to calculate the percentage of null/NaN within the data
# .count() tell how much non-null/NaN
# .isnull().sum() tell how much null/NaN
users_nan[users_nan > 0].drop('country_destination')

In [26]:
users.gender.shape[0]

275547