# Introduction

In this challenge, we are given a list of users along with their demographics, web session records, and some summary statistics. We are asked to predict which country a new user's first booking destination will be. All the users in this dataset are from the USA.

There are 12 possible outcomes of the destination country: 'US', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL','DE', 'AU', 'NDF' (no destination found), and 'other'. Note that 'NDF' is different from 'other' because 'other' means there was a booking, but is to a country not included in the list, while 'NDF' means there wasn't a booking.

The training and test sets are split by dates. In the test set, we will predict all the new users with first activities after 7/1/2014. In the sessions dataset, the data only dates back to 1/1/2014, while the users dataset dates back to 2010. 

The datasets are:

* age_gender_bkts.csv
* countries.csv
* sample_submission_NDF.csv
* sessions.csv
* test_users.csv
* train_users2.csv

We will go through the following steps:
* Fetch Airbnb data from https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data
* Merge the test_users.csv and train_users2.csv
* Clean the data
* Parse the data
* Collect the data in a panda dataframe
* Display the data
* Explore the data

We will also show the different sub-steps that can be taken to reach the presented solution.

As we begin the study, we first need to ensure that the datasets are clean and that all irrelevant parts are removed or filled correctly.

//import the necessary packages to perform our data analysis.
* numpy >>
* pandas >>
* matplotlib.pyplot >>
* seaborn >>



In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

# Load the data into DataFrames
train_users = pd.read_csv("train_users_2.csv") # Use the training set to build model
test_users = pd.read_csv("test_users.csv") # Use the test set to validate model

## >> Cleaning Data
### Basic Analysis of Data
From line 21 to 24 we are performing a basic analysis of data to identify patterns and/or irrelevant parts.

.shape >> return the number of rows and columns in the data set
.describe >> returns statistics about the numerical columns in a dataset

In [21]:
train_users.shape

(213451, 16)

In [22]:
test_users.shape

(62096, 15)

In [23]:
train_users.describe()

Unnamed: 0,timestamp_first_active,age,signup_flow
count,213451.0,125461.0,213451.0
mean,20130850000000.0,49.668335,3.267387
std,9253717000.0,155.666612,7.637707
min,20090320000000.0,1.0,0.0
25%,20121230000000.0,28.0,0.0
50%,20130910000000.0,34.0,0.0
75%,20140310000000.0,43.0,0.0
max,20140630000000.0,2014.0,25.0


In [24]:
test_users.describe()

Unnamed: 0,timestamp_first_active,date_first_booking,age,signup_flow
count,62096.0,0.0,33220.0,62096.0
mean,20140810000000.0,,37.616677,7.813885
std,80245850.0,,74.440647,11.254291
min,20140700000000.0,,1.0,0.0
25%,20140720000000.0,,26.0,0.0
50%,20140810000000.0,,31.0,0.0
75%,20140910000000.0,,40.0,23.0
max,20140930000000.0,,2002.0,25.0
