This project/kernel will be put on hold as I have other projects and studies to attend to. So instead of the original intent of running a double classifier, one for determining whether a destination purchase was made and then deciding where the location is, this kernel will be an exploratory page that shows my first steps of exploring a foreign dataset. 


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

import os
print(os.listdir("../input"))

# Introduction
Welcome to this notebook.
In this notebook I will be exploring the AirBnb's competition dataset, gathering insights, and visualizing any patterns, and generating a general understanding of this dataset. The notebook will not be edited and formatted as a final draft and edited because it will contain my thoughts and steps through the dataset. A polished one can be released upon request.

## The Dataset
We are given a list of users with information such as their demographics, web session records, and summary statistics. The purpose of this dataset is to predict where a new user would likely to book for their first destination. All the users are from the US, and there are 12 possible outcomes: 'US', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL','DE', 'AU', 'NDF' (no destination found), and 'other'. NDF and 'other' are separate. 

As shown in the code above, we are given 6 datasets, countries, sessions, train users 2, age gender bkts, test uers, and sample submission. Train users and test users should hold the same type of information with the exception of the country destination/prediction. Countries would be the information on the destination countries. Sessions would be the web sessions for the users, and the age gender bkts for the statistics on the user. 

Since we have information being held in several different datasets, we might have to join datasets together or we can see if there is a master dataset that holds all the information so we do not have to look through the derivatives. In a dataset of this small size, it is more efficient for me to use one dataset instead of joining multiple datasets as I filter through each. 

## Theory
How would a professional predict where a person may want to go based on their behavior? A simple predictor would be to look at which locations was the audience looking at, in addition what was the click through rate or exploration of the listing. Time elapsed might have a positive correlation with interest, but anomalies such as extremely long viewing time or bounce rate times should be ignored as they will affect the prediction. How many returns to a certain country page may also indicate interest in a country. There is not too much that I understand about Airbnb's customers so I will begin Data Analysis. 

# Exploration

In [None]:
train = pd.read_csv('../input/train_users_2.csv')

In [None]:
train.info()

- We have three variables that represent time, an identifier, basic summary, website flow position, affiliates, and electronic summary. 
- Dates are comparable, so types can be converted to an integers or a datetime variable. 
- Gender can be converted to a dummy variable
- Age is a float, which is interesting. It might mean that there are missing variables.
- signup_flow is an integer, and it is a position of where the customer signed up for an account. This implies that this variable is ordinal. 
- Language could show some relationship to location, as some people prefer to travel to locations where there is not an extreme language barrier. There are those who prefer the inverse. 
- Since there are only 88908 first bookings, we may expect many country_destination data rows to be NDF.   

In [None]:
train.head(20)

A quick look at the first 20 entries show no simple visible pattern. I need to convert the timestamp_first_active to a YYYY-MM-DD to compare it to the date account created, and I am interested to see how to combine these results with the web sessions to see if more frequent visits are correlated with purchases. In addition, an area to explore would be whether people are friends with one another, for instance people who view the webpage and share the link with a friend are unlikely to both make a purchase because only one person purchasing is necessary. 

## Cleaning
- asdf

### Code


In [None]:
train.gender.value_counts()

In [None]:
train.language.value_counts()

In [None]:
# There is an outlier that is disrupting the graph
plt.hist(train.age.dropna(), bins=60)
plt.show()

In [None]:
# A guess is that the Year of when people are born are being entered as "age"
# I will run an anomaly classifier and an age regression 
train.age.value_counts().tail(15)

In [None]:
train.affiliate_provider.value_counts()

In [None]:
for aff in train.affiliate_provider.value_counts().index:
    print(aff)
    print(train.country_destination[train.affiliate_provider == aff].value_counts())
    
# It is not easy to tell what the connection is for affiliate proviers and the country destination, 
# so I guess we will be very blindly feeding the classification algorithm for this kaggle competition . 

In [None]:
train.affiliate_channel.value_counts()

In [None]:
train.signup_flow.value_counts()

## More data
After looking through the training dataset, I realized that there is a lot to accomplish. 
- Regression for the age
- A way to handle the unknown genders (There are almost 45% unknown)
- Creating two features for the date to first booking and previous dates, and the activeness of the person and their date account created.
- Chunking signup_flows and making a decision
- Creating a classifier of whether the person chose a country, then further deciding which country was chosen

Outside of the first dataset, we still have to look through the sessions, countries, and age_gender_bkts of the users. 

In [None]:
dem = pd.read_csv('../input/age_gender_bkts.csv')
coun = pd.read_csv('../input/countries.csv')
sess = pd.read_csv('../input/sessions.csv')

In [None]:
dem.info()

In [None]:
dem.head(20)

In [None]:
dem.year.value_counts()
# Year is not useful for t

In [None]:
coun.info()

In [None]:
coun

In [None]:
sess.info()