train_users.csv - the training set of users     
test_users.csv - the test set of users     
id: user id    
date_account_created: the date of account creation    
timestamp_first_active: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
date_first_booking: date of first booking    
gender    
age    
signup_method    
signup_flow: the page a user came to signup up from    
language: international language preference    
affiliate_channel: what kind of paid marketing     
affiliate_provider: where the marketing is e.g. google, craigslist, other    
first_affiliate_tracked: whats the first marketing the user interacted with before the signing up    
signup_app    
first_device_type    
first_browser    
country_destination: this is the target variable you are to predict    
sessions.csv - web sessions log for users     
user_id: to be joined with the column 'id' in users table    
action     
action_type    
action_detail    
device_type     
secs_elapsed   
countries.csv - summary statistics of destination countries in this dataset and their locations    
age_gender_bkts.csv - summary statistics of users' age group, gender, country of destination    
sample_submission.csv - correct format for submitting your predictions    

In [None]:
import os

import numpy as np
import pandas as pd

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Training Data General Exploration

In [None]:
df = pd.read_csv(f'{dirname + "/" + "train_users_2.csv.zip"}')
df.head()

In [None]:
df.columns

In [None]:
df.info()

### Takeaway

Our columns have 1 float, two integers, and 13 strings.     
Looking at the .head() sample, I see:
* id is string based.    
* date_account_created is a date time stamp in the form YYYY-MM-DD
* timestamp_first_active is an int, but is really in the date form YYYYMMDDHHMMSS, 24 hours format.
* date_first_booking has a lot of missing entries, and is in YYYY-MM-DD form
* gender is in at least male, female, and -unknown- form
* age is missing entries, and is a float. Unsure why it is a float since it is discrete
* signup_method is advertising data
* signup_flow is advertising data. This number is the page the user came to sign up on
* language is language, should be a string, but is also categorical
* affiliate_channel is advertising data
* affiliate_provider is advertising data
* first_affiliate_tracked is advertising data, is also missing data.
* signup_app is advertising data
* first_device_type is advertising data
* first_browser is advertising data
* country_destination is the target variable to predict for this competition. It has multiple classes


### Actionables
I will clean the data by changing the data types to a managable form. For example, the date columns from string to datetime objects, or categorizing categorical variables. 

## Change dates to datetime objects

In [None]:
df["date_account_created"] = pd.to_datetime(df.date_account_created)
df["date_account_created"].head()

In [None]:
df["timestamp_first_active"] = pd.to_datetime(df.timestamp_first_active, format="%Y%m%d%H%M%S")
df["timestamp_first_active"].head()

In [None]:
df["date_first_booking"] = pd.to_datetime(df.date_first_booking)
df["date_first_booking"].head()

## Change Categorical Variables
I will change gender, age, 

In [None]:
df.gender.value_counts()

In [None]:
gender_map = {
    "-unknown-" : 0,
    "FEMALE" : 1,
    "MALE" : 2,
    "OTHER" : 3
}
df["gender"] = df["gender"].map(gender_map)
df.gender.head()

In [None]:
df["age"] = df.age.fillna(0).astype(int)
df.age.head()

### Takeaways
I clenaed up two variables, but before moving on I would like to visualize the data first. 
The categorical names would be more helpful to me as is, since for this notebook I am not looking to apply machine learning tools. 

### Actionables
Visualize the variables

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.figure(figsize=(15, 10))
sns.pairplot(df)
plt.show()

In [None]:
df.describe()

### Takeaways
The age data is not well defined. We have age in the 2000 range. This value should be dropped since it the current data is very dirty and I am unsure how much utility we can gain from observing it in any form. 

Similarly, the pairwise plot does not provide much information since almost all the columns are categorical in some way. 


### Actionables
I should further explore the data, and fix the age issue. Maybe something like, all ages greater than 100 should be set to 0.

In [None]:
df[(df.age > 50) & (df.age<200)]["age"].hist()

In [None]:
df[(df.age > 80) & (df.age<120)]["age"].hist()

In [None]:
df.loc[df.age > 90, "age"] = 0

In [None]:
df.age.describe()

In [None]:
df.age[(df.age > 0) & (df.age < 18)].describe()

In [None]:
# As expected a lot of users under 18 are not choosing a destination
df.country_destination[(df.age > 0) & (df.age < 18)].value_counts()

In [None]:
# Look into the ages that the cut off occurs
df.age[(df.age > 0) & (df.age < 18) & (df.country_destination != "NDF")]

In [None]:
df.country_destination[(df.age >= 15) & (df.age < 18)].value_counts()

In [None]:
df.columns

In [None]:
for i in range(df.shape[1] - 6): 
    print(df.iloc[:, 6 + i].value_counts(), "\n")

### Takeaways
Age:      
100 years old has a spike so it might be a gag age. There was a dip in age close to 90, so I used that as the cut off point. Anything higher will be treated as 0, which provides us no information. 
I looked at the ranges that affected country was selected. 
Outside of those with NDF, it seems there is no simple heuristic to determine country destination. 


The value counts open up the floor to a lot of exploration. 
* Why do the numbers for the signup_method not match those in the signup flow. 
* Why is signup_flow at 25 so popular?
* Why do the browsers have a google earth, and a psvita browser?
* How many counts per id do we have? 
* Is there even enough information from a person's first interaction to determine where that person would travel to? Theoretically this does not make much sense. 
* How do we deal with the unbalanced classes and unbalanced data?
* Will we get more useful data after ignoring the age data that is too large or too small?

In [None]:
# legends = df.signup_method.unique()
# for i in range(df.shape[1] - 6): 
#     counts = df[df.columns[6 + i]].value_counts()
#     for legend in legends: 
#         counts = pd.concat([counts, df[df.signup_method == legend][df.columns[6 + i]].value_counts()], axis=1)
#     counts.fillna(0).astype(int)
#     plt.plot(counts.iloc[:, 1:], label=legends)
#     plt.xticks(rotation=45)
#     plt.legend()
#     plt.show()
    
# Redo but in seaborn
for i in range(len(df.columns) - 6):
    plt.figure(i)
    sns.countplot(x=df.columns[6 + i], hue='signup_method',data=df)
    plt.xticks(rotation=45)
    

### Takeaways
This just shows that the methods of signing up really does not affect much. I say this based on the position of the lines in the graph relative to the Y axis. 

In [None]:
df.id.value_counts()

### Takeaways
We only have one id, so each of these will be treated as a different person. 


# Prediction Variable Influence

I will check the effect on the top 20 most frequent categorical variables for each column. 

In [None]:
df.columns

In [None]:
for i in range(len(df.columns)):
    plt.figure(i, figsize=(12, 8))
    sns.countplot(x=df.columns[i], hue='country_destination',
                  data=df[df[df.columns[i]].isin(
                      df[df.columns[i]].value_counts().index[:10]
                  )]
                 )
    plt.xticks(rotation=45)

### Takeaways
Other than the first_time_booking, NDF was always the most popular destination.
This makes me more concerned about what NDF means. It could mean "No Destination Found"

In [None]:
df[df.country_destination == "NDF"].describe()

In [None]:
df.describe()

In [None]:
df[df.country_destination == "NDF"].date_first_booking.value_counts()

### Takeaways
I believe these three lines of code support the idea that NDF is no destination found. There were no recorded dates of booking for these ids. A small other bit of information is that these people are also less likely to have recorded genders or age. 

In a previous section I looked into age and NDF. There seems to be nothing to infer about age and the destination. But it does not make sense that people aged 2 are selecting locations. This could be a result of differential privacy at work, or just a fault of a generated dataset. 

### Actionable
Let's look at how the new changed age is related to the normalized values of the country destination counts.

In [None]:
df[df.age == 0]["country_destination"].value_counts(normalize=True)

In [None]:
df["country_destination"].value_counts(normalize=True)

In [None]:
df[(df.age == 0) & (df.country_destination != "NDF")]["country_destination"].value_counts(normalize=True)

In [None]:
df[(df.country_destination != "NDF")]["country_destination"].value_counts(normalize=True)

### Takeaways
With the adjusted ages where all over 90 are set to 0, it seems like 0 aged people are more likely to choose NDF. I verified this by pulling the normalized values of the destinations without NDF, and the ratios are very similar. 

This could be a flaw in the system, or an effect of the use of incognito. 
I can look into that by checking the browsers. 


In [None]:
df[(df.country_destination == "NDF") & (df.age == 0)]["first_browser"].value_counts()

In [None]:
age = pd.read_csv(f'{dirname + "/" + "train_users_2.csv.zip"}')["age"]
age.head()

In [None]:
age = age.fillna(0).astype(int)
age.head()

In [None]:
df["age"] = age

In [None]:
df[(df.country_destination == "NDF") & (df.age >= 90)]["first_browser"].value_counts()

In [None]:
# Readjust the ages over 90 to be separate from 0
df.loc[df.age >= 90, "age"] = -1

In [None]:
df[(df.country_destination == "NDF")]["first_browser"].value_counts().head()

In [None]:
set(df.first_browser.unique()) - set(df[(df.country_destination == "NDF")]["first_browser"].unique())

### Takeaways
Visually there does not seem to be any big pattern tying age with first browser.     
There also does not seem to be that much of a relationship with NDF either.     

There are other datasets, but just looking at this dataset alone it is really hard to tell with theory what could tie into predicting not only a purchase but also the location of purchase. I am honestly shocked. 



In [None]:
for destination in df.country_destination.unique(): 
    plt.figure(destination, figsize=(10, 8))
    sns.countplot(x="signup_flow", data=df[df.country_destination == destination])
    plt.title(destination)
    plt.xticks(rotation=45)

### Takeaways
I zoomed in for clarity despite not seeing anything in the previous densely populated graphs. I continue with my conclusion that it is impossible for me to infer anything from these. I cannot discern a relationship between country destination, and I do not think I can find anything zooming into the other multidimensional visualizations. 

# Age Gender Brackets General Exploration

In [None]:
import pandas as pd
import numpy as np
import os

df = pd.read_csv(f'{"/kaggle/input/airbnb-recruiting-new-user-bookings" + "/" + "age_gender_bkts.csv.zip"}')
df.head()

In [None]:
df.age_bucket.unique()

In [None]:
df.describe()

In [None]:
for elem in df.columns: 
    print(df[elem].value_counts(), "\n")

### Takeaways
This is a reference chart for 2015. The only value that really varies is the population in thousands. I suppose this can be used as a prediction for where people would go, similar ages will go to similar locations for airbnb? 

I am unsure how to integrate this. One approach could be to check the age of the person applying, and then if the age is a popular value at the location then mark a boolean as True. 

# Countries General Exploration

In [None]:
import pandas as pd
import numpy as np
import os

df = pd.read_csv(f'{"/kaggle/input/airbnb-recruiting-new-user-bookings" + "/" + "countries.csv.zip"}')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df

In [None]:
df.shape

### Takeaways
This is a 10 row 7 column dataset. So for 10 countries. It is the 10 in the training dataset minus NDF and other. 
We have the physical distance from the US, and then language distances from the US. The higher the number the more similar it is to english according to levenshtein. It is also interesting to see that the area of the location is also included. 

Based on the information presented here, I feel that there are some assumptions that we can make. One is that all the people who are included in this list started in the US. But other than that, I feel that this dataset is not very useful. What would the latitude and longitude of one point on the US even do. 

Not a useful dataset. 

# Sessions General Exploration

In [None]:
import pandas as pd
import numpy as np
import os

df = pd.read_csv(f'{"/kaggle/input/airbnb-recruiting-new-user-bookings" + "/" + "sessions.csv.zip"}')
df.head()

In [None]:
df.info()

In [None]:
# For some reason, info is not putting out the null values for these strings/floats
for elem in df.columns: 
    print(df[elem].isnull().value_counts(), "\n")

In [None]:
df.describe()

In [None]:
for elem in df.columns: 
    print(df[elem].value_counts(), "\n")

### Takeaway
This is a massive dataset. It seems to be transactional dataset, however there are no timestamps. 
* We have several users interacting thousands of times, and those interacting only once. 
* There are 359 actions for AirBnB, which seem to be related to 10 types of actions
* Those are related to 155 action details
* Then we have device type, which is not as diverse as the first browser from the training data. 
* Seconds elapsed ranges from 0 to well over an equivalent of 24 hours.


### Actionables
I think the first thing would be to match up the ids. Find out what is in here that is not in the other dataset, and vice versa.
Then I should think of a way to extract features from here that would be usable in conjunction with the training data. 
With a separate dataset like this, I am curious to see if any information is also in the testing data. If there is not then maybe looking in this sessions dataset is not very useful since there is no way to join the information to the other dataset other than to build heuristics. I would prefer not to do so since building heuristics would introduce unremovable bias. 

In [None]:
import pandas as pd
import numpy as np
import os

df = pd.read_csv(f'{"/kaggle/input/airbnb-recruiting-new-user-bookings" + "/" + "test_users.csv.zip"}')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

### Takeaways
As expected, the data columns are identical to that of the training set. With a quick look it also seems like that the data is more populated, with the only missing values occuring in age. Also there are no values in date_first_booking, which is expected since we are trying to predict when the person makes the first booking.

### Actionables
Check the id overlap. 

In [None]:
import pandas as pd
import numpy as np
import os

train = pd.read_csv(f'{"/kaggle/input/airbnb-recruiting-new-user-bookings" + "/" + "train_users_2.csv.zip"}')
sessions = pd.read_csv(f'{"/kaggle/input/airbnb-recruiting-new-user-bookings" + "/" + "sessions.csv.zip"}')
test = pd.read_csv(f'{"/kaggle/input/airbnb-recruiting-new-user-bookings" + "/" + "test_users.csv.zip"}')

In [None]:
sessions.columns

In [None]:
# Id in sessions and not in train
print("unique in sessions v train", len(set(sessions.user_id.unique()) - set(train.id.unique())))
# Id in train and not in sessions
print("unique in train v sessions", len(set(train.id.unique()) - set(sessions.user_id.unique())))
# Id in sessions and not in test
print("unique in sessions v test", len(set(sessions.user_id.unique()) - set(test.id.unique())))
# Id in test and not in sessions
print("unique in test v sessions", len(set(test.id.unique()) - set(sessions.user_id.unique())))
# Id overall v sessions
print("unique overall v sessions", len(set(pd.concat([train, test], axis=0).id.unique()) - set(sessions.user_id.unique())))
# Id sessions v overall
print("unique sessions v overall", len(set(sessions.user_id.unique()) - set(pd.concat([train, test], axis=0).id.unique())))

### Takeaways
It really seems like we have this dataset where we can find correlations between certain attributes of train and sessions which can be generalized to understanding the test data. 
There are a lot more unique ids in overall, with most of the test users being in sessions. Only 428 users in test are not in session. 
I think with this exploration so far, I am now ready to move onto part 2 to apply other techniques to understanding the data. 


# Conclusion

### What did I do
I did general exploration using info(), describe(), univariate visualizations and bivariate visualizations. 
From there I checked for any missing values, the datatypes for each dataset, and possible relationships based on theory. 

#### What worked
* I can tell that clustering will help make sense of this high dimensional data.
* There is ID overlap in sessions and the training dataset. 


#### What did not work
* There is still a lot of missing data for age, gender, and other variables.
* I cannot find a simple relationship between the country destination and other variables. 
* I cannot make sense of signup_flow or the other categorical variables. I zoomed in to one sign up and I could not get any more information than I did with the multivariate graphs.  


The dataset is pretty complex with not a lot of theory to back up the connection between the presented variables and country of destination prediction. We have a lot of categorical variables which would make general analysis complicated beyond using bar graphs to evaluate two dimensional changes.      

The next notebook will look to apply statistical tehcniques more capable of making sense of higher dimensional data. This will be done by changing the categorical variables to dummy variables and then running tools such as clustering to find groups. I will also see how to combine the sessions dataset to provide generalizable insights onto the test dataset. 

