# Introduction
With the pt 1 exploration notebook, I left off with a lot of variables to look into. Theory-wise, the data was too complex to manage in my head, so for this notebook I will look to apply clustering to make more sense of the high dimensional data. 

I will take the unsupervised and supervised approach to clustering the data using K-means for the unsupervised method. With K-means, I will have to first change all the categorical columns in the training data to dummy variables. I will then iterate through different numbers of clusters and use the elbow method to help determine which ones to keep. From there I can further analyze the data based on the categorized cluster. 

Continuing with the categorized cluster, it may help with classification to build smaller models based on the clusters. For example, users that have the id in the session dataset vs those that do not. There may or may not be a difference. This notebook will explore this potential relationship. Some caveats of this exploration are that the categorical variables explored in the previous dataset are unbalanced. There are a lot of long tailed groups for dates, age, advertising data, and the target predictions. 

In [None]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_csv(f'{dirname + "/" + "train_users_2.csv.zip"}')
# Cleaning code from the pt. 1 notebook. 
train["date_account_created"] = pd.to_datetime(train.date_account_created)
train["timestamp_first_active"] = pd.to_datetime(train.timestamp_first_active, format="%Y%m%d%H%M%S")
train["date_first_booking"] = pd.to_datetime(train.date_first_booking)
train.loc[train.age >= 90, "age"] = -1
train.loc[train.age <= 14, "age"] = 0


gender_dummies = pd.get_dummies(train.gender, prefix="gender")
train = pd.concat([train, gender_dummies], axis=1)
train = train.drop(["gender", "gender_-unknown-"], axis=1)


In [None]:
age_intervals = [-1, 0, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90]
age_dummies = pd.get_dummies(pd.cut(train.age, bins=age_intervals, right=False), prefix="age")
train = pd.concat([train, age_dummies], axis=1)
train = train.drop(["age"], axis=1)

### Actionables
Now that got the dummies out of the ages set into intervals to smooth out performance, I will try to make dummies out of the other variables. 

Gender was set to dummy because it ordinal genders does not make sense to me. 
A male is not twice the magnitude of a female. -unknown- was removed since it is perfectly collinear when the other gender dummies are False/0.

```
gender_map = {
    "-unknown-" : -1,
    "FEMALE" : 0,
    "MALE" : 1,
    "OTHER" : 2
}
train["gender"] = train["gender"].map(gender_map)```

The next step I will look at the years.

In [None]:
train.date_account_created.dt.year.describe()

In [None]:
train.timestamp_first_active.dt.year.describe()

In [None]:
train["dac_year"] = train.date_account_created.dt.year - 2008
train["dac_month"] = train.date_account_created.dt.month
train["dac_day"] = train.date_account_created.dt.day

train[["dac_year", "dac_month", "dac_day"]].head(10)

In [None]:
train["tfa_year"] = train.timestamp_first_active.dt.year - 2008
train["tfa_month"] = train.timestamp_first_active.dt.month
train["tfa_day"] = train.timestamp_first_active.dt.day

train[["tfa_year", "tfa_month", "tfa_day"]].head(10)

In [None]:
# Calculate delay between days
(train.date_account_created - train.timestamp_first_active).dt.days.value_counts()

### Takeaways
It is interesting how there are so many days that are -1. That would suggest that there are people who sign up before becoming active. I wonder what their first affiliates are.

I subtracted 2008 from the years since I will be clustering these values. I would like to normalize them, but I think that reducing the number to be a lot closer to 0 would be sufficient. I do not want to have negative values here. I used 2008 because that was the year that airbnb started. 


In [None]:
train[(train.date_account_created - train.timestamp_first_active).dt.days == -1]["affiliate_channel"].value_counts()

In [None]:
train.affiliate_channel.value_counts()

In [None]:
train.shape

### Takeaways
There is barely any difference between the two. Maybe I am caculating incorrectly. I think that for the time delta since timestamps are during the day while date account creates are strictly for the day that a non-integer day result gets rounded down. 
To balance this out, I will add 1 to the values then. I am comfortable doing this because I did not find any values smaller than -1. If there were values that were smaller then it might suggest that people were tracked before their first time becoming active. 

I thought about seasons, but I am going to ignore it because I feel that it will be highly correlated with months, and that since I will end up using a dimensionality reducer that seasons would be nullified anyways. 

In [None]:
# Number of days it took to create the account since becoming first active
train["delay_days"] = (train.date_account_created - train.timestamp_first_active).dt.days + 1

In [None]:
adv_columns = ["affiliate_channel", "affiliate_provider", "first_affiliate_tracked", "first_browser", "first_device_type", "signup_flow", "language", "signup_method", "signup_app"]
adv_dummies = pd.get_dummies(train[adv_columns])

train = pd.concat([train, adv_dummies], axis=1)
train = train.drop(adv_columns, axis=1)

### Takeaways
I think that the data is done being processed. I will first run this through k-means and see what results we can get. I will do the first run with 12 clusters since we have 12 outputs for country destinations. I will then use the elbow method to see if the number of optimal clusters are the same. 

I will explore the data with the clusters given, then move onto integrating the sessions data with the training and test data. 

In [None]:
print(train.columns, train.shape)

In [None]:
train = train.drop([
    "id", 
    "date_account_created", 
    "timestamp_first_active", 
    "date_first_booking"], 
    axis=1)
train.head(10)

In [None]:
train.country_destination.unique()

In [None]:
# output_map = {
#     'NDF': 0, 
#     'US': 1, 
#     'other': 2, 
#     'FR': 3, 
#     'CA': 4, 
#     'GB': 5, 
#     'ES': 6, 
#     'IT': 7, 
#     'PT': 8, 
#     'NL': 9,
#     'DE': 10, 
#     'AU': 11
# }
# train["ordinal_output"] = train["country_destination"].map(output_map)

output_dummies = pd.get_dummies(train.country_destination, prefix="output")
train = pd.concat([train, output_dummies], axis=1)
train = train.drop("country_destination", axis=1)

In [None]:
from sklearn.cluster import KMeans

model_12 = KMeans(n_clusters=12, random_state=42*42)
model_12 = model_12.fit(train)
predict_12 = model_12.predict(train)

In [None]:
model_12.inertia_

In [None]:
def data_proc(explore):
    age_intervals = [-1, 0, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90]
    
    explore["date_account_created"] = pd.to_datetime(explore.date_account_created)
    explore["timestamp_first_active"] = pd.to_datetime(explore.timestamp_first_active, format="%Y%m%d%H%M%S")
    explore["date_first_booking"] = pd.to_datetime(explore.date_first_booking)
    explore.loc[explore.age >= 90, "age"] = -1
    explore.loc[explore.age <= 14, "age"] = 0

    explore['age'] = pd.cut(explore.age, bins=age_intervals, right=False)

    explore["dac_year"] = explore.date_account_created.dt.year - 2008
    explore["dac_month"] = explore.date_account_created.dt.month
    explore["dac_day"] = explore.date_account_created.dt.day


    explore["tfa_year"] = explore.timestamp_first_active.dt.year - 2008
    explore["tfa_month"] = explore.timestamp_first_active.dt.month
    explore["tfa_day"] = explore.timestamp_first_active.dt.day

    explore["delay_days"] = (explore.date_account_created - explore.timestamp_first_active).dt.days + 1
    
    return explore


explore = pd.read_csv(f'{dirname + "/" + "train_users_2.csv.zip"}')
explore = data_proc(explore)
explore = pd.concat([explore, pd.Series(predict_12, name="Cluster")], axis=1)

In [None]:
explore.columns

In [None]:
for i in range(len(explore.columns) - 4):
    plt.figure(i, figsize=(12, 8))
    sns.countplot(x=explore.columns[4 + i], hue='Cluster',data=explore)
    plt.xticks(rotation=45)

### Takeaways
I think it is very interesting how there are several clusters with almost nothing in them. for custer 1, 4, 6, and 11. Why are these so low? But other than that, the other clusters seem to be repeating their frequency in all of the areas, which suggest that a lot of the variables are pretty randomly distributed. This is frustrating because it seems like that it is just impossible to use the data in train to predict the output. This also suggests that the session dataset must be used. 

There is a difference in the dates based on the clusters. I think that instead of spotting a relationship based on the country destination, which is only one column, the clustering algorithm found patterns based on the dates. 

I will look into that before using the elbow method and adding in the session data.

In [None]:
for i in range(len(explore.columns) - 4):
    plt.figure(i, figsize=(12, 8))
    sns.countplot(x=explore.columns[4 + i], data=explore[explore.Cluster == 0])
    plt.xticks(rotation=45)

In [None]:
for i in range(len(explore.columns) - 4):
    plt.figure(i, figsize=(12, 8))
    sns.countplot(x=explore.columns[4 + i], data=explore[explore.Cluster == 11])
    plt.xticks(rotation=45)

### Takeaways
I definitely can see a correlation with date for cluster 0. Cluster 11 on the other hand, being an outlier cluster had more random seeming values. 

But in terms of country destination, cluster 0 shows a few things. There is a pattern emerging between NDF and the other variables. I thought it was interesting how the age of cluster 0 reflected the general population as shown in previous exploration visualizations. So it really shows the listed age is not a good deteminant at least for that cluster for the output. Intead it showed a lot of first browsers that were not extremely popular when they create an account on the first day. 

Another takeaway is that I should convert the delay_days to an interval in the same way I did for the ages. 


In [None]:
def data_proc(explore):
    age_intervals = [-1, 0, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90]
    delay_days_intervals = [0, 1, 2, 3, 4, 5, 6, 7, 14, 30, 60, 90, 180, 365]
    
    explore["date_account_created"] = pd.to_datetime(explore.date_account_created)
    explore["timestamp_first_active"] = pd.to_datetime(explore.timestamp_first_active, format="%Y%m%d%H%M%S")
    explore["date_first_booking"] = pd.to_datetime(explore.date_first_booking)
    explore.loc[explore.age >= 90, "age"] = -1
    explore.loc[explore.age <= 14, "age"] = 0

    explore['age'] = pd.cut(explore.age, bins=age_intervals, right=False)
    

    explore["dac_year"] = explore.date_account_created.dt.year - 2008
    explore["dac_month"] = explore.date_account_created.dt.month
    explore["dac_day"] = explore.date_account_created.dt.day


    explore["tfa_year"] = explore.timestamp_first_active.dt.year - 2008
    explore["tfa_month"] = explore.timestamp_first_active.dt.month
    explore["tfa_day"] = explore.timestamp_first_active.dt.day

    explore["delay_days"] = (explore.date_account_created - explore.timestamp_first_active).dt.days + 1
    explore["delay_days"] = pd.cut(explore.delay_days, bins=delay_days_intervals, right=False).astype(str)
    explore["delay_days"] = explore["delay_days"].fillna(720)
    
    return explore

In [None]:
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

model = KMeans(random_state=42*42)
visualizer = KElbowVisualizer(model, k=(1,10))
visualizer.fit(train)

### Takeaways
The transition seems to be around the 3-5 clusters.

I will repeat the clustering without the destinations

In [None]:
model = KMeans(random_state=42*42)
visualizer = KElbowVisualizer(model, k=(1,10))
visualizer.fit(train.drop(train.columns[train.columns.str.contains("output")], axis=1))

### Takeaways
After dropping the output dummies, it seems that the clusters converged to 3. 
Unfortunately with only three clusters, I do not feel that I can easily get a lot of clear information for which factors contribute to determining the country destination. 



In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
output = pca.fit_transform(train)


In [None]:
plt.scatter(output[:, 0], output[:, 1])

### Takeaways
For visualization, I ran PCA with a n_component of 2. I will not apply PCA for the clustering because the number of columns is much smaller than the rows. I have not put in the sessions data, and also the data is not standardly scaled, which may negatively affect the performance of PCA. 

Looking at this graph, it also does not seem like clustering would be able to get a good representation of the data, it might be densely populated information on the y axis, one that is further up, and another that captures the outliers. The other possibility is that the densely populated part is split in two and then the reamaining values are captured in the outliers. That would reminde me of hte previous clusters where each cluster had a different overlap based on dates, and then there were three clusters that caught all the points of lower density.

In [None]:
model_3 = KMeans(n_clusters=3, random_state=42*42)
model_3 = model_3.fit(train)
predict_3 = model_3.predict(train)
explore = pd.read_csv(f'{dirname + "/" + "train_users_2.csv.zip"}')
explore = data_proc(explore)
explore = pd.concat([explore, pd.Series(predict_3, name="Cluster")], axis=1)

In [None]:
plt.scatter(output[:, 0], output[:, 1], c=explore.Cluster, cmap="brg")
plt.show()

In [None]:
for j in range(len(explore.Cluster.unique())):
    for i in range(len(explore.columns) - 4):
        plt.figure((i+1)*(j+1))
        sns.countplot(x=explore.columns[4 + i], data=explore[explore.Cluster == j])
        plt.xticks(rotation=45)

### Takeaways
These three clusters show a much better difference compared to the 12 cluster. But that is expected. A few differences that I see here are in the delay days, of gender, and also first browsers. But as I came to realize when exploring the previous clustering model, these are not clustered against the classes, so the data as it is will not be very useful to look into. 

# Sessions Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sessions = pd.read_csv("/kaggle/input/airbnb-recruiting-new-user-bookings/sessions.csv.zip")

In [None]:
sessions.head()

In [None]:
sessions["id"] = sessions["user_id"]
sessions = sessions.drop("user_id", axis=1)
sessions.head()

In [None]:
for column in sessions.columns:
    print(sessions[column].isnull().value_counts())

In [None]:
for elem in sessions.columns: 
    print(sessions[elem].value_counts(), "\n")

In [None]:
single_event = sessions.id.value_counts()[sessions.id.value_counts() == 1].index

In [None]:
len(single_event)

In [None]:
train = pd.read_csv("/kaggle/input/airbnb-recruiting-new-user-bookings/train_users_2.csv.zip")

In [None]:
train[train.id.isin(single_event)]["country_destination"].value_counts()

In [None]:
train.loc[
    (train.id.isin(single_event)) & 
    (train.country_destination == "US")
]

### Takeaways
No obvious pattern. I am surprised how some people can have only one session recroded despite having an account created, a period of inactivity, and then a date first booking. This could show that the session data is incomplete or trimmed down. This period was too old for the feature of group booking to take place since that feature was introduced in 2017. 

I need to decide which features from sessions to keep. Action and action detail have an unbalanced set of features, so the small ones should be cut out to reduce dimensions. Action detail and action type both have 1030k unknown actions, on top of 1126k missing values. 

So I think I should change small frequency values to 0, and then a standardscalar. Unlike age, I think that the relative values are not as important and certain age milestones might have a bigger impact than the number of events. 
I will look to see what is a good cutoff range and then round to a clean number. 
An example of what I will cut off are the booking_response in action type since there are only 4 overall. 

For action, there are 359 different recorded actions, a lot of these will just be trimmed off. 

In [None]:
sessions.groupby(["id"])["action"].value_counts()

In [None]:
vc = sessions.action.value_counts()

In [None]:
vc[vc<5000].plot()

In [None]:
vc[vc<500].plot()

In [None]:
vc[vc<200].plot()

### Takeaway
In terms of which variables to be cut off as whole, a good cut off amount seems to be around 1000, where the concavity of the curve seems to be the greatest. Or I should choose the section right when the curve flattens out, which looks to be around 60. 
A log transformation might help distribute these counts. But I do not feel that a log transformation would be helpful since these fewer represented categories are not frequently counted as part of a customer's journey. 

On the other hand, there may be parts in this data that immediately show interest such as reaching out to an AirBnB host which would be incredibly useful data buildling a model for a person who would NDF or select a location. It may not help with determining which location, but the binary case of a booking is possible. 

In [None]:
vc[vc<200].head(30)

### Takeaways
From a manual look, I see a user interest action such as requesting photography, and then from the host aspect I see a lot of features showing that the owner is active. Since I will have to repeat this step with action type and action detail, I will remove the data and go with a lighter model. The total number of columns will still be under 10000 regardless and is a small fraction of the number of rows, but since these long tailed actions seem more host centric than user centric, I will move forward in removing them. 

In [None]:
sessions.groupby(["id"])["action_detail"].value_counts()

In [None]:
advc = sessions.action_detail.value_counts()

In [None]:
advc[advc<10000].plot()

In [None]:
advc[advc<1000].plot()

In [None]:
advc[advc<200].plot()

### Takeaways
Similarly the cut off seems to be around 60. To err on caution, I will set the cut off to 70 for both. Then as mentioned in previous takeaways I will get the counts of each category as a variable for each id. 

In [None]:
vc[vc < 70].index

In [None]:
advc[advc < 70].index

In [None]:
sessions.loc[sessions.action_type == "booking_response", "action_type"] = None
sessions.loc[sessions.action_type == "-unknown-", "action_type"] = None

sessions.loc[sessions.action.isin(vc[vc < 70].index), "action"] = None
sessions.loc[sessions.action == "-unknown-", "action"] = None

sessions.loc[sessions.action_detail.isin(advc[advc < 70].index), "action_detail"] = None
sessions.loc[sessions.action_detail == "-unknown-", "action_detail"] = None

sessions.loc[sessions.device_type == "-unknown-", "device_type"] = None

In [None]:
for elem in sessions.columns: 
    print(sessions[elem].value_counts(), "\n")

### Takeaways
Those adjustments have been made. The final two steps would be to process seconds and then convert it into dataframe that can be added onto the original training data. Since the training data and the test data are in the same format, it will also be added to the test data to help predictions. 

To process seconds, since it is ordinal and continuous, applyng a log transformation and then converting it into a range would be great to retain counts in a manageable format. To extract more information, I will create two variables for the average log time spent per event and the user's log standard deviation.

In [None]:
# 1 is added because ln(0) is not a value. 
# 172800 is 48 hours equivalent
sessions.loc[sessions.secs_elapsed > 172800, "secs_elapsed"] = 172800
sessions["log_seconds"] = np.log(sessions.secs_elapsed + 1)

In [None]:
lsec_intervals = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
sessions["seconds_range"] = pd.cut(sessions.log_seconds, bins=lsec_intervals, right=False)

### Takeaways
Great, now I need to run a groupby and build a dataframe with all previously mentioned information.
Per id get all the counts of each of the categories and calculate two features from the 1sec_intervals

In [None]:
gb = sessions.groupby(["id"])
a_temp = gb["action"].value_counts()
at_temp = gb["action_type"].value_counts()
ad_temp = gb["action_detail"].value_counts()
dt_temp = gb["device_type"].value_counts()
sr_temp = gb["seconds_range"].value_counts()

In [None]:
action = a_temp.unstack().fillna(0).astype(int)
action_type = at_temp.unstack().fillna(0).astype(int)
action_detail = ad_temp.unstack().fillna(0).astype(int)
device_type = dt_temp.unstack().fillna(0).astype(int)
seconds_range = sr_temp.unstack().fillna(0).astype(int)
action_type.head()

In [None]:
lmean = gb["log_seconds"].mean()
lmean.head()

In [None]:
lstd = gb["log_seconds"].std()
lstd.head()

In [None]:
sess_join = pd.DataFrame(sessions.id.value_counts())
join_list = [action, action_type, action_detail, device_type, lmean, lstd, seconds_range]
join_list_name = [action, action_type, action_detail, device_type, lmean, lstd, seconds_range]

for i in range(len(join_list)): 
    sess_join = sess_join.join(join_list[i], rsuffix=join_list_name)
sess_join.drop("id", axis=1, inplace=True)

In [None]:
sess_join.head()

In [None]:
seconds_range.add_prefix("seconds_range_")

In [None]:
sess_join.columns

In [None]:
adv_dummies

In [None]:
pd.get_dummies(train.signup_flow.astype(str))

In [None]:
train[train == "-unknown-"] = np.nan

In [None]:
train.gender.value_counts()

### Takeaways
All the data is prepared. I will take the data processing methods from this notebook into another notebook where I will focus on running a prediction method. 

I will use logistic regression, random forests, and XGboost. 
The first two will be fairly explainable, and I am using XGboost because it was all the rage for kaggle several years ago. 