# Explore Session Data

## Variable Description 
- user_id: to be joined with the column 'id' in users table
- action_type: depth 1 action description
- action_detail: depth 2 action description
- action: depth 3 action description
- device_type
- secs_elapsed

## Summary of the exploratory analysis

- 10M data of users for 2014.
- Jan-Jun: train users, Jul-Sep: test users. 
- each row is a single action by a user, with 3 hierarchical descriptions of action (action type - action detail - action)
  with seconds spent. 
- 11 action types, 156 action details, and 360 actions. 
- 65K out of 210K train users have session data. 
- Almost all(99.3%) of 62K test users have session data. 
- ~10% of data have missing information.  

## TODO

- More manual data inspections, especially for session data. 
- Check which actions will more likely to lead to a booking intuitively. 
- Can we infer a user is 'host', 'guest', or 'both'?
    - "clustering"
- What would be good metric to measure 'user engagement'?
  Possible starting points are:
  - The total number of action
  - The total number of seconds
  - The average seconds per action
  - Having some 'key' actions, such as 'request booking' or 'manage posting', rather than 'search' or 'show'. 
- Use and merge data about country statistics and user statistics.  

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df_s = pd.read_csv('sessions.csv')

In [3]:
df_s.shape

(10567737, 6)

In [4]:
df_s.head()

Unnamed: 0,user_id,action,action_type,action_detail,device_type,secs_elapsed
0,d1mm9tcy42,lookup,,,Windows Desktop,319.0
1,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,67753.0
2,d1mm9tcy42,lookup,,,Windows Desktop,301.0
3,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,22141.0
4,d1mm9tcy42,lookup,,,Windows Desktop,435.0


In [5]:
df_s.isnull().sum()

user_id            34496
action             79626
action_type      1126204
action_detail    1126204
device_type            0
secs_elapsed      136031
dtype: int64

- 10% of action_type and action_details are missing. 
- others are ~1% missing. 

In [6]:
len(df_s.action_type.unique())

11

In [7]:
df_s.action_type.value_counts()[:10]

view                3560902
data                2103770
click               1996183
-unknown-           1031170
submit               623357
message_post          87103
partner_callback      19132
booking_request       18773
modify                 1139
booking_response          4
Name: action_type, dtype: int64

In [8]:
len(df_s.action_detail.unique())

156

In [9]:
df_s.action_detail.value_counts()[:10]

view_search_results            1776885
p3                             1376550
-unknown-                      1031141
wishlist_content_update         706824
user_profile                    656839
change_trip_characteristics     487744
similar_listings                364624
user_social_connections         336799
update_listing                  269779
listing_reviews                 269021
Name: action_detail, dtype: int64

In [10]:
len(df_s.action.unique())

360

In [11]:
df_s.action.value_counts()[:10]

show                     2768278
index                     843699
search_results            725226
personalize               706824
search                    536057
ajax_refresh_subtotal     487744
update                    365130
similar_listings          364624
social_connections        339000
reviews                   320591
Name: action, dtype: int64