# Extreme Travelers

Early birds, night owls, and tireless/recurring itinerants: 
An exploratory analysis of extreme transit behaviors in Beijing, China

https://www.sciencedirect.com/science/article/pii/S0197397516301539

In [1]:
import os
os.chdir("/home/tales/dev/master/mdc_analysis/")
print("working dir", os.getcwd())

working dir /home/tales/dev/master/mdc_analysis


In [2]:
import pandas as pd

from src.dao import csv_dao
from src.dao import objects_dao
from src.similarity.extreme_travelers import early_bird, nigh_owl, tireless_intinerant
from src.similarity.extreme_travelers import sequence_report

## Loading User Data

In [3]:
users_srg = objects_dao.load_all_stop_region_group_object()

Loading user_id: 6189 - 1 out of 163
Loading user_id: 5936 - 2 out of 163
Loading user_id: 6087 - 3 out of 163
Loading user_id: 5973 - 4 out of 163
Loading user_id: 6085 - 5 out of 163
Loading user_id: 6074 - 6 out of 163
Loading user_id: 6012 - 7 out of 163
Loading user_id: 5982 - 8 out of 163
Loading user_id: 5948 - 9 out of 163
Loading user_id: 5974 - 10 out of 163
Loading user_id: 6090 - 11 out of 163
Loading user_id: 6199 - 12 out of 163
Loading user_id: 6068 - 13 out of 163
Loading user_id: 6024 - 14 out of 163
Loading user_id: 5976 - 15 out of 163
Loading user_id: 6094 - 16 out of 163
Loading user_id: 5941 - 17 out of 163
Loading user_id: 5995 - 18 out of 163
Loading user_id: 5962 - 19 out of 163
Loading user_id: 6093 - 20 out of 163
Loading user_id: 6033 - 21 out of 163
Loading user_id: 6079 - 22 out of 163
Loading user_id: 6038 - 23 out of 163
Loading user_id: 6175 - 24 out of 163
Loading user_id: 6042 - 25 out of 163
Loading user_id: 5924 - 26 out of 163
Loading user_id: 6083

<table align="left">
  <tr>
    <th>Label</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>Early birds (EBs)</td>
    <td> First trip < 6AM, more than two days in a week (60% of weekdays)</td>
  </tr>
  <tr>
    <td>Night owls (NOs)</td>
    <td> Last trip (boarding time) > 10PM, more than two days in a week (60% weekdays)</td>
  </tr>
  <tr>
    <td>Tireless itinerants (TIs)</td>
    <td> More than one and a half hours for one-way commuting (from the home location to job location) more than two days in a week</td>
  </tr>
  <tr>
    <td>Recurring itinerants (RIs)</td>
    <td> More than 30 trips in weekdays of a week (more than 6 trips per day)</td>
  </tr>
  <tr>
    <td>Average Beijingers (ABs)</td>
    <td> The “average” cardholders in the MDC Dataset</td>
  </tr>
</table>

## Extreme Travelers
Factor Analysis

In [4]:
from bokeh.plotting import figure
from bokeh.io import output_notebook, show
from bokeh.palettes import Category20
output_notebook()

In [5]:
def plot_result_multi_line(xs_list, ys_list,  x_label, y_label, color_list=[], legend_list=[], title=""):
    p = None
    
    for i in range(len(xs_list)):
        p = plot_result(xs_list[i], 
                        ys_list[i],  
                        x_label, 
                        y_label, 
                        color=color_list[i], 
                        legend=legend_list[i], 
                        title=title, 
                        p=p)
        
    return p

def plot_result(xs, ys,  x_label, y_label, color="darkblue", legend=None, title="", p=None):
    xs = [float(x) for x in xs]
    ys = [float(y) for y in ys]

    if not p:
        p = figure(plot_width=500, plot_height=300, title=title, x_axis_label=x_label, y_axis_label=y_label)
    
    p.line(xs, ys, color=color, alpha=0.8, line_width=2)
    p.circle(xs, ys, color=color, fill_alpha=1, size=4, legend=legend)
#     p.legend.location = "bottom_right"

    return p

## Early Bids

In [6]:
try:
    eb_rates = pd.read_csv("notebooks/outputs/eb_rates.csv", index_col=0).to_dict()
    
except FileNotFoundError:
    eb_rates = {}

    for leaving_time in [5,6,7,8,9,10]:

        eb_rate = {}

        for user_id in users_srg.keys():
            try:
                eb_rate[user_id] = early_bird(users_srg[user_id], leaving_time=leaving_time)
            except ZeroDivisionError:
                eb_rate[user_id] = 0

        eb_rates[leaving_time] = eb_rate

    pd.DataFrame(eb_rates).to_csv("notebooks/outputs/eb_rates.csv")

eb_data = pd.DataFrame(eb_rates).median()
eb_data.index = eb_data.index.astype(int)
eb_data = eb_data.sort_index()
eb_data

5     0.103933
6     0.143258
7     0.184388
8     0.214421
9     0.250000
10    0.283505
11    0.304348
12    0.326531
13    0.352459
14    0.384615
dtype: float64

In [7]:
p = plot_result(xs=eb_data.index.tolist(), 
                ys=eb_data.tolist(),
                x_label="leaving_time (h)",
                y_label="Rate",
                color=Category20[6][0],
                title="Frequency of Early-Birding for users (median)")
                
show(p)

## Night Owls

In [8]:
try:
    no_rates = pd.read_csv("notebooks/outputs/no_rates.csv", index_col=0).to_dict()
    
except FileNotFoundError:
    no_rates = {}

    for boarding_time in [11,12,13,14,15,16,17,18]:
        no_rate = {}

        for user_id in users_srg.keys():
            try:
                no_rate[user_id] = nigh_owl(users_srg[user_id], boarding_time=boarding_time)
            except ZeroDivisionError:
                no_rate[user_id] = 0

        no_rates[boarding_time] = no_rate
        
    pd.DataFrame(no_rates).to_csv("notebooks/outputs/no_rates.csv")
    
no_data = pd.DataFrame(no_rates).median().sort_index()
no_data.index = no_data.index.astype(int)
no_data = no_data.sort_index()
no_data

10    0.017391
11    0.016393
12    0.015480
13    0.014925
14    0.012658
15    0.009950
16    0.007843
17    0.004329
18    0.002608
dtype: float64

In [9]:
pd.DataFrame(no_rates)["10"].sort_values(ascending=False).head(16).index.tolist()

[6103,
 6190,
 6074,
 6102,
 6078,
 5951,
 6183,
 6100,
 6077,
 6056,
 6181,
 6182,
 6198,
 6172,
 6062,
 5987]

In [10]:
p = plot_result(xs=no_data.index.tolist(), 
                ys=no_data.tolist(),
                x_label="boarding_time (h)",
                y_label="Rate",
                color=Category20[6][1],
                title="Frequency of Night-Owling for users (median)")
                
show(p)

## Tireless Itinerants

In [11]:
try:
    ti_rates = pd.read_csv("notebooks/outputs/ti_rates.csv", index_col=0).to_dict()

except FileNotFoundError:

    ti_rates = {}

    for commuting_time_m in [10, 30, 50, 70, 90, 110]:
        ti_rate = {}

        for user_id in users_srg.keys():
            try:
                ti_rate[user_id] = tireless_intinerant(users_srg[user_id], commuting_time_m=commuting_time_m)
            except ZeroDivisionError:
                ti_rate[user_id] = 0

        ti_rates[commuting_time_m] = ti_rate
        
    pd.DataFrame(ti_rates).to_csv("notebooks/outputs/ti_rates.csv")
    
ti_data = pd.DataFrame(ti_rates).median().sort_index()
ti_data.index = ti_data.index.astype(int)
ti_data = ti_data.sort_index()
ti_data

5      0.0
10     0.0
20     0.0
30     0.0
50     0.0
70     0.0
90     0.0
110    0.0
dtype: float64

In [12]:
p = plot_result(xs=ti_data.index.tolist(), 
                ys=ti_data.tolist(),
                x_label="commuting_time (h)",
                y_label="Rate",
                color=Category20[6][2],
                title="Frequency of Tireless Itineranting for users (median)")
                
show(p)

In [13]:
q50 = pd.DataFrame(ti_rates).quantile(0.5)
q60 = pd.DataFrame(ti_rates).quantile(0.6)
q70 = pd.DataFrame(ti_rates).quantile(0.7)
q80 = pd.DataFrame(ti_rates).quantile(0.8)
q90 = pd.DataFrame(ti_rates).quantile(0.9)

qs = [q50, q60, q70, q80, q90]

fixed_qs = []
for q in qs:
    q.index = q.index.astype(int)
    q = q.sort_index()
    fixed_qs.append(q)

colors = ["#FF0000", "#BF0000", "#800000", "#400000", "#000000"]
colors.reverse()

In [14]:
p = plot_result_multi_line(xs_list=[qn.index.tolist() for qn in fixed_qs], 
                           ys_list=[qn.tolist() for qn in fixed_qs],  
                           x_label="commuting_time (h)",
                           y_label="Rate",
                           color_list=colors, 
                           legend_list=["q50", "q60", "q70", "q80", "q90"], 
                           title="Frequency of Tireless Itineranting for users (quantiles)")

show(p)

It is posible that people leave their mobiles at home, more often than I expected.

In [15]:
import random
random_i = random.randint(0, len(users_srg.keys()) - 1)
random_user = list(users_srg.keys())[random_i]

print("Stop Region Sequence Report")

colnames = ["tags", "stay_time_h", "start_weekday", "start_date", "start_time", "end_date", "end_time"] 
sequence_report(users_srg["5928"])[colnames]

Stop Region Sequence Report


Unnamed: 0_level_0,tags,stay_time_h,start_weekday,start_date,start_time,end_date,end_time
sr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
5928_1,[store],0.110833,Tuesday,2009-09-15,11:39:39,2009-09-15,11:46:18
5928_2,[store],4.307222,Tuesday,2009-09-15,11:51:01,2009-09-15,16:09:27
5928_3,[store],0.102500,Wednesday,2009-09-16,08:43:43,2009-09-16,08:49:52
agg_5928_4,[WORK],27.475278,Wednesday,2009-09-16,08:57:50,2009-09-17,12:26:21
agg_5928_8,[store],12.058333,Thursday,2009-09-17,15:47:17,2009-09-18,03:50:47
agg_5928_10,[HOME],3.189167,Friday,2009-09-18,15:58:29,2009-09-18,19:09:50
agg_5928_13,[museum],0.619167,Saturday,2009-09-19,06:20:21,2009-09-19,06:57:30
5928_15,"[supermarket, grocery_or_supermarket, food, st...",0.625556,Saturday,2009-09-19,10:22:17,2009-09-19,10:59:49
5928_16,"[post_office, finance]",0.120000,Saturday,2009-09-19,12:25:18,2009-09-19,12:32:30
5928_17,"[clothing_store, store]",0.142222,Saturday,2009-09-19,12:38:39,2009-09-19,12:47:11


In [16]:
sequence_report(users_srg["5928"]).loc["5928_1"]["sr_start_time"].item()

1253014779.0

In [17]:
sequence_report(users_srg["5928"]).loc["5928_16"]["sr_end_time"].item()

1253363550.0

In [18]:
1253014779 >= 1254122852, 1280517172 <= 1253363550

(False, False)