## Mobility analysis
The purpose of this notebook is to compare the mobility of users who live in the "treatment" (ZATs near new cable car) and "control" zones (similar ZATs but no new cable car)

1) Compute distance between home and work locations 
2) Compute visits (20, 30, 1hr visit times) 

In [None]:
from dotenv import load_dotenv
load_dotenv()

from plotting import * 
from preprocess import *
from analysis import *

import pandas as pd 
import skmob

# Access environment variables and define other necessary variables
data_dir = os.getenv('WORKING_DIR')
meta_dir = f'{data_dir}metadata/'
pq_dir = f'{data_dir}data/parquet/in_study_area/pass_qc/in_zats/home_in_zats/'
hw_dir = f'{data_dir}data/parquet/in_study_area/pass_qc/home_work_locs/'
shp_name_zat = 'zat_stratum'
shapefile_zat = f'{meta_dir}income/{shp_name_zat}.shp'

### Compute distance between home and work locations for users
Note that there are three users with two home locations reported `5f0bf98c-00de-4e2d-a8da-8d71d0da8031`, `64e27ab9-8c1c-4548-a3e8-26870ebbc5c8`, `41fd82d6-a492-4864-8068-2134fb176723`, and also some users with two work locations. For these we will just take the first location reported. A quick check to confirm that these users just have multiple locations because they fall on ZAT boundaries, we will plot the two home locations for the three users. (The plot is in line with this explanation.)

In [None]:
home_locs_file = f'{hw_dir}home/selected_txt_control_home_locs_2200_0600_w_zats_for_users_pass_qc.csv'
work_locs_file = f'{hw_dir}work/all_work_locs_0800_1800_w_zats_for_users_pass_qc.csv'

home_df = pd.read_csv(home_locs_file, usecols=['uid', 'lat_home', 'lng_home', 'Area', 'MUNCod', 'NOMMun', 'ZAT_home', 'UTAM_home', 'stratum'])
work_df = pd.read_csv(work_locs_file, usecols=['uid', 'lat_work', 'lng_work', 'Area', 'MUNCod', 'NOMMun', 'ZAT_work', 'UTAM_work', 'stratum'])

uids_mult = ['5f0bf98c-00de-4e2d-a8da-8d71d0da8031', '64e27ab9-8c1c-4548-a3e8-26870ebbc5c8', '41fd82d6-a492-4864-8068-2134fb176723']
home_uids_mult = home_df[home_df['uid'].isin(uids_mult)]
print(home_uids_mult)

#map_obj, user_data = plot_user_on_map(shapefile_path=shapefile_zat, df=home_uids_mult, lat_col='lat_home', lng_col='lng_home', user_id=uids_mult[2])
#map_obj.save(f"{data_dir}figures/{uids_mult[2]}_home_locs_dups_example_zat_boundary.html")
#map_obj

In [None]:
# drop the duplicates (keeping the first home location) and merge the frames 
home_df = home_df.drop_duplicates(subset='uid')
work_df = work_df.drop_duplicates(subset='uid')

home_work_df = home_df.merge(work_df, on='uid', suffixes=('_home','_work'))
print(len(home_df), len(work_df), len(home_work_df))
home_work_df.head()

Calculate the distance between the home and work location for each user

In [None]:
home_work_df['distance_km'] = home_work_df.apply(geodesic_distance, axis=1)
home_work_df.head()

In [None]:
home_work_w_dist_file = f'{hw_dir}home_work_locs_for_users_living_in_selected_txt_control_zats_w_distance_btwn_home_work.csv'
home_work_df.to_csv(home_work_w_dist_file, index=False)

### Load the pings data for the users living in the selected ZATs

Load 1. the shapefile with the ZAT and stratum information and 2. the user pings

In [None]:
regions_gdf_zat = gpd.read_file(shapefile_zat)
regions_gdf_zat.plot(column='stratum')

In [None]:
ping_files = glob.glob(pq_dir + '*.parquet')
ping_df = ds.dataset(ping_files, format="parquet").to_table().to_pandas()
ping_df = ping_df.reset_index()
ping_df = ping_df.drop(columns='index')
ping_df.head()

In [None]:
len(ping_df)

### Compute visits and other mobility variables
Can also "compress" the pings data and do other preprocessing and filtering

### Load and preprocess pings data
Not that we can print the parameters of the functions, for example for the filtering with `print(ftdf.parameters)`. Filtering takes ~28 minutes on the full dataset of pings from users in the selected zats.  

In [None]:
# Convert the DataFrame into a TrajDataFrame and filter out outliers
from skmob.preprocessing import filtering
tdf = skmob.TrajDataFrame(ping_df, user_id='uid', latitude='lat', longitude='lng', datetime='datetime')
ftdf = filtering.filter(tdf) # takes quite some time - maybe save for later 
n_deleted_points = len(tdf) - len(ftdf) 
print(n_deleted_points)

### Compute the stops 
We will compute stops for 20, 30, and 60 minutes for the dataset and output those stops. For the filtered dataset, there are 1861875 stops for at least 60 minutes, although this number is reduced to 1613447 if we don't count pings with more than 24 hours worth of minutes until the next ping (missing data). There are 2763819 stops for at least 20 minutes if we don't count pings with more than 24 hours worth of minutes until the next ping (missing data). 

It took ~20 minutes to compute the stops for each minimum stopping time.

In [None]:
# Compute the stops 
number_min = 60
sftdf = detection.stay_locations(ftdf, stop_radius_factor=0.5, minutes_for_a_stop=number_min, spatial_radius_km=0.2, leaving_time=True)
print(f'The number of stops for {number_min} minutes in the filtered dataset is {len(sftdf)}')

In [None]:
calculate_visits_min_minutes(
    tdf=ftdf, 
    visit_durations=[20, 30, 60], 
    out_dir=f'{data_dir}data/parquet/in_study_area/pass_qc/in_zats/visits/',
    no_data_for_minutes=1440) #1440 is 24 hours worth of minutes - this is to not count stops with more than 1440 minutes of missing data between them as stops because they may just be missing points

### Load in the visits df and visualize some of the stops

In [None]:
out_dir=f'{data_dir}data/parquet/in_study_area/pass_qc/in_zats/visits/'
outfilename = f'{out_dir}users_living_in_sel_zat_visits_atleast_60min.csv'
outfilename = f'{out_dir}users_living_in_sel_zat_visits_atleast_60min_nodatafor_1440_minutes.csv'
visit_df = pd.read_csv(outfilename)
visit_df.head(10)

In [None]:
visit_df.uid.value_counts()

Map some of the visits

In [None]:
user_id = 'e846d741-5ee4-440f-b304-e4e3886c2210' #'f12efcd6-a347-416c-8a39-93e6fb67f7aa' #'324fe201-cce9-4395-91ab-ee421cdd34c9' #'f07076d8-32be-40f4-ad3a-e1ece90ec6f7' #'00002eec-9e3e-4e4d-9822-4e4858a0de0c'
map_obj, user_data = plot_user_on_map(
    shapefile_path=shapefile_zat, 
    df=visit_df, 
    lat_col='lat', lng_col='lng', 
    user_id=user_id)
    
map_obj

### Compute other metrics (radius of gyration, etc)

In [None]:
# TO DO 

### Visualizing trajectories and stops 

For a single user...

In [None]:
# work with one file initially 
#ping_file = ping_files[1]

# work with data for one user initially (with many stops)
selected_user = '28a77f51-a6e4-4ca9-88a9-ba37edace695'
user_filter=ds.field('uid').isin([selected_user])
ping_df = ds.dataset(ping_files, format="parquet").to_table(filter=user_filter).to_pandas()

In [None]:
map_obj, user_data = plot_frac_data_on_map(shapefile_path=shapefile_zat, ddf=dd.from_pandas(ping_df, npartitions=2), frac=1.0)
map_obj

In [None]:
map_obj_tdf, user_data_tdf = plot_user_on_map(shapefile_path=shapefile_zat, df=tdf, lat_col='lat', lng_col='lng', user_id=selected_user)
map_obj_ftdf, user_data_ftdf = plot_user_on_map(shapefile_path=shapefile_zat, df=ftdf, lat_col='lat', lng_col='lng', user_id=selected_user)
map_obj_ftdf

In [None]:
# Compute the stops 
dfs_to_process = [tdf, ftdf]

number_min = 60
stdf = detection.stay_locations(tdf, stop_radius_factor=0.5, minutes_for_a_stop=number_min, spatial_radius_km=0.2, leaving_time=True)
sftdf = detection.stay_locations(ftdf, stop_radius_factor=0.5, minutes_for_a_stop=number_min, spatial_radius_km=0.2, leaving_time=True)
print(f'The number of stops for {number_min} minutes in the filtered dataset is {len(sftdf)}')

map_obj_stdf, user_data_stdf = plot_user_on_map(shapefile_path=shapefile_zat, df=stdf, lat_col='lat', lng_col='lng', user_id=selected_user)
map_obj_stdf.save(f"{data_dir}figures/{selected_user}_user_stops_{number_min}_minutes.html")

map_obj_sftdf, user_data_sftdf = plot_user_on_map(shapefile_path=shapefile_zat, df=sftdf, lat_col='lat', lng_col='lng', user_id=selected_user)
map_obj_sftdf.save(f"{data_dir}figures/{selected_user}_user_stops_{number_min}_minutes_filtered.html")