## Mobility analysis
The purpose of this notebook is to compare the mobility of users who live in the "treatment" (ZATs near new cable car) and "control" zones (similar ZATs but no new cable car)

1) Compute distance between home and work locations 
2) Compute visits (20, 30, 1hr visit times) 

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

working_dir = os.getenv("WORKING_DIR")
os.environ["USE_PYGEOS"] = "0"

from setup import *
from plotting import *
from preprocess import *
from analysis import *

import pandas as pd
import numpy as np
import skmob

c = read_config(
    f"{working_dir}configs/config_2018.yml"
)  # can switch to the config_2019.yml
(
    year,
    datatypes,
    initial_cols,
    sel_cols,
    final_cols,
    minlon,
    maxlon,
    minlat,
    maxlat,
) = get_config_vars(c=c, mode="preprocess")
min_days, min_pings = get_config_vars(c=c, mode="user_qc")

where = get_dirs(working_dir, year=year, min_days=min_days, min_pings=min_pings)

meta_dir = where.meta_dir

# Access environment variables and define other necessary variables
working_dir = os.getenv("WORKING_DIR")
hw_dir = f"{where.pass_qc_dir}home_work_locs/"
pq_dir = f"{hw_dir}home/in_zats/"
pq_dir_out = f"{hw_dir}home/in_zats/visits/"
ensure_directory_exists(pq_dir_out)

In [None]:
# filter pings for simple visit approximation
simple_visit_approx = True

# compute visits with skmob
skmob_compute_visit = False
filter_tdf = False  # whether to filter the tdf with skmob (True for computed visits)

### Compute distance between home and work locations for users

In [None]:
home_locs_file = (
    f"{hw_dir}home/selected_txt_control_home_locs_w_zats_for_users_pass_qc.csv"
)
work_locs_file = (
    f"{hw_dir}work/selected_txt_control_work_locs_w_zats_for_users_pass_qc.csv"
)

home_df = pd.read_csv(
    home_locs_file,
    usecols=[
        "uid",
        "lat_home",
        "lng_home",
        "Area",
        "MUNCod",
        "NOMMun",
        "ZAT_home",
        "UTAM_home",
        "stratum",
    ],
)
work_df = pd.read_csv(
    work_locs_file,
    usecols=[
        "uid",
        "lat_work",
        "lng_work",
        "Area",
        "MUNCod",
        "NOMMun",
        "ZAT_work",
        "UTAM_work",
        "stratum",
    ],
)

In [None]:
# drop the duplicates (keeping the first home location) and merge the frames
home_df = home_df.drop_duplicates(subset="uid")
work_df = work_df.drop_duplicates(subset="uid")

home_work_df = home_df.merge(work_df, on="uid", suffixes=("_home", "_work"))
print(len(home_df), len(work_df), len(home_work_df))
home_work_df.head()

Calculate the distance between the home and work location for each user

In [None]:
home_work_df["distance_km"] = home_work_df.apply(geodesic_distance, axis=1)
home_work_df.head()

In [None]:
home_work_w_dist_file = f"{hw_dir}home_work_locs_for_users_living_in_selected_txt_control_zats_w_distance_btwn_home_work.csv"
# home_work_df.to_csv(home_work_w_dist_file, index=False) # uncomment to rewrite file

### Load the pings data for the users living in the selected ZATs

Load 1. the shapefile with the ZAT and stratum information and 2. the user pings

In [None]:
shapefile, regions_gdf_zat = get_shp(
    meta_dir=f"{meta_dir}income/", shp_name=c["meta"]["shp"]["zat"], load=True
)

# regions_gdf_zat.plot(column="stratum")

In [None]:
ping_files = glob.glob(pq_dir + "*.parquet")
ping_df = ds.dataset(ping_files, format="parquet").to_table().to_pandas()
ping_df = ping_df.reset_index()
ping_df = ping_df.drop(columns="index")
ping_df.head(5)

### Examining common ping locations 

In [None]:
df = ping_df
counted_pairs = df.groupby(["lat", "lng"]).size().reset_index(name="counts")
sorted_counted_pairs = counted_pairs.sort_values("counts", ascending=False)
print(sorted_counted_pairs.head(20))
# sorted_counted_pairs['counts'].hist(bins=20)

# Specify the specific latitude and longitude pair you're interested in (e.g. most common pair)
# target_lat = 4.570431 #4.649300 (2019) #4.570431 (2018)
# target_lng = -74.095920 #-74.061699 (2019) #-74.095920 (2018)

# Filter the dataframe to retrieve rows with the specific lat_visit, lng_visit pair
# filtered_df = df[(df['lat_visit'] == target_lat) & (df['lng_visit'] == target_lng)]
# filtered_df.head(10)

In [None]:
len(ping_df)

### Compute visits and other mobility variables
Can also "compress" the pings data and do other preprocessing and filtering

### Load and preprocess pings data
Not that we can print the parameters of the functions, for example for the filtering with `print(ftdf.parameters)`. Filtering takes ~28 minutes on the full dataset of pings from users in the selected zats.  

In [None]:
# Convert the DataFrame into a TrajDataFrame and filter out outliers
from skmob.preprocessing import filtering

tdf = skmob.TrajDataFrame(
    ping_df, user_id="uid", latitude="lat", longitude="lng", datetime="datetime"
)
print(len(ping_df), len(tdf))

if filter_tdf:
    print("Filtering tdf...")
    ftdf = filtering.filter(tdf)  # takes quite some time - maybe save for later
    n_deleted_points = len(tdf) - len(ftdf)
    print(n_deleted_points)
    print(len(ping_df), len(ftdf))

2023 December 13: Elena and I are finding some issues with users with multiple pings across multiple days only having one visit in the visits computed post-filtering on the `ftdf`. To see if these are just filtered out, I computd the visits with the `tdf` but that doesn't fix the issue. As yo

It looks like even with the `tdf`, the 107 pings are being combined into one long visit `00cc96ca-1549-4a8d-a3cf-a5ed1cb3b7b8,4.5737677,-74.106285,2018-07-05 15:59:13-05:00,d2g618tkq01t,10.0,83,RAFAEL URIBE URIBE,SAN JOSE,2018-11-13 21:45:41-05:00` This appears to be because all the pings for this user fall within the radius of the same location.

In [None]:
from skmob.preprocessing import detection

selected_user = "00cc96ca-1549-4a8d-a3cf-a5ed1cb3b7b8"
example_pings = ping_df[ping_df["uid"] == selected_user].copy()
example_tdf = tdf[tdf["uid"] == selected_user].copy()
# print(example_pings.sort_values(by='datetime').head())
example_tdf.sort_values(by="datetime")
example_stdf = detection.stay_locations(
    example_tdf,
    stop_radius_factor=0.5,
    minutes_for_a_stop=20.0,
    spatial_radius_km=0.2,
    leaving_time=True,
    no_data_for_minutes=1440,  # 1000000000000.0,
    min_speed_kmh=None,
)

print(example_stdf)
example_tdf.sort_values(by="datetime").head(5)

In [None]:
map_obj_tdf, user_data_tdf = plot_user_on_map(
    shapefile_path=shapefile,
    df=example_tdf,
    lat_col="lat",
    lng_col="lng",
    user_id=selected_user,
)

map_obj_tdf

### Simple method for calculating "visits"
To simply filter pings that were within a certain range of a previous ping we can apply the function below. That will preserve most of the pings from the above example user for instance. However, I'd imagine that for users with near constant pings then many of their pings will be filtered out, so it's important to keep such behaviors in mind.

In [None]:
num_min = 20
filtered_pings = filter_pings_for_visit_approx(ping_df, time_window_minutes=num_min)
filtered_pings.head()

In [None]:
print(f"Writing files to {pq_dir_out}")

filtered_ping_file = f"{pq_dir_out}pings_seperated_by_{num_min}_minutes_for_users_living_in_selected_txt_control_zats.csv"
# filtered_pings.to_csv(filtered_ping_file, index=False) #uncomment to rewrite file

### Compute the stops with sci-kit mobility
We will compute stops for 20, 30, and 60 minutes for the dataset and output those stops. 
It took ~20 minutes to compute the stops for each minimum stopping time.

> number_min = 20
> sftdf = detection.stay_locations(
    ftdf,
    stop_radius_factor=0.5,
    minutes_for_a_stop=number_min,
    spatial_radius_km=0.2,
    leaving_time=True,
)
print(
    f"The number of stops for {number_min} minutes in the filtered dataset is {len(sftdf)}"
)

For the`no_data_for_minutes` parameter, the options are either 1e12 or 1440. 1440 is 24 hours worth of minutes - this is to not count stops with more than 1440 minutes of missing data between them as stops because they may just be missing points.

In [None]:
if skmob_compute_visit:
    if filter_tdf == False:
        ftdf = tdf
    calculate_visits_min_minutes(
        tdf=ftdf,
        visit_durations=[20, 30, 60],
        out_dir=pq_dir_out,
        no_data_for_minutes=1e12,  # or 1440
    )