## Visit analysis
The purpose of this notebook is to compare the mobility of users who live in the "treatment" (ZATs near new cable car) and "control" zones (similar ZATs but no new cable car). We will map the visits we computed in the `mobility_analysis` notebook to the treatment and control group information and point of interest (POIs) computed from google places.

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

working_dir = os.getenv("WORKING_DIR")
os.environ["USE_PYGEOS"] = "0"

from setup import *
from plotting import *
from preprocess import *
from analysis import *

import pandas as pd
import numpy as np
import skmob

c = read_config(f"{working_dir}configs/config_2018.yml")
(
    year,
    datatypes,
    initial_cols,
    sel_cols,
    final_cols,
    minlon,
    maxlon,
    minlat,
    maxlat,
) = get_config_vars(c=c, mode="preprocess")
min_days, min_pings = get_config_vars(c=c, mode="user_qc")

where = get_dirs(working_dir, year=year, min_days=min_days, min_pings=min_pings)

meta_dir = where.meta_dir

In [None]:
# hw_dir = f'{working_dir}data/parquet/in_study_area/pass_qc/home_work_locs/'
# sel_zat_home_locs_meta_fp = f'{meta_dir}selected_txt_control_home_locs_2200_0600_w_zats_txt_group_for_users_pass_qc.csv'
# visit_dir= f'{working_dir}data/parquet/in_study_area/pass_qc/in_zats/visits/'

hw_dir = f"{where.pass_qc_dir}home_work_locs/home/"
visit_dir = f"{hw_dir}in_zats/visits/"
out_dir = f"{visit_dir}w_pois/"

for d in [visit_dir, out_dir]:
    ensure_directory_exists(d)

sel_zat_home_locs_meta_fp = f"{hw_dir}selected_txt_control_home_locs_w_zats_for_users_pass_qc_w_treatment_group.csv"

### Load home locations with ZATs and treatment group for users living in treatment and control ZATs

In [None]:
sel_zat_home_locs_meta = pd.read_csv(sel_zat_home_locs_meta_fp)
sel_zat_home_locs_meta.head()

### Load shapefiles with the POI information 
There is one with a buffer of 15 meters around the google places POIs and one with 20 meters buffer.

In [None]:
# for POIs with 15m Buffer
shp_name, gdf_from_shp = get_shp_to_assign_poi(
    shp_dir=f"{meta_dir}places/Buffer Shapefiles/", config=c, radius=15, plot=True
)

### Load visits and map the visits to the POIs
Load the visit locations for users living within the treatment and control ZATs. Note that these are rough estimates and may need to be recomputed. 

In [None]:
visit_durations = [20, 30, 60]
min_minutes = 1e12  # or 1440

for i in tqdm(
    range(0, len(visit_durations)),
    desc=f"Mapping and writing user visits to POIs from {shp_name}",
):
    stopping_time = visit_durations[i]
    visits_filename = f"users_living_in_sel_zat_visits_atleast_{stopping_time}min_nodatafor_{min_minutes}_minutes"  # modify fp as needed, e.g. add _nodatafor_1440_minutes'
    visits_fp = f"{visit_dir}{visits_filename}.csv"
    outfilename = f"{visits_filename}_w_poi_from_{shp_name}_shp"
    visit_df = read_visits(visits_fp, uid_treat_group_info=sel_zat_home_locs_meta)
    visits_w_poi_df = calc_write_visit_pois(
        visit_df=visit_df,
        regions_gdf=gdf_from_shp,
        cols_to_keep=(
            list(visit_df.columns)
            + [
                "name",
                "dup",
                "status",
                "category",
                "id",
                "BUFF_DIST",
                "ORIG_FID",
                "Shape_Area",
            ]
        ),
        out_dir=out_dir,
        subdir_name=shp_name,
        outfilename=outfilename,
    )

#### Some logs: 
**For the 15m Buffer**
- There were 3007082 visits in the file. 1157808 instances occured where visits mapped to named POIs (some visits may map to multiple POIs). Wrote data to users_living_in_sel_zat_visits_atleast_20min_w_poi_from_POI_Buffer15m_shp
- There were 2487677 visits in the file. 934178 instances occured where visits mapped to named POIs (some visits may map to multiple POIs). Wrote data to users_living_in_sel_zat_visits_atleast_30min_w_poi_from_POI_Buffer15m_shp
- There were 1861875 visits in the file. 643682 instances occured where visits mapped to named POIs (some visits may map to multiple POIs). Wrote data to users_living_in_sel_zat_visits_atleast_60min_w_poi_from_POI_Buffer15m_shp

**For the 20m Buffer**
- There were 3007082 visits in the file. 1967679 instances occured where visits mapped to named POIs (some visits may map to multiple POIs). Wrote data to users_living_in_sel_zat_visits_atleast_20min_w_poi_from_POI_Buffer20m_shp

- There were 2487677 visits in the file. 1583775 instances occured where visits mapped to named POIs (some visits may map to multiple POIs). Wrote data to users_living_in_sel_zat_visits_atleast_30min_w_poi_from_POI_Buffer20m_shp

- There were 1861875 visits in the file. 1089400 instances occured where visits mapped to named POIs (some visits may map to multiple POIs). Wrote data to users_living_in_sel_zat_visits_atleast_60min_w_poi_from_POI_Buffer20m_shp

#### Some logs for the files with the additional `_nodatafor_1440_minutes` filtering: 

**For the 15m Buffer**
- There were 2763819 visits in the file. 1080490 instances occured where visits mapped to named POIs (some visits may map to multiple POIs). Wrote data to users_living_in_sel_zat_visits_atleast_20min_nodatafor_1440_minutes_w_poi_from_POI_Buffer15m_shp

- There were 2242918 visits in the file. 856483 instances occured where visits mapped to named POIs (some visits may map to multiple POIs). Wrote data to users_living_in_sel_zat_visits_atleast_30min_nodatafor_1440_minutes_w_poi_from_POI_Buffer15m_shp

- There were 1613447 visits in the file. 564951 instances occured where visits mapped to named POIs (some visits may map to multiple POIs). Wrote data to users_living_in_sel_zat_visits_atleast_60min_nodatafor_1440_minutes_w_poi_from_POI_Buffer15m_shp

**For the 20m Buffer**
- There were 2763819 visits in the file. 1836594 instances occured where visits mapped to named POIs (some visits may map to multiple POIs). Wrote data to users_living_in_sel_zat_visits_atleast_20min_nodatafor_1440_minutes_w_poi_from_POI_Buffer20m_shp

- There were 2242918 visits in the file. 1452026 instances occured where visits mapped to named POIs (some visits may map to multiple POIs). Wrote data to users_living_in_sel_zat_visits_atleast_30min_nodatafor_1440_minutes_w_poi_from_POI_Buffer20m_shp

- There were 1613447 visits in the file. 955859 instances occured where visits mapped to named POIs (some visits may map to multiple POIs). Wrote data to users_living_in_sel_zat_visits_atleast_60min_nodatafor_1440_minutes_w_poi_from_POI_Buffer20m_shp


### Load the visits and look at how many mapped to one or more POIs, group them by POI category

In [None]:
# uncomment to directly access the outputs for 2019 data (based on pre-config file structure) TO DO: remove after running with config or modifying file structure
# outdir_2019 = f"{c.run.in_dir}parquet/in_study_area/year=2019/pass_qc/in_zats/visits/w_pois/"
# outfilename_2019 = f'users_living_in_sel_zat_visits_atleast_20min_w_poi_from_{shp_name}_shp'
# vists_w_poi_df_fp = f'{outdir_2019}{shp_name}/{outfilename_2019}.csv'
# visits_w_poi_df = pd.read_csv(vists_w_poi_df_fp)
# visits_w_named_pois, visits_w_more_than_one_named_poi, grouped_category_proportions = calc_group_poi_visits(visits_w_poi_df)
# visits_w_named_pois.to_csv(f'{outdir_2019}{shp_name}/{outfilename_2019}_drop_null.csv', index=False)
# visits_w_named_pois.head()

In [None]:
# for 2018 data
stopping_time = 20
min_minutes = 1e12  # or 1440
outfilename = f"users_living_in_sel_zat_visits_atleast_{stopping_time}min_nodatafor_{min_minutes}_minutes_w_poi_from_{shp_name}_shp"
vists_w_poi_df_fp = f"{out_dir}{shp_name}/{outfilename}.csv"
visits_w_poi_df = pd.read_csv(vists_w_poi_df_fp)
(
    visits_w_named_pois,
    visits_w_more_than_one_named_poi,
    grouped_category_proportions,
) = calc_group_poi_visits(visits_w_poi_df)
visits_w_named_pois.to_csv(
    f"{out_dir}{shp_name}/{outfilename}_drop_null.csv", index=False
)
visits_w_named_pois.head()

### Plot proportion of category of POI visits for control and treatment ZATs

In [None]:
grouped_cat_prop_csv_fp = f"{out_dir}{outfilename}_grouped_by_txt_category_proportions"
grouped_category_proportions.to_csv(f"{grouped_cat_prop_csv_fp}.csv")
grouped_category_proportions.head()

In [None]:
from plotting import *

In [None]:
plotfilename = f"{where.plot_dir}users_living_in_sel_zat_visits_atleast_{stopping_time}min_w_poi_from_{shp_name}_shp_grouped_by_txt_category_proportions"
plot_stacked_bar_from_csv(
    f"{grouped_cat_prop_csv_fp}.csv", out_file=plotfilename, colormap="Spectral"
)

### Visits by month

In [None]:
visits_w_poi_df.head()

In [None]:
cols_for_bar = [
    "uid",  #'lat_visit', 'lng_visit',
    "datetime",  #'leaving_datetime',
    "Group",
    "name",
    "category",
]

all_visits, all_visits_grouped = count_visits_by_month(
    visits_w_poi_df, cols=cols_for_bar, normalize=True
)
named_poi_visits, named_visits_grouped = count_visits_by_month(
    visits_w_named_pois, cols=cols_for_bar, normalize=True
)

#### Plot number of visits per month for each group

In [None]:
data = all_visits_grouped  # [all_visits_grouped['month']<12]
plot_dir = where.plot_dir
palette = sns.color_palette("Paired")

title = f"Total >{stopping_time} min visits per number of users per group by month for users living in selected ZATs"
plot_visits_bar(
    data=data,
    x="month",
    y="count_normalized_nusers",
    hue="Group",
    plot_dir=plot_dir,
    title=title,
    figsize=(6, 4),
)

title = f"Total >{stopping_time} min visits by month for users living in treatment or control ZATs"
plot_visits_bar(
    data=data,
    x="month",
    y="count",
    hue="Group",
    plot_dir=plot_dir,
    title=title,
    figsize=(6, 4),
)

data = named_visits_grouped  # [named_visits_grouped['month']<12]

title = f"Total >{stopping_time} min visits to named POIs per number of users per group by month for users living in selected ZATs"
plot_visits_bar(
    data=data,
    x="month",
    y="count_normalized_nusers",
    hue="Group",
    plot_dir=plot_dir,
    title=title,
    figsize=(6, 4),
)

title = f"Total >{stopping_time} min visits to named POIs for users living in treatment or control ZATs"
plot_visits_bar(
    data=data,
    x="month",
    y="count",
    hue="Group",
    plot_dir=plot_dir,
    title=title,
    figsize=(6, 4),
)

#### Plot the percentage of montly visits that come from control group users versus treatment group users

In [None]:
all_visits_p, all_visits_grouped_p = count_visits_by_month(
    visits_w_poi_df, cols=cols_for_bar, as_proportion=True
)
named_poi_visits_p, named_visits_grouped_p = count_visits_by_month(
    visits_w_named_pois, cols=cols_for_bar, as_proportion=True
)

Note: For the 2019 data, we may want to merge the 2018 and 2019 files. But otherwise I didn't count the datapoints from the last day of December 2018 in the 2019 data by filtering like `all_visits_grouped_p[all_visits_grouped_p['month']<12]`

In [None]:
data = all_visits_grouped_p  # [all_visits_grouped_p['month']<12]
title = f"Percentage >{stopping_time} min visits made by users living in treatment versus control ZATs over time"
plot_visits_bar(
    data=data,
    x="month",
    y="percentage",
    hue="Group",
    plot_dir=plot_dir,
    title=title,
    figsize=(6, 4),
)

data = named_visits_grouped_p  # [named_visits_grouped_p['month']<12]
title = f"Percentage >{stopping_time} min visits made to named POIs by users living in treatment versus control ZATs over time"
plot_visits_bar(
    data=data,
    x="month",
    y="percentage",
    hue="Group",
    plot_dir=plot_dir,
    title=title,
    figsize=(6, 4),
)

### Make the same plots but only include users that have more than a particular number of visits

In [None]:
thresh_pois = 300
thresh_named_pois = 50

visits_w_poi_df_filtered = filter_users_by_minimum_visits(
    visit_df=visits_w_poi_df, visit_threshold=thresh_pois
)
visits_w_named_pois_filtered = filter_users_by_minimum_visits(
    visit_df=visits_w_named_pois, visit_threshold=thresh_named_pois
)

In [None]:
# compute and graph filtered data
all_visits_p, all_visits_grouped_p = count_visits_by_month(
    visits_w_poi_df_filtered, cols=cols_for_bar, as_proportion=True
)
named_poi_visits_p, named_visits_grouped_p = count_visits_by_month(
    visits_w_named_pois_filtered, cols=cols_for_bar, as_proportion=True
)

data = all_visits_grouped_p  # [all_visits_grouped_p['month']<12]
title = f"Percentage >{stopping_time} min visits made by users living in treatment versus control ZATs (min {thresh_pois} visits per user)"
plot_visits_bar(
    data=data,
    x="month",
    y="percentage",
    hue="Group",
    plot_dir=plot_dir,
    title=title,
    figsize=(6, 4),
)

data = named_visits_grouped_p  # [named_visits_grouped_p['month']<12]
title = f"Percentage >{stopping_time} min visits made to named POIs by users living in treatment versus control ZATs (min {thresh_named_pois} visits per user)"
plot_visits_bar(
    data=data,
    x="month",
    y="percentage",
    hue="Group",
    plot_dir=plot_dir,
    title=title,
    figsize=(6, 4),
)

### Filter and then plot the normalized version of the data

In [None]:
all_visits_p, all_visits_grouped_p = count_visits_by_month(
    visits_w_poi_df_filtered, cols=cols_for_bar, normalize=True
)
named_poi_visits_p, named_visits_grouped_p = count_visits_by_month(
    visits_w_named_pois_filtered, cols=cols_for_bar, normalize=True
)

data = all_visits_grouped_p  # [all_visits_grouped_p['month']<12]
title = f"Total >{stopping_time} min visits per number of users per group by month for users living in selected ZATs (min {thresh_pois} visits per user)"
plot_visits_bar(
    data=data,
    x="month",
    y="count_normalized_nusers",
    hue="Group",
    plot_dir=plot_dir,
    title=title,
    figsize=(6, 4),
)

data = named_visits_grouped_p  # [named_visits_grouped_p['month']<12]
title = f"Total >{stopping_time} min visits to named POIs per number of users per group by month for users living in selected ZATs (min {thresh_named_pois} visits per user)"
plot_visits_bar(
    data=data,
    x="month",
    y="count_normalized_nusers",
    hue="Group",
    plot_dir=plot_dir,
    title=title,
    figsize=(6, 4),
)