## Home and work location mapping 
The purpose of this notebook is to map the  home and work locations for the users that passed qc. Specifically we want to map: 

1) All the home locations of users in Bogota 
2) All the work locations of the users in Bogota
3) All the home and work locations of the users that live near the stations

In [None]:
from dotenv import load_dotenv
load_dotenv()

from plotting import * 
from preprocess import *

import pandas as pd 

# Access environment variables and define other necessary variables
data_dir = os.getenv('WORKING_DIR')
meta_dir = f'{data_dir}metadata/'
pq_dir = f'{data_dir}data/parquet/in_study_area/pass_qc/'

### Load the pings data and map each users' home location

Load 1. the shapefile with the ZAT and stratum information and 2. the user stats to get the list of users. 
Note that there are 701961 users that passed qc. Takes quite some time to run (~ 111 minutes).

In [None]:
shp_name_zat = 'zat_stratum'
shapefile_zat = f'{meta_dir}income/{shp_name_zat}.shp'
regions_gdf_zat = gpd.read_file(shapefile_zat)
regions_gdf_zat.plot()

user_stats_filepath = f'{data_dir}/data/user_stats/user_stats_2019_months1-8_60min_pings_10min_days_shp_filtered.csv'
user_stats_filtered = pd.read_csv(user_stats_filepath)
uids_pass_qc= list(user_stats_filtered['uid'])
user_stats_filtered.head()

For `start_night` and `end_night`, originally choose 22:00 and 6:00 to map home locations. But we can also map daytime hours of course to try to determine work locations for instance and so we did that for 8:00 and 18:00. By defaul the `goal` parameter, which is meant to describe what information you hope to get by finding the location with the most pings within the time window is set `goal='home'`, but for work determination I set `goal=work` for instance, mainly to accurately label the output files.

In [None]:
goal='work'
start_time = '8:00'
end_time = '18:00'

In [None]:
out_dir = f'{pq_dir}home_work_locs/{goal}/'
cols = ['uid', 'datetime', 'lat', 'lng']
gdf_cols = ['Area', 'MUNCod', 'NOMMun', 'ZAT', 'UTAM', 'stratum']

compute_home_lat_lngs_for_users(uids_pass_qc=uids_pass_qc, regions_gdf=regions_gdf_zat, pq_dir=pq_dir, 
                                out_dir=out_dir, num_users=50000, cols = ['uid', 'datetime', 'lat', 'lng'], 
                                gdf_cols=['Area', 'MUNCod', 'NOMMun', 'ZAT', 'UTAM', 'stratum'],
                                start_time=start_time, end_time=end_time, goal=goal)

### Merge the home (or work) locs from all users to find those that live in the zats of interest 

#### Combine the data for all the users
Read in the home (or work) locs files with the home (or work) locations of all the users (lat, lng, ZAT) that we just wrote to get the uids of those users that live/work in those areas

In [None]:
out_dir = f'{pq_dir}home_work_locs/{goal}/'

In [None]:
locs_files = glob.glob(out_dir + '*.parquet')
all_locs_filepath = f'{out_dir}all_{goal}_locs_0800_1800_w_zats_for_users_pass_qc.csv'
hl_df = ds.dataset(locs_files, format="parquet").to_table().to_pandas()
hl_df = hl_df.rename(columns={'lat': f'lat_{goal}', 'lng': f'lng_{goal}', 'ZAT': f'ZAT_{goal}', 'UTAM': f'UTAM_{goal}'})
hl_df.to_csv(all_locs_filepath)
hl_df.head()

#### Read in zats file 
This file specifies the treatment and control zats. Make sure that the ZATs are the same data type in the `hl_df` and the `zats_tc` dataframes for proper filtering.

In [None]:
import numpy as np 

zats_tc_fp = f'{meta_dir}ZAT_treat_control.csv'
print(hl_df[f'ZAT_{goal}'].dtypes)
zats_tc = pd.read_csv(zats_tc_fp).astype('float64') 

ztreat = [i for i in list(zats_tc['ZATs Treatment group']) if str(i) != "nan"] 
zcontrol = [i for i in list(zats_tc['ZAT Control group']) if str(i) != "nan"] 
zsel = ztreat + zcontrol
zats_tc.head()

#### Filter the users by the selected zat ids to get the uids of those users that live in those areas
Also write out a file with data on the homes and stratum, etc of these users.

In [None]:
hl_df_zsel = hl_df[hl_df[f'ZAT_{goal}'].isin(zsel)].copy().reset_index()
hl_df_zsel = hl_df_zsel.drop(columns=['index'])

zsel_locs_filepath = f'{out_dir}selected_txt_control_{goal}_locs_0800_1800_w_zats_for_users_pass_qc.csv'
hl_df_zsel.to_csv(zsel_locs_filepath)
hl_df_zsel.head()

In [None]:
users_locs_in_zsels = list(hl_df_zsel['uid'])

There are 39,655 users that potentially live in one of the 96 ZATs according to where they have the most pings between 10pm and 6am (which is how skmob calculates home locations)

### Filter the ping data to only include pings from those users that live in the specified ZATs
Takes ~2.5 min to filter the data and write all the pq files and 5 min to write the csv 

In [None]:
pq_dirs_pass_qc = glob.glob(pq_dir + '*.parquet')
pq_dirs_pass_qc_names = [i.split(f'{pq_dir}')[1].split('.parquet')[0] for i in pq_dirs_pass_qc]

for i in tqdm(range(0,len(pq_dirs_pass_qc)), desc=f'Writing data for users that pass qc'):
    print(f'Filtering data for {pq_dirs_pass_qc_names[i]}...')
    dataset = ds.dataset(pq_dirs_pass_qc[i], format="parquet")
    table = dataset.to_table(filter=ds.field('uid').isin(users_locs_in_zsels))
    data_for_users_in_zats = f'{pq_dir}/in_zats/{goal}_in_zats/{pq_dirs_pass_qc_names[i]}_selected_zats_{goal}.parquet'
    pq.write_table(table, data_for_users_in_zats)

#### Make a parquet and csv file for the pings for all the users in the ZATs
Time: ~1 minute for the parquet file and ~5 minutes for the csv file

In [None]:
in_zats_dir = f'{pq_dir}/in_zats/{goal}_in_zats/'
in_zats_dir_pq = glob.glob(in_zats_dir + f'*{goal}.parquet')
print(in_zats_dir_pq)
in_zats_pings_df = ds.dataset(in_zats_dir_pq, format="parquet").to_table().to_pandas().reset_index()
in_zats_pings_df = in_zats_pings_df.drop(columns=['index'])
in_zats_pings_df.head()

In [None]:
in_zats_pings_df_filepath = f'{in_zats_dir}pings_{goal}_locs_0800_1800_w_zats_for_users_pass_qc_living_in_selzats.csv'
in_zats_pings_df.to_csv(in_zats_pings_df_filepath)