## Conversion of tabular data to other formats for querying 
The purpose of this notebook is to test (memory, time) efficient methods to work with big data that don't fit into memory for queries and to run downstream analyses. In this case the data are large tabular files of mobile data. 
The two methods I will try are: 

1) To create a database [with SQLite](https://www.sqlite.org/index.html)
2) To re-write the data to a parquet format using [Apache Arrow](https://arrow.apache.org/docs/python/parquet.html)

(The second option honestly seems much faster and better than the first, at least so far in my hands.)

other resources to look into: 
- [Apache Beam](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/sql_taxi.py)
- [mobilekit POI mapping](https://mobilkit.readthedocs.io/en/latest/examples/M4R_03_POI_visit_analysis.html) and related [docs](https://mobilkit.readthedocs.io/en/latest/mobilkit.spatial.html)
- [info](https://docs.dask.org/en/stable/generated/dask.dataframe.read_parquet.html) about reading parquet files with dask 

## Filter of data by qc parameters 
Will also calculate user stats and filter data for users that meet the minimum standard for downstream analysis. 

### Data reading and package imports

In [1]:
from dotenv import load_dotenv
load_dotenv()

import os
import glob
#from tqdm import tqdm, notebook
from tqdm.notebook import trange, tqdm

import numpy as np
import pandas as pd

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

import dask.dataframe as dd
import geopandas as gpd
from datetime import datetime as dt

import mobilkit #.loader.crop_spatial as mkcrop_spatial

# Access environment variables and define other necessary variables
data_dir = os.getenv('WORKING_DIR')
meta_dir = f'{data_dir}metadata/'

#data_2019 = f'{data_dir}data/year=2019/'
#data_folders = glob.glob((data_2019 + '*/'))

initial_cols=['device_id', 'id_type', 'latitude', 'longitude', 'horizontal_accuracy', 'timestamp',  'ip_address', 'device_os', 'country', 'unknown_2', 'geohash']
sel_cols = ["device_id","latitude","longitude","timestamp","geohash","horizontal_accuracy"]
final_cols = ["uid","lat","lng","datetime","geohash","horizontal_accuracy"]

# boundary box that roughly captures the larger county of Bogota
minlon = -74.453
maxlon = -73.992
minlat = 3.727
maxlat = 4.835

In [2]:
#### FUNCTIONS FOR DATA PROCESSING ####

def get_days(data_folder):
    """Assuming a directory organized as a month's worth of days with files in each directory like "day=01", etc """
    day_dirs = glob.glob((data_folder + '*/'))
    return day_dirs

def get_files(data_folder, day_dir):
    """Assuming a dir corresponding to and named for a day day_dir, (e.g. "day=01") within the data_folder with that day's mobile data files."""
    day = day_dir.split(data_folder)[1]
    filepaths = [fn for fn in glob.glob(day_dir + '*') if not os.path.basename(fn).endswith('.gz')] # select all the non-zipped mobile data files
    return filepaths, day

def load_data(filepaths, initial_cols, sel_cols, final_cols): 
    """Load in the mobile data and specify the columns"""
    ddf = dd.read_csv(filepaths, names=initial_cols)
    ddf = ddf[sel_cols]
    ddf.columns = final_cols
    return ddf 

def convert_datetime(ddf: dd.DataFrame): #needs work
    """Process timestamp to datetime for dataframe with a "datatime" column with timestamp values. """
    ddf["datetime"] = dd.to_datetime(ddf["datetime"], unit='ms', errors='coerce')
    ddf["datetime"] = ddf["datetime"].dt.tz_localize('UTC').dt.tz_convert('America/Bogota')
    return ddf

def preprocess_mobile(ddf: dd.DataFrame, final_cols: list, minlon , maxlon, minlat, maxlat): #needs work
    """Select only those points within an area of interest and process timestamp to datetime 
    for dataframe with a "datatime" column with timestamp values."""
    ddf = find_within_box(ddf, minlon, maxlon, minlat, maxlat)
    ddf = convert_datetime(ddf)[final_cols]
    df = ddf.compute()
    return df

def find_within_box(ddf, minlon, maxlon, minlat, maxlat):
    """Quick way to filter out points not in a particular rectangular region."""
    box=[minlon,minlat,maxlon,maxlat]
    filtered_ddf = mobilkit.loader.crop_spatial(ddf, box).reset_index()
    return filtered_ddf

#### FUNCTIONS FOR PARQUET CONVERSION ####

def write_to_pq(df, out_dir, filename): 
    table_name = f'{out_dir}{filename}.parquet'
    table = pa.Table.from_pandas(df)
    pq.write_table(table, table_name)

def from_month_write_filter_days_to_pq(data_folder: str, month: str, year: str, out_dir:str):
    day_dirs = glob.glob((data_folder + '*/'))
    for i in tqdm(range(0,len(day_dirs)), desc=f'Files from {year} {month} processed'): 
        filepaths, day = get_files(data_folder, day_dirs[i])
        day_name = day.split('/')[0]
        ddf = load_data(filepaths, initial_cols, sel_cols, final_cols)
        df = preprocess_mobile(ddf, final_cols, minlon, maxlon, minlat, maxlat)
        filename = f'bogota_area_{year}_{month}_{day_name}'
        write_to_pq(df, out_dir, filename)
    return

## Apache Re-formatting

### Convert tabular data to parquet
For each day in each month, load the files for the Colombia mobile data, filter the pings that are roughly in the Bogota area, and process the datetime. Write each day as a parquet file with the year, month, and day information in the filename.

This process takes about 10-15 minutes for larger data months and about 1 minute for the smallest data months (month=3)

More information [here](https://arrow.apache.org/docs/python/parquet.html).

In [3]:
year = 'year=2019'
in_dir = f'{data_dir}data/'
data_year = f'{in_dir}{year}/'
data_folders = glob.glob((data_year + '*/'))

# restrict to only folders where the data has not been processed yet
data_folders = [folder for folder in data_folders if ('month=1' not in folder) and ('month=2' not in folder)]

out_dir = f'{data_dir}data/parquet/bogota_area_raw/'

for i in range(0, len(data_folders)):
    data_folder = data_folders[i]
    month = data_folder.split(data_year)[1].split('/')[0]
    #from_month_write_filter_days_to_pq(data_folder, month, year, out_dir) 

### Make dataset from parquet
Convert the parquet files into a dataset that is queryable

More information [here](https://arrow.apache.org/docs/python/dataset.html#dataset).

In [4]:
pq_dir = f'{data_dir}data/parquet/bogota_area_raw/'
dataset = ds.dataset(pq_dir, format="parquet")

As in the documentation "Creating a Dataset object does not begin reading the data itself. If needed, it only crawls the directory to find all the file and infers the dataset’s schema (by default from the first file)." 

Lets look at some properties of the dataset, including what files it would be based on and some metadata.

In [5]:
print(dataset.files)
print(dataset.schema.to_string(show_field_metadata=False))
fragments = list(dataset.get_fragments())
print(fragments)
#fragments.split_by_row_group()

['/Users/emilyrobitschek/git/ETH/SPUR/mobile_data_colombia/data/parquet/bogota_area_raw/month=1/bogota_area_year=2019_month=1_day=01.parquet', '/Users/emilyrobitschek/git/ETH/SPUR/mobile_data_colombia/data/parquet/bogota_area_raw/month=1/bogota_area_year=2019_month=1_day=02.parquet', '/Users/emilyrobitschek/git/ETH/SPUR/mobile_data_colombia/data/parquet/bogota_area_raw/month=1/bogota_area_year=2019_month=1_day=03.parquet', '/Users/emilyrobitschek/git/ETH/SPUR/mobile_data_colombia/data/parquet/bogota_area_raw/month=1/bogota_area_year=2019_month=1_day=04.parquet', '/Users/emilyrobitschek/git/ETH/SPUR/mobile_data_colombia/data/parquet/bogota_area_raw/month=1/bogota_area_year=2019_month=1_day=05.parquet', '/Users/emilyrobitschek/git/ETH/SPUR/mobile_data_colombia/data/parquet/bogota_area_raw/month=1/bogota_area_year=2019_month=1_day=06.parquet', '/Users/emilyrobitschek/git/ETH/SPUR/mobile_data_colombia/data/parquet/bogota_area_raw/month=1/bogota_area_year=2019_month=1_day=07.parquet', '/Use

Using the Dataset.to_table() method we can read the dataset (or a portion of it) into a pyarrow Table. Depending on the dataset size this can require a lot of memory, so it's best to consider filtering the dataset first. For instance if we only want the user id column we can execute the following:

In [6]:
def compute_user_stats(dataset):
    table = dataset.to_table(columns=['uid', 'datetime']).to_pandas()
    table_dd = dd.from_pandas(table, npartitions=20)
    user_stats = mobilkit.stats.userStats(table_dd).compute()
    return user_stats

def compute_user_stats_from_pq(pq_dir):
    table_dd = dd.read_parquet(pq_dir, columns=['uid', 'datetime'])
    user_stats = mobilkit.stats.userStats(table_dd).compute()
    return user_stats

# note this takes quite some time (~20 minutes for two months of data with 10 partitions)
#user_stats = compute_user_stats(dataset)

# the kernel crashed so I just decided to rewrite the function without the pandas conversion in the hopes it would work better. 
# (In the end it took 175 minutes to run but it ran). I will comment it out for now.

#user_stats = compute_user_stats_from_pq(pq_dir)
#user_stats.head()

### Write out user stats for the whole dataset for filtering

In [7]:
min_pings, min_days = 60, 10 
output_filepath = f'{data_dir}/data/user_stats_2019_months1-8_60min_pings_10min_days.csv'
output_folder = f'{data_dir}data/agg_data/'

# ran this once and will comment out for now to not overwrite the file: 
#user_stats_filtered = user_stats[(user_stats['pings'] >= min_pings) & (user_stats['daysActive'] >= min_days)]
#print(f"Based on {min_pings} ping and {min_days} day mininum cutoffs, kept {len(user_stats_filtered)} of a total of {len(user_stats)} users for this dataset.")
#user_stats_filtered.to_csv(output_filepath, index=False)

In [8]:
user_stats_filtered = dd.read_csv(output_filepath)
print(f"Based on {min_pings} ping and {min_days} day mininum cutoffs, kept {len(user_stats_filtered)} users for this dataset.")
user_stats_filtered.head()

Based on 60 ping and 10 day mininum cutoffs, kept 728535 users for this dataset.


Unnamed: 0,uid,min_day,max_day,pings,daysActive,daysSpanned,pingsPerDay,avg
0,000a7743-1dd9-4a98-b902-9e681b7a33c7,2019-03-28 00:00:00-05:00,2019-08-31 00:00:00-05:00,2087,144,156,"[23, 6, 18, 33, 2, 58, 53, 77, 45, 62, 30, 29,...",14.493056
1,0015c3e5-dd4d-4198-954e-bdc5f225f29a,2019-03-31 00:00:00-05:00,2019-08-31 00:00:00-05:00,421,59,153,"[1, 5, 6, 3, 2, 2, 7, 3, 5, 8, 4, 1, 3, 2, 7, ...",7.135593
2,0016a88e-c934-4e67-bb9f-d0d6282aa9ea,2019-05-09 00:00:00-05:00,2019-08-04 00:00:00-05:00,133,21,87,"[9, 2, 9, 1, 1, 6, 2, 4, 5, 1, 1, 31, 43, 6, 2...",6.333333
3,0020aadf-097e-47e7-b66b-163eebae1b43,2019-05-31 00:00:00-05:00,2019-08-31 00:00:00-05:00,2353,84,92,"[21, 75, 28, 24, 13, 4, 10, 12, 5, 2, 29, 5, 1...",28.011905
4,002f09fc-f6c9-4144-a8d6-e5caec2afacd,2019-03-30 00:00:00-05:00,2019-08-31 00:00:00-05:00,364,71,154,"[2, 3, 2, 2, 1, 1, 3, 2, 1, 6, 3, 1, 2, 3, 31,...",5.126761


### Filter dataset by users that pass quality control 
Write out to parquet file for downstream analysis. (Reorganized the files into folders with 2 months of data each.)

In [9]:
pq_dir = f'{data_dir}data/parquet/bogota_area_raw/'
pq_dirs_months = glob.glob(pq_dir + '*')
pq_dirs_months_names = [i.split(f'{pq_dir}')[1] for i in pq_dirs_months]
uids_pass_qc= list(user_stats_filtered['uid'])

for i in tqdm(range(0,len(pq_dirs_months)), desc=f'Writing data for users that pass qc'):
    print(f'Filtering data for {pq_dirs_months_names[i]}...')
    dataset = ds.dataset(pq_dirs_months[i], format="parquet")
    table = dataset.to_table(filter=ds.field('uid').isin(uids_pass_qc))
    # this causes the kernel to crash when I ran it on all the data so I need to rewrite it to not load everything in memory all at once
    data_for_qcd_users = f'{pq_dir}bogota_area_year=2019_{pq_dirs_months_names[i]}_pass_qc.parquet'
    pq.write_table(table, data_for_qcd_users)

Writing data for users that pass qc:   0%|          | 0/8 [00:00<?, ?it/s]

Filtering data for month=7...
Filtering data for month=8...
Filtering data for month=1...
Filtering data for month=6...
Filtering data for month=3...
Filtering data for month=4...
Filtering data for month=5...
Filtering data for month=2...


##### Let's look at the data for August of 2019 for example

In [10]:
data_for_qcd_users = f'{pq_dir}bogota_area_year=2019_month=8_pass_qc.parquet'
qc_user_data = dd.read_parquet(data_for_qcd_users)
qc_user_data.head()

: 

: 

In [None]:
print(len(qc_user_data))

We can also filter by particular queries downstream, which is so cool! 