# Overview

This project aims to redevelop an existing codebase used to analyze travel patterns through GPS data. GPS data typically contains observations from a large number of individuals (ranging from hundreds of thousands to millions) over extended periods (from weeks to months or even years). This data includes two critical pieces of information: geographical locations (latitude and longitude) where individual mobile devices are observed, and the corresponding times of these observations. Using these data points, we can infer individuals' mobility patterns, specifically when and where they travel from one place to another.

The GPS data is often collected from mobile apps installed on users' devices. Third-party companies partner with these apps to provide the data to researchers in an anonymized form. Typically, sensitive location information (e.g., home, hospital, religious places) is replaced with the centroid latitude and longitude of the corresponding census block group to ensure privacy.

With the GPS data, our code can infer which GPS traces can be grouped together to represent a "stay," where the device is stationary for a certain period. This is achieved through the `IncrementalClustering` function. When a GPS trace is not part of a stay, it indicates that the device is moving. Once the stays are identified, their durations can be calculated using the `UpdateStayDuration` function. Additionally, our code includes a function called `AddressOscillation` to address oscillation issues. Oscillation occurs when consecutive points that start and end at the same location appear to jump between different locations within a short time frame (seconds), even though the device itself is not moving.

There are also other functions, in this demo, we are only presenting three of them as one of the workflows. The ultimate goal is to redesign the functions such that the users can create their own workflows by combining the functions in different ways to analyze the GPS data. And the functions can run independently or in a pipeline.


In [3]:
import pandas as pd
import numpy as np
import geopy
from IncrementalClustering import IC
from UpdateStayDuration import USD
from AddressOscillation import AO
import os

# Workflow 2 is supposed to run with GPS data

**Incremental Clustering -> Stay Duration Calculator -> Address Oscilallation -> Stay Duration Calculator**



## Data

In [4]:
data = pd.read_csv('input_file.csv')
# currently the functions developed requires a input file with 9 columns in the order of 
# unix_start_t, user_ID, mark_1, orig_lat, orig_long, orig_unc, stay_lat, stay_long, stay_unc, and stay_dur 
data.head(5)

Unnamed: 0,unix_start_t,user_ID,mark_1,orig_lat,orig_long,orig_unc,stay_lat,stay_long,stay_unc,stay_dur,stay_ind,human_start_t
0,1552522827,f98247efb013243c91c247aa6c6e119e71404a698005d1...,0,47.651589,-122.327397,900,-1,-1,-1,-1,-1,190313162027
1,1552850582,f98247efb013243c91c247aa6c6e119e71404a698005d1...,0,47.999315,-122.221824,122,-1,-1,-1,-1,-1,190317122302
2,1552851399,f98247efb013243c91c247aa6c6e119e71404a698005d1...,0,47.999042,-122.221574,104,-1,-1,-1,-1,-1,190317123639
3,1553827223,f98247efb013243c91c247aa6c6e119e71404a698005d1...,0,47.590755,-122.330222,103,-1,-1,-1,-1,-1,190328194023
4,1553840035,f98247efb013243c91c247aa6c6e119e71404a698005d1...,0,47.998033,-122.221373,128,-1,-1,-1,-1,-1,190328231355


## Incremental Clustering

This function will change the entries for stay_lat, stay_long, stay_unc (not all entries, depending on the user input parameters) 

In [7]:
inputFile = 'input_file.csv'
outputFile = 'output_file.csv'
spatial_constraint = 0.2
# dur_constraint = 30
dur_constraint = 0

In [8]:
IC(inputFile,outputFile,spatial_constraint,dur_constraint)

total number of users to be processed:  5995
number of chunks to be processed 6
Start processing bulk:  6  at time:  0603-15:33  memory:  95.9
End reading


  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super().

Start processing bulk:  5  at time:  0603-15:33  memory:  95.4
End reading


  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super().

Start processing bulk:  4  at time:  0603-15:34  memory:  95.5
End reading


  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super().

Start processing bulk:  3  at time:  0603-15:35  memory:  95.5
End reading


  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super().

Start processing bulk:  2  at time:  0603-15:35  memory:  95.2
End reading


  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super().

Start processing bulk:  1  at time:  0603-15:36  memory:  95.3
End reading


  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super().

In [9]:
IC_output = pd.read_csv('output_file.csv')
IC_output.head(5)

Unnamed: 0,unix_start_t,user_ID,mark_1,orig_lat,orig_long,orig_unc,stay_lat,stay_long,stay_unc,stay_dur,stay_ind,human_start_t
0,1555555649,9ead9956e271893cb3ce97feefab0cbd053fcd392234e1...,0,47.438999,-122.303658,142,47.438999,-122.303658,142,-1,-1,190417224729
1,1556559744,c0cb0a8ec6aec9c2e283bdeb9159bcff76378c5bada84f...,0,47.407275,-122.30579,4235,47.407275,-122.30579,4235,-1,-1,190429104224
2,1555455106,5904f791972bac0751e8188470d59c04121904647dc0ce...,0,47.623806,-122.346104,110,47.623806,-122.346104,110,-1,-1,190416155146
3,1555455712,5904f791972bac0751e8188470d59c04121904647dc0ce...,0,47.627091,-122.343147,800,47.627091,-122.343147,800,-1,-1,190416160152
4,1555456226,5904f791972bac0751e8188470d59c04121904647dc0ce...,0,47.58181,-122.321538,116,47.58181,-122.321538,116,-1,-1,190416161026


In [10]:
IC_output.head(20)

Unnamed: 0,unix_start_t,user_ID,mark_1,orig_lat,orig_long,orig_unc,stay_lat,stay_long,stay_unc,stay_dur,stay_ind,human_start_t
0,1555555649,9ead9956e271893cb3ce97feefab0cbd053fcd392234e1...,0,47.438999,-122.303658,142,47.438999,-122.303658,142,-1,-1,190417224729
1,1556559744,c0cb0a8ec6aec9c2e283bdeb9159bcff76378c5bada84f...,0,47.407275,-122.30579,4235,47.407275,-122.30579,4235,-1,-1,190429104224
2,1555455106,5904f791972bac0751e8188470d59c04121904647dc0ce...,0,47.623806,-122.346104,110,47.623806,-122.346104,110,-1,-1,190416155146
3,1555455712,5904f791972bac0751e8188470d59c04121904647dc0ce...,0,47.627091,-122.343147,800,47.627091,-122.343147,800,-1,-1,190416160152
4,1555456226,5904f791972bac0751e8188470d59c04121904647dc0ce...,0,47.58181,-122.321538,116,47.58181,-122.321538,116,-1,-1,190416161026
5,1555456467,5904f791972bac0751e8188470d59c04121904647dc0ce...,0,47.534921,-122.298934,1200,47.534921,-122.298934,1200,-1,-1,190416161427
6,1555457380,5904f791972bac0751e8188470d59c04121904647dc0ce...,0,47.534921,-122.298934,1200,47.534921,-122.298934,1200,-1,-1,190416162940
7,1555458302,5904f791972bac0751e8188470d59c04121904647dc0ce...,0,47.536775,-122.303856,10484,47.536775,-122.303856,10484,-1,-1,190416164502
8,1555459170,5904f791972bac0751e8188470d59c04121904647dc0ce...,0,47.536775,-122.303856,2613,47.536775,-122.303856,2613,-1,-1,190416165930
9,1574615877,28e9af579993c0ec171453f4f7ca16ae8b7c0d24d5e979...,0,47.319824,-122.174438,1160,47.319929,-122.174408,1160,-1,-1,191124091757


## Stay Duration Calculator

In [6]:
inputFile = 'output_file.csv'
outputFile = 'output_file.csv'
duration_constraint = 30

In [7]:
USD(inputFile,outputFile,duration_constraint)

total number of users to be processed:  5995
number of chunks to be processed 6
Start processing bulk:  6  at time:  0515-13:39  memory:  72.4
End reading
Start processing bulk:  5  at time:  0515-13:39  memory:  72.5
End reading
Start processing bulk:  4  at time:  0515-13:39  memory:  72.1
End reading
Start processing bulk:  3  at time:  0515-13:39  memory:  72.5
End reading
Start processing bulk:  2  at time:  0515-13:39  memory:  73.1
End reading
Start processing bulk:  1  at time:  0515-13:39  memory:  73.0
End reading


In [9]:
Stay_Output = pd.read_csv('output_file.csv')
Stay_Output.head(5)                       

Unnamed: 0,unix_start_t,user_ID,mark_1,orig_lat,orig_long,orig_unc,stay_lat,stay_long,stay_unc,stay_dur,stay_ind,human_start_t
0,1555340403,840f331133114f8fed39a2ab38990fc6d22549c68a1382...,1,47.445557,-122.302233,165,-1,-1,-1,-1,-1,190415080003
1,1555341027,840f331133114f8fed39a2ab38990fc6d22549c68a1382...,1,47.445526,-122.302143,165,-1,-1,-1,-1,-1,190415081027
2,1555341073,840f331133114f8fed39a2ab38990fc6d22549c68a1382...,1,47.446869,-122.300242,165,-1,-1,-1,-1,-1,190415081113
3,1555045747,bc31ba2f775b09af722863f1a8f6c55823dab62230776c...,0,47.444895,-122.314696,1100,-1,-1,-1,-1,-1,190411220907
4,1555046007,bc31ba2f775b09af722863f1a8f6c55823dab62230776c...,0,47.439099,-122.302854,567,-1,-1,-1,-1,-1,190411221327


## Address Oscillation

In [10]:
inputFile = 'output_file.csv'
outputFile = 'output_file.csv'
duration_constraint = 30

In [11]:
AO(inputFile,outputFile,duration_constraint)

total number of users to be processed:  5994
number of chunks to be processed 6
Start processing bulk:  6  at time:  0515-13:43  memory:  73.6
End reading
Start processing bulk:  5  at time:  0515-13:43  memory:  73.1
End reading
Start processing bulk:  4  at time:  0515-13:43  memory:  73.2
End reading
Start processing bulk:  3  at time:  0515-13:44  memory:  73.3
End reading
Start processing bulk:  2  at time:  0515-13:44  memory:  73.4
End reading
Start processing bulk:  1  at time:  0515-13:44  memory:  73.4
End reading


In [12]:
Osc_Output = pd.read_csv('output_file.csv')
Osc_Output.head()

Unnamed: 0,unix_start_t,user_ID,mark_1,orig_lat,orig_long,orig_unc,stay_lat,stay_long,stay_unc,stay_dur,stay_ind,human_start_t
0,1555045747,bc31ba2f775b09af722863f1a8f6c55823dab62230776c...,0,47.444895,-122.314696,1100,-1,-1,-1,-1,-1,190411220907
1,1555046007,bc31ba2f775b09af722863f1a8f6c55823dab62230776c...,0,47.439099,-122.302854,567,-1,-1,-1,-1,-1,190411221327
2,1555046558,bc31ba2f775b09af722863f1a8f6c55823dab62230776c...,0,47.446042,-122.303631,128,-1,-1,-1,-1,-1,190411222238
3,1555046849,bc31ba2f775b09af722863f1a8f6c55823dab62230776c...,0,47.45205,-122.299567,700,-1,-1,-1,-1,-1,190411222729
4,1555047864,bc31ba2f775b09af722863f1a8f6c55823dab62230776c...,0,47.444427,-122.299931,700,-1,-1,-1,-1,-1,190411224424


## Stay Duration Calculator

In [13]:
inputFile = 'output_file.csv'
outputFile = 'output_file.csv'
duration_constraint = 30

In [14]:
USD(inputFile,outputFile,duration_constraint)

total number of users to be processed:  5995
number of chunks to be processed 6
Start processing bulk:  6  at time:  0515-13:45  memory:  73.0
End reading
Start processing bulk:  5  at time:  0515-13:45  memory:  73.3
End reading
Start processing bulk:  4  at time:  0515-13:45  memory:  73.3
End reading
Start processing bulk:  3  at time:  0515-13:46  memory:  73.9
End reading
Start processing bulk:  2  at time:  0515-13:46  memory:  74.0
End reading
Start processing bulk:  1  at time:  0515-13:46  memory:  73.9
End reading


In [15]:
Stay_Output_2 = pd.read_csv('output_file.csv')
Stay_Output_2.head(5)                       

Unnamed: 0,unix_start_t,user_ID,mark_1,orig_lat,orig_long,orig_unc,stay_lat,stay_long,stay_unc,stay_dur,stay_ind,human_start_t
0,1555340403,840f331133114f8fed39a2ab38990fc6d22549c68a1382...,1,47.445557,-122.302233,165,-1,-1,-1,-1,-1,190415080003
1,1555341027,840f331133114f8fed39a2ab38990fc6d22549c68a1382...,1,47.445526,-122.302143,165,-1,-1,-1,-1,-1,190415081027
2,1555341073,840f331133114f8fed39a2ab38990fc6d22549c68a1382...,1,47.446869,-122.300242,165,-1,-1,-1,-1,-1,190415081113
3,1555045747,bc31ba2f775b09af722863f1a8f6c55823dab62230776c...,0,47.444895,-122.314696,1100,-1,-1,-1,-1,-1,190411220907
4,1555046007,bc31ba2f775b09af722863f1a8f6c55823dab62230776c...,0,47.439099,-122.302854,567,-1,-1,-1,-1,-1,190411221327


# Current Status and Future Goals of the Code

## Current Status

The existing functions developed for analyzing travel patterns through GPS data require an input file with 9 specific columns: `unix_start_t`, `user_ID`, `mark_1`, `orig_lat`, `orig_long`, `orig_unc`, `stay_lat`, `stay_long`, `stay_unc`, and `stay_dur`. The order of these columns is crucial since the functions access them by index, making them inflexible to any changes in column order.

## Future Goals
Agreed during the meeting that the first, second, and fifth points have higher priority. 

1. **Flexibility in Input Files**:
   - **Current Limitation**: Functions depend heavily on the column order of the input file.
   - **Desired Improvement**: Make the functions robust to changes in column order and handle missing columns gracefully. For instance, the user might provide an input file with only `user_ID`, `unix_time`, `orig_lat`, `orig_long`, `orig_unc`, and with different column names. 

2. **Performance Optimization**:
   - **Current Limitation**: Performance issues arise with large datasets.
   - **Desired Improvement**: Optimize runtime by examining and enhancing the imported functions. Suggestions include reducing the use of append operations inside loops and collecting items in a list for a single concatenation operation.

3. **Additional Functionalities**:
   - **Visualization**:
     - **Current Status**: No source code for visualization.
     - **Desired Improvement**: Add visualizations such as trajectory maps, stay point locations on a map, and statistical summaries of stay durations. The above mentioned items can change when different workflows are deployed, or running the same workflow with different parameters like distance and duration thresholds. 
   - **Inferred Home Locations**: Develop functionality to infer home locations based on trajectories, possibly on the block group level.
   - **Inferred Mode of Travel**: Implement functionality to infer the mode of travel, at least for walking.
   - **Group Travel Detection**: Create a method to infer if trajectories represent a group of individuals traveling together.

4. **Software and Dependency Management**:
   - **Clarify Software Requirements**: Specify the Python version (e.g., Python>=3.8) and list all required packages in a `requirements.txt` file (e.g., `numpy>=2.0.0b1`, `geopy`, `sklearn`).
   - **Documentation**: Provide a problem statement, user guide with a sample case (see case study integration), and clarify input and output data in the documentation.

5. **Refactoring the Framework**:
   - **Modularization**: Structure the code into clear, reusable modules and functions. Each module should handle a specific part of the workflow.
   - **Efficiency Improvements**: Enhance computational efficiency by leveraging built-in C++ APIs, similar to the approach in the Python path4gmns project. Focus on identifying and optimizing the most time-consuming low-level functions.

6. **Case Study Integration**:
   - **Current Status**: Case study data is not integrated.
   - **Desired Improvement**: Integrate case study data into the source code branch and provide a README file to introduce the data and guide users on how to use it.

## Stretch Goals (by SSEC):
Attained alongside refactoring the old codebase. (Not part of the initial requirements)

1. **Added CI/CD Integration**:
   - Integrated github actions and pre-commit checks/auto-run tests/support continuous deployment of change.  


2. **Support For Command Line Integration**:
   - Developed support for running the workflows from command line. Plus additional support for optional arguments. Helper text added for optional arguments with default value support. This was not present in the old code.  

3. **Data Validation**:
   - Added input data validation layer for different values for inputs like negative values, character passed for numeric arguments etc. This support was not present in the old code.  


4. **Testing Layer**:
   - Added unit tests as well as end-to-end workflow tests. No tests existed in the entire codebase for any of the workflows or workflow steps.
   - Integrated Automated Testing with the CI pipeline
   - No test existed for any of the workflows or any individual function like (IC, TSC, USD, AO) in the old code base.  


5. **IO Layer**
   - Added support for IO layer that allows input in file formats that can be converted into pandas dataframe. (like .xlsx, .csv)
   - Easily extensible to other formats supporting pandas dataframe.
   - No IO layer support existed previously. (Added to requirements after initial SOW) 


6. **Suggested Algorithmic Improvement for TSC**
   - Proposed change in algorithm which had the potential to increase the efficiency of all the workflows that included Trace Segmentation.
   - Provided clear communication, detailed walk-over of the old algorithm vs the new algorithm in the meetings as well as the GitHub Discussion  

