<a href="https://colab.research.google.com/github/sachins301/UTA-Distributed-Computing/blob/main/UTA%20Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multithreaded Data Collection


## Package Installations
Install xmltodict to convert xml to dictionary data structure.

In [1]:
!pip install xmltodict

Collecting xmltodict
  Downloading xmltodict-0.13.0-py2.py3-none-any.whl.metadata (7.7 kB)
Downloading xmltodict-0.13.0-py2.py3-none-any.whl (10.0 kB)
Installing collected packages: xmltodict
Successfully installed xmltodict-0.13.0


## Imports

In [2]:
import requests
import xmltodict
import json
import pytz
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor
import time
import os
import concurrent.futures
import logging


Mount the userdata secrets and google drive to the runtime session.

In [3]:
from google.colab import userdata
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


Logger to identify the current thread.

In [4]:
# Create a custom logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

# Create handlers
# c_handler = logging.StreamHandler()
f_handler = logging.FileHandler('/content/utaproject.log', mode='w')
# c_handler.setLevel(logging.DEBUG)
f_handler.setLevel(logging.DEBUG)

# Create formatters and add it to handlers
# c_format = logging.Formatter('%(asctime)s - %(threadName)s - %(name)s - %(levelname)s - %(message)s')
f_format = logging.Formatter('%(asctime)s - %(threadName)s - %(name)s - %(levelname)s - %(message)s')
# c_handler.setFormatter(c_format)
f_handler.setFormatter(f_format)

# Add handlers to the logger
# logger.addHandler(c_handler)
logger.addHandler(f_handler)

# Correctly log messages
logger.info('Logging started')


INFO:__main__:Logging started


## API Connection
Connect to the UTA api endpoint using the requests package.
Parse the xml data to json format using xmltodict and json library. The user token for the API connection is saved in google colab secrets. It is also necessery to properly handle the exceptions to keep the threads running.

In [5]:
def retrieve_data(route_id, onward_calls = False):
  token = userdata.get('TOKEN')
  url = f"http://api.rideuta.com/SIRI/SIRI.svc/VehicleMonitor/ByRoute?route={route_id}&onwardcalls={onward_calls}&usertoken={token}"
  response = requests.get(url)
  if response.status_code != 200:
    logger.error(f"Failed to retrieve data. Status code: {response.status_code}")
    return None
  logger.info(f"Data retrieved for route_id: {route_id}")
  xml_data = xmltodict.parse(response.text)
  json_data = json.loads(json.dumps(xml_data))
  return json_data

Create unique file name for each of the json files. File names has to be unique to avoid collisions between different threads. Timestamp is one of the easiest and robust ways to create unique id within a local system. We are using the cleaner epoch time format.

In [6]:
def convert_to_epoch(timestamp):
    date_part, rest = timestamp.split('.')
    microseconds_part = rest[:-6]  # Extract microseconds part
    timezone_part = rest[-6:]      # Extract timezone part
    microseconds_part = microseconds_part[:6]  # Take only the first 6 digits for microseconds
    timestamp = date_part + '.' + microseconds_part + timezone_part
    dt = datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%S.%f%z")
    epoch_time = int(dt.timestamp())

    return str(epoch_time)

Save the json file using the combination of rout id and epoch time to avoid collisions.

In [7]:
def save_data(route_id, json_data):
  file_name = route_id + "-" + convert_to_epoch(json_data["Siri"]["ResponseTimestamp"])  + ".json"
  with open('/content/json_dumps/'+file_name, 'w', encoding='utf-8') as f:
    f.write(json.dumps(json_data))
  logger.info(f"Data saved to {file_name}")

## Task Function for Asynchronous Execution Using ThreadPoolExecutor
Define a task function that will be executed asynchronously by the threads in a ThreadPoolExecutor. The function will process each item in a route_id_list, which contains the data points that need to be handled concurrently.
### process_route Function:
This function is designed to process a single route_id. It logs the start and end of the processing and can include any specific logic needed to handle each route.
### Error Handling:
The function includes basic error handling to catch and log any exceptions that occur during processing.

In [8]:
route_id_list = ['4', '455', '1', '2', '220']
def process_route(route_id):
  logger.info(f"Processing route_id: {route_id}")
  while True:
    json_data = retrieve_data(route_id)
    if json_data:
      logger.info(f"Saving data for route_id: {route_id}")
      save_data(route_id, json_data)
    time.sleep(5)

## ThreadPoolExecutor
The ThreadPoolExecutor is a high-level interface from the concurrent.futures module in Python, designed for managing a pool of threads that execute tasks concurrently. It simplifies the process of launching and managing threads, allowing developers to easily parallelize tasks without needing to manage thread creation, synchronization, or termination manually.

### Code Walkthrough
In the below code snippet, a ThreadPoolExecutor is instantiated without specifying the max_workers parameter, so it defaults to min(32, os.cpu_count() + 4). This means the number of threads in the pool will be either 32 or the number of available CPU cores plus 4, whichever is smaller. This thread pool is used to asynchronously execute the process_route function on each item in the route_id_list, enabling the processing of multiple routes concurrently.

The executor.map() function maps the process_route function over the entire route_id_list, executing each function call in parallel across the threads in the pool. This allows for efficient processing of data points, especially when the workload involves I/O-bound (in this case, data retrieval and saving to file) or computationally light tasks that benefit from concurrency.

The try-except block in the code includes an infinite loop with a time.sleep(1) call, which keeps the main thread alive, allowing the ThreadPoolExecutor to continue processing tasks. If a KeyboardInterrupt occurs, the program will log a shutdown message and exit gracefully. This setup is useful in long-running applications where the main thread needs to wait while background threads complete their work.

This code structure is commonly used in scenarios where tasks need to be processed in parallel, such as handling multiple network requests, processing large datasets, or performing independent computations concurrently. The ThreadPoolExecutor ensures that system resources are efficiently utilized while simplifying the complexities associated with multithreading in Python.

In [None]:

# No of default max workers is Min(32, os.cpu_count + 4) workers.
with concurrent.futures.ThreadPoolExecutor() as executor:
    executor.map(process_route, route_id_list)

try:
    while True:
        time.sleep(1)
except KeyboardInterrupt:
    logger.info("Shutting down...")

In [None]:
_# No of default workers /cpu available for multithreading (+ 4 workers for better concurrency)
print(os.cpu_count() + 4)

6


In [None]:
# Ziping the json_dumps
import shutil
shutil.make_archive('json_dumps', 'zip', '/content/json_dumps')

'/content/json_dumps.zip'

In [None]:
# copy files to google drive
!cp -r /content/json_dumps.zip /content/drive/MyDrive/Colab\ Notebooks/

INFO:__main__:Data retrieved for route_id: 2
INFO:__main__:Saving data for route_id: 2
INFO:__main__:Data saved to 2-1723474184.json


In [None]:
# Opening JSON file
f = open('/content/drive/My Drive/Colab Notebooks/'+file_name)
data = json.load(f)