## **Advanced Analyitcs and Applications - Data Collection Strategy**

This Jupyter notebook `01_Data_Preparation.ipynb` is the first step in the "Advanced Analytics and Application" team project. Its primary purpose is to handle the initial data collection and preparation tasks. This includes:

* Setting up the Python environment and importing necessary libraries.
* Loading project configurations, including API keys and file paths.
* Fetching taxi trip data from the City of Chicago data portal using a custom API client `src/api/taxi.py`.
* Fetching weather data for Chicago using a custom API client `src/api/weather.py` for a relevant period.
* Storing the raw fetched data for subsequent processing and analysis.

The data collected here will form the basis for descriptive analytics, predictive modeling, and reinforcement learning tasks outlined in the project assignment.

##### **Table of Contents**

0. [Notebook Setup](#Notebook-Set-Up-and-Imports)
1. [Data Collection](#Data-Collection)
1.1. [Taxi Data Collection](#Taxi-Data)
1.2. [Weather Data Collection](#Weather-Data)
3. [References](#References)

##### **Notebook Set Up and Imports**

In [1]:
%%html
<style>
.dataframe th {
    font-family: "JetBrainsMono Nerd Font";
}
.dataframe td {
    font-family: "JetBrainsMono Nerd Font";
}
</style>

In [2]:
import importlib
import os
import pickle
import subprocess
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns
import yaml

In [3]:
sys.path.append(str(Path.cwd().parent))
from src.utils.notebook_setup import load_files, setup_notebook

style_manager = setup_notebook()

if str(Path().resolve()).split("/")[-1] == "AAA":
    print("already set repo root")
else:
    notebooks_dir = Path().resolve()
    repo_root = notebooks_dir.parent
    config_dir = repo_root / "config"
    data_dir = repo_root / "data"
    results_dir = data_dir / "results"
    raw_data_dir = data_dir / "raw"
    processed_data_dir = data_dir / "processed"

    with open(config_dir / "config.yaml", "r") as file:
        config = yaml.safe_load(file)
    
    os.chdir(repo_root)

## Data Collection
[Back to Table of Contents](#Table-of-Contents)

### Taxi Data

This section focuses on acquiring taxi trip data from the City of Chicago's Socrata open data portal. This is achieved using a custom-built API client.

**API Client Initialization**:

The ChicagoTaxiAPI class, defined in `src.api.taxi.py`. An instance of the ChicagoTaxiAPI is created. The necessary Socrata App Token for accessing the Chicago Taxi dataset is retrieved from the previously loaded config object (specifically from `config["CHICAGO_TAXI"].get("APP_TOKEN", None)`). Using an `APP_TOKEN` is crucial for avoiding strict rate limits imposed on unauthenticated API requests.

In [35]:
from src.api.taxi import ChicagoTaxiAPI

api = ChicagoTaxiAPI(
    app_token=config["CHICAGO_TAXI"].get("APP_TOKEN", None)
)

df_sample = api.fetch_data(
    select = (
        "trip_id, taxi_id, trip_start_timestamp, trip_end_timestamp, trip_seconds, "
        "trip_miles, pickup_census_tract, dropoff_census_tract, pickup_community_area, "
        "dropoff_community_area, fare, tips, tolls, extras, trip_total, payment_type, "
        "company, pickup_centroid_location, dropoff_centroid_location"
    ),
    where=(
        "pickup_centroid_location IS NOT NULL "
        "AND dropoff_centroid_location IS NOT NULL "
        "AND trip_start_timestamp IS NOT NULL "
        "AND trip_start_timestamp >= '2025-05-01T00:00:00' "
        "AND trip_start_timestamp <= '2025-05-02T00:00:00' "
    ),
    order="trip_start_timestamp DESC",
    limit=300_000,
)

api.close()

In [48]:
import folium
import geopandas as gpd
from shapely.geometry import LineString
from IPython.display import display

# Rebuild geometry
df_sample["geometry"] = df_sample.apply(
    lambda row: LineString([
        row["pickup_centroid_location"]["coordinates"],
        row["dropoff_centroid_location"]["coordinates"]
    ]) if row["pickup_centroid_location"] and row["dropoff_centroid_location"] else None,
    axis=1
)
gdf = gpd.GeoDataFrame(df_sample, geometry="geometry", crs="EPSG:4326").dropna(subset=["geometry"])

# Center map on chicago
m = folium.Map(location=[41.8781, -87.6298], zoom_start=11)

# Add trip lines
for line in gdf["geometry"]:
    folium.PolyLine(
        locations=[(lat, lon) for lon, lat in line.coords],
        color="blue", weight=2, opacity=0.5
    ).add_to(m)

# Add pickup and dropoff markers
for _, row in gdf.iterrows():
    pickup_coords = row["pickup_centroid_location"]["coordinates"]
    dropoff_coords = row["dropoff_centroid_location"]["coordinates"]
    
    folium.CircleMarker(
        location=(pickup_coords[1], pickup_coords[0]),  # lat, lon
        radius=4,
        color="green",
        fill=True,
        fill_color="green",
        fill_opacity=0.7,
        popup=f"Pickup: {row['trip_start_timestamp']}"
    ).add_to(m)
    
    folium.CircleMarker(
        location=(dropoff_coords[1], dropoff_coords[0]),  # lat, lon
        radius=4,
        color="red",
        fill=True,
        fill_color="red",
        fill_opacity=0.7,
        popup=f"Dropoff: {row['trip_end_timestamp']}"
    ).add_to(m)

# Show map in notebook
display(m)

In [None]:
df_batch_sample = api.fetch_batch_data(
    select = (
        "trip_id, taxi_id, trip_start_timestamp, trip_end_timestamp, trip_seconds, "
        "trip_miles, pickup_census_tract, dropoff_census_tract, pickup_community_area, "
        "dropoff_community_area, fare, tips, tolls, extras, trip_total, payment_type, "
        "company, pickup_centroid_location, dropoff_centroid_location"
    ),
    where=(
        "pickup_centroid_location IS NOT NULL "
        "AND dropoff_centroid_location IS NOT NULL "
        "AND trip_start_timestamp >= trip_end_timestamp "
        "AND trip_start_timestamp IS NOT NULL"
    ), 
    output_dir=raw_data_dir
)

In [19]:
df_batch_sample.tail()

Unnamed: 0,trip_id,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_centroid_location,dropoff_centroid_location,pickup_census_tract,dropoff_census_tract
1917884,8872040b5f9d93421781e46111fd4ed5c5be4e25,182738fe701aff980860adbe8c50444f6edfb7adcf06a9...,2024-01-01,2024-01-01,366.0,0.45,8.0,8.0,5.25,0.0,0.0,1.0,6.75,Credit Card,Star North Taxi Management Llc,"{'type': 'Point', 'coordinates': [-87.63330803...","{'type': 'Point', 'coordinates': [-87.63330803...",,
1917885,19167920814770e4d57260adef492b658415e6a1,13c0599d1bce4a6239d30c3feeba903749ab197df436dc...,2024-01-01,2024-01-01,488.0,1.95,32.0,33.0,8.5,0.0,0.0,1.0,9.5,Cash,Patriot Taxi Dba Peace Taxi Associat,"{'type': 'Point', 'coordinates': [-87.62519214...","{'type': 'Point', 'coordinates': [-87.62033462...",,
1917886,054bbff2c120ec8b40dbe90b6b15cf52847d46fd,6c1e4e8e25a1b47575b359c5a0844cf23c50e540a86ecd...,2024-01-01,2024-01-01,392.0,1.03,8.0,7.0,6.25,0.0,0.0,0.0,6.25,Cash,Flash Cab,"{'type': 'Point', 'coordinates': [-87.63330803...","{'type': 'Point', 'coordinates': [-87.64948872...",,
1917887,acb5dce9434bc8dc7fd511f0210c237b2bdab101,04c44d1bf8cc741f86f8ccdee5b64b65e5a6631d743450...,2024-01-01,2024-01-01,283.0,1.27,8.0,24.0,6.25,3.0,0.0,0.0,9.75,Credit Card,Flash Cab,"{'type': 'Point', 'coordinates': [-87.63330803...","{'type': 'Point', 'coordinates': [-87.67635598...",,
1917888,67cbf4af40b12db55b3a3e4efa09f358288c0cf4,57c40509cae37a0e5e536a657cdb7f8c6824314bc466a7...,2024-01-01,2024-01-01,0.0,0.0,7.0,7.0,3.25,0.0,0.0,7.0,10.25,Cash,Taxi Affiliation Services,"{'type': 'Point', 'coordinates': [-87.64948872...","{'type': 'Point', 'coordinates': [-87.64948872...",,


In [27]:
from src.api.weather import ChicagoWeatherAPI
from datetime import date

weather_api = ChicagoWeatherAPI()


today = date.today()
seven_days_ago = today - pd.Timedelta(days=7)
two_days_ago = today - pd.Timedelta(days=2)

historical_df = weather_api.get_historical_weather(
    start_date=seven_days_ago.strftime("%Y-%m-%d"),
    end_date=two_days_ago.strftime("%Y-%m-%d"),
    hourly_vars=["temperature_2m", "precipitation", "weather_code", "wind_speed_10m"],
    daily_vars=["temperature_2m_max", "temperature_2m_min", "precipitation_sum"] # request both
)

forecast_df = weather_api.get_forecast_weather(
    days=3,
    hourly_vars=["temperature_2m", "apparent_temperature", "precipitation_probability"],
    daily_vars=["sunrise", "sunset", "uv_index_max"]
)

weather_api.close()

In [28]:
historical_df

Unnamed: 0_level_0,temperature_2m,precipitation,weather_code,wind_speed_10m
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2025-05-15 00:00:00,13.7,0.0,3.0,5.4
2025-05-15 01:00:00,14.1,0.0,3.0,5.8
2025-05-15 02:00:00,15.1,0.0,2.0,7.0
2025-05-15 03:00:00,16.3,0.0,0.0,8.5
2025-05-15 04:00:00,17.2,0.0,0.0,9.0
...,...,...,...,...
2025-05-20 19:00:00,12.4,0.0,3.0,20.4
2025-05-20 20:00:00,,,,
2025-05-20 21:00:00,,,,
2025-05-20 22:00:00,,,,


In [24]:
forecast_df

Unnamed: 0_level_0,temperature_2m,apparent_temperature,precipitation_probability
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2025-05-22 00:00:00,8.1,5.9,30
2025-05-22 01:00:00,8.2,5.4,28
2025-05-22 02:00:00,8.0,5.3,15
2025-05-22 03:00:00,8.1,5.3,17
2025-05-22 04:00:00,8.3,6.1,13
...,...,...,...
2025-05-24 19:00:00,9.7,6.2,1
2025-05-24 20:00:00,8.9,5.7,4
2025-05-24 21:00:00,8.4,5.6,4
2025-05-24 22:00:00,8.2,5.7,4
