Table of Contents:
1. Import libraries & data.
2. Kepler.gl map: Trip flows between stations.  

In [1]:
# Import libraries

import pandas as pd
from keplergl import KeplerGl
import json
from pathlib import Path

  from pkg_resources import resource_string


In [2]:
# Set path
PROJECT_DIR = Path.cwd()
DATA_DIR = PROJECT_DIR / "data" / "processed"
csv_path = DATA_DIR / "citibike_2022_with_weather.csv"

In [12]:
# Rubric aggregation

use_cols_min = ["start_station_name", "end_station_name"]

df_min = pd.read_csv(csv_path, usecols=use_cols_min)

# Create the required "1" column
df_min["trip"] = 1

# Aggregate to the required 3 columns
df_agg = (
    df_min.groupby(
        ["start_station_name", "end_station_name"],
        as_index=False
    )["trip"]
    .sum()
    .rename(columns={"trip": "trips"})
)

df_agg.head()

Unnamed: 0,start_station_name,end_station_name,trips
0,1 Ave & E 110 St,1 Ave & E 110 St,791
1,1 Ave & E 110 St,1 Ave & E 18 St,2
2,1 Ave & E 110 St,1 Ave & E 30 St,4
3,1 Ave & E 110 St,1 Ave & E 39 St,1
4,1 Ave & E 110 St,1 Ave & E 44 St,12


In [8]:
# Clean aggregation + coordinates (memory-safe)

use_cols = [
    "start_station_name", "end_station_name",
    "start_lat", "start_lng", "end_lat", "end_lng"
]

CHUNKSIZE = 200_000

pair_counts = {}          # (start_name, end_name) -> trips
start_coords = {}         # start_name -> (lat, lng)
end_coords = {}           # end_name   -> (lat, lng)

for chunk in pd.read_csv(csv_path, usecols=use_cols, chunksize=CHUNKSIZE):
    chunk = chunk.dropna(subset=["start_station_name","end_station_name","start_lat","start_lng","end_lat","end_lng"])

    # reduce tiny coordinate variation
    chunk["start_lat"] = chunk["start_lat"].round(5)
    chunk["start_lng"] = chunk["start_lng"].round(5)
    chunk["end_lat"]   = chunk["end_lat"].round(5)
    chunk["end_lng"]   = chunk["end_lng"].round(5)

    # store one representative coordinate per station name (first seen)
    for s, lat, lng in zip(chunk["start_station_name"], chunk["start_lat"], chunk["start_lng"]):
        start_coords.setdefault(s, (lat, lng))
    for e, lat, lng in zip(chunk["end_station_name"], chunk["end_lat"], chunk["end_lng"]):
        end_coords.setdefault(e, (lat, lng))

    # count trips per (start, end)
    grouped = chunk.groupby(["start_station_name", "end_station_name"]).size()

    for (s, e), n in grouped.items():
        pair_counts[(s, e)] = pair_counts.get((s, e), 0) + int(n)

# build aggregated df
df_trips = (
    pd.DataFrame(
        [(s, e, trips) for (s, e), trips in pair_counts.items()],
        columns=["start_station_name", "end_station_name", "trips"]
    )
)

# add coordinates back
start_df = pd.DataFrame(
    [(k, v[0], v[1]) for k, v in start_coords.items()],
    columns=["start_station_name", "start_lat", "start_lng"]
)
end_df = pd.DataFrame(
    [(k, v[0], v[1]) for k, v in end_coords.items()],
    columns=["end_station_name", "end_lat", "end_lng"]
)

df_trips = df_trips.merge(start_df, on="start_station_name", how="left").merge(end_df, on="end_station_name", how="left")

In [9]:
# Quick check
df_trips.sort_values("trips", ascending=False).head(10)

Unnamed: 0,start_station_name,end_station_name,trips,start_lat,start_lng,end_lat,end_lng
108994,Central Park S & 6 Ave,Central Park S & 6 Ave,12041,40.76597,-73.97651,40.76591,-73.97634
239258,7 Ave & Central Park South,7 Ave & Central Park South,8541,40.76674,-73.97907,40.76674,-73.97907
194183,Roosevelt Island Tramway,Roosevelt Island Tramway,8213,40.75728,-73.9536,40.75728,-73.9536
16914,Grand Army Plaza & Central Park S,Grand Army Plaza & Central Park S,7287,40.7644,-73.97371,40.7644,-73.97371
295751,Soissons Landing,Soissons Landing,7275,40.69232,-74.01487,40.69232,-74.01487
389220,W 21 St & 6 Ave,9 Ave & W 22 St,6345,40.74174,-73.99416,40.7455,-74.00197
367338,5 Ave & E 72 St,5 Ave & E 72 St,6037,40.77283,-73.96685,40.77283,-73.96685
364577,1 Ave & E 62 St,1 Ave & E 68 St,5826,40.76116,-73.96042,40.76501,-73.95818
64277,Yankee Ferry Terminal,Yankee Ferry Terminal,5759,40.68707,-74.01676,40.68707,-74.01676
310981,Broadway & W 58 St,Broadway & W 58 St,5509,40.76643,-73.98195,40.76695,-73.98169


In [10]:
# Drop self-loops
df_trips = df_trips[
    df_trips.start_station_name != df_trips.end_station_name
]

In [11]:
# Quick check
df_trips.sort_values("trips", ascending=False).head(10)

Unnamed: 0,start_station_name,end_station_name,trips,start_lat,start_lng,end_lat,end_lng
389220,W 21 St & 6 Ave,9 Ave & W 22 St,6345,40.74174,-73.99416,40.7455,-74.00197
364577,1 Ave & E 62 St,1 Ave & E 68 St,5826,40.76116,-73.96042,40.76501,-73.95818
261037,Norfolk St & Broome St,Henry St & Grand St,4883,40.71723,-73.98802,40.71421,-73.9811
443109,West St & Chambers St,Pier 40 - Hudson River Park,4584,40.71763,-74.01322,40.72771,-74.0113
303202,Yankee Ferry Terminal,Soissons Landing,4556,40.68707,-74.01676,40.69232,-74.01487
123137,North Moore St & Greenwich St,Vesey St & Church St,4523,40.7202,-74.0103,40.71222,-74.01047
128697,W 21 St & 6 Ave,W 22 St & 10 Ave,4410,40.74174,-73.99416,40.74692,-74.00452
220765,Henry St & Grand St,Norfolk St & Broome St,4324,40.71421,-73.9811,40.71723,-73.98802
56592,Soissons Landing,Yankee Ferry Terminal,4299,40.69232,-74.01487,40.68707,-74.01676
505813,Pier 40 - Hudson River Park,West St & Chambers St,4222,40.72771,-74.0113,40.71755,-74.01322


# Sanity checks & data cleanup explanation

- Aggregated trips by (start_station_name, end_station_name) using chunked reads to stay memory-safe.

- Rounded station coordinates to 5 decimals to reduce GPS jitter and ensure stable joins.

- Verified coordinate validity and locality (NYC bounds, no lat/lng anomalies).

- Ran a top-flows check and confirmed that the highest-volume pairs were initially dominated by self-loops (start = end), which are common bike-share artifacts (dock corrections / very short trips).

- Removed self-loops prior to visualization to avoid zero-length arcs and distorted scaling in Kepler.

- Re-checked top flows after filtering; remaining routes are short, plausible, and symmetric across nearby stations, indicating healthy aggregation.

Result: the dataset is now suitable for Kepler arc/line layers without visual or statistical artifacts.

2. Kepler.gl map: Trip flows between stations.

In [16]:
# Initialize the map

map_1 = KeplerGl(
    height=650,
    data={"Trips": df_trips}
)
map_1

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(data={'Trips':                     start_station_name                 end_station_name  \
0          …

### Kepler.gl Map Customization

The start and end station point layers were styled using a warm yellow–orange color
with reduced opacity to provide clear spatial context while remaining visually subtle.

Trip connections were visualized using a sequential color palette ranging from purple
to orange, mapped to the number of trips. This palette choice helps emphasize
high-volume routes while maintaining contrast against the dark basemap.

## Filtering for the most common trips in New York City

To identify the most common trips in New York City, I added a filter on the `trips` variable in Kepler.gl and increased the minimum threshold to remove low-frequency routes. This significantly reduced visual clutter and highlighted only the highest-volume station-to-station connections.

After filtering, the remaining routes cluster strongly in **Manhattan**, with particularly dense activity in **Midtown and Downtown Manhattan** and along the **Hudson River waterfront**. These areas appear especially busy, as many high-volume routes connect nearby stations within short distances. This pattern suggests frequent, repeat trips rather than occasional long-distance travel.

The prominence of Manhattan and waterfront-adjacent corridors is consistent with what is known about Citi Bike usage in New York City. These zones combine high station density, major employment centers, transit hubs, and popular recreational areas such as riverfront bike paths. Together, these factors help explain why these station pairs remain visible even after filtering for only the most common trips.

In [17]:
# Create a config object and save the map

config = map_1.config

In [18]:
# Export the map as html

map_1.save_to_html(
    file_name="NYC_CitiBike_Trips.html",
    read_only=False,
    config=config
)

Map saved to NYC_CitiBike_Trips.html!


In [19]:
# Save the config as a JSON file

import json

with open("kepler_config.json", "w") as outfile:
    json.dump(config, outfile)