# Processing & Visualizations for Gridwise Data

## About this notebook

The purpose of this notebook is to parse and visualize the Preliminary data given by Gridwise in 2023. For questions surrounding the design or implementation of this project, please email Katie.Rischpater@nrel.gov ! 

## File I/O, Dataframe setup

### File Imports

In [3]:
# Data Processing
import pandas as pd
import numpy as np
# from datetime import datetime, timedelta

# Geospacial Data
import geopandas as gpd
import geoplot
import geoplot.crs as gcrs

# Visualization
import matplotlib.pyplot as plt
import osmnx as ox
# import networkx as nx

### Load & Parse CSV 

This is currently configured to work with a single CV: syntax may change as more data is provided!

Relavant fields in the Gridwise_Data CSV Include:
- Timestamps ('start_time', 'end_time')
- Trip Category (We're interested in 'Rideshare')
- Driver id (driver_id)
- Lat / Long data, including:
    - start_block_group_internal_point_lat, start_block_group_internal_point_lng
    - end_block_group_internal_point_lat, end_block_group_internal_point_lng

In [5]:
CSV_PATH = '../Gridwise_Data/Preliminary/gwan-los-angeles-10-30-trips.csv'
gridwise_df = pd.read_csv(CSV_PATH)
rideshare_df =  gridwise_df[gridwise_df['service_type'] == 'Rideshare']
CITY_NAME = 'Los Angeles, California, USA'

# Inspect only rideshare trips with nonempty start and end times
rideshare_with_data_df =  rideshare_df[((rideshare_df['start_time'].notna()) & (rideshare_df['end_time'].notna()) & (rideshare_df['request_time'].notna()))].reset_index(drop=True)

## MatPlotLib Visualizations

### Number of Trips / User

In [None]:
# Finds maximum # of rides driven by a single user
driverid_counts = rideshare_df['driver_id'].value_counts() 
driver_id_counts_df = driverid_counts.reset_index()
driver_id_counts_df.columns = ['driver_id', 'count']

all_days_grouped_df = driver_id_counts_df.groupby('count')['driver_id'].nunique().reset_index()
all_days_grouped_df.columns = ['Number of Trips', 'Number of Users']

# Display data as bar graph
plt.bar(all_days_grouped_df['Number of Trips'], all_days_grouped_df['Number of Users'] )
plt.xlabel('Number of Trips') # Just Oct. 30th
plt.ylabel('Number of Drivers')
plt.title('Number of Drivers vs Number of Rideshare Trips')

plt.yticks(np.arange(0, all_days_grouped_df['Number of Users'].max()+1, step=5))
# plt.xlim(left=0) # Moves

plt.show()

### Number Trips / Day


With the current Preliminary Dataset, this doesn't do anything, since all of the data comes from a single day... Once we have more data, this may prove useful!

In [7]:
start_as_datetime_df = rideshare_with_data_df
start_as_datetime_df['start_time'] = pd.to_datetime(start_as_datetime_df['start_time'])

daily_count_df = start_as_datetime_df.groupby(['driver_id', start_as_datetime_df['start_time'].dt.date])['start_time'].count()
max_trips_per_user_df = daily_count_df.groupby('driver_id').max().reset_index()
max_trips_per_user_df.columns = ['driver_id', 'MaxTripsInSingleDay']

single_day_grouped_df = max_trips_per_user_df.groupby('MaxTripsInSingleDay')['driver_id'].nunique().reset_index()
single_day_grouped_df.columns = ['Number of Trips', 'Number of Users']''


## GeoPlot Visualizations 

### Fetch city graph

TODO:
- Because `geoplot` and `geopandas` have Kernel Density Equation (KDE) plotting built in, I may switch to using one of those packages (As, from my understanding, OSMNX does not have KDE).

In [8]:
city_graph = ox.graph_from_place(CITY_NAME, network_type="drive")

### Convert lat/lon points of dataframe

TODO:
  - As written, this does not work! Will fix later.

In [None]:
# Since we do this twice (For Origin / Destination), abstract this block to a function
def parse_points(category_to_parse):
    lat_string = category_to_parse + '_lat'
    lon_string = category_to_parse + '_lng'
    rideshare_points_df =  rideshare_with_data_df[((rideshare_with_data_df[lat_string].notna()) & (rideshare_with_data_df[lon_string].notna()))].reset_index(drop=True)
    point_geometry = gpd.points_from_xy(pd.to_numeric(rideshare_points_df[lat_string]), pd.to_numeric(rideshare_points_df[lon_string]))
    return gpd.GeoDataFrame(rideshare_points_df, geometry=point_geometry)


# Calculation & Execution Functions
ORIGIN_COLUMN = 'start_block_group_internal_point'
DEST_COLUMN = 'end_block_group_internal_point'
PROJECTION = gcrs.AlbersEqualArea()

origin_points_gdf = parse_points(ORIGIN_COLUMN)
dest_points_gdf = parse_points(DEST_COLUMN)

print(origin_points_gdf)

gdf_city = ox.graph_to_gdfs(city_graph, edges=False)

ax = gdf_city.plot(figsize=(10,10), edgecolor='white')
origin_points_gdf.plot(ax=ax, color='red', markersize=10)
# geoplot.pointplot(origin_points_gdf, ax=ax, hue=origin_points_gdf.geometry.buffer(0.01).unary_union.convex_hull.area / origin_points_gdf.geometry.buffer(0.01).area, legend=True, legend_var='hue', cmap='inferno', legend_kwargs={'label': 'Point Density'})

### Future Work:
- Looking at time btwn trips, can consider filter out 2/3 trip drivers
- Other ways to cluster:
- TODO: Look at length of trips / driver
- TODO: Look at time between trips (/ driver)
  - E.g., driver does 15 trips, but does two shifts (5 in the morning, 10 in the evening); what do those look like?
  - From this, _then_ we can start clustering by shift, not just by one cluster
- Stronger variables to consider:
  - Time bwn rides
  - Time of day
  - Location of O-D Pairs