<a href="https://colab.research.google.com/github/zwt4pb/portfolio/blob/main/SSA_Take_Home_Case.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [30]:
import pandas as pd
from scipy.stats import chi2_contingency

file_path = '/content/JC-202306-citibike-tripdata[1].csv'
citibike_data = pd.read_csv(file_path)

In [31]:
# Check for missing values
missing_data = citibike_data.isnull().sum()
print("Missing Data Count by Variable:")
print(missing_data)

Missing Data Count by Variable:
ride_id                 0
rideable_type           0
started_at              0
ended_at                0
start_station_name     55
start_station_id       55
end_station_name      361
end_station_id        361
start_lat               0
start_lng               0
end_lat                88
end_lng                88
member_casual           0
dtype: int64


There are limitations to this data. I am going to clean it by dropping these unfinished rows and used the cleaned version for the rest of the analysis.

In [32]:
citibike_cleaned = citibike_data.dropna()
missing_data = citibike_cleaned.isnull().sum()
print("Missing Data Count by Variable:")
print(missing_data)

Missing Data Count by Variable:
ride_id               0
rideable_type         0
started_at            0
ended_at              0
start_station_name    0
start_station_id      0
end_station_name      0
end_station_id        0
start_lat             0
start_lng             0
end_lat               0
end_lng               0
member_casual         0
dtype: int64


In [33]:
citibike_cleaned['route'] = citibike_cleaned['start_station_name'] + " to " + citibike_cleaned['end_station_name']
top5routes = citibike_cleaned['route'].value_counts().nlargest(5)
print("The 5 most popular routes and their quantities:")
print(top5routes)


The 5 most popular routes and their quantities:
route
Hoboken Terminal - Hudson St & Hudson Pl to Hoboken Ave at Monmouth St                          531
Grove St PATH to Marin Light Rail                                                               501
South Waterfront Walkway - Sinatra Dr & 1 St to South Waterfront Walkway - Sinatra Dr & 1 St    469
12 St & Sinatra Dr N to South Waterfront Walkway - Sinatra Dr & 1 St                            451
Marin Light Rail to Grove St PATH                                                               441
Name: count, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  citibike_cleaned['route'] = citibike_cleaned['start_station_name'] + " to " + citibike_cleaned['end_station_name']


In [34]:
contingency_table = pd.crosstab(citibike_cleaned['member_casual'], citibike_cleaned['rideable_type'])
chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
print("Contingency Table:")
print(contingency_table)
print("p-value:", p_value)

Contingency Table:
rideable_type  classic_bike  docked_bike  electric_bike
member_casual                                          
casual                22143          433           5282
member                60506            0           8298
p-value: 0.0


There is a statistically significant association between membership type and the type of bike used.

In [39]:
citibike_cleaned['started_at'] = pd.to_datetime(citibike_cleaned['started_at'])
citibike_cleaned['ended_at'] = pd.to_datetime(citibike_cleaned['ended_at'])
citibike_cleaned['ride_duration'] = (citibike_cleaned['ended_at'] - citibike_cleaned['started_at']).dt.total_seconds() / 60
citibike_cleaned['start_hour'] = citibike_cleaned['started_at'].dt.hour
avg_ride_length_by_hour = citibike_cleaned.groupby('start_hour')['ride_duration'].mean()
print("Average Ride Length by Hour:")
print(avg_ride_length_by_hour)


Average Ride Length by Hour:
start_hour
0     13.283021
1     19.747754
2     19.915869
3     12.664678
4     12.956479
5      7.626480
6      6.951190
7      7.655251
8      8.983350
9     10.876922
10    14.180819
11    13.006059
12    12.491760
13    12.745132
14    12.280714
15    12.964654
16    11.685383
17    11.999161
18    11.005849
19    11.595289
20    13.042484
21    13.479206
22    13.685090
23    15.020393
Name: ride_duration, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  citibike_cleaned['started_at'] = pd.to_datetime(citibike_cleaned['started_at'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  citibike_cleaned['ended_at'] = pd.to_datetime(citibike_cleaned['ended_at'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  citibike_cleaned['ride_duration'] = (citibike_cle

Average ride durations have an interesting variation. Based on the table above, on average, the longest rides for the month were started in the early hours of the morning from 1 am until 3 A.M. One particular reason for this could be due to smaller sample size of rides at these times, so each ride carried more weight in calculating the average. Times during the day, specifically around lunchtime and typical commuting time (5 pm) had far less variation. This is likely due to a larger sample size of rides occuring at this time.

What is the most interesting analytical output you can generate, what specific number(s) would serve
as a “headline” for that output, and what are the potential implications for riders or administrators of the Citibike program?
- An analytical output I would like to generate would be mileage for each bike ride. This could be found by using the Haversine formula to calculate the distance in miles and then adding it to a new variable/column. Average mileage could be the headline for this output. Riders and administration would get insights into traffic patterns by comparing the mileage to the duration of the ride, and this could be used to optimize trips for members.

What additional information would you like to have added to this dataset, and what questions would it
help you answer?
- I would want to add two pieces of information. Number 1, I would like to add employment. I think it would be interesting to see whether there was a relationship between duration, start time, and employment. This would provide insight into what the main uses of the bikes are. Additionally, I would like to information on the weather that day. I think weather has a direct impact on whether or not a user takes a bike that day, so it could be interesting to see the relationship between weather and number of trips.