# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**


The Ford Bike Sharing Project is an insightful exploration into the dynamics of a bike-sharing service using data visualization techniques. The core objective revolves around analyzing data from the Ford GoBike System, a prominent service operating within the San Francisco Bay Area. This project underscores the pivotal role of data visualization as an essential skill in the broader realm of data analysis. It emphasizes its utility not only in exploratory data analysis but also in refining data through wrangling and effectively communicating findings.

The project systematically navigates through various stages, from initial data investigation and wrangling to in-depth univariate, bivariate, and multivariate analyses. By dissecting the Ford GoBike dataset, which encompasses details of individual rides, the project delves into understanding usage patterns and influential factors. Key features within the dataset, such as trip duration, start and end times, station information, user type, and demographic details, provide a rich foundation for this analysis.

# **GitHub Link -**

https://github.com/vidyagowda1/ford-bike-sharing

# **Problem Statement**
The project aims to analyze the Ford GoBike System data to gain insights into bike-sharing usage patterns and the factors influencing them. Specifically, it seeks to address the need for a deeper understanding of trip characteristics, such as duration, and how these characteristics are affected by variables like weather and user type.

This analysis is crucial because trip duration is closely linked to the bike-sharing company's revenue. Therefore, identifying the key factors that affect trip duration can inform strategies to optimize service offerings, enhance user experience, and ultimately, improve the company's financial performance.

The project utilizes data visualization techniques to explore and present these relationships, highlighting the importance of effective data communication in deriving actionable conclusions. By examining the provided dataset, the project endeavors to provide answers to specific questions related to average trip times and the impact of external factors on bike usage.


#### **Define Your Business Objective?**

Data-Driven Decision Making:
Utilize data analysis and visualization to inform strategic decisions about the bike-sharing service.
Understanding User Behavior:
Analyze data to understand user behavior and patterns, particularly concerning trip duration and user type.
Revenue Optimization:
Identify factors that influence trip duration, as it has a close relationship with revenue.
Explore ways to increase revenue, such as attracting more customers and converting casual users to subscribers.
Service Improvement:
Gain insights to improve the service and potentially increase ridership.
These points directly reflect the project's focus on using data to understand and improve the business aspects of a bike-sharing system.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
# Function to load a CSV file into a pandas DataFrame
def load_csv(file_path):
    try:
        return pd.read_csv(file_path)  # Load the CSV file
    except Exception as e:
        print(f"Error: {e}")  # Print error message if loading fails
        return None  # Return None in case of failure

# Define file paths for the dfs
fordbike_path = '/content/drive/MyDrive/Colab Notebooks/fordgobike-tripdata.csv.zip'  # File path for the ford-bike df
# Load the df using the load_csv function
fordbike_df = load_csv(file_path=fordbike_path)  # Load ford-bike df
# Display all columns
pd.set_option('display.max_columns', None)

### Dataset First View

In [None]:
# Dataset First Look
fordbike_df.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('Number of Rows =',fordbike_df.shape[0])
print('Number of Columns =',fordbike_df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
fordbike_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print('Number of duplicates in dataset =',fordbike_df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print('Number of missing values in dataset =',fordbike_df.isnull().sum().sum())

In [None]:
# Visualizing the missing values
# Calculating the number of missing values per column
missing_values = fordbike_df.isnull().sum()

# Plotting the missing values
fig, ax = plt.subplots(figsize=(16, 6))
bars = ax.bar(missing_values.index, missing_values.values, color='teal')

# Adding data labels on top of the bars
for bar in bars:
    yval = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, yval, f'{yval}',
            ha='center', va='bottom', fontsize=9, fontweight='bold', color='black')

# Adding labels and title
plt.title('Missing Values per Column', fontsize=16, fontweight='bold')
plt.xlabel('Columns', fontsize=12)
plt.ylabel('Number of Missing Values', fontsize=12)

# Rotating x-axis labels for better readability
plt.xticks(rotation=45)

# Display the plot
plt.show()

### What did you know about your dataset?

This dataset contains information on individual bike-sharing trips made in the greater San Francisco Bay Area. The dataset includes 94,802 entries, and it is used to analyze ride patterns, trip durations, and the influence of factors like user type, seasonality, and demographics.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
fordbike_df.columns

In [None]:
# Dataset Describe
fordbike_df.describe()

### Variables Description

duration_sec - Duration of the trip in seconds

start_time - Date and time when the trip started

end_time - Date and time when the trip ended

start_station_id - Unique ID of the starting station

start_station_name - Name of the starting station

start_station_latitude - Latitude of the starting station

start_station_longitude - Longitude of the starting station

end_station_id - Unique ID of the ending station

end_station_name - Name of the ending station

end_station_latitude - Latitude of the ending station

end_station_longitude - Longitude of the ending station

bike_id - Unique identifier for the bike used

user_type - Type of user Subscriber (member) or Customer (casual)

member_birth_year - Birth year of the user

member_gender - Gender of the user (Male, Female, or Other)

bike_share_for_all_trip - Indicates whether the trip was part of the Bike Share for All program (Yes/No)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print('Unique Values in dataset:\n')
print(fordbike_df.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Create copy of the dataset
df = fordbike_df.copy()

# 1. Convert start_time and end_time to datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])

In [None]:
# 2. Create new columns for date, time, and duration in minutes
df['start_date'] = df['start_time'].dt.date
df['start_hour'] = df['start_time'].dt.hour
df['start_day'] = df['start_time'].dt.day_name()
df['start_month'] = df['start_time'].dt.month_name()
df['duration_min'] = df['duration_sec'] / 60  # Convert seconds to minutes


In [None]:
# 3. Handle missing values

# Dropping rows with missing member_birth_year and member_gender
df = df.dropna(subset=['member_birth_year', 'member_gender'])

# Checking missing values after handling
print("\nMissing values after handling:")
print(df.isnull().sum())

In [None]:
# 4. Create age column from birth year
df['age'] = 2025 - df['member_birth_year']

In [None]:
# 5. Remove outliers in trip duration
# This helps to focus analysis on realistic trip lengths
duration_cap = df['duration_min'].quantile(0.99)
df = df[df['duration_min'] <= duration_cap]

In [None]:
# 6. Converting birth year and age to int
df['member_birth_year'] = df['member_birth_year'].astype(int)
df['age'] = df['age'].astype(int)

In [None]:
# 7. Drop irrelevant columns
df.drop(['bike_id', 'start_station_id', 'end_station_id'], axis=1, inplace=True)

In [None]:
# Dataset shape after cleaning
print('Number of Rows =',df.shape[0])
print('Number of Columns =',df.shape[1])

In [None]:
# View final dataset
df.head()

### What all manipulations have you done and insights you found?

Manipulations-
Datetime Conversions:

start_time and end_time columns were converted to proper datetime formats.
This allowed easy extraction of time-based features.
Feature Engineering:

Trip Duration (duration_min) was created by subtracting start_time from end_time and converting to minutes.

Extracted new columns from start_time:

start_hour (hour of the day),

start_day (day of week),

start_month (month name),

start_year (year).

Calculated rider age from birth year.

Data Type Conversion:

Numerical fields like member_birth_year and age column were converted into integer datatype.
Filtering Unusual Records:

Trips with negative durations or extremely high durations (outliers like >24 hours) were filtered out to ensure accuracy.
Handling Missing Data:

Checked null values across all columns.
Rows with nulls in key columns like member_gender or member_birth_year were removed to maintain data quality.
Insights-
Presence of Outliers:

Some records showed unrealistically high or negative trip durations, indicating the need for cleaning.
Missing Demographic Data:

Several records had missing gender or birth year, which could affect gender-based or age-based insights.
User and Time Patterns Possible:

By extracting start_hour, start_day, and start_month, patterns in riding behavior over different times and seasons became analyzable.
Age Distribution Identified:

Calculating age from birth year allowed for deeper analysis into age groups of riders and their preferences.
Trip Duration is a Critical Metric:

Creating duration_min provided the key measure to compare how trip length varies by age, gender, time of day, and user type.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:

# Setting visualization style
plt.style.use('fivethirtyeight')

#### Chart - 1-Distribution of Trip Duration (in minutes)

In [None]:
# Chart - 1 visualization code
# Histogram for trip duration
# Set figure size
plt.figure(figsize=(12, 6))
# Create histogram
sns.histplot(df['duration_min'], bins=50, color='teal', edgecolor='black', kde = True)
# Add labels and title
plt.title('Distribution of Trip Duration (in minutes)',fontsize = 16, fontweight='bold')
plt.xlabel('Duration (minutes)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
# Set font size for x & y axis value
plt.xticks(fontsize = 10)
plt.yticks(fontsize = 10)
# Show Plot
plt.show()

##### 1. Why did you pick the specific chart?

A histogram clearly shows how trip durations are distributed, helping identify common and extreme values.

##### 2. What is/are the insight(s) found from the chart?

Most trips are short (under 20 minutes), indicating the system is used for quick commutes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Poistive Impact - Yes, this insight supports optimizing bike availability for short-distance riders and efficient fleet rotation.

Negative Insight - No, it shows healthy usage patterns. Longer trips may need review, but not necessarily negative.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# User Type Distribution
# Set figure size
plt.figure(figsize=(12, 6))
# Counting user types
user_type_counts = df['user_type'].value_counts()
# Create bar plot
sns.barplot(x=user_type_counts.index, y=user_type_counts.values, palette='viridis')
# Add data labels
for i, v in enumerate(user_type_counts.values):
    plt.text(i, v, str(v), ha='center', va='bottom', fontsize=9, fontweight='bold', color='black')
# Add labels and title
plt.title('Bar Chart of User Types', fontsize=16, fontweight='bold')
# Set font size for x & y axis value
plt.xlabel('User Type', fontsize=12)
plt.ylabel('Number of Riders', fontsize=12)
# Set font size for x & y axis value
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Show plot
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is ideal to compare categorical counts such as Subscriber vs Customer.

##### 2. What is/are the insight(s) found from the chart?


Majority are Subscribers, indicating high user retention and loyalty.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact - Yes, strong subscriber base helps in revenue predictability and customer lifetime value.

Negative Insight - No, fewer customers indicate potential for growth through short-term promotions.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Member Gender Distribution
plt.figure(figsize=(12, 6))
sns.barplot(x=df['member_gender'].value_counts().index, y=df['member_gender'].value_counts().values, palette='viridis')
for i, v in enumerate(df['member_gender'].value_counts().values):
    plt.text(i, v, str(v), ha='center', va='bottom', fontsize=9, fontweight='bold', color='black')
plt.title('Bar Chart of Member Gender Distribution', fontsize=16, fontweight='bold')
plt.xlabel('Gender', fontsize=12)
plt.ylabel('Number of Riders', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.show()

##### 1. Why did you pick the specific chart?


A bar plot is ideal as it helps to identify gender participation using a simple and clear comparison

##### 2. What is/are the insight(s) found from the chart?

Males dominate ridership, with females underrepresented.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact - Yes, it highlights opportunity to target female riders through safety or awareness campaigns.

Negative Insight - Gender imbalance could be a sign of safety or comfort issues for female riders.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Member Age Distribution
# Create histogram
plt.figure(figsize=(12, 6))
sns.histplot(df['age'], bins=50, color='teal', edgecolor='black', kde = True)
# Add labels and title
plt.title('Distribution of Rider Age',fontsize = 16, fontweight='bold')
plt.xlabel('Age', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
# Setting font size for xticks and yticks
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Show Plot
plt.show()

##### 1. Why did you pick the specific chart?

Histograms show spread and concentration of rider age.

##### 2. What is/are the insight(s) found from the chart?

Most riders are between 25-40 years old, suggesting a young working demographic.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact - Yes, helps tailor services (e.g., mobile apps, offers) to a tech-savvy, commuter audience.

Negative Insight - May indicate under-engagement of seniors, not critical but room for inclusion.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Start Hour Distribution
hourly = df['start_hour'].value_counts().sort_index()
# Create bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x=hourly.index, y=hourly.values, palette='viridis')
# Add data labels
for i, v in enumerate(hourly.values):
    plt.text(i, v, str(v), ha='center', va='bottom', fontsize=8, fontweight='bold', color='black')
# Add title and labels
plt.title('Start Hour Distribution', fontsize=16, fontweight='bold')
plt.xlabel('Start Hour', fontsize=12)
plt.ylabel('Number of Trips', fontsize=12)
# Setting font size for xticks and yticks
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Show Plot
plt.show()



##### 1. Why did you pick the specific chart?

Shows peak demand hours and usage patterns.



##### 2. What is/are the insight(s) found from the chart?

Peak usage occurs during 8 AM and 5-6 PM, confirming commute-driven behavior.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact - Yes, helps optimize fleet positioning and maintenance outside peak times.

Negative Insight - No, but over-reliance on rush hours can risk under-utilization during other hours.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Start Day of Week Distribution
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Create bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x=df['start_day'].value_counts().index, y=df['start_day'].value_counts().values, palette='muted')
# Add data labels
for i, v in enumerate(df['start_day'].value_counts().values):
    plt.text(i, v, str(v), ha='center', va='bottom', fontsize=9, fontweight='bold', color='black')
# Add labels and title
plt.title('Start Day of Week Distribution', fontsize=16, fontweight='bold')
plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Number of Trips', fontsize=12)
# Setting font size for xticks and yticks
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Show Plot
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart effectively shows the distribution of rides across each day, making it easy to identify peak days and behavioral patterns.

##### 2. What is/are the insight(s) found from the chart?

Most rides happen on weekdays, especially Tuesday to Thursday, indicating heavy use for weekday commuting.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact - Yes, it informs operational planning—more bikes and support staff can be allocated during peak weekday hours.

Negative Insight - Lower usage on weekends may indicate missed leisure ride opportunities. It’s not a negative, but a business growth area.



#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Top 10 Start Stations
top_starts = df['start_station_name'].value_counts().nlargest(10)
# Create bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x=top_starts.values, y=top_starts.index, palette='viridis')
# Add data labels
for i, v in enumerate(top_starts.values):
    plt.text(v, i, str(v), ha='left', va='center', fontsize=9)
# Add labels and title
plt.title('Top 10 Start Stations', fontsize=16, fontweight='bold')
plt.xlabel('Number of Trips', fontsize=12)
plt.ylabel('Start Station', fontsize=12)
# Setting font size for xticks and yticks
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Show Plot
plt.show()





##### 1. Why did you pick the specific chart?

A vertical bar chart displays the top-performing stations clearly, even with long names.

##### 2. What is/are the insight(s) found from the chart?


Certain stations dominate as popular starting points—often located in commercial or transit-dense areas.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact - Yes, these stations can be prioritized for bike availability, rebalancing, and marketing.

Negative Insight - Over-reliance on a few stations might strain infrastructure; expansion of underutilized stations could balance usage.



#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Top 10 End Stations
top_ends = df['end_station_name'].value_counts().nlargest(10)
# Create bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x=top_ends.values, y=top_ends.index, palette='viridis')
# Add data labels
for i, v in enumerate(top_ends.values):
    plt.text(v, i, str(v), ha='left', va='center', fontsize=9)
# Add labels and title
plt.title('Top 10 End Stations', fontsize=16, fontweight='bold')
plt.xlabel('Number of Trips', fontsize=12)
plt.ylabel('End Station', fontsize=12)
# Setting font size for xticks and yticks
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Show Plot
plt.show()


##### 1. Why did you pick the specific chart?

Like the previous chart, it highlights most common destinations using a clean, comparative format.

##### 2. What is/are the insight(s) found from the chart?

Popular drop-off points align closely with start stations, suggesting frequent round-trips or commuter corridors.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact - Yes, end-station data helps optimize bike docking, availability, and customer satisfaction.

Negative Insight - If end stations are overloaded compared to start stations, it could cause operational imbalance (bike shortages).




Chart - 9 - Trips by Day of Week (Split by User Type)

In [None]:
# Chart - 9 visualization code
# Define the order of the days of the week
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Group the data by 'start_day' and 'user_type' to get the trip counts, and ensure day_order is followed
day_user = df.groupby(['start_day', 'user_type']).size().unstack().reindex(day_order)

# Create the line plot using Seaborn
plt.figure(figsize=(16, 5))  # Set the figure size
sns.lineplot(data=day_user, markers='o', dashes=False, linewidth=2)

# Add labels and title
plt.title('Trips by Day of Week (Split by User Type)', fontsize=16, fontweight='bold')  # Title of the chart
plt.xlabel('Day of Week', fontsize=12)  # X-axis label
plt.ylabel('Number of Trips', fontsize=12)  # Y-axis label

# Setting font size for x and y ticks
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Align the legend properly
plt.legend(title='User Type', loc='center left', bbox_to_anchor=(1, 0.5), fontsize=12)

# Show plot
plt.tight_layout()  # Ensure everything fits well
plt.show()

##### 1. Why did you pick the specific chart?

A line chart is ideal for visualizing trends over a categorical sequence like days of the week. Splitting by user type (e.g., Subscriber vs Customer) makes it easy to compare usage patterns across user segments.

##### 2. What is/are the insight(s) found from the chart?

Subscribers tend to use bikes heavily on weekdays (especially Tuesday to Thursday), suggesting regular commuting behavior.

Customers show more activity on weekends, indicating casual or recreational use.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact -

Yes, it allows Ford GoBike to tailor their services:

Prioritize bike availability and support staff during weekdays for subscribers.
Plan promotions, events, or tourist-targeted offers for weekends to attract customers.
Negative Insight -

The relatively low weekday engagement by customers could indicate a missed opportunity. By understanding this gap, the business can target strategies to encourage casual weekday riders (e.g., discounts or partnerships with local businesses).



Chart - 10 - Trip Duration by User Type (Box Plot)

In [None]:
# Chart - 10 visualization code
# Trip Duration by User Type
# Create box plot
plt.figure(figsize=(12, 6))
sns.boxplot(x='user_type', y='duration_min', data=df, palette='pastel')
# Add title and labels
plt.title('Trip Duration by User Type', fontsize=16, fontweight='bold')
plt.xlabel('User Type', fontsize=12)
plt.ylabel('Duration (minutes)', fontsize=12)
# Set xticks and yticks font size
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Show Plot
plt.show()

##### 1. Why did you pick the specific chart?

A box plot is ideal for visualizing the distribution, spread, and outliers in trip durations across user types. It highlights medians, quartiles, and variability at a glance.

##### 2. What is/are the insight(s) found from the chart?

Customers tend to have longer and more variable trip durations compared to Subscribers, whose trips are shorter and more consistent.

Customers show more extreme outliers, possibly indicating exploratory or recreational rides.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact - Yes. It gives clear signals about user behavior:

Subscribers likely use the service for short, regular commutes.
Customers may benefit from targeted plans that accommodate longer, flexible trips—e.g., daily passes or weekend bundles.
Negative Insight - Not necessarily negative, but the high variability and outliers in customer trips may imply irregular usage patterns. If not planned for (e.g., ensuring bike availability for extended durations), it could strain inventory and impact customer experience.

Chart - 11 - Trip Count by Hour (Split by Gender)

In [None]:
# Chart - 11 visualization code
# Group the data by 'start_hour' and 'member_gender' to count trips by hour and gender
hour_gender = df.groupby(['start_hour', 'member_gender']).size().unstack(fill_value=0)

# Create a bar plot for the trip count by hour, split by gender
plt.figure(figsize=(18, 6))  # Set the size of the figure
hour_gender.plot(kind='bar', stacked=False, figsize=(10, 6), width=0.8)

# Add titles and labels
plt.title('Trip Count by Hour and Gender', fontsize=16, fontweight = 'bold')
plt.xlabel('Hour of Day', fontsize=12)  # X-axis label
plt.ylabel('Number of Trips', fontsize=12)  # Y-axis label

# Align the legend to the top-right corner
plt.legend(title='Gender', bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=12)

# Set xticks and yticks font size
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Show the plot
plt.show()



##### 1. Why did you pick the specific chart?

A bar chart provides a clear, comparative view of ride frequency across each hour of the day. Splitting it by gender adds a valuable demographic dimension to understand who rides when.

##### 2. What is/are the insight(s) found from the chart?

Peak trip counts are observed during morning (7–9 AM) and evening (4–6 PM), indicating commute patterns.

Male users dominate the rides across most hours, especially during commuting times.

Female and other gender categories show a smaller, more evenly spread riding pattern.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact -

Yes, these patterns can inform:

Operational decisions like when to rebalance bikes.
Gender-targeted marketing (e.g., safety awareness or incentives for female riders during off-peak hours).
Infrastructure planning to support commuter-heavy hours.
Negative Impact -

The gender imbalance significantly fewer female riders may point to safety concerns or lack of targeted outreach. Addressing this could expand user diversity and grow ridership.

Chart - 12 - Age by User Type

In [None]:
# Chart - 12 visualization code
# Age by User Type
# Create box plot
plt.figure(figsize=(12, 6))
sns.boxplot(x='user_type', y='age', data=df, palette='pastel')
# Add title and labels
plt.title('Age by User Type', fontsize=16, fontweight='bold')
plt.xlabel('User Type', fontsize=12)
plt.ylabel('Age', fontsize=12)
# Set xticks and yticks font size
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Show Plot
plt.show()

##### 1. Why did you pick the specific chart?

A box plot is effective for comparing age distributions across user types. It reveals medians, ranges, and outliers, making it easy to spot demographic trends and differences.

##### 2. What is/are the insight(s) found from the chart?

Subscribers generally fall into a narrower and younger age range (mostly mid-20s to mid-40s), indicating a strong working professional base.

Customers show a wider age range, including older users, suggesting more occasional or leisure-based use.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact -

Yes. This insight enables:

Tailored marketing—focus professional commuting plans toward the dominant subscriber age group.
Diversification strategies—design campaigns or services attractive to older or casual riders (e.g., guided tours, senior discounts).
Negative Insight -

The lack of younger or older subscribers might hint at unmet needs or barriers (e.g., pricing or accessibility). Addressing these could unlock new user segments and expand the service's reach.


# Chart - 13 - Trip Duration by Gender

In [None]:
 # Chart - 13 visualization code
# Trip Duration by Gender
# Create violin plot
plt.figure(figsize=(12, 6))
sns.violinplot(x='member_gender', y='duration_min', data=df, palette='pastel')
# Add labels and title
plt.title('Trip Duration by Gender', fontsize=16, fontweight='bold')
plt.xlabel('Gender', fontsize=12)
plt.ylabel('Duration (minutes)', fontsize=12)
# Set xticks and yticks font size
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Show plot
plt.show()

##### 1. Why did you pick the specific chart?

A violin plot combines the benefits of a box plot and a kernel density plot. It not only shows the spread and median of trip durations but also gives a visual sense of the distribution shape for each gender category. This helps in spotting asymmetry or concentration of trip durations.

##### 2. What is/are the insight(s) found from the chart?

Male and female riders have fairly similar median trip durations, but female riders show slightly more variation.

The "Other"/Unspecified gender group shows more dispersed durations and noticeable outliers, suggesting less consistent trip behavior or possibly a smaller sample size.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact -

Yes. Understanding trip behavior by gender helps the company:

Refine service offerings or communication strategies to better align with how different genders use the service.
Recognize patterns that might indicate barriers or unique usage needs, which could be addressed to improve user experience.
Negative Insight -

The wider variability in trip durations for the "Other" category could indicate less predictability in how this group uses the system. This might point to a lack of tailored service offerings or accessibility gaps that could be limiting consistent usage.



Chart - 14 - Average Trip Duration by Start Day of Week

In [None]:
# Chart - 14 visualization code
# Average Trip Duration by Start Day of Week
avg_duration = df.groupby('start_day')['duration_min'].mean().reindex(day_order)
# Create bar plot
plt.figure(figsize=(14, 6))
sns.barplot(x=avg_duration.index, y=avg_duration.values, palette='muted')
# Add data labels
for i, v in enumerate(avg_duration.values):
    plt.text(i, v, str(round(v, 1)), ha='center', va='bottom', fontsize=9, color='black')
# Add labels and title
plt.title('Average Trip Duration by Start Day of Week', fontsize=16, fontweight='bold')
plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Average Duration (minutes)', fontsize=12)
# Setting font size for xticks and yticks
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Show Plot
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is ideal for comparing average values across categorical variables. In this case, it clearly shows how trip duration varies from Monday to Sunday, helping uncover behavioral trends tied to the day of the week.

##### 2. What is/are the insight(s) found from the chart?

Weekends (Saturday and Sunday) show higher average trip durations compared to weekdays, indicating more leisure or exploratory rides.

Weekdays, especially Tuesday to Thursday, have shorter average durations, likely reflecting structured commute patterns.



**Chart - 15 - Average Trip Duration by Start Hour**

In [None]:
# Chart - 15 visualization code
# Average Trip Duration by Start Hour
# Calculate average trip duration by hour
avg_duration_by_hour = df.groupby('start_hour')['duration_sec'].mean().reset_index()

# Convert duration from seconds to minutes for easier understanding
avg_duration_by_hour['duration_min'] = avg_duration_by_hour['duration_sec'] / 60

# Create line plot
plt.figure(figsize=(16, 5))
plt.plot(avg_duration_by_hour['start_hour'], avg_duration_by_hour['duration_min'], marker='o', linestyle='-')
# Show all xticks values
plt.xticks(range(25))
# Add labels and title
plt.title('Average Trip Duration by Start Hour', fontsize=16, fontweight='bold')
plt.xlabel('Start Hour', fontsize=12)
plt.ylabel('Average Duration (minutes)', fontsize=12)
# Set xticks and yticks font size
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
# Show Plot
plt.show()

##### 1. Why did you pick the specific chart?

A line chart is great for visualizing continuous trends—here, it helps observe how average trip duration changes across 24 hours. Plotting it over time (hourly) makes it easy to spot peaks and drops throughout the day.


##### 2. What is/are the insight(s) found from the chart?

Average trip duration tends to peak early morning (around 5-7 AM) and late at night (after 9 PM), likely due to leisure or non-commuting trips.

Midday to early evening (10 AM - 7 PM) shows more consistent, slightly shorter trip durations, likely influenced by work or errand-based usage.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Positive Impact -

Yes.

The company can plan dynamic pricing or promotions during low-usage hours.
Support fleet rebalancing efforts by knowing when longer trips are likely.
Design personalized offers based on user trip patterns (e.g., incentives for off-peak travel).
Negative Insight -

Possibly. If long trips during odd hours go unaddressed (e.g., no rebalancing or bike availability), it may lead to stockouts or delays, harming user satisfaction. These need to be monitored and managed proactively.



Chart - 18 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Selecting only numeric columns for correlation
numeric_cols = df[['duration_min', 'age', 'start_hour']]

# Correlation matrix
corr_matrix = numeric_cols.corr()

# Plot using seaborn
plt.figure(figsize=(14, 5))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')   # Create heatmap
# Add title
plt.title('Correlation Heatmap', fontweight = 'bold', fontsize = 16)
# Show plot
plt.show()


1. Why did you pick the specific chart?

A correlation heatmap is chosen to visualize the strength and direction of linear relationships between multiple numerical variables. It gives an immediate overview of which features are potentially influencing each other, and how strong that influence is.

2. What is/are the insight(s) found from the chart?

duration_min vs age: Weak positive correlation — older users don’t take significantly longer or shorter trips.

duration_min vs start_hour: Near-zero correlation — trip duration doesn’t vary meaningfully with the hour of the day.

age vs start_hour: Slight negative correlation — older users slightly tend to avoid peak trip hours, but it's very minimal.

Overall, none of the numerical variables show any strong linear relationship with each other.

**Chart - 19 - Pair Plot**

In [None]:
# Pair Plot visualization code
# Sample the dataset to keep the pairplot fast and readable
sample_df = df[['duration_min', 'age', 'start_hour', 'member_gender']].dropna().sample(1000, random_state=42)
# Set figure size
plt.figure(figsize=(12, 5))
# Convert gender to categorical for coloring
sns.pairplot(sample_df, hue='member_gender', diag_kind='kde', palette='husl')
# Add title
plt.suptitle('Pairplot of Duration, Age, and Start Hour by Gender', y=1.02, fontweight = 'bold', fontsize = 16)
# Show plot
plt.show()


1. Why did you pick the specific chart?

A pairplot is ideal when exploring pairwise relationships between multiple numerical features, especially when segmented by a categorical feature like gender. It allows us to visually assess distributions and potential correlations in a compact matrix form, while also comparing patterns across genders.

2. What is/are the insight(s) found from the chart?

* Trip Duration vs Age: Both males and females show a wide spread across age groups for trip durations, but no visible strong trend — suggesting duration is not strongly age-dependent.

* Start Hour vs Age: No clear pattern, indicating trip start times are spread fairly evenly across age groups.

* Duration vs Start Hour: There’s a slight concentration of shorter trips during peak hours for both genders, but not significantly different between them.

* Distribution Differences: The distribution of age and start hour is fairly consistent across genders, but trip duration appears slightly more varied for males.


5**. Solution to Business Objective**

What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Focus on Subscriber Retention & Growth :
Since Subscribers form the majority of the user base, and they consistently show higher ride frequency and shorter trip durations (indicating utility usage), prioritize:

Retention strategies: Loyalty programs, renewal discounts, referral incentives.
Conversion of casual customers to subscribers through trial passes and promotions.
Optimize Bike Availability by Time and Location :
The analysis showed peak usage during morning and evening commute hours and specific high-traffic stations. Use this to:

Strategically rebalance bikes across top start/end stations.
Implement dynamic reallocation schedules aligned with hourly usage trends.
Enhance Targeting Based on Gender & Age Segments :
Though trip patterns across gender and age didn't show strong variation, segment-level marketing can still be valuable:

Offer youth-focused campaigns and student discounts (as younger users have relatively higher usage).
Develop safety or comfort features that appeal to female riders, encouraging more balanced usage.
Improve Trip Experience Through App Features :
Since trip durations are generally short and predictable:

Use in-app tools to recommend start/end stations based on user location and availability.
Provide real-time updates on station capacity to avoid frustration and missed trips.
Seasonal Promotions and Off-Peak Incentives :
Usage is lower in certain months or non-peak hours:

Introduce seasonal offers or “Happy Hour” discounts to encourage off-peak usage.
Partner with local events or businesses to bundle ride offers during less busy times.
Monitor Underperforming Stations :
Identify stations with consistently low usage and:

Investigate accessibility issues.
Consider relocating or marketing those locations better.

# **Conclusion**

The Ford GoBike analysis uncovered clear patterns in user behavior and trip usage. Subscribers are the primary users, mostly riding during weekday peak hours for commuting, while customers prefer weekends and longer rides, indicating recreational use.

Peak usage occurs during rush hours, and certain stations consistently experience high demand, guiding better bike distribution strategies. Young male adults (25-35) dominate ridership, suggesting a target demographic for marketing efforts.

While correlations between numeric variables were minimal, categorical insights such as user type, day of week, and gender provided stronger patterns for business action.

These insights can help Ford GoBike enhance operations, increase customer satisfaction, and grow ridership by optimizing availability, targeting promotions, and improving service in high-demand areas.Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***