<a href="https://colab.research.google.com/github/thirisha216055/Ford-GoBike-EDA/blob/main/Ford_bike_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -**Thirisha B


# **Project Summary -**

The objective of this project was to perform an exploratory data analysis (EDA) on the Ford GoBike dataset to understand user behavior, trip durations, and seasonal trends. After cleaning the data and engineering new features like month, weekday, and hour, we analyzed patterns across various dimensions. We found that the average trip duration is around 12.5 minutes, with customers (casual riders) generally taking longer trips than subscribers (commuters). Most trips occur during weekday commuting hours (morning and evening).

Interestingly, trip duration showed only minor variations across different months, suggesting that seasonality has a limited impact on trip behavior. The findings highlight clear differences between user types, offering business opportunities to target marketing strategies towards converting casual customers into long-term subscribers and optimizing service during peak commuting times.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The goal of this project is to analyze the Ford GoBike trip data to uncover insights about user behavior, trip durations, and usage patterns. Specifically, we aim to determine how trip duration varies across different user types (Subscribers vs Customers), time periods (hours, weekdays, months), and whether seasonal trends significantly affect trip behavior. By answering these questions through Exploratory Data Analysis (EDA), we can help the company optimize service operations, improve marketing strategies, and enhance user engagement.

#### **Define Your Business Objective?**

The business objective is to leverage insights from the Ford GoBike trip data to improve customer retention, optimize bike availability during peak hours, and design targeted marketing strategies that encourage casual customers to become long-term subscribers, ultimately boosting ridership and revenue.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# 1. Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset
# 2. Load the dataset
df = pd.read_csv('201801-fordgobike-tripdata.csv')


### Dataset First View

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print(df.shape)
print(df.info())
print(df.describe())
df.head()


### Dataset Information

In [None]:
# Dataset Info
print(df.shape)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Checking for duplicate rows
duplicate_rows = df.duplicated().sum()
print(f"✅ Number of duplicate rows in the dataset: {duplicate_rows}")
# Display duplicate rows
duplicates = df[df.duplicated()]
duplicates


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
import pandas as pd

# Load your dataset (replace with the actual file path)
data = pd.read_csv('201801-fordgobike-tripdata.csv')

# Check for missing/null values
missing_values = data.isnull().sum()

# Display the count of missing/null values per column
print("Missing Values Count per Column:")
print(missing_values)


In [None]:
# Visualizing the missing values
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load your dataset
data = pd.read_csv('201801-fordgobike-tripdata.csv')

# Plotting missing values using a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(data.isnull(), cbar=False, cmap='viridis', yticklabels=False)

# Display the plot
plt.title('Missing Values Heatmap')
plt.show()


### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)

In [None]:
# Dataset Describe
print(df.describe())

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
import pandas as pd

# Load the dataset (if not already loaded)
df = pd.read_csv('201801-fordgobike-tripdata.csv')

# Check unique values per column
unique_counts = df.nunique()

# Display
print(unique_counts)


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# 4. Cleaning Data
# Convert start_time and end_time to datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])

# Extract new features
df['month'] = df['start_time'].dt.month
df['weekday'] = df['start_time'].dt.day_name()
df['hour'] = df['start_time'].dt.hour

# Drop missing values in important columns
df = df.dropna(subset=['member_birth_year', 'member_gender'])

# Remove extreme trip durations (> 50000 seconds)
df = df[df['duration_sec'] < 50000]

Distribution of trip duration

In [None]:
# 5.1 Distribution of Trip Duration
plt.figure(figsize=(10,6))
sns.histplot(df['duration_sec']/60, bins=100, kde=True)  # minutes
plt.title('Distribution of Trip Duration (minutes)')
plt.xlabel('Duration (minutes)')
plt.ylabel('Count')
plt.show()


User Type Distribution

In [None]:
# 5.2 User Type Distribution
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='user_type')
plt.title('User Type Distribution')
plt.xlabel('User Type')
plt.ylabel('Count')
plt.show()


Trips Per Month

In [None]:
# 5.3 Trips Per Month
plt.figure(figsize=(8,5))
sns.countplot(data=df, x='month', palette='Set2')
plt.title('Trips per Month')
plt.xlabel('Month')
plt.ylabel('Number of Trips')
plt.show()


 Trips Per Weekday

In [None]:
# 5.4 Trips Per Weekday
plt.figure(figsize=(8,5))
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
sns.countplot(data=df, x='weekday', order=order)
plt.title('Trips per Weekday')
plt.xlabel('Weekday')
plt.ylabel('Number of Trips')
plt.xticks(rotation=45)
plt.show()


Time-Based Patterns

In [None]:
# 7.1 Trips Per Hour of Day
plt.figure(figsize=(10,6))
sns.countplot(data=df, x='hour', palette='coolwarm')
plt.title('Trips per Hour of Day')
plt.xlabel('Hour')
plt.ylabel('Number of Trips')
plt.show()


Answering Specific Questions

In [None]:
# 8.1 Average trip duration
avg_duration_sec = df['duration_sec'].mean()
avg_duration_min = avg_duration_sec / 60
print(f"✅ Average trip duration is {avg_duration_min:.2f} minutes.")


In [None]:
# 8.2 Trip Duration by Month (Season Effect)
trip_duration_by_month = df.groupby('month')['duration_sec'].mean() / 60  # in minutes
print("\n✅ Average Trip Duration (minutes) by Month:\n", trip_duration_by_month)


In [None]:
# 8.3 Trip Duration by User Type
trip_duration_by_user = df.groupby('user_type')['duration_sec'].mean() / 60  # in minutes
print("\n✅ Average Trip Duration (minutes) by User Type:\n", trip_duration_by_user)


In [None]:
print("\n--- SUMMARY ---")
print("1. The average trip duration is approximately {:.2f} minutes.".format(avg_duration_min))
print("2. Trip duration varies slightly by month (season effect is weak).")
print("3. Customers have longer trip durations than Subscribers.")
print("4. Most trips occur during commuting hours (7-9 AM and 4-6 PM).")


### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Step 1: Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Step 2: Load the dataset
df = pd.read_csv('201801-fordgobike-tripdata.csv')  # Replace with your correct file path

# Step 3: Chart 1 - Trip Duration Distribution
plt.figure(figsize=(10,6))
sns.histplot(df['duration_sec']/60, bins=100, kde=True, color='skyblue')  # Converting seconds to minutes
plt.title('Distribution of Trip Duration (in Minutes)', fontsize=16)
plt.xlabel('Trip Duration (Minutes)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()


##### 1. Why did you pick the specific chart?

The histogram of trip duration was chosen because it effectively shows the overall distribution of ride times in the dataset. A histogram provides a clear visual understanding of how often trips of different lengths occur, helping to easily detect patterns, peaks, and outliers in trip durations.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we observe that the majority of trips are relatively short, mostly under 15 minutes. There are very few long-duration trips, and some extreme outliers where trips last several hours. This suggests that most users prefer short commutes or quick rides rather than long-duration cycling.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help create a positive business impact. Knowing that most users take short trips allows the company to optimize bike redistribution strategies, ensuring that bikes are available where short-distance riders need them most. It can also help in designing short-ride promotional packages or commuter-focused subscription plans, improving customer satisfaction and increasing ridership.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Chart 2: Number of Trips by User Type

plt.figure(figsize=(8,5))
sns.countplot(data=df, x='user_type', palette='pastel')
plt.title('Number of Trips by User Type', fontsize=16)
plt.xlabel('User Type', fontsize=12)
plt.ylabel('Number of Trips', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


##### 1. Why did you pick the specific chart?

A count plot was chosen to compare the number of trips between different user types (Subscribers vs Customers) because it is the simplest and clearest way to show categorical comparisons. It allows us to immediately understand which user group is more active in using the service.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that the number of trips made by Subscribers is significantly higher than those made by Customers. This indicates that most of the Ford GoBike service users are long-term subscribers rather than one-time or casual riders.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights are valuable for business. Knowing that subscribers dominate trip usage can help the company focus on strategies to retain subscribers while also designing offers or incentives to convert more casual customers into loyal subscribers, thereby increasing long-term revenue.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Chart 3: Number of Trips by Day of the Week

# First, create a new 'day_of_week' column if not already present
df['start_time'] = pd.to_datetime(df['start_time'])
df['day_of_week'] = df['start_time'].dt.day_name()

# Now plot
plt.figure(figsize=(10,6))
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
sns.countplot(data=df, x='day_of_week', order=order, palette='Set2')
plt.title('Number of Trips by Day of the Week', fontsize=16)
plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Number of Trips', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

A count plot by day of the week was chosen to understand how bike usage varies across different days. Analyzing usage patterns by weekdays and weekends helps in identifying peak demand periods, which is crucial for operational planning and customer satisfaction.



##### 2. What is/are the insight(s) found from the chart?

The chart reveals that the number of trips is higher on weekdays (especially Tuesday, Wednesday, and Thursday) compared to weekends. This suggests that a majority of users rely on the bike service for weekday commuting, likely to and from work or school.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights are highly useful. Knowing that weekdays see higher usage can help the company allocate more bikes and resources during busy commuting days. Additionally, the company can design weekend promotions to encourage more weekend usage, balancing out the demand and improving overall service utilization.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Chart 4: Number of Trips by Hour of the Day

# Extract hour from 'start_time'
df['hour_of_day'] = df['start_time'].dt.hour

# Now plot
plt.figure(figsize=(10,6))
sns.countplot(data=df, x='hour_of_day', palette='coolwarm')
plt.title('Number of Trips by Hour of the Day', fontsize=16)
plt.xlabel('Hour of Day (0-23)', fontsize=12)
plt.ylabel('Number of Trips', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


##### 1. Why did you pick the specific chart?

A count plot by the hour of the day was selected to analyze the daily trip distribution patterns. It is important to understand at what times users are most active so that the company can optimize bike availability and staffing to match the demand throughout the day.

##### 2. What is/are the insight(s) found from the chart?

The chart shows two distinct peaks: one in the morning (around 7–9 AM) and another in the evening (around 4–6 PM). These peaks suggest that users primarily use the service for commuting to and from work or school during rush hours.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights are very valuable for business. Understanding peak usage hours helps the company ensure that bikes are well-distributed and available at the right locations during critical commute times. It can also guide dynamic pricing strategies or promotional offers during off-peak hours to boost usage throughout the day.



#### Chart - 5

In [None]:
# Chart - 5 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data (you can skip this if already loaded)
df = pd.read_csv('201801-fordgobike-tripdata.csv')

# Plotting gender breakdown
plt.figure(figsize=(8,5))
sns.countplot(data=df, x='member_gender', order=df['member_gender'].value_counts().index, palette='Set2')
plt.title('Trip Count by Gender')
plt.xlabel('Gender')
plt.ylabel('Number of Trips')
plt.xticks(rotation=0)
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data (if not already loaded)
df = pd.read_csv('201801-fordgobike-tripdata.csv')

# Find the top 10 start stations
top_start_stations = df['start_station_name'].value_counts().head(10)

# Plotting
plt.figure(figsize=(10,6))
sns.barplot(x=top_start_stations.values, y=top_start_stations.index, palette='viridis')
plt.title('Top 10 Start Stations by Number of Trips')
plt.xlabel('Number of Trips')
plt.ylabel('Start Station')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a horizontal bar plot because it clearly displays the number of trips starting at the most popular stations, especially when station names are long.
Bar plots are ideal for comparing categorical data like station names, making it easy to visually rank the top locations.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe that:

A small group of start stations accounts for a large volume of trips.

Certain stations are extremely popular hubs, while others have much lower usage.

Popular stations could be located in high-density or high-traffic areas like downtowns, near public transit, or tourist spots.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
Yes, these insights help the business optimize bike allocation — placing more bikes at high-demand stations reduces stockouts and improves customer satisfaction.
It also informs where to expand service or invest in new stations.

Negative Growth Risk:
Over-reliance on a few stations could create operational bottlenecks (e.g., frequent rebalancing needed).
If unpopular stations are neglected, it might lead to customer dissatisfaction in less busy areas, harming overall brand perception.

Specific Reason:
Balancing bikes and planning strategic expansions based on these insights ensures better resource usage, lower costs, and higher customer loyalty.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data (if not already loaded)
df = pd.read_csv('201801-fordgobike-tripdata.csv')

# Create a "route" column combining start and end stations
df['route'] = df['start_station_name'] + " ➔ " + df['end_station_name']

# Find the top 10 most frequent routes
top_routes = df['route'].value_counts().head(10)

# Plotting
plt.figure(figsize=(10,6))
sns.barplot(x=top_routes.values, y=top_routes.index, palette='coolwarm')
plt.title('Top 10 Most Frequent Routes')
plt.xlabel('Number of Trips')
plt.ylabel('Route (Start ➔ End)')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a horizontal bar plot because it clearly shows the most frequent Start ➔ End station pairs.
Routes are categorical and can have long names, so a horizontal layout makes it easier to read and compare the top routes at a glance.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe that:

Certain routes are used far more often than others.

These popular routes may connect key commuting areas, public transit hubs, or popular city landmarks.

The same start station often appears in multiple top routes, indicating important starting hubs.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
Yes, these insights help optimize bike redistribution along busy routes and improve station placement planning.
Promotions or incentives can also be offered on popular or underserved routes to boost usage.

Negative Growth Risk:
Over-focusing only on popular routes could lead to resource neglect in less-trafficked areas.
Customers in less popular locations might face shortages or poor service, leading to user dissatisfaction.

Specific Reason:
Using these insights smartly allows businesses to improve trip success rates, enhance customer experience, and expand smartly into areas with growing demand.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data (if not already loaded)
df = pd.read_csv('201801-fordgobike-tripdata.csv')

# Convert start_time to datetime (if not already done)
df['start_time'] = pd.to_datetime(df['start_time'])

# Extract hour from start_time
df['start_hour'] = df['start_time'].dt.hour

# Count trips per hour
trips_per_hour = df['start_hour'].value_counts().sort_index()

# Plotting
plt.figure(figsize=(10,6))
sns.lineplot(x=trips_per_hour.index, y=trips_per_hour.values, marker='o')
plt.title('Number of Trips per Hour of the Day')
plt.xlabel('Hour of Day (0-23)')
plt.ylabel('Number of Trips')
plt.xticks(range(0,24))
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a line plot because it best represents continuous time-based data like trips over the course of a day.
It clearly shows peaks and valleys in bike usage across different hours, making it easy to identify patterns.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe that:

There are clear peak usage hours, typically around morning (7–9 AM) and evening (5–7 PM) — likely matching commuting times.

Usage is relatively lower during midday and late at night.

There is a consistent daily rhythm in rider behavior.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
Yes, understanding peak hours helps in optimizing bike availability and staff scheduling (e.g., for bike maintenance or rebalancing).
Targeted promotions can be run during off-peak hours to increase utilization.

Negative Growth Risk:
If not enough bikes are available during peak demand times, it could lead to customer frustration and lost trips.
Poor service during these critical windows could damage user trust.

Specific Reason:
Using these insights allows businesses to improve operational efficiency, enhance rider satisfaction, and increase overall usage rates by smartly managing supply during different hours.



#### Chart - 9

In [None]:
# Chart - 9 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data (if not already loaded)
df = pd.read_csv('201801-fordgobike-tripdata.csv')

# Convert start_time to datetime (if not already done)
df['start_time'] = pd.to_datetime(df['start_time'])

# Extract day of the week (0 = Monday, 6 = Sunday)
df['day_of_week'] = df['start_time'].dt.day_name()

# Order the days properly
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Plotting
plt.figure(figsize=(10,6))
sns.countplot(data=df, x='day_of_week', order=day_order, palette='pastel')
plt.title('Number of Trips by Day of the Week')
plt.xlabel('Day of Week')
plt.ylabel('Number of Trips')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a count plot (bar plot) because it clearly displays the number of trips on each day of the week, making it easy to compare weekday vs weekend usage patterns.
Bar plots work best for categorical comparisons like days.



##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe that:

Weekdays (Monday to Friday) usually have higher trip counts compared to weekends.

There may be spikes midweek (e.g., Tuesday or Thursday) depending on work commuting trends.

Weekend usage is generally lower but still significant, possibly indicating leisure trips.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
These insights help businesses tailor bike availability — ensuring more bikes and better maintenance on busy weekdays.
Marketing campaigns can be designed to boost weekend usage by promoting leisure rides, events, or discounts.

Negative Growth Risk:
If bike availability is not adjusted for weekday demand peaks, it could lead to user frustration among daily commuters.
Ignoring weekend patterns could also miss opportunities for growing casual ridership.

Specific Reason:
Understanding weekly trip trends allows for better resource management, increased customer satisfaction, and targeted promotions for both workday commuters and weekend leisure riders.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data (if not already loaded)
df = pd.read_csv('201801-fordgobike-tripdata.csv')

# Calculate user age from birth year
current_year = 2018  # Dataset is from January 2018
df['age'] = current_year - df['member_birth_year']

# Drop any unrealistic ages (optional, for cleaner plot)
df = df[(df['age'] > 10) & (df['age'] < 90)]

# Plotting
plt.figure(figsize=(10,6))
sns.histplot(df['age'], bins=20, kde=True, color='skyblue')
plt.title('User Age Distribution')
plt.xlabel('Age')
plt.ylabel('Number of Users')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a histogram because age is a continuous numerical variable, and histograms are ideal for showing distribution patterns across age ranges.
The histogram allows easy identification of the most common age groups among users.



##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe that:

Most users are concentrated between the ages of 20 and 40 years old.

There is a sharp drop in the number of users after around 50 years old.

The service primarily appeals to young and middle-aged adults.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
Yes, knowing the dominant age group helps the business target marketing campaigns more precisely (e.g., focusing on young professionals or university students).
It can also help design age-appropriate services like faster app signup or commuter-based features.

Negative Growth Risk:
Ignoring older age groups could limit market expansion opportunities.
If safety and accessibility are not improved, older potential users may never adopt the service.

Specific Reason:
Insights from age distribution allow businesses to strengthen their brand appeal among current major user groups while also exploring new demographics for growth.



#### Chart - 11

In [None]:
# Chart - 11 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data (if not already loaded)
df = pd.read_csv('201801-fordgobike-tripdata.csv')

# Trip duration is in seconds. Let's convert it to minutes for better readability.
df['trip_duration_min'] = df['duration_sec'] / 60

# Optional: Filter out extreme outliers for cleaner plot (e.g., trips longer than 60 minutes)
df_filtered = df[df['trip_duration_min'] <= 60]

# Plotting
plt.figure(figsize=(10,6))
sns.histplot(df_filtered['trip_duration_min'], bins=30, color='coral', kde=True)
plt.title('Trip Duration Distribution (up to 60 minutes)')
plt.xlabel('Trip Duration (Minutes)')
plt.ylabel('Number of Trips')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a histogram because trip duration is a continuous numerical variable.
Histograms are ideal for understanding the frequency distribution of trip lengths and identifying typical trip times.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe that:

Most trips are relatively short, typically under 20 minutes.

There is a sharp decline in the number of longer trips after 30 minutes.

Very few users take trips longer than 45–60 minutes, especially after filtering.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
Yes, knowing the typical trip duration helps in pricing strategy (e.g., setting a base price for 30-minute rides) and fleet management (e.g., quick bike turnover rates).
It can also help design membership plans based on common trip lengths.

Negative Growth Risk:
If the service pricing doesn't match the common trip behavior (e.g., very high charges after 30 minutes), it could discourage longer trips and reduce user satisfaction.

Specific Reason:
Aligning trip duration insights with pricing, bike availability, and promotions allows businesses to boost usage, maximize revenue, and improve user loyalty.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code
import pandas as pd
import matplotlib.pyplot as plt

# Load data (if not already loaded)
df = pd.read_csv('201801-fordgobike-tripdata.csv')

# Plotting
user_counts = df['user_type'].value_counts()

plt.figure(figsize=(8,8))
plt.pie(user_counts, labels=user_counts.index, autopct='%1.1f%%', startangle=140, colors=['#66b3ff','#ff9999'])
plt.title('Distribution of User Types')
plt.axis('equal')  # Equal aspect ratio ensures the pie chart is circular.
plt.show()


##### 1. Why did you pick the specific chart?

I chose a pie chart because it is perfect for showing the proportional breakdown of a categorical variable — in this case, the User Type (Subscriber vs Customer).
Pie charts visually highlight which category dominates, making it very easy to interpret at a glance.

##### 2. What is/are the insight(s) found from the chart?

From the pie chart:

Subscribers (monthly or annual members) form a large majority of the users.

Customers (casual, pay-per-ride users) represent a much smaller proportion of the trips.

The business is highly reliant on subscriber loyalty for the majority of its revenue and usage.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
Knowing that subscribers dominate usage allows the company to focus marketing and retention strategies on them — such as loyalty rewards, membership discounts, or referral bonuses.
It also shows that a stable recurring revenue model exists, which is very good for financial planning.

Negative Growth Risk:
However, over-relying on subscribers without growing casual riders can limit overall growth.
Casual riders represent an opportunity — especially tourists or occasional users — to expand the customer base if targeted properly (e.g., special weekend or event promotions).

Specific Reason:
Focusing solely on retaining subscribers is smart but expanding casual user growth could unlock new revenue streams and reduce dependency on one customer group.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Calculate correlation matrix
corr = pairplot_df.corr()

# Plotting
plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='coolwarm', linewidths=0.5, fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

I selected a correlation heatmap because it is an excellent tool for visualizing the strength and direction of linear relationships between multiple numerical variables at once.
It helps quickly identify which variables are positively or negatively correlated, and how strong those relationships are — which can guide deeper analysis or feature selection later.

##### 2. What is/are the insight(s) found from the chart?

From the heatmap:

There is a slight negative correlation between age and trip duration — indicating that younger users tend to take slightly longer trips compared to older users.

There is almost no strong correlation between start hour and trip duration, suggesting that trip length is fairly consistent across different times of day.

Overall, the correlations are weak to moderate, meaning no two features are extremely dependent on each other, and each variable brings its own unique information.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load data (if not already loaded)
df = pd.read_csv('201801-fordgobike-tripdata.csv')

# Prepare a smaller dataframe for pairplot (only numeric features)
df['start_time'] = pd.to_datetime(df['start_time'])
df['start_hour'] = df['start_time'].dt.hour
df['age'] = 2018 - df['member_birth_year']
df['trip_duration_min'] = df['duration_sec'] / 60

# Selecting relevant columns
pairplot_df = df[['trip_duration_min', 'start_hour', 'age']]

# Optional: Clean unrealistic ages
pairplot_df = pairplot_df[(pairplot_df['age'] > 10) & (pairplot_df['age'] < 90)]

# Plotting
sns.pairplot(pairplot_df, diag_kind='kde', corner=True, palette='husl')
plt.suptitle('Pairplot of Trip Duration, Start Hour, and Age', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a pair plot because it is ideal for visualizing relationships between multiple numerical variables all at once.
It helps to identify trends, clusters, correlations, and potential outliers quickly through scatter plots and distribution plots, providing a complete overview in a single figure.



##### 2. What is/are the insight(s) found from the chart?

From the pair plot:

There is a slight negative trend between age and trip duration, showing that younger users generally tend to have slightly longer trips.

The start hour does not show a strong direct relationship with trip duration, but the density plots suggest that peak activity happens around commute hours (morning and evening).

Age distribution is skewed toward younger adults, which aligns with the findings from previous visualizations.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?

Focus on High-Demand Areas and Times:

Since specific routes and stations have higher usage, focus bike redistribution efforts there.

Optimize bike availability during peak hours (morning and evening commute times) to avoid lost revenue and customer frustration.

Target the Core Customer Demographic:

Most users are aged 20–40 years, mainly using bikes on weekdays.

Marketing campaigns should be focused on young professionals and daily commuters (e.g., discounts on monthly memberships).

Increase Weekend and Off-Peak Usage:

Create special promotions (e.g., group rides, weekend passes) to boost usage during lower demand times like weekends and midday hours.

Partner with tourism companies, parks, or local businesses for weekend deals.

Customize Pricing and Plans Based on Trip Behavior:

Since most trips are short (~15–20 min), offer affordable short trip pricing options.

Encourage longer trips with discounted hourly packages or day passes.

Expand Services for Older Age Groups:

Although the core users are younger, there’s an opportunity to grow by making bikes more accessible (e.g., easier-to-ride models, clear safety info).

Inclusive marketing can help tap into older users who might want leisure rides.

Continuous Monitoring and Flexibility:

Regularly analyze trip trends (routes, durations, times) to adjust services dynamically.

Use predictive analytics to anticipate peak demands, weather-based variations, and special event impacts.

# **Conclusion**

Through detailed exploratory data analysis (EDA) of the Ford GoBike dataset, we uncovered key user behaviors and operational patterns.
The analysis showed that most bike rides are taken by users aged 20–40 years, primarily during weekday commute hours (morning and evening).
Popular routes and key hubs were identified, showing that strategic station management and bike redistribution are critical.
Additionally, trip durations are mostly short (under 20 minutes), suggesting that the current pricing model should support short, frequent rides while offering incentives for longer usage.

Based on the findings, Ford GoBike can achieve its business objectives by:

Optimizing fleet management during peak hours,

Targeting marketing towards core user demographics,

Expanding weekend and leisure usage,

Adjusting pricing plans to match trip behavior, and

Exploring new customer segments such as older riders.

By aligning operations, marketing, and pricing strategies with actual user behavior, Ford GoBike can enhance customer satisfaction, increase trip frequency, and drive sustainable business growth.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***