# **Project Name**    - EDA of Ford GoBike program.



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

This project investigates the Ford GoBike trip dataset from January 2018, which contains detailed information about infividual rides taken by users across the Bay Area. The dataset includes trip duratin, start and end times, station information, user type, gender, and birth year.

The anaysis begins with data cleaning and preprocessing to handle missing or inconsistent values, particularly in demographic fields. After ensuring a clean dataset, a series or visualizations and statistical summaries are conducted to understand the data's structure and hidden patterns.

Initial visualizations focus on trip duration analysis, revealing that the majority of trips are short, which aligns with the typical use-case of bike sharing for last-mile commutes or short errands. Outlier filtering is used to highlight meaningful usage patterns rather than skewed data points.

A temporal analysis is then conducted, uncovering that weekdays, especially mornings and evenings during rush hours, have higher ride volumes—a strong indicator that commuters make up a significant portion of the user base. Weekends show more mid-day usage, suggesting leisure activity.

By connecting these insights, the project outlines several business recommendations: increasing bike availability during peak commuting hours, expanding stations in high-demand areas, creating incentive programs for underrepresented user groups (e.g., women or older riders), and using age/gender analytics to tailor marketing efforts.

In conclusion, this EDA project delivers a comprehensive overview of user behavior in the Ford GoBike system. The patterns revealed through the data can guide key decisions in *logistics planning, user experience optimization, marketing strategy, and overall service improvement*. While the current scope is observational, it also lays a solid foundation for future predictive modeling, such as demand forecasting or customer segmentation using classification or regression techniques.


# **Problem Statement**


To derive actionable insights that can enhance customer experience, optimize fleet distribution, and support growth strategies for the Ford GoBike program.



#### **Define Your Business Objective?**

The goal of this analysis is to explore usage patterns in the January 2018 dataset to uncover valuable insights that can inform operational decisions, marketing strategies, and infrastructure planning.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd

### Dataset Loading

In [None]:
# Load Dataset
path="/content/201801-fordgobike-tripdata.csv"

### Dataset First View

In [None]:
# Dataset First Look
df=pd.read_csv(path)
df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows=df.shape[0]
columns=df.shape[1]


### Dataset Information

In [None]:
# Dataset Info
print("Number of rows=", rows)
print("Number of columns=", columns)
df.replace("NaN", pd.NA, inplace=True)
print(df.columns.tolist())
missing=df.isnull().sum()
print("")
print("Total missing values in each column in the given dataset")
print(missing)



#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate=df.duplicated().sum()
print("Total duplicate values in the given dataset:",duplicate)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Total number of missing or null values:", missing.sum())

In [None]:
# Visualizing the missing values
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title("Missing Values Heatmap")
plt.show()
missing_counts = df.isnull().sum()
missing_counts = missing_counts[missing_counts > 0]
if not missing_counts.empty:
    plt.figure(figsize=(10, 5))
    missing_counts.plot(kind='bar', color='coral')
    plt.title("Missing Values per Column")
    plt.ylabel("Number of Missing Values")
    plt.xlabel("Columns")
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
else:
    print("✅ No missing values found in any column!")

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("📋 Dataset Columns:")
print(df.columns.tolist())

In [None]:
# Dataset Describe
print("📊 Dataset Summary Statistics:")
print(df.describe())
print(df.describe(include='all'))


### Variables Description

path: Store the path of the excel sheet in which dataset is stored
df: Used to represent the dataset stored in the excel sheet
missing_count: Used to count and display the number of missing or null values of the dataset
duplicate: Count the number of duplicate values of the dataset
rows: Store the number of rows of the dataset via the shape tuple
columns: Store the number of columns of the dataset via the shape tuple

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("Total number of unique values in the dataset in each column:")
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
import numpy as np

# 1. Basic overview
print(df.info())
print(df.describe())

# 2. Convert start_time and end_time to datetime
df['start_time'] = pd.to_datetime(df['start_time'], errors='coerce')
df['end_time'] = pd.to_datetime(df['end_time'], errors='coerce')

# 3. Create new columns for better time-based analysis
df['start_date'] = df['start_time'].dt.date
df['start_hour'] = df['start_time'].dt.hour
df['start_dayofweek'] = df['start_time'].dt.day_name()
df['trip_duration_minutes'] = df['duration_sec'] / 60

# 4. Remove trips with duration less than 1 minute or more than 2 hours (outliers)
df = df[(df['trip_duration_minutes'] >= 1) & (df['trip_duration_minutes'] <= 120)]

# 5. Clean up user demographic data
# Drop rows with missing gender or birth year (optional based on analysis goal)
df = df.dropna(subset=['member_gender', 'member_birth_year'])

# 6. Convert birth year to age
df['member_age'] = 2018 - df['member_birth_year']

# 7. Filter out unrealistic ages (assuming users are between 15 and 90 years old)
df = df[(df['member_age'] >= 15) & (df['member_age'] <= 90)]

# 8. Ensure user_type and member_gender are categorical
df['user_type'] = df['user_type'].astype('category')
df['member_gender'] = df['member_gender'].astype('category')

# 9. Reset index after filtering
df = df.reset_index(drop=True)

# 10. Final dataset check
print(df.info())
print(df.head())
df

### What all manipulations have you done and insights you found?

The data wrangling process focused on cleaning and preparing the Ford GoBike dataset for analysis. First, the `start_time` and `end_time` columns were converted to datetime format. Some values were malformed (e.g., "52:35.2"), so `errors='coerce'` was used to convert invalid entries to `NaT`, which were then dropped.

New time-based features were created, such as `start_date`, `start_hour`, and `start_dayofweek`, along with `trip_duration_minutes` for better readability. Trips shorter than 1 minute or longer than 120 minutes were removed to exclude outliers.

We cleaned demographic data by dropping rows with missing values in `member_gender` and `member_birth_year`. Using the birth year, we calculated `member_age` and filtered for realistic ages (15–90). Categorical columns like `user_type` and `member_gender` were converted to category types for efficiency.

These manipulations ensured a clean, consistent, and analyzable dataset, ready for visualization and business insight generation.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Chart - 1: Distribution of Trip Durations (in minutes)
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.histplot(df['trip_duration_minutes'], bins=50, kde=True, color='skyblue')
plt.title('Distribution of Trip Durations (in Minutes)')
plt.xlabel('Trip Duration (minutes)')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A histogram(distribution plot) is the most effective way to visualize the spread concentration of a continuos variable like trip duration.

##### 2. What is/are the insight(s) found from the chart?

Most trips tend to last between 5 to 15 minutes, with a steep drop after 30 minutes. This shows that the majority of users rely on the bikes for short commutes or quick errands.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. understanding the most frequent trip durations can help in optimizing pricing plans and improving fleet availability. For instance, a flexible model could incentivize off-peak, short trips.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(x='start_hour', data=df, palette='mako')
plt.title('Number of Trips by Hour of Day')
plt.xlabel('Start Hour')
plt.ylabel('Number of Trips')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A bar plot or line plot is ideal for temporal patterns, especially when showing how trips vary by hour. It helps in identifying peak and off-peak hours.

##### 2. What is/are the insight(s) found from the chart?

There are two clear peaks: 8 AM and 5-6 PM, aligning with commute hours. Usage dips significantly during midday and late at night.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. These patterns can inform fleet rebalancing, staff scheduling, and maintenance windows. Also, marketing campaigns or ride promotions can target low-usage hours.

#### Chart - 3

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure()
sns.countplot(x='bike_share_for_all_trip', data=df, palette='Set1')
plt.title("Bike Share for All Trip - Usage Frequency")
plt.xlabel("Bike Share for All Trip")
plt.ylabel("Number of Users")
plt.show()

##### 1. Why did you pick the specific chart?

Evaluates the awareness and adoption of the Bike Share for All program.

##### 2. What is/are the insight(s) found from the chart?

Most users are not part of this program, suggesting low uptake or awareness.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

After visualization, the analysis has portrayed nagative potential of this business. It has hihglighted a missed oppurtunity for inclusive growth. Need better visibility and education about the program, especially in lower-income or undeserved areas.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(8, 5))
sns.countplot(x='user_type', data=df, palette='coolwarm')
plt.title('Trip Count by User Type')
plt.xlabel('User Type')
plt.ylabel('Count')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A count plot (bar chart) is ideal for comparing categorical distributions—in this case, between subscriber and customer types.

##### 2. What is/are the insight(s) found from the chart?

Subscribers significantly outnumber casual customers, showing a stable user base of regular commuters.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely. This can guide customer retention strategies and help tailor features for subscribers, while also exploring growth opportunities for casual customer conversion.

#### Chart - 5

In [None]:
# Chart - 5: Average Trip Duration by Gender
plt.figure(figsize=(8, 5))
sns.barplot(x='member_gender', y='trip_duration_minutes', data=df, estimator='mean', palette='Set2')
plt.title('Average Trip Duration by Gender')
plt.xlabel('Gender')
plt.ylabel('Average Duration (minutes)')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A bar plot with group comparisons is effective for summarizing means across categories—in this case, gender vs. average trip duration.

##### 2. What is/are the insight(s) found from the chart?

There is a noticeable difference in average trip durations by gender. For instance, female riders may have slightly shorter average durations than male riders.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Gender-based usage patterns can inform personalized marketing, targeted feature development (e.g., safer night-time routes), and inclusive planning to ensure equitable access.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Calculate age
df['age'] = 2018 - df['member_birth_year']

# Plot age distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['age'].dropna(), bins=30, kde=True, color='teal')
plt.title("Age Distribution of Riders")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This bar graph provides age profiling aids in understanding demographics and planning tailored services.

##### 2. What is/are the insight(s) found from the chart?

Most riders are between 25-40 years old, indicating young working professionals as the primary market.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The business impact is positive. targeted campaigns for fitness, convenience, and envronmental impact can further engage this age group. Lack of younger or older users may require accessibility and pricing strategies.

#### Chart - 7

In [None]:
# Chart - 7 visualization code


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Select only numerical columns for correlation
numeric_df = df.select_dtypes(include=['int64', 'float64'])

# Compute correlation matrix
correlation_matrix = numeric_df.corr()

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap - Ford GoBike Dataset')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap is perfect for exploring linear relationships between numerical variables in the dataset. It gives a quick overview of how strongly variables are related-both positively and negatively-and helps in feature selection for modelling.

##### 2. What is/are the insight(s) found from the chart?

duration_sec and trip_duration_min are perfectly correlated-as expected, since one is derived from the other. start_hour had weak or no correlation  with duration, meaning trip length doesn't depend strongly on the time of day. member_birth_year showed very weak correlation with trip duration, suggesting age might not be a major factor in how long someone uses the service.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Select a few relevant numerical columns for the pair plot
pairplot_cols = ['duration_sec', 'trip_duration_minutes', 'start_hour', 'member_birth_year']

# Drop rows with missing values in selected columns to avoid plot issues
pairplot_df = df[pairplot_cols].dropna()

# Create the pair plot
sns.pairplot(pairplot_df)
plt.suptitle('Pair Plot - Ford GoBike Dataset', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot shows relationships across multiple numerical variables at once using scatter plots and histograms. It helps detect patterns, clusters, and outliers. Ideal for EDA(Exploratory Data Analysis) when you want to visually explore how variables interact.

##### 2. What is/are the insight(s) found from the chart?

There is a clear linear relationship between duration_sec and trip_duration_min. The distribution of member_burth_year shows most riders are between 1980 and 1995, suggesting a younger demographic uses the service more. The start_hour distribution reveals common trip hours indicating commute patterns.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1.Majority of trips occur during morning and evening commuting hours. Increase bike availability and maintenance efforts around 7-9 AM and 5-7 PM, especially near major transit hubs and business districts.

2.Provide loyalty rewards, personalized offers, or subscription bundles to enhance subscriber retention and convert casual users.

3.Targeted gender based marketing by improving lighting, route suggestions, and bike station placements in areas favored by women to improve comfort and safety.

4.Use commuting peak data and location-based trends to propose partnerships with local businesses or coffee showps near start/end stations.

# **Conclusion**

The dataset reveals that user behavior is heavily influenced eby commuting needs, with clear demand peaks during weekday mornings and evenings. The service is predominantly used by subscribers, indicating satisfaction with ongoing plans. By leveraging data-driven decisions-like optimizing operational hours, targeting marketing efforts by time/gender/day, and improving customer experience-business growth, retention, and operational efficiency can be significantly enhanced.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***