Bike Sharing Dataset

Analysis prepared by Jen Kelleman

Data Science Capstone Project

Carnegie Mellon University

February 2025

Background

Bike sharing systems have become very popular in many cities, allowing people to easily rent a bike from one location and return it at another. These systems are praised for their benefits in reducing traffic, improving the environment, and promoting health. Additionally, they collect a lot of user data, making them valuable for studying city mobility patterns.

What's cool about bike sharing systems is the data they generate. Unlike buses or subways, they record the exact travel duration and positions. This makes them like a virtual sensor network for city mobility. By monitoring this data, we can detect important events in the city.

My project focuses on a bike sharing system in Washington, D.C., with records of bike trips in two-hour intervals over a two-year period (2011-2012). The data includes details about each interval, such as weather conditions, the day of the week, temperature, humidity, and windspeed. The main goal is to analyze how the number of bike users has changed over time and how environmental factors influence bike usage.

Research questions

Based on my research goals, I'm interested in exploring research questions:

1. Environmental and seasonal factors

How do environmental and seasonal factors influence the hourly and daily bike rental counts?

Can we predict the number of bike rentals based on weather conditions, holidays, and other temporal factors?

Is there a correlation between major cultural events in Washington, D.C., and the number of bike rentals?

What patterns can be observed in the bike rental data across different times of the day, days of the week, and months of the year?

2. Bike-rental status

How do casual and registered users differ in their bike rental behaviors?

These questions aim to analyze and predict bike rental trends, considering various influencing factors.

Associated tasks

1. Regression:

Predication of bike rental count hourly or daily based on the environmental and seasonal settings.

2. Event and Anomaly Detection:

**HYPOTHESIS TO BE TESTED: Count of rented bikes are potentially correlated to major cultural events in the town, which are easily verifiable with search engines. For example, the query "National Cherry Blossom Festival in DC" returns search engine results for "March 26-April 10". Here is a valuable reference for highlighting the top 100 most important dates:

For 2011: https://www.bizbash.com/bizbash-lists/media-gallery/13478255/washingtons-top-100-events-of-2011

For 2012: https://www.bizbash.com/bizbash-lists/top-100-events/top-list/13230517/washingtons-top-100-events-2012

Therefore the data can be used for validation of event or anomaly detection algorithms as well.

Dataset characteristics

The dataset contains the hourly and daily count of rental bikes from the Capital bikeshare system in Washington, DC, covering the years 2011 and 2012. It includes corresponding weather and seasonal information, making it a rich source for analyzing bike rental patterns and their correlation with various factors.

The dataset is multivariate and includes 13 features such as:

Date: The date of the observation.

Year: Encoded as 0 for 2011 and 1 for 2012. Year (0: 2011, 1:2012).

Month: Categorical values from 1 to 12. Month (1 to 12).

Hour: Categorical values from 0 to 23. Hour (0 to 23).

Holiday: Binary value indicating whether the day is a holiday (extracted from Holiday Schedule).

Work day: Categorical values representing the day of the week. If day is neither weekend nor holiday is 1, otherwise is 0.

Work Day: Binary value indicating whether the day is a working day.

Weather: Categorical values representing different weather conditions.

weathersit:extracted from Freemeteo)

1: Clear, Few clouds, Partly cloudy, Partly cloudy

2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

Temperature: Normalized temperature in Celsius.

Feels Like: Normalized feeling temperature in Celsius.

Humidity: Normalized humidity. The values are divided to 100 (max)

Windspeed: Normalized wind speed. The values are divided to 67 (max)

Casual: The count of casual users in a particular hour.

Registered: The count of registered users in a particular hour.

Main Conclusions and Takeaways

Summary of insights into predicting future bikeshare system usage. Knowing these insights can help to optimize bike availability and improve the service efficiency.

Time-based: Bike rental usage peaks during commuting hours. Specifically, 7am-9am and 5pm-7pm.

Weather-based: Higher temperatures, lower windspeed, and lower humidity all generally correspond to an increase in bike rentals.

Seasonality: Bike rental usage peaks during the summer months (May, June, July, September).

Potential limitations: No information on the exact location of the bikeshare systems, no user type segmentation (e.g., commuters vs. leisure users), and ambiguities in the data.

For future explorations, I recommend during a peak hours analysis, demand forecasting, and operational adjustments.

Peak hours analysis: Conduct a detailed analysis of peak hours to understand the factors driving high demand during these times. Look at commuter patterns, in particular.

Demand forecasting: Use the hourly data to build predictive models for bike rental demand.

Operational adjustments: Investigate the operational aspects, such as bike availability and maintenance schedules, to ensure that bikes are available at peak hours.

Appendum

To guide the data analysis, I've created additional columns such as:

Count: The count of total rental bikes including both casual and registered in a particular hour.

MDY: The month day, year of a particular bike-renting instance.

Season: Categorical values representing winter, spring, summer, and fall. Season (1:springer, 2:summer, 3:fall, 4:winter)

Holiday Name: Categorical values representing the US government holiday name.

Day of the Week: Categorical values representing the day of the week. Day of the Week (1: Monday, 2: Tuesday, 3: Wednesday, 4: Thursday, 5: Friday, 6: Saturday, 7: Sunday)

Cultural Event: Categorical values representing certain major cultural events held in DC. (

Files

1. Data cleaning: data_cleaning.R

2. EDA: eda.R

3. Feature engineering: feature_engineering.R

4. Modeling: modeling.R

5. Visualization: visualization.R

6. Main script: main.R

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
analysis_files		analysis_files
bike-eda_files		bike-eda_files
clean-data		clean-data
raw-data		raw-data
.gitignore		.gitignore
Final_JKExecutiveSummary.pdf		Final_JKExecutiveSummary.pdf
README.md		README.md
rental-bike-sharing.Rproj		rental-bike-sharing.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bike Sharing Dataset

Background

Research questions

Associated tasks

Dataset characteristics

Main Conclusions and Takeaways

Appendum

Files

About

Uh oh!

Releases

Packages

Languages

jkelleman/rental-bike-sharing

Folders and files

Latest commit

History

Repository files navigation

Bike Sharing Dataset

Background

Research questions

Associated tasks

Dataset characteristics

Main Conclusions and Takeaways

Appendum

Files

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages