Skip to content

jkelleman/rental-bike-sharing

Repository files navigation

Bike Sharing Dataset

Analysis prepared by Jen Kelleman

Data Science Capstone Project

Carnegie Mellon University

February 2025

Background

Bike sharing systems have become very popular in many cities, allowing people to easily rent a bike from one location and return it at another. These systems are praised for their benefits in reducing traffic, improving the environment, and promoting health. Additionally, they collect a lot of user data, making them valuable for studying city mobility patterns.

What's cool about bike sharing systems is the data they generate. Unlike buses or subways, they record the exact travel duration and positions. This makes them like a virtual sensor network for city mobility. By monitoring this data, we can detect important events in the city.

My project focuses on a bike sharing system in Washington, D.C., with records of bike trips in two-hour intervals over a two-year period (2011-2012). The data includes details about each interval, such as weather conditions, the day of the week, temperature, humidity, and windspeed. The main goal is to analyze how the number of bike users has changed over time and how environmental factors influence bike usage.

Research questions

Based on my research goals, I'm interested in exploring research questions:

1. Environmental and seasonal factors
  • How do environmental and seasonal factors influence the hourly and daily bike rental counts?
  • Can we predict the number of bike rentals based on weather conditions, holidays, and other temporal factors?
  • Is there a correlation between major cultural events in Washington, D.C., and the number of bike rentals?
  • What patterns can be observed in the bike rental data across different times of the day, days of the week, and months of the year?
  • 2. Bike-rental status
  • How do casual and registered users differ in their bike rental behaviors?
  • These questions aim to analyze and predict bike rental trends, considering various influencing factors.

    Associated tasks

    1. Regression:

    Predication of bike rental count hourly or daily based on the environmental and seasonal settings.

    2. Event and Anomaly Detection:

    **HYPOTHESIS TO BE TESTED: Count of rented bikes are potentially correlated to major cultural events in the town, which are easily verifiable with search engines. For example, the query "National Cherry Blossom Festival in DC" returns search engine results for "March 26-April 10". Here is a valuable reference for highlighting the top 100 most important dates:

  • For 2011: https://www.bizbash.com/bizbash-lists/media-gallery/13478255/washingtons-top-100-events-of-2011
  • For 2012: https://www.bizbash.com/bizbash-lists/top-100-events/top-list/13230517/washingtons-top-100-events-2012
  • Therefore the data can be used for validation of event or anomaly detection algorithms as well.

    Dataset characteristics

    The dataset contains the hourly and daily count of rental bikes from the Capital bikeshare system in Washington, DC, covering the years 2011 and 2012. It includes corresponding weather and seasonal information, making it a rich source for analyzing bike rental patterns and their correlation with various factors.

    The dataset is multivariate and includes 13 features such as:

  • Date: The date of the observation.
  • Year: Encoded as 0 for 2011 and 1 for 2012. Year (0: 2011, 1:2012).
  • Month: Categorical values from 1 to 12. Month (1 to 12).
  • Hour: Categorical values from 0 to 23. Hour (0 to 23).
  • Holiday: Binary value indicating whether the day is a holiday (extracted from Holiday Schedule).
  • Work day: Categorical values representing the day of the week. If day is neither weekend nor holiday is 1, otherwise is 0.
  • Work Day: Binary value indicating whether the day is a working day.
  • Weather: Categorical values representing different weather conditions.
  • weathersit:extracted from Freemeteo)
  • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
  • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
  • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
  • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • Temperature: Normalized temperature in Celsius.
  • Feels Like: Normalized feeling temperature in Celsius.
  • Humidity: Normalized humidity. The values are divided to 100 (max)
  • Windspeed: Normalized wind speed. The values are divided to 67 (max)
  • Casual: The count of casual users in a particular hour.
  • Registered: The count of registered users in a particular hour.
  • Appendum

    To guide the data analysis, I've created additional columns such as:

  • Count: The count of total rental bikes including both casual and registered in a particular hour.
  • MDY: The month day, year of a particular bike-renting instance.
  • Season: Categorical values representing winter, spring, summer, and fall. Season (1:springer, 2:summer, 3:fall, 4:winter)
  • Holiday Name: Categorical values representing the US government holiday name.
  • Day of the Week: Categorical values representing the day of the week. Day of the Week (1: Monday, 2: Tuesday, 3: Wednesday, 4: Thursday, 5: Friday, 6: Saturday, 7: Sunday)
  • Cultural Event: Categorical values representing certain major cultural events held in DC. (
  • Files

  • 1. Data cleaning: data_cleaning.R
  • 2. EDA: eda.R
  • 3. Feature engineering: feature_engineering.R
  • 4. Modeling: modeling.R
  • 5. Visualization: visualization.R
  • 6. Main script: main.R
  • Releases

    No releases published

    Packages

    No packages published

    Languages