## Bike Sharing Demand

Defining the problem:

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

In [None]:
## Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
## Constants
dir = '../input/bike-sharing-demand/'

In [None]:
## Util functions
def plot_count(x, y, xlabel, ylabel, title, xrotation=None, yrotation=None, horizontal=False):
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    if xrotation:
        plt.xticks(rotation=xrotation)
    if yrotation:
        plt.yticks(rotation=yrotation)
    if horizontal:
        plt.barh(x, y)
    else:
        plt.bar(x, y)
        
def sns_count_plot(data, x, xlabel, ylabel, title, hue=None, xrotation=None, yrotation=None):
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    if xrotation:
        plt.xticks(rotation=xrotation)
    if yrotation:
        plt.yticks(rotation=yrotation)
    sns.countplot(x=x, data=data, hue=hue);
    
def sns_bar_plot(data, x, y, xlabel, ylabel, title, hue=None, xrotation=None, yrotation=None):
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    if xrotation:
        plt.xticks(rotation=xrotation)
    if yrotation:
        plt.yticks(rotation=yrotation)
    sns.barplot(x=x, y=y, data=data, hue=hue);

In [None]:
## Read data
train = pd.read_csv(dir + 'train.csv')
test = pd.read_csv(dir + 'test.csv')
sample_submission = pd.read_csv(dir + 'sampleSubmission.csv')
print(f"Train {train.shape}")
print(f"Test {test.shape}")
train.head()

### Data Fields

* datetime - hourly date + timestamp 
* season -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
* holiday - whether the day is considered a holiday
* workingday - whether the day is neither a weekend nor holiday
* weather - 
    1. Clear, Few clouds, Partly cloudy, Partly cloudy
    2. Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    3. Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    4. Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
* temp - temperature in Celsius
* atemp - "feels like" temperature in Celsius
* humidity - relative humidity
* windspeed - wind speed
* casual - number of non-registered user rentals initiated
* registered - number of registered user rentals initiated
* count - number of total rentals

In [None]:
train.dtypes

### Let's classify the columns

- datetime - datetime
- seasonal, weather - categorical
- holiday, workingday - binary
- temp, atemp, windspeed - float
- humidity, casual, registered, count - int

In [None]:
## Lets convert datetime column into datetime type
train.datetime = pd.to_datetime(train.datetime)
train['year'] = train.datetime.dt.year
train['month'] = train.datetime.dt.month
train['day'] = train.datetime.dt.day
train['hour'] = train.datetime.dt.hour
train['registered_percent'] = (train.registered / train['count']) * 100
train['casual_percent'] = (train.casual / train['count']) * 100
train.head()

### How many years data do we have? How much data does each of the year contain?

* We have close to 5400 data points for 2011 and 2012.
* According to the calculation, considering for 2011 if we have 12 months data x 30 days average x 24 hours a day = 8640 data points but we only have 5000 which means we dont have data for some hours of the day or we dont have data for some days. Let's find out.

In [None]:
px.pie(train, values='year', names='year', title='Distribution of data years', width=300, height=300)

### For 2011 and 2012, do we have data for all the 12 months? 

* We do have all the months however, for every month we should get 30 x 24 = 720 data points.
* Clearly we must be missing some days in the month then.

In [None]:
sns_count_plot(data=train, x='month', xlabel='month', ylabel='count', title='Count of month in year', hue='year')

### How many days do we have in the January in the data/?

* Looks like we only have first 19 days as expected (we have to predict the 20th day of every month).
* Even then, we should have 24 data points but we do have less than that in some cases.

In [None]:
month_1 = train[train.month == 1]
sns_count_plot(data=month_1, x='day', xlabel='day', ylabel='count', title='Count of days in January', hue='year')

## Seasonal count

* It seems we have data for all the 4 seasons - spring, summer, winter, fall and each have almost equal distribution.
* Bike rental seems to be high in summer, fall and winter but typically low in spring.

In [None]:
fig = plt.figure(figsize=(10, 7))
plt.subplot(121)
plt.xlabel('season')
plt.ylabel('count of data points')
sns_count_plot(data=train, x='season', xlabel='season', ylabel='count of data points', title='Count of seasons in year', hue='year')
plt.subplot(122)
sns_bar_plot(data=train, x='season', y='count', xlabel='season', ylabel='number of bikes rented', title='Distribution of bikes rented along seasons')

### We observed season doesn't affect the count as all of the seasons have equivalent rental count. Does this apply to weather as well? 

* We are seeing pretty much the same numbers for 2011 and 2012 so let's plot against the season now.
* We can observe we have equivalent data for Spring than for summer, fall and winter.
* As we can expect the bike rentals are more on a clear sky than in rainy/snowy weather

In [None]:
print("weather count in year 2011 & 2012")
print(train.weather.value_counts())
plt.figure(figsize=(12, 5))
plt.subplot(121)
sns_count_plot(data=train, x='weather', xlabel='weather', ylabel='count of data points', title='Count of weather in season', hue='season')
plt.subplot(122)
sns_count_plot(data=train, x='weather', xlabel='count', ylabel='count of data points', title='Distribution of bike rented in weather')

### How many holidays do we have in the two years? Which seasons are these holidays in?

* Woo we have a lot more working days than we have holidays.

In [None]:
print(train.holiday.value_counts())

figure, axis = plt.subplots(1, 2, figsize=(10, 5))
axis[0].set_xlabel('holiday')
axis[0].set_ylabel('count of data points')
plt.title('Count of holidays in year')
plt.subplot(121)
sns.countplot(data=train, x='holiday', hue='year')
plt.title('Count of holidays in season')
plt.subplot(122)
sns.countplot(data=train, x='season', hue='holiday')

### How temperature varies the rental count?

* As expected more data around 15-30 degrees.
* But we can also find bike rented during extreme tempertaures (close to 40 degrees).
* The relationship of higher temperature should result in less bike rents doesn't hold true entirely.

In [None]:
fig, axis = plt.subplots(1, 2, figsize=(10, 8))
plt.subplot(121)
plt.title('Distribution of temperature')
sns.histplot(train.temp, kde=True)
plt.subplot(122)
plt.title('Temperature vs bike rental')
sns.scatterplot(x='temp', y='count', data=train)

### Humidity and effect of humidity on rental count

In [None]:
fig, axis = plt.subplots(1, 2, figsize=(10, 8))
plt.subplot(121)
plt.title('Distribution of humidity')
sns.histplot(train.humidity, kde=True)
plt.subplot(122)
plt.title('Humidity vs bike rental')
sns.scatterplot(x='humidity', y='count', data=train)

### Windspeed

* Windspeed data seems to be ideal rather than the above two plots.
* As expected at higher wind speed less people take bikes.

In [None]:
fig, axis = plt.subplots(1, 2, figsize=(10, 8))
plt.subplot(121)
plt.title('Distribution of windspeed')
sns.histplot(train.windspeed, kde=True)
plt.subplot(122)
plt.title('Windspeed vs bike rental')
sns.scatterplot(x='windspeed', y='count', data=train)

### Compare registered and casual bike rentals

In [None]:
plt.title('Registered and casual bike rentals distribution')
plt.xlabel('bike rental')
sns.histplot(train['registered_percent'])
sns.histplot(train['casual_percent'], color='red');

### What is the highest rental count we have recieved so far?

* Although we have more values close to 0 rents but the highest count recived so far is close to 1000 a day.
* We rented the most bikes on 12th of September 2012 around 6pm on a working day during Fall when the weather was clear.

In [None]:
rental_count = train.sort_values('count', ascending=False)
print("Which day had the highest rented bikes")
print(rental_count.iloc[0])
sns.histplot(rental_count['count']);

### What is the highest registered bike rental count?

* We rented the most bikes to registered user on the same day 12th of September 2012.
* Distribution shows the higest rented bikes count is 866.

In [None]:
rental_registered_count = train.sort_values('registered', ascending=False)
print("Which day had the highest registered rented bikes")
print(rental_registered_count.iloc[0])
sns.histplot(rental_registered_count['registered']);

### What is the highest casual bikes rented?

* Highest casual count is 367 which is on 17th of March 2012.

In [None]:
rental_casual_count = train.sort_values('casual', ascending=False)
print("Which day had the highest casual rented bikes")
print(rental_casual_count.iloc[0])
sns.histplot(rental_casual_count['casual']);

### Rented bikes distribution every month in 2011 and 2012

* In 2012, all over the year we rented more bikes than 2011.
* Seems like a general trend is bike rentals start increasing reaches maximum around August then again decreases till year end.
* Usually bike are most rented during the Fall season.

In [None]:
sns.lineplot(x='month', y='count', hue='year', data=train);

### Are there any rows which have both holiday and working day as 0?

* Woah we have close to 3k rows having both holiday and working day as 0.
* Assuming these days are weekends.

In [None]:
holidays_data = train[(train['holiday'] == 0) & (train['workingday'] ==0)]
print(holidays_data.shape)
holidays_data.head()