[credit: The Data Analysis Workshop](https://smile.amazon.com/Data-Analysis-Workshop-state-art/dp/1839211385/ref=sr_1_1?dchild=1&keywords=The+Data+Analysis+Workshop+Solve+business+problems+with+state-of-the-art+data+analysis+models&qid=1612045402&sr=8-1)

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [None]:
# load hourly data
hourly_data = pd.read_csv('../input/bike-sharing-dataset/hour.csv')

In [None]:
# print some generic statistics about the data
print(f"Shape of data: {hourly_data.shape}")
print(f"Number of missing values in the data:\
{hourly_data.isnull().sum().sum()}")

In [None]:
# get statistics on the numerical columns
hourly_data.describe().T

# Preprocessing Temporal and Weather Features
The seasons column contains values from 1 to 4, which encode, respectively, the Winter, Spring, Summer, and Fall seasons.  
The yr column contains the values 0 and 1 representing 2011 and 2012,  
while the weekday column contains values from 0 to 6, with each one representing a day of the week (0: Sunday, 1: Monday, through to 6: Saturday).  
Furthermore, we scale the hum column to values between 0 and 100 (as it represents the humidity percentage),  
and the windspeed column to values between 0 and 67 (as those are the registered minimum and maximum wind speed):

As a first step, create a copy of the original dataset. This is done as we do not want a specific transformation to affect our initial data:

In [None]:
preprocessed_data = hourly_data.copy()

In the next step, map the season variable from a numerical to a nicely encoded categorical one. In order to do that, we create a Python dictionary, which contains the encoding, and then exploit the apply and lambda functions:

In [None]:
seasons_mapping = {1: 'winter', 2: 'spring', 3: 'summer', 4: 'fall'}
preprocessed_data['season'] = preprocessed_data['season'].apply(lambda x: seasons_mapping[x])

Create a Python dictionary for the yr column as well:

In [None]:
yr_mapping = {0: 2011, 1: 2012}
preprocessed_data['yr'] = preprocessed_data['yr'].apply(lambda x: yr_mapping[x])

Create a Python dictionary for the weekday column:

In [None]:
weekday_mapping = {0: 'Sunday', 1: 'Monday', 2: 'Tuesday', \
3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday'}
preprocessed_data['weekday'] = preprocessed_data['weekday'].apply(lambda x: weekday_mapping[x])

Encode the weathersit values:

In [None]:
weather_mapping = {1: 'clear', 2: 'cloudy', 3: 'light_rain_snow', 4: 'heavy_rain_snow'}
preprocessed_data['weathersit'] = preprocessed_data['weathersit'].apply(lambda x: weather_mapping[x])

Finally, rescale the hum and windspeed columns:

In [None]:
preprocessed_data['hum'] = preprocessed_data['hum'] * 100
preprocessed_data['windspeed'] = preprocessed_data['windspeed'] * 67

We can visualize the results from our transformation by calling the sample() **method on the newly created dataset:

In [None]:
cols = ['season', 'yr', 'weekday', 'weathersit', 'hum', 'windspeed']
preprocessed_data[cols].sample(10, random_state=1)

# Registered versus Casual Use Analysis  
We begin our analysis of the number of rides performed by registered users versus non-registered (or casual) ones. These numbers are represented in the ***registered and casual*** columns, respectively, with the ***cnt*** column representing the sum. 

We can easily verify the relationship between *cnt and registered + casual* for each entry in the dataset by using the assert statement:

In [None]:
assert (preprocessed_data.casual + preprocessed_data.registered == preprocessed_data.cnt).all(), \
'Sum of casual and registered rides not equal to total number of rides'

We first take a look at their distributions:

In [None]:
sns.distplot(preprocessed_data['registered'], label='registered')
sns.distplot(preprocessed_data['casual'], label='casual')
plt.legend()
plt.xlabel('rides')
plt.title("Rides distributions")

Let's now focus on the evolution of rides over time. We can analyze the number of rides each day:

In [None]:
plot_data = preprocessed_data[['registered', 'casual', 'dteday']]
ax = plot_data.groupby('dteday').sum().plot(figsize=(10,6))
ax.set_xlabel("time")
ax.set_ylabel("number of rides per day")

We can take the rolling mean and standard deviation to smooth out the curves

In [None]:
"""
Create new dataframe with necessary for plotting columns, and obtain
number of rides per day, by grouping over each day
"""
plot_data = preprocessed_data[['registered', 'casual', 'dteday']]
plot_data = plot_data.groupby('dteday').sum()
"""
define window for computing the rolling mean and standard deviation
"""
window = 7
rolling_means = plot_data.rolling(window).mean()
rolling_deviations = plot_data.rolling(window).std()
"""
Create a plot of the series, where we first plot the series of rolling
means, then we color the zone between the series of rolling means +- 2
rolling standard deviations
"""
ax = rolling_means.plot(figsize=(10,6))
#ax.fill_between(rolling_means.index, rolling_means['registered'] + 2*rolling_deviations['registered'], \
#rolling_means['registered'] - 2*rolling_deviations['registered'], alpha = 0.2)
#ax.fill_between(rolling_means.index, rolling_means['casual'] + 2*rolling_deviations['casual'], \
#rolling_means['casual'] - 2*rolling_deviations['casual'], alpha = 0.2)
ax.set_xlabel("time")
ax.set_ylabel("number of rides per day")

Let's now focus on the distributions of the requests over separate hours and days of the week.

In [None]:
# select relevant columns
plot_data = preprocessed_data[['hr', 'weekday', 'registered', 'casual']]
"""
transform the data into a format, in number of entries are computed as count,
for each distinct hr, weekday and type (registered or casual)
"""
plot_data = plot_data.melt(id_vars=['hr', 'weekday'], var_name='type', value_name='count')
"""
create FacetGrid object, in which a grid plot is produced.
As columns, we have the various days of the week,
as rows, the different types (registered and casual)
"""
grid = sns.FacetGrid(plot_data, row='weekday', col='type', height=2.5, aspect=2.5, \
row_order=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
# populate the FacetGrid with the specific plots
grid.map(sns.barplot, 'hr', 'count', alpha=0.5)

# Analyzing Seasonal Impact on Rides

Plot the number of rides distributed over ***hours by seasons***:

Start by combining the hours and seasons. Create a subset of the initial data by selecting the hr, season, registered, and casual columns:

In [None]:
plot_data = preprocessed_data[['hr', 'season', 'registered', 'casual']]

unpivot the data from wide to long format:

In [None]:
plot_data = plot_data.melt(id_vars=['hr', 'season'], var_name='type', value_name='count')

Define the seaborn FacetGrid object, in which rows represent the different seasons and apply the seaborn.barplot() function to each of the FacetGrid elements:

In [None]:
grid = sns.FacetGrid(plot_data, row='season', col='type', height=2.5, aspect=2.5, \
row_order=['spring', 'summer', 'fall', 'winter'])
grid.map(sns.barplot, 'hr', 'count', alpha=0.5)

Plot the number of rides distributed over ***weekdays by seasons***:

In [None]:
plot_data = preprocessed_data[['weekday', 'season', 'registered', 'casual']]
plot_data = plot_data.melt(id_vars=['weekday', 'season'], var_name='type', value_name='count')
grid = sns.FacetGrid(plot_data, row='season', col='type', height=2.5, aspect=2.5, \
row_order=['spring', 'summer', 'fall', 'winter'])
grid.map(sns.barplot, 'weekday', 'count', alpha=0.5)