[credit: The Data Analysis Workshop](https://smile.amazon.com/Data-Analysis-Workshop-state-art/dp/1839211385/ref=sr_1_1?dchild=1&keywords=The+Data+Analysis+Workshop+Solve+business+problems+with+state-of-the-art+data+analysis+models&qid=1612045402&sr=8-1)

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [None]:
hourly_data = pd.read_csv('../input/bike-sharing-dataset/hour.csv')

In [None]:
preprocessed_data = hourly_data.copy()
seasons_mapping = {1: 'winter', 2: 'spring', 3: 'summer', 4: 'fall'}
preprocessed_data['season'] = preprocessed_data['season'].apply(lambda x: seasons_mapping[x])
yr_mapping = {0: 2011, 1: 2012}
preprocessed_data['yr'] = preprocessed_data['yr'].apply(lambda x: yr_mapping[x])
weekday_mapping = {0: 'Sunday', 1: 'Monday', 2: 'Tuesday', \
3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday'}
preprocessed_data['weekday'] = preprocessed_data['weekday'].apply(lambda x: weekday_mapping[x])
weather_mapping = {1: 'clear', 2: 'cloudy', 3: 'light_rain_snow', 4: 'heavy_rain_snow'}
preprocessed_data['weathersit'] = preprocessed_data['weathersit'].apply(lambda x: weather_mapping[x])
preprocessed_data['hum'] = preprocessed_data['hum'] * 100
preprocessed_data['windspeed'] = preprocessed_data['windspeed'] * 67

The first and most common way to measure the relationship between two different continuous variables is to measure their correlation  
Let's define a function that performs computing the correlation and analysis between the variables:

In [None]:
def plot_correlations(data, col):
    # get correlation between col and registered rides
    corr_r = np.corrcoef(data[col], data["registered"])[0,1]
    ax = sns.regplot(x=col, y="registered", data=data, scatter_kws={"alpha":0.05}, \
    label=f"Registered rides (correlation: {corr_r:.3f})")
    # get correlation between col and casual rides
    corr_c = np.corrcoef(data[col], data["casual"])[0,1]
    ax = sns.regplot(x=col, y='casual', data=data, scatter_kws={"alpha":0.05}, \
    label=f"Casual rides (correlation: {corr_c:.3f})")
    #adjust legend alpha
    legend = ax.legend()
    for lh in legend.legendHandles:
        lh.set_alpha(0.5)
        ax.set_ylabel("rides")
        ax.set_title(f"Correlation between rides and {col}")
    return ax

plot the correlation between *rides and temp*

In [None]:
plt.figure(figsize=(10,8))
ax = plot_correlations(preprocessed_data, 'temp')

plot the correlation between *rides and atemp*

In [None]:
plt.figure(figsize=(10,8))
ax = plot_correlations(preprocessed_data, 'atemp')

plot the correlation between *rides and hum*

In [None]:
plt.figure(figsize=(10,8))
ax = plot_correlations(preprocessed_data, 'hum')

plot the correlation between *rides and windspeed*

In [None]:
plt.figure(figsize=(10,8))
ax = plot_correlations(preprocessed_data, 'windspeed')

From the plots above, we can observe that higher *temperatures* have a positive impact on the number of rides  and it's a similar case for *atemp*. Note that as the values in the registered column are widely spread with respect to the different values in temp, we have a lower correlation compared to the casual column. 
The same pattern can be observed which the *humidity* level has a negative correlation with both types of rides.
we can see that there is minimal correlation between the number of rides and the *wind speed* (a weak positive correlation).

# Evaluating the Difference between the Pearson and Spearman Correlations
The major drawbacks of the *correlation coefficient* is its assumption of a *linear relationship* between the two random variables. This is quite a strong assumption as, most of the time, relationships in nature are not linear.  
The *Spearman rank correlation* generalizes the Pearson's correlation to *monotonic relationships* between two variables.
Let's investigate the relationship between the different variables in light of the difference between the two correlation measures.

In [None]:
from scipy.stats import pearsonr, spearmanr

In [None]:
#Define a function that, on the provided data and column, computes the Pearson and Spearman correlation coefficients 
#with the registered and casual rides:
def compute_correlations(data, col):
    pearson_reg = pearsonr(data[col], data["registered"])[0]
    pearson_cas = pearsonr(data[col], data["casual"])[0]
    spearman_reg = spearmanr(data[col], data["registered"])[0]
    spearman_cas = spearmanr(data[col], data["casual"])[0]
    return pd.Series({"Pearson (registered)": pearson_reg, "Spearman (registered)": spearman_reg,\
    "Pearson (casual)": pearson_cas, "Spearman (casual)": spearman_cas})

In [None]:
# compute correlation measures between different features
cols = ["temp", "atemp", "hum", "windspeed"]
corr_data = pd.DataFrame(index=["Pearson (registered)", "Spearman (registered)",\
"Pearson (casual)", "Spearman (casual)"])
for col in cols:
    corr_data[col]=compute_correlations(preprocessed_data, col)
corr_data.T

As we can see from above, for most of the variables, the Pearson and Spearman correlation coefficient are close enough.  
The most striking difference between the two coefficients occurs when comparing the temp (and atemp) and casual columns. More precisely, the Spearman correlation is quite high, meaning that there is significant evidence for *a nonlinear, relatively strong and positive relationship*.  
An interpretation of this result is that casual customers are far keener on using the bike sharing service when temperatures are higher.

**Correlation Matrix Plot**

In [None]:
cols = ["temp", "atemp", "hum", "windspeed", "registered", "casual"]
plot_data = preprocessed_data[cols]
corr = plot_data.corr()
fig = plt.figure(figsize=(10,8))
plt.matshow(corr, fignum=fig.number)
plt.xticks(range(len(plot_data.columns)), plot_data.columns)
plt.yticks(range(len(plot_data.columns)), plot_data.columns)
plt.colorbar()
plt.ylim([5.5, -0.5])