In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

First, let's load the dataset in to a dataframe and have a look.

In [None]:
df = pd.read_csv('/kaggle/input/london-bike-sharing-dataset/london_merged.csv', parse_dates=['timestamp'])
df.tail()

In [None]:
df.shape

As season and weather code has a numerical values it would be good to give meaningful words for those before we proceed with our analysis.

In [None]:
df['season'] = df['season'].map({0:'spring', 1:'summer', 2:'fall', 3:'winter'})
df['weather'] = df['weather_code'].map({1:'clear', 2:'scattered clouds', 3:'Broken clouds', 4:'Cloudy', 7:'Light rain', 10:'rain with thunderstorm',26:'snowfall',94:'Freezing Fog'})

df = df.drop(['weather_code'], axis=1)
df.tail()

Now we can start cleaning data. Starting with checking the column data types. Before the datatype for timestamp had `object` data type and I passed `parse_dates` attribute to `pd.read_csv` above to covert it to `datettime`.

In [None]:
df.dtypes

Season, Weather, is_holiday, is_weekend should be ideally categorical columns. Let's covert them to be categorical.

In [None]:
cat_cols = ['season', 'weather', 'is_holiday', 'is_weekend']

for col in cat_cols:
    df[col] = df[col].astype('category')
df.dtypes

Now all the types of the columns looks good. Checking for NaN values.

In [None]:
df.isnull().sum()


As there are no missing values, let's look at the 8-th value summary of the numerical columns and check whether there are any value deviations.

In [None]:
df.describe()

All values are inside the ranges they should be. Dataset looks good and ready for EDA.

### EDA

Let's plot hist for categorical columns is_holiday and is_weeken and see how the count varies for each type of day.

In [None]:
plt.figure(figsize=(16,8))

cols = ['is_holiday', 'is_weekend']
for i in range(len(cols)):
    plt.subplot(1,2,i+1)
    df.groupby(cols[i])['cnt'].hist(bins=50,grid=False)
    plt.xlabel(cols[i])
    plt.legend(df[cols[i]].unique())

Interestingly we can see that bike sharing is really low when it is a holiday or a weekend. It seems that most of the bike sharing is used on weekdays that is may be by the working crowd as a day-to-day commute for jobs.

Now let's plot bar charts to see how bike sharing varies with the weather or the season.

In [None]:
plt.style.use('ggplot')
fig, axs = plt.subplots(1, 2, figsize=(10,6))

cols = ['season', 'weather']
for i in range(len(cols)):
    sns.barplot(x=cols[i], y='cnt', data=df, ax=axs[i])
    axs[i].xaxis.set_tick_params(rotation=90)

When analysing the bar charts we can see that the bike sharing in winter is low and the behaviour is same when there is snowfall. People tend to share bikes on summer and when there is not rain or snow. Bike sharing is more common when the sky is clear or only has few scattered clouds.

Now let's move on to other numerical fields. t1, t2, hum and wind speed.

First it would be good to check whether there are correlated columns.

In [None]:
df.plot(kind='scatter', x='t1', y='t2')

We can see that t1 and t2 are highly correalted and therefore we can only consider one column for further analysis. Let's select t2, temperature feel as our measure.

In [None]:
df.plot(kind='scatter', x='t2', y='hum')

In [None]:
df.plot(kind='scatter', x='hum', y='wind_speed')

In [None]:
df.plot(kind='scatter', x='wind_speed', y='t2')

Other columns are not correalted with one another. Let's continue our analysis to compare `cnt` with these numeric fields.

In [None]:
num_cols = ['t2', 'hum', 'wind_speed']
fig, axs = plt.subplots(1, 3, figsize=(10,6))

i = 0
for col in num_cols:
    sns.lineplot(x=col, y='cnt', data=df, ax=axs[i])
    i+=1

We can see that as the temprature rises, bike sharing increases. And when the humidity increases bike sharing decreases. Ideal wind speed for bike riding seems to be between 20-40 km/h.

### Conclusion

* Bike sharing is more common on weekdays than holidays or weekends.
* Bike sharing increases when the tempreature rises.
* Bike sharing is more common in summer, then fall, then spring and it is lowest in winter.
* People tend to ride bikes when the sky is clear, or only have few scattered clouds and tend to avoid bike rides when there is rain, snow or when the sky is coludy.
* People tend to ride bikes mostly when the humidity is low.
* Ideal wind speed for bike riding seems to be 20-40 km/h.

This is my EDA on London Bike Sharing and the analysis will be continued. Comments and feedback are welcome. 