- Recently I got a lot of feedback from my dear friends who just change or about the change their career towards to Data Analysis/ Data Science and Machine Learning areas about the lack of material between beginning the analysis journey and the advanced techniques.

- They are looking for detailed but at the same time beginner friendly, not so much complicated (with different regression, normalization techniques, etc.) explained Explanatory Data Analysis examples, which show them how to start and most importantly how to read the descriptive statistics and graphs.

- After getting these feedbacks, I have decided to make some kind of series of EDA’s from different datasets, without making so complicated for the people at their first steps of DS/ML journey.

### This notebook is part of the 9 Beginner Friendly EDAs. If these EDAs would be helpful to anyone, I would be more than happy.


#### **INTRO**

- In this study, we are going to make Exploratory Data Analysis (EDA) with the London Bike Share dataset.
- Study aims to be beginner friendly and give as much as possible explanation for each step on the way.
- Study's dataset has 17414 instances along with their count of bike share, temperature and other features.
- Data includes 2015-2017 bike share info in London..

- 'Ride into a wise, healthy world that’s eco-friendly, efficient, and fun.' from the https://www.pbsc.com/about-us website


- Let's import the required libraries

In [None]:
import pandas as pd
import numpy as np


import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

### Overview Stage

- Read the csv
- Look for basic information about the dataset

In [None]:
df = pd.read_csv('../input/london-bike-sharing-dataset/london_merged.csv')
df.head()

Metadata:
- "timestamp" - timestamp field for grouping the data
- "cnt" - the count of a new bike shares
- "t1" - real temperature in C
- "t2" - temperature in C "feels like"
- "hum" - humidity in percentage
- "windspeed" - wind speed in km/h
- "weathercode" - category of the weather
- "isholiday" - boolean field - 1 holiday / 0 non holiday
- "isweekend" - boolean field - 1 if the day is weekend
- "season" - category field meteorological seasons: 0-spring ; 1-summer; 2-fall; 3-winter.

- "weathe_code" category description:
   - 1 = Clear ; mostly clear but have some values with haze/fog/patches of fog/ fog in vicinity 2 = scattered clouds / few clouds 3 = Broken clouds 4 = Cloudy 7 = Rain/ light Rain shower/ Light rain 10 = rain with thunderstorm 26 = snowfall 94 = Freezing Fog

In [None]:
df.shape

- We have 17414 instances with 10 different variables to work on.

In [None]:
df.isnull().sum()

- Yes, very clean data for the 17414 instances.
- In the real world very hard to find this kind of clean data. Enjoy !!

In [None]:
df.info()

- It looks like we have 9 numeric variable. But is that so???
- Also we have 1 non-numeric variable. 
- Non-numeric variable is coded as Object, but it looks like time object. It needs further adjustment. Noted.
- Also boolean variables are coded as 0 and 1, noted.
- Categorical variables **season** and **weathercode** are also coded as numerical.  Noted.
- "t1" - real temperature in C and "t2" - temperature in C "feels like" seems quite same thing, needs to look their correlation. Noted.

In [None]:
df.drop(['season', 'weather_code', 'is_holiday','is_weekend'], axis=1).describe()

Before going further, let's summarize what we have got from the dataset.

- Our dataset has 17414 time records of the bike rent. 
-  "t1" - real temperature in C and "t2" - temperature in C "feels like" seems quite same thing, needs to look their correlation. We need to be careful about the multicollinearity.

- We have date object, needs to be adjusted.

- Numerically coded (season and weather_code) variables can be used as a group to see the differences among them.

- 'cnt' : count of bike share, will be our target variable to work on it.

- Numerical columns most probably have outliers. (Mean- Median difference, difference between 75% and maximum value, difference between %25 and minimum value), we have to check them.

- Let's make the necessary adjustments before moving to the analysis part.

#### **Temperature**

- Lets' check correlation between real temperature and felt temperature.
- if correlation is high, we can detect the multicollinearity and use one of the highly correlated variable  to improve our model success.
- Even though, we will make detailed EDA in this study, still it is best practice to follow.

In [None]:
df['t1'].corr(df['t2'])

- Correlation is extremely high, so we will use only  "t1" - real temperature in C, in our analysis.

#### **timestamp**

- Let's make 'timestamp' as datetime object and use its values to make new columns out of it.

In [None]:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df= df.set_index('timestamp')

In [None]:
df['year_month']= df.index.strftime('%Y-%m')
df['year'] = df.index.year
df['month']= df.index.month
df['day_of_week']=df.index.dayofweek
df['hour']=df.index.hour

df.head()

- Seems much better

#### Look at the **season** and **weather_code** 

In [None]:
df['season'].value_counts()

- That's good, it can be used as a group to see the differences at the count of bike share

In [None]:
df['weather_code'].value_counts()

- It seems OK, can be used in the groupby.

### Analysis Part

#### **Season**

In [None]:
df['season'].value_counts(normalize=True)

- Dataset contains almost same number of instances from the four seasons.

In [None]:
fig = px.bar(x= df['season'].value_counts().index, y=df['season'].value_counts().values, 
             title='Seasons', labels={'y':'Count', 'x':'Seasons'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

#### **weather_code**

In [None]:
df['weather_code'].value_counts(normalize=True)

- %35 of the times, weather code is Clear(1.0)
- %23 of the times, weather code is 'scattered clouds, few clouds'
- %20 of the times, weather code is 'broken clouds'
- %12 of the time 'rain, light rain'

- By the way, remember that we are looking at the London's data. So rain and cloud is quite a Londonish.

In [None]:
fig = px.pie(df, values=df['weather_code'].value_counts().values, 
             names= ['Clear', 'scattered clouds', 'Broken clouds', 'Cloudy' 'Rain', 'rain with thunderstorm', 'snowfall', 'Freezing Fog'])
fig.show()



#### **Count of a New Bike Shares**

In [None]:
df['cnt'].describe()

- We have huge difference between mean and median values (mean = 1143, median=844)
- It has highly skewed distribution with the outliers on the maximum side.
- We can expect highly right skewed distribution with possible outliers in the maximum side.
- Let' see it.

In [None]:
fig = px.histogram(df, x= 'cnt', title='Count of a New Bike Shares', marginal="box", hover_data = df[['season']])
fig.show()

- As expected, highly right skewed distribution with the outliers on the maximum side.

- All of the extreme outliers (starting from 5560 count) are in the season 1, which means in the summer.

- Any surprise !!! 

#### **real temperature in C**

In [None]:
df['t1'].describe()

- Both mean and median scores are very close to each other. Median is slightly higher than mean score. 
- So we can expect very slightly left skewed distribution
- But the distribution will be very close to normal distribution with several outliers.
- Let's see it.

In [None]:
fig = px.histogram(df, x= 't1', title='Temperatures', marginal="box", hover_data = df[['season']])
fig.show()

- Yeah, as we expected, quite normal distribution with several outliers, 
- As seen better in the box plot, very slightly left skewed distribution.

#### **Wind Speed**

In [None]:
df['wind_speed'].describe()

- We can expect slighlt right skewed distribution (mean 15.9, median=15)
- Which will be very close to normal distribution
- We can expect outliers on the maximum side.

In [None]:
fig = px.histogram(df, x= 'wind_speed', title='Wind Speed', marginal="box", hover_data = df[['season']])
fig.show()

- As we expected, several outliers on the right side.
- Slightly right skewed distribution

#### **Humidity**

In [None]:
df['hum'].describe()

- Both mean and median scores are close to each other.
- Since median score is little bit higher than mean score, we can expect slightly left skewed distribution.
- Possible outliers on the minimum side.

In [None]:
fig = px.histogram(df, x= 'hum', title='Humidity', marginal="box", hover_data = df[['season']])
fig.show()

- As we expected, left skewed distribution with outliers on the left side.

- Ok After seeing numerical variables in detail. let's see correlation matrix and their relationships with count of number of bike share.

### **Correlation**

In [None]:
df[['cnt','t1','hum','wind_speed']].corr()

In [None]:
index_vals = df['season'].astype('category').cat.codes

fig = go.Figure(data=go.Splom(
                dimensions=[dict(label='Number of Bike Share',
                                 values=df['cnt']),
                            dict(label='Temperature',
                                 values=df['t1']),
                            dict(label='Humidity',
                                 values=df['hum']),
                           dict(label='Wind Speed',
                                 values=df['wind_speed'])],
                showupperhalf=False, 
                text=df['season'],
                marker=dict(color=index_vals,
                            showscale=False,
                            line_color='white', line_width=0.5)
                ))


fig.update_layout(
    title='Bike Share in london',
    width=1000,
    height=1000,
)

fig.show()

- Based on the correlation matrix:
    - There is a weak positive relationship (.388) between temperature and the number of bike share
    - Also there is a weak negative relationship (.46) between humidity and the number of the bike share.

#### **Holiday or No?**

In [None]:
df['is_holiday'].value_counts()

In [None]:
fig = px.pie(df, values=df['is_holiday'].value_counts().values, 
             names= ['Normal Day','Holiday'] )
fig.show()

#### **Wekend or No**

In [None]:
df['is_holiday'].value_counts()

In [None]:
fig = px.pie(df, values=df['is_holiday'].value_counts().values, 
             names= ['Weekday','Weekend'] )
fig.show()

- Ok let's go deeper.

### **Bike Share by Year**

In [None]:
fig = px.scatter(df, x="year", y="cnt")
fig.show()

- From 2015 to 2017 we can observe significant decrease on the bike share counts.

### **Bike Share by Year and Months**

In [None]:
fig = px.scatter(df, x="year_month", y="cnt")
fig.show()

- As easily seen in the scatter plot, during the summer time, there is significant increase on the bike share.
- On the other hand, during the winter time it decreases significantly.

#### **Bike Share by Seasons**

In [None]:
df['season1']= df['season'].replace({0:'Spring',1:'summer',2:'Fall',3:'Winter'})
fig = px.bar(df, x='season1', y= 'cnt',  hover_data = df[['year_month']], color='season1', 
             labels={'season1':'Seasons','cnt':'Number of Bike Share'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- As we have seen in the year_month, same is true based on the seasons.
- Bike share increases on the summer time and reaches lowest point on the winter time.

#### **Bike Share During the Holiday**

In [None]:
holiday = df.groupby('is_holiday')['cnt'].mean().reset_index().rename(columns={'is_holiday': 'Holiday', 'cnt':'Number of Bike Shared'}, )
holiday['Holiday']= holiday['Holiday'].replace({0: 'Normal Day', 1:'Holiday'})

fig = px.bar(holiday, x='Holiday', y= 'Number of Bike Shared', color='Holiday', )
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Normal days have more bike share than holidays.

#### **Bike Share During the Weekend**

In [None]:
weekend = df.groupby('is_weekend')['cnt'].mean().reset_index().rename(columns={'is_weekend': 'Weekend', 'cnt':'Number of Bike Shared'}, )
weekend['Weekend']= weekend['Weekend'].replace({0: 'Weekday', 1:'Weekend'})

fig = px.bar(weekend, x='Weekend', y= 'Number of Bike Shared', color='Weekend', )
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Weekdays have more bike share than weekends.

#### **Bike Shares by Hour**

In [None]:
fig = px.scatter(df, x="hour", y="cnt", color='is_holiday')
fig.show()

- On the mornings between 8-10, and on the afternoons between 17-18 are the peak hours for bike sharing.
- We can make different speculations based on this result, such as before going to work or school and after school or work would be the peak hours for sharing bike.
- But still we need more data to justify our assumptions.

In [None]:
fig = px.scatter(df, x="hour", y="cnt", color='is_weekend')
fig.show()

- During the wekend we have another result to look for it.
- Weekend time between 10-16 are the peak time to share a bike.
- Yeah, also during the midnight, somebody needs a ride !!!

In [None]:
fig = px.scatter(df, x="day_of_week", y="cnt", color='is_weekend', hover_data = df[['hour']])
fig.show()

- Except Thurdays, almost same distribution during the weekdays.
- Thursdays have the peaks at the morning 8.a.m and afternoons between 16-18.

## This notebook is a part of the 9 Beginner Friendly EDAs
## If you like this one, you can also check out other notebooks in the Beginner Friendly EDAs series!


* [Data Analyst Jobs - EDA](https://www.kaggle.com/kaanboke/plotly-data-analyst-jobs)
* [Top Games on Google Play Store](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-games)
* [Hollywood Top Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-movies)
* [UDEMY Courses EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-udemy)
* [World Happiness Report - EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-eda)
* [Countries Life Expectancy](https://www.kaggle.com/kaanboke/plotly-beginner-friendly)
* [Netflix Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-netflix)
* [Amazon Top 50 Bestselling Books EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-amazon)

- Thanks for the dataset contibutor for this data. I really enjoyed working on it.

- It was a quite pleasure to share with you this detailed, beginner friendly EDA. Thanks for your time.

- All the best 