**This is my first attempt at an EDA, and thus I would try to make the explanation as detailed as possible for my own understanding. If you find anything that I overlooked or made a mistake on, please let me know so that I can improve in the future. Thank you in advance! **

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Importing other necessary libraries other than defaults
# This cell would be updated and run again in the case that I would want to use an extra library
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import missingno as msno
import calendar
from datetime import datetime

# Data Overview

In [None]:
# Reading the train and test datasets
train_path = '/kaggle/input/bike-sharing-demand/train.csv'
test_path = '/kaggle/input/bike-sharing-demand/test.csv'

# Since the 'datetime' column represents datetime values, the parse_dates argument is also passed
train = pd.read_csv(train_path, parse_dates=['datetime'])
test = pd.read_csv(test_path, parse_dates=['datetime'])

In [None]:
# View a brief excerpt of the train and test datasets
#train.head()
#train.tail()
#test.head()
#test.tail()

In [None]:
print('Shape of train dataset: {}'.format(train.shape))
print('Shape of test dataset: {}'.format(test.shape))

In [None]:
# Checking whether the 'count' column truly is the sum of the 'casual' and 'registered' columns
print(train[train['count'] != train['casual'] + train['registered']])

According to the results of the previous cells, the train dataset has 10886 observations across 9 features, the test dataset has 6493 observations across the same 9 features, whose details will be described below.
The train dataset also has 3 columns containing target values, being "casual", "registered", and "count" which is the sum of the previous two. For this reason, it is possible to ignore the other two and only focus on the "count" column.

In [None]:
# Basic description of the dataset
train.describe().T

The basic description of the dataset shows a intriguing point: there are observations in which the humidity is 0 and the windspeed is also 0. While the 0 windspeed might be an actual situation when there was no wind, it could also be the case that the windspeed is negligibly small that it could not be detected. On the other hand, a 0 humidity is an impossibility in and of itself (at least on Earth that is...). Therefore some attention needs to be paid to these values during model. Let's count the number of observations having such extreme values.

In [None]:
num_zero_humidity = train['humidity'].value_counts()[0]
num_zero_wind = train['windspeed'].value_counts()[0]
print('Number of observations with 0 humidity: {}/{} which is {:.4f}%'.format(num_zero_humidity, len(train), num_zero_humidity/len(train)*100))
print('Number of observations with 0 wind: {}/{} which is {:.4f}%'.format(num_zero_wind, len(train), num_zero_wind/len(train)*100))

With so few observations with 0 humidity, we could get away with removing them.

With roughly 12% of the observations having 0 windspeed, however, removing them would result in a huge loss of data. Since these values could very well be depicting the situations when windspeed is negligibly small, it would be reasonable to leave them as they are during the modeling stage, or we could try to impute them.

In [None]:
train.dtypes

# Checking for missing data

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

It seems that we are in luck as there are no missing data!!!!

(Makes sense since this is a beginner-friendly dataset)


Let's just visualize this lack of missing data for the sake of it, and for the sake of utilizing the missingno library.

In [None]:
msno.matrix(train)

# Transforming data

As per the description of the dataset, the "season", "holiday", "workingday", "weather" features are *'categorical'* features, yet their their *dtype* is *'int64'*. Given that each feature has at most 4 distinct categorical values, it would make sense to encode them with One-hot Encoding when used for making predictions. For EDA, it would be enough to make their *dtype* into *'category'* (and also give better descritive values than just 1,2,3,...)

Also, the "datetime" column contains information on both the hourly date and the timestamp. It would also make sense to seperate from this the values of "month", "day", "dayofweek", and "hour" into their own columns for easier analysis.

So, the to-do list includes:
* Separate "datetime" values into separate columns: "date", "month", "dayofweek", "hour"
* Change "season", "holiday", "workingday", "weather", "date", "month", "dayofweek", "hour" features into *'category'*


In [None]:
train['date'] = train['datetime'].dt.date
train['month'] = train['datetime'].dt.month
train['dayofweek'] = train['date'].apply(lambda x: calendar.day_name[datetime.strptime(str(x), '%Y-%m-%d').weekday()])
train['hour'] = train['datetime'].dt.hour

train['season'] = train['season'].map({1: 'Spring', 2: 'Summer', 3: 'Fall', 4: 'Winter'})
train['weather'] = train['weather'].map({
    1: 'Clear / Cloudy', #Clear, Few clouds, Partly cloudy, Partly cloudy
    2: 'Misty', #Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    3: 'Light Snow / Rain', #Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    4: 'Heavy Snow / Rain' #Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
})

train = train.drop('datetime',axis=1)

In [None]:
CategoryCol = ['season','holiday','workingday','weather','month','dayofweek','hour']
for col in CategoryCol:
    train[col] = train[col].astype('category')

In [None]:
train.dtypes

# Outlier Analysis

At first glance, it is clear that both the "casual" and "registered" values are heavily skewed to the right, and there exist a great number of observations lying beyond the Third Quantile.

The boxplots that plot "count" against the categorical features reveal some interesting insights:
* On average, more people rent bikes on working days than non-working days
* There are fewer rentals during Spring compared to other seasons
* When the weather is really bad (heavy snow/rain), the number of rentals drop drastically, which is what would be expected
* The number of rentals are highest during 7-8AM and 5-6PM, which coincide with rush hours. Due to large number of rentals, there are also virtually no outliers during these hours.
* Saturday and Sunday contribute significantly fewer outliers compared to other days of the week.

The 4 plots of "count" against the numerical features also reveal the following:
* Plots of "count" against "temp" and "atemp" have similar shape, as they should, because these two features are closely correlated.
* There is a gap of windspeed values between 0 and 7. This might have confirmed our suspicion that the "0 windspeed" implies the case of negligibly small windspeed. Still, we could treat them as they are, or impute them in order to have coutinuous values in the dataset.

In [None]:
def draw_plot(data): #making this a function so that plots can be easily redrawn after data transformation
    # Boxplots of target values
    fig1, ax1 = plt.subplots(ncols=3,nrows=2)
    fig1.set_size_inches(20,12)
    
    sns.boxplot(data=data, y='count',orient='v',ax=ax1[0,0])
    ax1[0,0].set(title='Box Plot on Count of Total Rentals', ylabel='Total')
    
    sns.boxplot(data=data, x='workingday', y='count', ax=ax1[0,1],orient='v')
    ax1[0,1].set(title='Box Plot on Total Rentals across Working days and Non-working days', ylabel='Total', xlabel=None)
    
    sns.boxplot(data=data, x='dayofweek', y='count',ax=ax1[0,2],orient='v',order=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
    ax1[0,2].set_xticklabels(ax1[0,2].get_xticklabels(),rotation=30)
    ax1[0,2].set(title='Box Plot on Total Rentals across Weekdays', ylabel='Total', xlabel=None)

    sns.boxplot(data=data, x='hour', y='count',ax=ax1[1,0],orient='v')
    ax1[1,0].set_xticklabels(ax1[1,0].get_xticklabels(),rotation=90)
    ax1[1,0].set(title='Box Plot on Total Rentals across Hours', ylabel='Total', xlabel='Hour')

    sns.boxplot(data=data, x='weather', y='count',ax=ax1[1,1],orient='v')
    ax1[1,1].set_xticklabels(ax1[1,1].get_xticklabels(),rotation=45)
    ax1[1,1].set(title='Box Plot on Total Rentals across Weather Conditions', ylabel='Total', xlabel='Weather')

    sns.boxplot(data=data, x='season', y='count',ax=ax1[1,2],orient='v',order=['Spring','Summer','Fall','Winter'])
    ax1[1,2].set(title='Box Plot on Total Rentals across Seasons', ylabel='Total', xlabel=None)
    
    #Scatterplots of 'count' values against numerical features
    fig3, ax3 = plt.subplots(ncols=4)
    fig3.set_size_inches(20,6)
    
    sns.regplot(data=data, x='temp', y='count',ax=ax3[0],color='goldenrod')
    sns.regplot(data=data, x='atemp', y='count',ax=ax3[1],color='green')
    sns.regplot(data=data, x='humidity', y='count',ax=ax3[2],color='purple')
    sns.regplot(data=data, x='windspeed', y='count',ax=ax3[3])
    
draw_plot(train)

Consider observations with "count" value that lies further than 3 standard deviations on either side of the mean value of the column as outliers. Let's remove the outliers!

Box plots of the dataset after removing outliers show fewer values beyond the Third Quantile.

In [None]:
trainNoOutlier = train[(np.abs(train['count']-train['count'].mean()) <= (3*train['count'].std())) &
                       (train['humidity'] != 0)] #removing 0 humidity values as well 
trainNoOutlier = trainNoOutlier.drop(['casual','registered'],axis=1)
draw_plot(trainNoOutlier)

The number of observations which have been removed is insignificant compared to the size of the dataset.

In [None]:
print('Shape of train dataset: {}'.format(train.shape))
print('Shape of train dataset without outliers: {}'.format(trainNoOutlier.shape))
print('Number of removed observations: {}'.format(train.shape[0] - trainNoOutlier.shape[0]))

# Visualizing Distribution

As clearly depicted in the figures below, the distribution of target values is right-skewed.

Since the target is a count of the number of bike rentals, we could leave this distribution as it is and try to apply Poisson regression model during analysis. Or we could try to normalize this distribution using log transformation or box-con transformation.

In [None]:
def draw_distribution(data): #making this a function for easy reuse
    fig1, ax1 = plt.subplots(ncols=2)
    fig1.set_size_inches(10,4)
    sns.histplot(data=data,x='count',kde=True, ax=ax1[0])
    stats.probplot(x=data['count'], dist='norm', plot=ax1[1])
    
draw_distribution(trainNoOutlier)

Log transforming the response values:

In [None]:
train_log = trainNoOutlier.copy().drop(['count'],axis=1)
train_log['count'] = np.log(trainNoOutlier['count'])

draw_distribution(train_log)

The incomplete graphs result from the response take value 0 for a number of observations. However, this would not hinder our contiued analysis.

Box-con transforming the response values:

In [None]:
train_boxcox = trainNoOutlier.copy().drop(['count'],axis=1)
train_boxcox['count'],_ = stats.boxcox(trainNoOutlier['count']+0.1) #avoid 0 value

draw_distribution(train_boxcox)

Judging from the distribution and probability plots, the Box-Cox Transformation did a better job of normaling the response values than the Log Transformation, though the results are still far from a normal distribution.

# Visualizing Correlation

Let's plot the correlation of response values to numerical features, using the original data, as well as the log-transformed and box-cox-transformed data.

In [None]:
corrMatt = trainNoOutlier.corr()
corrMattLog=train_log.corr()
corrMattBC=train_boxcox.corr()
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False

fig1,ax1=plt.subplots(ncols=3)
fig1.set_size_inches(20,6)
sns.heatmap(corrMatt,mask=mask,annot=True,square=True,cmap='YlOrBr',ax=ax1[0])
ax1[0].set(title='Original Data')
sns.heatmap(corrMattLog,mask=mask,annot=True,square=True,cmap='YlOrBr',ax=ax1[1])
ax1[1].set(title='Log-Transformed Data')
sns.heatmap(corrMattBC,mask=mask,annot=True,square=True,cmap='YlOrBr',ax=ax1[2])
ax1[2].set(title='Box-Cox-Transformed Data')

* As expected, "temp" and "atemp" have strong collinearity which can be inferred from common sense. It would make sense to remove either feature from the model, as having both potentially increases the variance of the model. We would choose to keep "temp", people often pay attention to this value rather than "atemp" when deciding whether or not to rent a bike.
* "temp", "humidity", "windspeed" have some correlation with though "windspeed" very weakly. However, this is not enough reason to remove from the model.

Displaying the correlation among features (both categorical and numerical) and the response values in more details with the pairplots.

Overall, there is no discernable trends in the correlation among the features and the respone in each pair. However, one noticable strange point is that, there are a number of observations for which even though the "temp" values differ by at most 10 degrees, the "atemp" values remain unchanged. This suggests some errors in recording the "atemp" values. This constitutes another reason to remove the "atemp" column from the model.

In [None]:
#pairplots of original data
sns.set()
sns.pairplot(trainNoOutlier, height=2.5)
plt.show()

In [None]:
#pairplots of log-transformed data
sns.set()
sns.pairplot(train_log, height=2.5)
plt.show()

In [None]:
#pairplots of box-cox-transformed data
sns.set()
sns.pairplot(train_boxcox, height=2.5)
plt.show()