# Bike sharing - Hypothesis Testing

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write t1orary files to /kaggle/t1/, but they won't be saved outside of the current session

**Business Problem:**
- To know/understand about the variables or factors which are significant in predicting the demand for shared electric cycles in the Indian market and how well those variables/factors describe the electric cycle demands.

**Dataset and Column Profiling:**:

- timestamp - timestamp field for grouping the data
- cnt - the cnt of a new bike shares
- t1 - real t1erature in C
- t2 - t1erature in C "feels like"
- hum - humidity in percentage
- windspeed - wind speed in km/h
- weather_codecode - category of the weather_code
    - 1 = Clear ; mostly clear but have some values with haze/fog/patches of fog/ fog in vicinity
    - 2 = scattered clouds / few clouds
    - 3 = Broken clouds
    - 4 = Cloudy
    - 7 = Rain/ light Rain shower/ Light rain
    - 10 = rain with thunderstorm
    - 26 = snowfall
- isis_holiday - boolean field - 1 is_holiday / 0 non is_holiday
- isweekend - boolean field - 1 if the day is weekend
- season - category(0-spring ; 1-summer; 2-fall; 3-winter)



**Importing required packages**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import NaN, nan, NAN
from scipy import stats
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")

**Loading data into Dataframe**

In [None]:
bike_data = pd.read_csv('../input/london-bike-sharing-dataset/london_merged.csv')
bike_data

In [None]:
bike_data.describe().transpose()

In [None]:
bike_data.info()

In [None]:
# Converting timestamp object into timestamp64[ns] for finding useful insights
bike_data['timestamp'] = bike_data['timestamp'].astype('datetime64[ns]')

In [None]:
bike_data.info()

In [None]:
bike_data.shape

In [None]:
bike_data.isnull().sum()/len(bike_data) * 100

In [None]:
bike_data.nunique()

In [None]:
bike_data.duplicated().sum()

**Observations:** 
- There are total 4 categorical features namely Season, is_holiday, is_weekend, weather_code and 7 numerical/continuos features and 1 datatime object. In total 12 independent features with 10886 rows.
- Missing data or Null values are not present , neither any duplicated row is there.

# Outlier Detection and Removal:

In [None]:
# Visualization before outlier removal
fig = plt.figure(figsize = (15,10))

ax1 = fig.add_subplot(221)
sns.boxplot(x = 'season', y = 'cnt', data = bike_data)
ax1.set_title('Boxplot for season vs corresponding bike renting cnts')

ax1 = fig.add_subplot(222)
sns.boxplot(x = 'is_holiday', y = 'cnt', data = bike_data)
ax1.set_title('Boxplot for is_holiday vs corresponding bike renting cnts')

ax1 = fig.add_subplot(223)
sns.boxplot(x = 'is_weekend', y = 'cnt', data = bike_data)
ax1.set_title('Boxplot for is_weekend vs corresponding bike renting cnts')

ax1 = fig.add_subplot(224)
sns.boxplot(x = 'weather_code', y = 'cnt', data = bike_data)
ax1.set_title('Boxplot for weather_code vs corresponding bike renting cnts')

plt.show()

In [None]:
fig = plt.figure(figsize = (15,10))

ax1 = fig.add_subplot(221)
sns.scatterplot(x = 'cnt', y = 't1',data = bike_data, hue ='season' )
ax1.set_title('scatterplot for season vs corresponding bike renting cnts')

ax1 = fig.add_subplot(222)
sns.scatterplot(x ='cnt', y = 't1', data = bike_data, hue ='is_holiday')
ax1.set_title('scatterplot for is_holiday vs corresponding bike renting cnts')

ax1 = fig.add_subplot(223)
sns.scatterplot(x = 'cnt', y = 't1',data = bike_data, hue ='is_weekend')
ax1.set_title('scatterplot for is_weekend vs corresponding bike renting cnts')

ax1 = fig.add_subplot(224)
sns.scatterplot(x = 'cnt',y = 't1',data = bike_data, hue ='weather_code')
ax1.set_title('scatterplot for weather_code vs corresponding bike renting cnts')

plt.show()

In [None]:
bike_dcopy = bike_data.copy() # Taken backup of orginial dataset before removing outliers

In [None]:
q1=bike_data['cnt'].quantile(0.25)
q3=bike_data['cnt'].quantile(0.75)
iqr=q3-q1
bike_data = bike_data[(bike_data['cnt'] >= q1 - 1.5*iqr) & (bike_data['cnt'] <= q3 +1.5*iqr)]
bike_data.shape

In [None]:
bike_dcopy.shape[0] - bike_data.shape[0]

In [None]:
#Visualization after removing outliers
fig = plt.figure(figsize = (15,10))

ax1 = fig.add_subplot(221)
sns.boxplot(x = 'season', y = 'cnt', data = bike_data)
ax1.set_title('Boxplot for season vs corresponding bike renting cnts')

ax1 = fig.add_subplot(222)
sns.boxplot(x = 'is_holiday', y = 'cnt', data = bike_data)
ax1.set_title('Boxplot for is_holiday vs corresponding bike renting cnts')

ax1 = fig.add_subplot(223)
sns.boxplot(x = 'is_weekend', y = 'cnt', data = bike_data)
ax1.set_title('Boxplot for is_weekend vs corresponding bike renting cnts')

ax1 = fig.add_subplot(224)
sns.boxplot(x = 'weather_code', y = 'cnt', data = bike_data)
ax1.set_title('Boxplot for weather_code vs corresponding bike renting cnts')

plt.show()

In [None]:
#Visualization after removing outliers
fig = plt.figure(figsize = (15,10))

ax1 = fig.add_subplot(221)
sns.scatterplot(x = 'cnt', y = 't1',data = bike_data, hue ='season' )
ax1.set_title('scatterplot for season vs corresponding bike renting cnts')

ax1 = fig.add_subplot(222)
sns.scatterplot(x ='cnt', y = 't1', data = bike_data, hue ='is_holiday')
ax1.set_title('scatterplot for is_holiday vs corresponding bike renting cnts')

ax1 = fig.add_subplot(223)
sns.scatterplot(x = 'cnt', y = 't1',data = bike_data, hue ='is_weekend')
ax1.set_title('scatterplot for is_weekend vs corresponding bike renting cnts')

ax1 = fig.add_subplot(224)
sns.scatterplot(x = 'cnt',y = 't1',data = bike_data, hue ='weather_code')
ax1.set_title('scatterplot for weather_code vs corresponding bike renting cnts')

plt.show()

**Observations:** 
- After dealing with the ouliers , 300 rows are removed out of 10886 from the dataset. As we can see from above boxplot and scatterplot, the data now looks more clean.

# Univariate Analysis and Bivariate Analysis:

**timestamp specific EDA:**

As we will be finding some useful insights w.r.t the timeseries data, we will will working on the original dataset before removing outliers which is **bike_dcopy**

In [None]:
#creating a new dataframe for indexing timestamp
bike_datatime = pd.read_csv('../input/london-bike-sharing-dataset/london_merged.csv')
bike_datatime

In [None]:
bike_dcopy["timestamp"].sort_values() 

In [None]:
bike_dcopy['Year'] = bike_dcopy['timestamp'].dt.year
bike_dcopy['Month'] = bike_dcopy['timestamp'].dt.month
bike_dcopy['Day'] = bike_dcopy['timestamp'].dt.day
bike_dcopy

In [None]:
np.sort(bike_dcopy[bike_dcopy['cnt'] >= bike_dcopy['cnt'].quantile(0.75)]['Day'].unique())

In [None]:
bike_dcopy[bike_dcopy['cnt'] >= bike_dcopy['cnt'].quantile(0.95)]['Month'].unique()

In [None]:
bike_dcopy['year'] = bike_dcopy['timestamp'].dt.year

In [None]:
bike_dcopy['month'] = bike_dcopy['timestamp'].dt.month

In [None]:
bike_dcopy.head()

In [None]:
year_data = bike_dcopy.groupby(['year'])['cnt'].sum()
year_data = year_data.reset_index()
sns.barplot(x='year',y='cnt',data=year_data)
plt.title('Count of booking per year')
plt.show()

In [None]:
month_data = bike_dcopy.groupby(['month'])['cnt'].sum()
month_data = month_data.reset_index()
sns.barplot(x='month',y='cnt',data=month_data)
plt.title('Count of booking per month')
plt.show()

In [None]:
mon_year_data = bike_dcopy.groupby(['year','month'])['cnt'].sum()
mon_year_data = pd.DataFrame(mon_year_data)
mon_year_data.reset_index(inplace = True)
myy = mon_year_data.pivot('month','year','cnt').fillna(0)

In [None]:
sns.heatmap(myy)
plt.title('Count of booking across years and months')
plt.xlabel('Year')
plt.ylabel('Month')
plt.show()

As infered earlier the data booking is almost same across all the months.

**Observations:** 
- As the data ranges from 1st Jan 2011 to 19th Dec 2012, the cnt of the bikes rented will also be in this time period of almost 2 years.
- During months of September and October, maximum number of  bikes are rented.
- The cnt is less in the cold seasons of winter(Months such as Nov, Dec, Jan, Feb) where due to cold, people prefer mostly not to ride the bikes.
- As we can see, the data contains dates from 1 to 19th of a particular month.
- As we can from the monthwose bar plot , the demand for  bikes at the starting months is quite low as compared to months from March 2012 onwards. There's adrop in the middle owing to cold and winter season.
- Booking from 2017 is nearly zero so far.
- Almost all the months have same number of bookings.


In [None]:
#Univariate analysis for numerical/continuos variables
def num_feat(col_data):
    fig,ax = plt.subplots(nrows=1,ncols=2,figsize=(10,5))
    sns.histplot(col_data, kde=True, ax=ax[0], color = 'purple')
    ax[0].axvline(col_data.mean(), color='r', linestyle='--',linewidth=2)
    ax[0].axvline(col_data.median(), color='k', linestyle='dashed', linewidth=2)
    ax[0].axvline(col_data.mode()[0],color='y',linestyle='solid',linewidth=2)
    sns.boxplot(x=col_data, showmeans=True, ax=ax[1])
    plt.tight_layout()

In [None]:
bike_data.info()

In [None]:
bike_data.columns

In [None]:
num_cols = ['t1','t2','hum','cnt','wind_speed']
num_cols

In [None]:
for i in num_cols:
    num_feat(bike_data[i])
    

**Observations for univariate numerical features:**
- There are outliers in windspeed and casual users which tells us that, the windspeed in not uniform. 
- The exponentil decay curve for the cnt tells that, as the users renting bikes increases, the frequency decreases.

In [None]:
#EDA on Univariate Categorical variables
def cat_feat(col_data):
    fig,ax = plt.subplots(nrows=1,ncols=2,figsize=(12,5))
    fig.suptitle(col_data.name+' wise sale',fontsize=15)
    sns.countplot(col_data,ax=ax[0])
    col_data.value_counts().plot.pie(autopct='%1.1f%%',ax=ax[1], shadow = True)
    plt.tight_layout()

In [None]:
bike_data.columns

In [None]:
cat_cols = ['season', 'is_holiday', 'is_weekend', 'weather_code']
cat_cols

In [None]:
for i in cat_cols:
    cat_feat(bike_data[i])

**Observations for univariate categorical features:**
- For weather_code, 10th and 26th  (Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog) no. of users renting bikes is much low and hence it's good to drop the feature while doing further tests.
- cnt for bikes rentied in working day is much higher than non working day.
- During is_holidays, people don't prefer to ride bikes.
- When the weather_code is Clear with Few clouds, people tend to rent more  bikes for their comute.
- During the spring, summer, fall, winter, the cnt is more or less equal for the users renting bikes

**Corelation between Bivariate features:**


In [None]:
plt.figure(figsize = (16, 10))
sns.heatmap(bike_data.corr(),annot=True)
plt.show()

**Observations:**:
- The **registered user** cnt has higher corelation with the cnt as compared to the **casual user cnt**.
- The **windspeed and season** has a very low(near zero) positive corelation with the cnt which means, the windspeed and seasons doesn't have an effect in the demand of  bikes rented.
- The **t1erature and user specific feeling of heat/cold** has a moderated corelation (0.3) with the cnt. People tends to go ot in bright sunny day whne the t1 is normal whereas as during hrsh conditions such as too hot or too cold, the demand in the  bikes has seen a considerable dip.
- The casual users who rents  bikes likes to ride the bikes as the t1erature is suitable.
- When there's a is_holiday, user cnt has seen a considerable dip whereas in working days, the cnt is normal.

# Two - Sample T-Test

**2- Sample T-Test to check if Holiday has an effect on the number of electric cycles rented**

#### Step 1: Define Null & Alternate Hypothesis

Setting up Null Hypothesis (H0) and Stating the alternate hypothesis (Ha) and significance level
- **H0 : The bike's renting cnt in working days and non- working days are equal.**
- **Ha : The bike's renting cnt in working days and non- working days is not equal.**
- alpha = 0.05

#### Step 2: Validate the assumptions
**Two-sample t-test assumptions**
- Data values must be independent. Measurements for one observation do not affect measurements for any other observation.
- Data in each group must be obtained via a random sample from the population.
- Data in each group are normally distributed.
- Data values are continuous.
- The variances for the two independent groups are equal.

In [None]:
bike_data.shape

In [None]:
bike_data['is_weekend'].value_counts(normalize = True) * 100

In [None]:
bike_data['is_weekend'].value_counts()

In [None]:

working_data = bike_data[bike_data['is_weekend'] == 1].sample(4500, replace = False)
non_working_data = bike_data[bike_data['is_weekend'] == 0].sample(4500, replace = False)


**Checking assumptions of the test (Normality, Equal Variance)**
- Using visualization methods - Histogram, Q-Q plot
- Using statistical methods like levene’s test, Shapiro-wilk test

In [None]:
round(working_data['cnt'].std()**2,2), round(non_working_data['cnt'].std()**2 ,2)

**Observations**: The variance is not equal for both the samples.

In [None]:
#Checking the normality
fig = plt.figure(figsize = (15,12))

ax1 = fig.add_subplot(221)
sns.histplot(data = working_data, x = 'cnt' , bins = 50, kde = True, ax = ax1, color = 'red')
ax1.set_title('cnt of bikes rented in working days')

ax2 = fig.add_subplot(222)
sm.qqplot(working_data['cnt'], line = 's', ax = ax2)
ax2.set_title('qqplot for cnt in working days')

ax3 = fig.add_subplot(223)
sns.histplot(data = non_working_data, x = 'cnt' , bins = 50, kde = True, ax = ax3, color = 'red')
ax3.set_title('cnt of bike rented in non-working days')

ax4 = fig.add_subplot(224)
sm.qqplot(non_working_data['cnt'], line = 's', ax = ax4)
ax4.set_title('qqplot for cnt in non-working days')

plt.show()

In [None]:
# Calculating the p-value and test- statistics usinging ttest_ind() for this right skwed distribution sample.
# Alternative == greater as it's right skewed and one-sided
t_test, p_value = stats.ttest_ind(working_data['cnt'],non_working_data['cnt'],
                                  alternative='greater', equal_var = False)
t_test, p_value

**Observations**: 
- The distribution of the population samples is right-skwed and it's not normal which violates is our assumption for conducting 2 sample t test. Also the varaince of the samples is unequal.Hence we will do log-transformation
- We got a p-value of 0.99 which is greater than 0.05 and hence we can say that we fail to reject null hypothesis. We will confirm after log - transformation as well

**Applying log on the data - Log Normal Distribution**

In [None]:
fig = plt.figure(figsize = (15,12))

ax1 = fig.add_subplot(221)
sns.histplot(data = np.log(working_data['cnt']) , bins = 50, kde = True, ax = ax1, color = 'green')
ax1.set_title('cnt of bikes rented in working days')

ax2 = fig.add_subplot(222)
sm.qqplot(np.log(working_data['cnt']), line = 's', ax = ax2)
ax2.set_title('qqplot for cnt in working days')

ax3 = fig.add_subplot(223)
sns.histplot(data = np.log(non_working_data['cnt']) , bins = 50, kde = True, ax = ax3, color  = 'green')
ax3.set_title('cnt of bike rented in non-working days')

ax4 = fig.add_subplot(224)
sm.qqplot(np.log(non_working_data['cnt']), line = 's', ax = ax4)
ax4.set_title('qqplot for cnt in non working days')

plt.show()

In [None]:
round(np.log(working_data['cnt']).std()**2,2), round(np.log(non_working_data['cnt']).std()**2 ,2)

**Observations**: After taking log on the sample population, we get a near normal distribution with variance very similar to each other. So we can calculate the p-value and test-statistics.

In [None]:
sample_w_log = np.log(working_data['cnt']).sample(4500)
sample_nw_log = np.log(non_working_data['cnt']).sample(4500)

In [None]:
statistic,p_value = stats.ttest_ind(sample_w_log,sample_nw_log , alternative = 'greater')
statistic,p_value

In [None]:
def htResult(p_value):
    significance_level = 0.05
    if p_value <= significance_level: 
        print('Reject NULL HYPOTHESIS') 
    else: 
        print('Fail to Reject NULL HYPOTHESIS') 

In [None]:
htResult(p_value)

In [None]:
stats.levene(sample_w_log, sample_nw_log, center='median')

In [None]:
sns.boxplot(x='is_weekend', y='cnt', data=bike_data)
plt.show()

**Conclusion : As the p value > alpha(0.05) , we fail to reject H0 and thus we can say that the cnt of renting of  bikes in both working and non-working days is equal. And we can confirm this using the boxplot as well**

# Chi-square test to check if weather is dependent on the season 

Assumptions:
- Assumption 1: Both variables are categorical.
- Assumption 2: All observations are independent.
- Assumption 3: Cells in the contingency table are mutually exclusive.
- Assumption 4: Expected value of cells should be 5 or greater in at least 80% of cells.
    - It’s assumed that the expected value of cells in the contingency table should be 5 or greater in at least 80% of cells and that no cell should have an expected value less than 1.

**H0 : Both weather_code and seasons are independent of each other**

**Ha : There is dependency of weather_code on Seasons**

aplha = 0.05

In [None]:
contigency_table = pd.crosstab(bike_data.weather_code,bike_data.season,margins=True,margins_name='Total')
contigency_table

In [None]:
contigency_table = contigency_table.rename(columns = {'Total':'Row_total'})
contigency_table

A Chi-Square Test of Independence
- As we doing independence test for 2 categorical vaiarbles we are using Chi-squared test.

- Expected value of cells should be 5 or greater in at least 80% of cells & that no cell should have an expected value less than 1.
- We can use the following formula to calculate the expected values for each cell in the contingency table:
- Expected value = (row sum * column sum) / table sum.

In [None]:
n = contigency_table.at["Total", "Row_total"]
exp=contigency_table.copy()
for x in exp.index[0:-1]:
    for y in exp.columns[0:-1]:
        v= (((contigency_table.at[x, "Row_total"]) * (contigency_table.at["Total", y]))/n ).round(2)
        exp.at[x,y]=float(v)

exp = exp.iloc[[0, 1, 2, 3, 4, 5, 6 ], [0, 1, 2, 3]]
exp

#### Weather_code 10 has expeted counts less than 5, so we will drop it.

In [None]:
bike_data['weather_code'].value_counts()

In [None]:
bike_data['season'].value_counts()

In [None]:
bike_data=bike_data[~(bike_data['weather_code']==10.0)]
bike_data['weather_code'].value_counts()

In [None]:
contigency_table = pd.crosstab(bike_data.weather_code,bike_data.season,margins=True,margins_name='Total')
contigency_table

#### Weather_code 26 has expeted counts less than 5, so we will drop it.

In [None]:
bike_data=bike_data[~(bike_data['weather_code']==26.0)]
bike_data['weather_code'].value_counts()

In [None]:
contigency_table = pd.crosstab(bike_data.weather_code,bike_data.season,margins=True,margins_name='Total')
contigency_table

In [None]:
contigency_table = contigency_table.rename(columns = {'Total':'Row_total'})
contigency_table

In [None]:
n = contigency_table.at["Total", "Row_total"]
exp=contigency_table.copy()
for x in exp.index[0:-1]:
    for y in exp.columns[0:-1]:
        v= (((contigency_table.at[x, "Row_total"]) * (contigency_table.at["Total", y]))/n ).round(2)
        exp.at[x,y]=float(v)

exp = exp.iloc[[0, 1, 2, 3, 4 ], [0, 1,2,3]]
exp

#### No Weather_code  has expeted counts less than 5, so we will continue with Chi Sqaure test

In [None]:
# bike_data['weather_code'] = bike_data['weather_code'].astype('category')
# bike_data['season'] = bike_data['season'].astype('category')

In [None]:
weather_code_season_dep = pd.crosstab(bike_data['weather_code'], bike_data['season'])
weather_code_season_dep

In [None]:
stat, p_value, dof, expected = stats.chi2_contingency(weather_code_season_dep)
stat, p_value, dof, expected
#stat, p, dof, expected

In [None]:
alpha = 0.05
if p_value >= alpha: 
    print('We fail to reject the Null Hypothesis Ho and thus we can conclude that smokers proportion is not significantly different in different regions"')
else:
    print('We reject the Null Hypothesis Ho')

**p- value (2.4810049592886517e-83) < alpha(0.05) --> so we can reject H0**
Which means weather_code and seasons have a significant dependency and Both weather_code and seasons are not independent of each other

**We can conclude that we can reject the H0 as we have enough evidence to reject the null hypothesis, so it seems like weather_code and Seasons are dependent on each other**

# ANNOVA to check if no. of cycles rented is similar or different in different weather_code and season

**Assumptions:**
- Normality – that each sample is taken from a normally distributed population
- Sample independence – that each sample has been drawn independently of the other samples
- Variance equality – that the variance of data in the different groups should be the same
- Your dependent variable – here, “cnt”, should be continuous – that is, measured on a scale which can be subdivided using increments

**1. weather_code**

**H0 (Null Hupothesis) :** cnt of bikes rented is same in diffrent types of weather_code

**Ha (Alternate Hupothesis) :** cnt of bikes rented is different in diffrent types of weather_code

**alpha: 0.05**

In [None]:
# We will be working on bike_dcopy which is already created as a deep copy of the original dataset. 
#This is because, we need to conclude on the basis of all the data and not when we remove the ouliers

In [None]:
bike_dcopy['weather_code'].value_counts()

**Checking assumptions of the test (Normality, Equal Variance)**

In [None]:
from scipy.stats import shapiro
def normality_check(series, alpha=0.05):
    _, p_value = shapiro(series)
    print(f'p value = {p_value}')
    if p_value >= alpha:
        print('We fail to reject the Null Hypothesis Ho')
    else:
        print('We reject the Null Hypothesis Ho')

In [None]:
sns.histplot(bike_dcopy['cnt'].sample(5000), kde = True)

In [None]:
#Taking the log of the above distribution sample as it's not normal.
sns.histplot(np.log(bike_dcopy['cnt'].sample(5000)), kde = True)

In [None]:
# H0: Series is Normal
# Ha : Series is not Normal
# alpha  = 0.05
stats.shapiro(bike_dcopy['cnt'].sample(5000))

**Observations:** Even after taking log, the distribution is not exactly normal. So our assumption doesn't holds true. Also, we have confirmed with the statistical test -Shapiro wik test that the series is not normal.Still we will be going ahead with the test just to check the results.

In [None]:

# Removing the weather_code type 12.0 and 10.0 as it's variance id diffrent than others and will fail our assumptions
bike_dcopy=bike_dcopy[~(bike_dcopy['weather_code']==10.0) ] 
bike_dcopy=bike_dcopy[~(bike_dcopy['weather_code']==26.0) ]

In [None]:
bike_dcopy['weather_code'].value_counts()

#### Normality Test:
We will perform normality check using **Shapiro test.**

The hypothesis of this test are:
- Null Hypothesis Ho - series is normal
- Alternative Hypothesis Ha - series is not normal

In [None]:
normality_check(bike_dcopy['weather_code'].sample(1400, replace = True))

#### Equality of Variance Test:
We will perform equivalence check for using using Levene's test.

The hypothesis of this test are:
- Null Hypothesis Ho - Variances are equal
- Alternative Hypothesis Ha - Variances are not equal

In [None]:
bike_dcopy.groupby(['weather_code'])['cnt'].describe() # Variance is different for diff weather_code

In [None]:
from scipy.stats import levene
def variance_check(series1, series2, series3,series4,series5, alpha=0.05):
    _, p_value = levene(series1, series2, series3)
    print(f'p value = {p_value}')
    if (p_value >= alpha).all():
        print('We fail to reject the Null Hypothesis Ho')
    else:
        print('We reject the Null Hypothesis Ho')

In [None]:
series1 =   bike_dcopy[bike_dcopy['weather_code'] == 1]['cnt'].sample(1400)
series2 =       bike_dcopy[bike_dcopy['weather_code'] == 2]['cnt'].sample(1400)
series3 =          bike_dcopy[bike_dcopy['weather_code'] == 3]['cnt'].sample(1400)
series4 =       bike_dcopy[bike_dcopy['weather_code'] == 4]['cnt'].sample(1400)
series5 =     bike_dcopy[bike_dcopy['weather_code'] == 7]['cnt'].sample(1400)

In [None]:
variance_check(series1, series2, series3,series4,series5)

In [None]:
sns.kdeplot(series1,color = 'green',shade='green')
sns.kdeplot(series2,color = 'blue',shade = 'blue')
sns.kdeplot(series3,color = 'red',shade = 'red')
sns.kdeplot(series4,color = 'yellow',shade = 'yellow')
sns.kdeplot(series5,color = 'orange',shade = 'orange')
plt.show()

#### Although both our assumption for normality and varaince failed, we will continue with One way Annova just to check what's the result.

In [None]:
stat,p = stats.f_oneway(bike_dcopy[bike_dcopy['weather_code'] == 1]['cnt'].sample(1400),
                        bike_dcopy[bike_dcopy['weather_code'] == 2]['cnt'].sample(1400),
                        bike_dcopy[bike_dcopy['weather_code'] == 3]['cnt'].sample(1400),
                       bike_dcopy[bike_dcopy['weather_code'] == 4]['cnt'].sample(1400),
                       bike_dcopy[bike_dcopy['weather_code'] == 7]['cnt'].sample(1400))
stat,p

In [None]:
test, p_val= stats.levene(bike_dcopy[bike_dcopy['weather_code'] == 1]['cnt'].sample(1400),
                        bike_dcopy[bike_dcopy['weather_code'] == 2]['cnt'].sample(1400),
                        bike_dcopy[bike_dcopy['weather_code'] == 3]['cnt'].sample(1400),
                       bike_dcopy[bike_dcopy['weather_code'] == 4]['cnt'].sample(1400),
                       bike_dcopy[bike_dcopy['weather_code'] == 7]['cnt'].sample(1400))
test, p_val

**Conclusion : As the p value < alpha(0.05) , we reject H0 and thus we can conclude that cnt of bikes differs with a change in weather_code.**

**2. Seasons**

**H0 (Null Hupothesis) :** cnt of bikes rented is same in different types of seasons

**Ha (Alternate Hupothesis) :** cnt of bikes rented is different in different types of seasons

**alpha: 0.05**

In [None]:
bike_dcopy['season'].value_counts()

In [None]:
bike_dcopy.groupby(['season'])['cnt'].describe() # The variance is different for all the 4 seasons

In [None]:
stat,p = stats.f_oneway(bike_dcopy[bike_dcopy['season'] == 1]['cnt'].sample(4000),
                        bike_dcopy[bike_dcopy['season'] == 2]['cnt'].sample(4000),
                        bike_dcopy[bike_dcopy['season'] == 3]['cnt'].sample(4000),
                        bike_dcopy[bike_dcopy['season'] == 0]['cnt'].sample(4000))
stat,p

In [None]:
test, p_val= stats.levene(bike_dcopy[bike_dcopy['season'] == 1]['cnt'].sample(4000),
                        bike_dcopy[bike_dcopy['season'] == 2]['cnt'].sample(4000),
                        bike_dcopy[bike_dcopy['season'] == 3]['cnt'].sample(4000),
                        bike_dcopy[bike_dcopy['season'] == 0]['cnt'].sample(4000))
test, p_val

**Conclusion : As the p value < alpha(0.05) , we reject H0 and thus we can conclude that cnt of bikes differs with a change in season.**

# Insights ,Conclusions , Inferences and Recommendations:

- There are total 4 categorical features namely Season, is_holiday, is_weekend, weather_code and 7 numerical/continuos features and 1 datatime object. In total 12 independent features with 10886 rows.
- Missing data or Null values are not present , neither any duplicated row is there.
- As the data ranges from 1st Jan 2011 to 19th Dec 2012, the cnt of the bikes rented will also be in this time period of almost 2 years.
- During months of September and October, maximum number of  bikes are rented.
- The cnt is less in the cold seasons of winter(Months such as Nov, Dec, Jan, Feb) where due to cold, people prefer mostly not to ride the bikes.
- As we can see, the data contains dates from 1 to 19th of a particular month.
- As we can from the month wise bar plot , the demand for  bikes at the starting months is quite low as compared to months from March 2012 onwards. There's a drop in the middle owing to cold and winter season.
- There are outliers in windspeed and casual users which tells us that, the windspeed in not uniform. Whereas the casual user cnt varies as they are not registered and not serious in riding the bikes.
- The exponential decay curve for the cnt(reg and non-reg) tells that, as the users renting bikes increases, the frequency decreases.
- For weather_code, in the fourth category i.e (Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog) no. of users renting bikes is much low and hence it's good to drop the feature while doing further tests.
- cnt for bikes rented in working day is much higher than non working day.
- During is_holidays, people don't prefer to ride bikes.
- When the weather_code is Clear with Few clouds, people tend to rent more  bikes for their comute.
- During the spring, summer, fall, winter, the cnt is more or less eual for the users renting bikes
- The registered user cnt has higher co-relation with the cnt as compared to the casual user cnt.
- The windspeed and season has a very low(near zero) positive co-relation with the cnt which means, the windspeed and seasons doesn't have an effect in the demand of  bikes rented.
- The t1erature and user specific feeling of heat/cold has a moderated co-relation (0.3) with the cnt. People tends to go ot in bright sunny day when the t1 is normal whereas as during harsh conditions such as too hot or too cold, the demand in the  bikes has seen a considerable dip.
- The casual users who rents  bikes likes to ride the bikes as the t1erature is suitable.
- When there's a is_holiday, user cnt has seen a considerable dip whereas in working days, the cnt is normal.
- **2 sample t-test:**
    - The distribution of the population samples is right-skwed and it's not normal which violates is our assumption for conducting 2 sample t test. Also the varaince of the samples is unequal.Hence we will do log-transformation
    - We got a p-value of 0.91 which is greater than 0.05 and hence we can say that we can accept the null hypothesis. We will confirm after log - transformation as well
    - After taking log on the sample population, we get a near normal distribution with variance very similar to each other. So we can calculate the p-value and test-statistics.
    - Conclusion : As the p value > alpha(0.05) , we accept H0 and thus we can say that the cnt of renting of  bikes in both working and non-working days is equal. And we can confirm this using the boxplot as well.
- **Chi-Square test:**
    - p- value (6.734426550686341e-08) < alpha(0.05) --> so we can reject H0 Which means weather_code and seasons have a significant dependency and Both weather_code and seasons are not independent of each other
    - We can conclude that our (chi_stat > chi_critical), we can reject the H0 as we have enough evidence to reject the null hypothesis, so it seems like weather_code and Seasons are dependent on each other.
- **One-way Anova:**
    - Even after taking log, the distribution is not exactly normal. So our assumption doesn't holds true. Also, we have confirmed with the statistical test -Shapiro wilk test that the series is not normal. Still we will be going ahead with the test just to check the results.
    - As the p value < alpha(0.05) , we reject H0 and thus we can conclude that cnt of bikes differs with a change in weather_code.
    - As the p value < alpha(0.05) , we reject H0 and thus we can conclude that cnt of bikes differs with a change in season.

**In order to conclude, we can say that the major factors affecting the count of bikes rented are season and weather_code. The working and non working days can't be considered as a significant factor in predicting the future of rental business. At the same time, the business team must focus on the months other than winter months for increasing the bike parking zones as during the winter months of (Nov, Dec, Jan, Feb), theres's a considerable dip in the cnt. So the team can utilize these months for serving some other purpose such as renting electric cars, etc which can be a comfortable means for commute in cold.**
