# Data Analysis Tools-Assignment 1/2/3([Coursera](https://www.coursera.org/learn/data-analysis-tools))

**This notebook has the relevant code for the Assignments of the Course - Data Analysis Tools**

## Topics Covered in this Notebook
**ANOVA and Post Hoc Tukey HSD Test**<br>
**Chi Square Test of Independence with Bonferroni Adjustment**<br>
**Pearson Correlation**<br>
**LASSO based Feature Selection using Lasso Path**

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
sns.color_palette("colorblind")
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 100

import gc

from itertools import combinations

import scipy.stats
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
from statsmodels.stats.multitest import multipletests

from sklearn import linear_model
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LassoLarsCV

In [None]:
#-------Read Data-------
df = pd.read_csv('/kaggle/input/bike-sharing-demand/train.csv',parse_dates=['datetime'])
df = df.rename({'count':'count_of_rentals'},axis=1)
df['month'] = df['datetime'].dt.month
df['day_of_week'] = df['datetime'].dt.dayofweek
df['hour'] = df['datetime'].dt.hour
df.head()

In [None]:
#--------Continious to Continious----------

sns.regplot(x="temp", y="atemp", fit_reg=True, data=df)
plt.show()

#Temp and Atemp are very significantly correlated

In [None]:
df['season'] = df['season'].map({  1:'spring', 2:'summer', 3:'fall', 4:'winter' })
df["season"].value_counts()

In [None]:
df['weather'] = df['weather'].map({ 
    1:'Clear, Few clouds, Partly cloudy, Partly cloudy',\
    2:'Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist',\
    3:'Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds',\
    4:'Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog'
}) 

**Let's consider the categorical explanatory variable 'Season' and the response variable The Count of rentals.**

In [None]:
#-----Boxplot-----
#---Categorical to Quantitative---
sns.boxplot(x='season', y='count_of_rentals', data=df)
plt.show()

In [None]:
fig, axs = plt.subplots(nrows=2,ncols=2)

sns.boxplot(x='hour', y='temp', \
            data=df.loc[df["season"]=="spring",:],ax=axs[0][0])
sns.boxplot(x='hour', y='temp', \
            data=df.loc[df["season"]=="summer",:],ax=axs[0][1])
sns.boxplot(x='hour', y='temp', \
            data=df.loc[df["season"]=="fall",:],ax=axs[1][0])
sns.boxplot(x='hour', y='temp', \
            data=df.loc[df["season"]=="winter",:],ax=axs[1][1])
plt.show()

In [None]:
fig, axs = plt.subplots(nrows=2,ncols=1)

sns.boxplot(x='hour', y='count_of_rentals', \
            data=df.loc[df["holiday"]==0,:],ax=axs[0])
sns.boxplot(x='hour', y='count_of_rentals', \
            data=df.loc[df["holiday"]==1,:],ax=axs[1])

plt.show()

In [None]:
print('---Means of Rental Counts by Each Season---')
df.groupby('season')['count_of_rentals'].agg(['mean','std'])

In [None]:
#--------------Crosstab------------------
#---------Caegorical to Categorical------
pd.crosstab(df['month'], df['weather'])

In [None]:
#--------Crosstab(Proportions)-----------
#---------Caegorical to Categorical------
pd.crosstab(df['month'], df['weather']).apply(lambda r: r/r.sum(), axis=0)

In [None]:
df['season'] = df['season'].astype('category')
season_map = dict(enumerate(df['season'].cat.categories))
df['season'] = df['season'].cat.codes

In [None]:
season_map

# ANOVA

An ANOVA test is used to infer if there is any relationship/dependecy between a Categorical Explanatoy Variable and an Continious Response Variable.

It quantifies the ratio between  variance of the group means to the mean of the within group variances.

In [None]:
#------Fit an Ordinary Least Square------
print('----Fit an OLS Regression----')
model = smf.ols(formula='count_of_rentals ~ C(season)', data=df).fit()
print (model.summary())

### ANOVA test Inference:<br>
Since the p-value of the test is 6.16e-149(approximately 0), we fail to accept the Null Hypothesis and infer that the count of rentals on an avergae do differ for each season.

# Tukey Post Hoc Test

A ANOVA test which is significant doesn't tell us which groups are different from each other.<br>
We need a POST hoc test to tell which pair of groups have a difference of means which is statistically significant.

In [None]:
print('----Tukey Test Post Hoc----')
mc = multi.MultiComparison(df['count_of_rentals'], df['season'])
res = mc.tukeyhsd()
print(res.summary())

### Tukey Test Inference:<br>
The count of rentals actually differ for each pairs of seasons on an average **(reject=true)**.

## Inference
The mean count of rentals does differ across seasons, like it is certainly higher in the summers/fall as compared to winters/spring.

# Chi Square Test of Significance
A Chi square test helps determine if there is any significant relationship between two categorical variables.

In [None]:
print ('chi-square value, p value, expected counts',end=":\n")
cs1= scipy.stats.chi2_contingency(pd.crosstab(df['weather'], df['month']))
print (cs1)

### Chi Square Test Inference
The p-value of the test is 1.0505381541371918e-34(approximately 0), so we fail to accept the Null Hypothesis and infer that there is some relationship between Month of the Year and Weather.

# Bonferroni Adjustment for Chi Square(Post-Hoc)
Since we have multiple pairs across the two Catgorical levels(Pairs of Month and Weather), hence we do have a chance of rejecting Null Hypothesis when it actually true, essentially commiting a **Type-1 Error**. What if there is one categorical pair for which the proportion difference is significant.  A way to mitigate this is to use Bonferroni **post Chi Square Adjustement**. We compute the Chi square for multiple categorical pairs and our inference is based on a :<br>**new significance level = $\alpha$/number of comparisons**(where $\alpha$ is most of the times **0.05**). 

Inspired by [https://neuhofmo.github.io/chi-square-and-post-hoc-in-python/](http://)

In [None]:
#-------------------------Post Hoc Test Chi Square--------------------

# Store p-values of each pair of month
p_vals_chi = []
pairs_of_months = list(combinations(df['month'].unique(),2))

#For Each Pair of Months compute Chi Square Stats
for each_pair in pairs_of_months:
    each_df = df[(df['month']==each_pair[0]) | (df['month']==each_pair[1])]
    p_vals_chi.append(\
          scipy.stats.chi2_contingency(
            pd.crosstab(each_df['weather'], each_df['month']))[1]
         )
    
gc.collect()

In [None]:
#Results of Bonferroni Adjustment
bonferroni_results = pd.DataFrame(columns=['pair of months',\
                                           'original p value',\
                                           'corrected p value',\
                                           'Reject Null?'])

bonferroni_results['pair of months'] = pairs_of_months
bonferroni_results['original p value'] = p_vals_chi

#Perform Bonferroni on the p-values and get the reject/fail to reject Null Hypothesis result.
multi_test_results_bonferroni = multipletests(p_vals_chi, method='bonferroni')

bonferroni_results['corrected p value'] = multi_test_results_bonferroni[1]
bonferroni_results['Reject Null?'] = multi_test_results_bonferroni[0]

bonferroni_results.head()

In [None]:
print(f"{bonferroni_results[bonferroni_results['Reject Null?']==True].shape[0]} pairs of months\
 have a signifant relationship w.r.t Weather",end=".\n")

print(f"{bonferroni_results[bonferroni_results['Reject Null?']==False].shape[0]} pairs of months\
 do not have a signifant relationship w.r.t Weather",end=".")

### Bonferroni Adjusted Chi Square Test Inference

For nearly half of the pairs, we have the null hypotheis rejected and for a similar number of pairs we fail to reject the null hpothesis. Let us have some examples to illustrate this.<br> The month pair(1,2) i.e **January and February** dont vary much in terms of weather and the Null Hypothesis is not rejected.<br> However for months (1,5) - **January and May** the Null Hypothesis has been rejected, we can infer that they do vary in terms of weather(possibly **Light Snow/Rain** is significantly more during the month of May as compared to January). 

# Pearson Correlation
Pearson Correlation quantifies the association between two Quantitative variables.

In [None]:
print ('association between temp and atemp')
print (scipy.stats.pearsonr(df['temp'], df['atemp']))

### Pearson Correlation Inference
A Pearson statistics of 0.98 and a p-value of zero shows a very strong statistical association between temp and atemp

# Feature Selection-LASSO

We select the most important features(based on LASSO), the unimportant predictors have their coeffs shrunk to zero. 

In [None]:
df = df.drop('datetime',axis=1)
df.columns

In [None]:
features = ['temp', 'atemp', 'humidity', 'windspeed','season', 'holiday','workingday', \
            'weather','month', 'day_of_week', 'hour']

#--------------Standard Scale Numerical Features----------------
numeric_features = ['temp', 'atemp', 'humidity', 'windspeed']
numeric_transformer = StandardScaler()
df[numeric_features] = numeric_transformer.fit_transform(df[numeric_features])

#--------------Dummify Categorical Features----------------------
df = pd.get_dummies(df,\
                   columns = ['season', 'holiday', 'workingday', \
            'weather','month', 'day_of_week', 'hour'])
df.head()

In [None]:
#-----------Train a Lasso Least Angle Regression, Cross Validated on 10 folds---------------
X,y = df.loc[:,~df.columns.isin(["casual","registered","count_of_rentals"])], \
    df["count_of_rentals"]

model=LassoLarsCV(cv=10, precompute=False).fit(X,y)
model_coefficients = pd.DataFrame.from_dict(dict(zip(X.columns, model.coef_)),orient="index",
                                           columns=["coefficient"])
model_coefficients.head()

In [None]:
#----------Select the Model Coeffs with Nonzero value-------------

model_coefficients = model_coefficients.loc[model_coefficients['coefficient']!=0.0]
model_coefficients.head()

In [None]:
#--------How the coefficients of the predicors change with changing alphas--------

m_log_alphas = -np.log10(model.alphas_)

lasso_path_df = pd.DataFrame(model.coef_path_.T,\
                             columns=X.columns,\
                            index=m_log_alphas)
lasso_path_df = lasso_path_df[model_coefficients.index.tolist()]
lasso_path_df.head()

In [None]:
plt.plot(m_log_alphas,lasso_path_df.to_numpy())
plt.legend(lasso_path_df.columns)
plt.show()

In [None]:
plt.plot(m_log_alphas,lasso_path_df.to_numpy())
#plt.legend(lasso_path_df.columns)
plt.show()

### LASSO Inference
**"temp"**, **"atemp"**, **"humidity"** stand out when it comes to explaining the number of rentals booked(the Target). LASSO has shrunk the weight of some predictors(like **"windspeed"**) to zero which have little or zero effect on explaining the number of rentals!

# Test of Moderation

In [None]:
df = pd.read_csv('/kaggle/input/bike-sharing-demand/train.csv',parse_dates=['datetime'])
df['month'] = df['datetime'].dt.month
df['day_of_week'] = df['datetime'].dt.dayofweek
df['hour'] = df['datetime'].dt.hour
df = df.rename({'count':'count_of_rentals'},axis=1)
gc.collect()

In [None]:
holidays,non_holidays=df.loc[df["holiday"]==1,:],\
                      df.loc[df["holiday"]==0,:]

print("Holidays Statistics")
print(holidays.groupby("hour").agg({"count_of_rentals":["mean","median"]}))

print("Non Holidays Statistics")
print(non_holidays.groupby("hour").agg({"count_of_rentals":["mean","median"]}))

In [None]:
holidays.groupby("hour").agg({"count_of_rentals":["mean","median"]}).plot(kind="bar")

In [None]:
non_holidays.groupby("hour").agg({"count_of_rentals":["mean","median"]}).plot(kind="bar")

In [None]:
#------Fit an Ordinary Least Square------
print('----Fit an OLS Regression----')
model = smf.ols(formula='count_of_rentals ~ C(hour)', data=holidays).fit()
print (model.summary())

In [None]:
#------Fit an Ordinary Least Square------
print('----Fit an OLS Regression----')
model = smf.ols(formula='count_of_rentals ~ C(hour)', data=non_holidays).fit()
print (model.summary())

# Inference-Test of Moderation
As we see Hour of the day plays a crucial role in determing the count of rentals. However when we compare holidays and non holidays, the time of **13:00 and 14:00 hours** we see more counts in case of holidays as compared to morning hours like **8**. Thus holiday works as a **moderator** between the relationship between Hour of Day and COunt of Rentals.

# To be Done(Work Under Progress):
* Ridge/Elastic Net
* Mututal Information

### Feedbacks Appreciated