## Significance and p values summary using advertising dataset
- null, alternative hypothesis
- p value
- resources

> **Null Hypothesis ($H_0$)**: There is no relationship between the amount spent on TV advertisement and sales figures

> **Alternative Hypothesis ($H_a$):** There is "some" relation between the amount spent on TV advertisement and sales figures

When performing statistical analyses, rejecting or not rejecting a null hypothesis always goes along with an associated **significance level** or **p-value**.

> p-value represents a **probability of observing your results (or something more extreme) given that the null hypothesis is true**
 
Applied to a regression model, p-values associated with coefficients estimates indicates the probability of observing the associated coefficient given that the null-hypothesis is true. As a result, very small p-values indicate that coefficients are **statistically significant**. A very commonly used cut-off value for the p-value is 0.05. If your p-value is smaller than 0.05, you would say:

> The parameter is statistically significant at $\alpha$ level 0.05. 

Just like for statistical significance, rejecting the null hypothesis at an alpha level of 0.05 is the equivalent for having a 95% confidence interval around the coefficient that does not include zero. In short

> The p-value represents the probability that the coefficient is actually zero. 

Let's import the code from the previous lesson and let's have a look at the p-value.

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
plt.style.use('fivethirtyeight')

data = pd.read_csv('data/advertising.csv', index_col=0)
f = 'Sales~TV'
f2 = 'Sales~Radio'
model = smf.ols(formula=f, data=data).fit()

In [6]:
model.pvalues

Intercept    1.406300e-35
TV           1.467390e-42
dtype: float64

In [7]:
model.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.612
Model:,OLS,Adj. R-squared:,0.61
Method:,Least Squares,F-statistic:,312.1
Date:,"Wed, 19 Aug 2020",Prob (F-statistic):,1.47e-42
Time:,22:39:34,Log-Likelihood:,-519.05
No. Observations:,200,AIC:,1042.0
Df Residuals:,198,BIC:,1049.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.0326,0.458,15.360,0.000,6.130,7.935
TV,0.0475,0.003,17.668,0.000,0.042,0.053

0,1,2,3
Omnibus:,0.531,Durbin-Watson:,1.935
Prob(Omnibus):,0.767,Jarque-Bera (JB):,0.669
Skew:,-0.089,Prob(JB):,0.716
Kurtosis:,2.779,Cond. No.,338.0


In [8]:
model = smf.ols(formula=f2, data=data).fit()
model.pvalues

Intercept    3.561071e-39
Radio        4.354966e-19
dtype: float64

In [9]:
# the alpha level of 0.05 is just a convention.
# alpha levels could be 0.1 and 0,001. 
# Note that the confidence intervals change too as alpha levels change (to 90% 
# and 99% confidence intervals, yet the standard output of statsmodels is a 95% confidence interval). 

#### resources

##### my favourite YouTube channel for statistics and machine learning for beginners, below is the link to his channel
- StatQuest. “R-squared explained” YouTube, Joshua Starmer,  https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw

##### almost all contents above are just summaries from our graduate training with flatiron school
- flatiron github for this. https://github.com/learn-co-students/dsc-enterprise-hsbc-significance-p-value-hsbc-ds-081319