## <a id='toc1_1_'></a>[Hypothesis Tests with Formulas, Code, Assumptions](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Hypothesis Tests with Formulas, Code, Assumptions](#toc1_1_)    
  - [**One-Sample T-Test**](#toc1_2_)    
  - [**Two-Sample T-Test (Unpaired)**](#toc1_3_)    
  - [**Two-Sample T-Test (Paired)**](#toc1_4_)    
  - [**Proportion Z Test**](#toc1_5_)    
  - [**Correlation Coefficient T-Test**](#toc1_6_)    
  - [**Linear Regression T-Test**](#toc1_7_)    
  - [**Logistic Regression Wald's Test**](#toc1_8_)    
  - [**ANOVA**](#toc1_9_)    
  - [**Tukey HSD Test**](#toc1_10_)    
  - [**Chi-Squared Test**](#toc1_11_)    
  - [**Bootstrapping**](#toc1_12_)    
  - [**Spearman Coefficient T-Test**](#toc1_13_)    
  - [**Point Biserial T-Test**](#toc1_14_)    
- [Two Tailed vs One Tailed Tests](#toc2_)    
  - [**One-Tailed Hypothesis Test**](#toc2_1_)    
  - [**Two-Tailed Hypothesis Test**](#toc2_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_2_'></a>[**One-Sample T-Test**](#toc0_)
   - **Use Case:** This test is used when you want to compare the mean of a sample to a known value or a theoretical expectation.
   - **Examples:** Testing if the average height of a certain type of plant is different from 10 cm.
   - **Hypothesis Testing:**  
     - $$H0: \mu_{test} = 0$$
     - $$H1: \mu_{test} \neq 0$$
   - **Formula:**
     - $ t = \frac{\overline{x} - \mu_{test}}{\left(s\,\big/ \sqrt{n}\right)}$
   - **Code:**
     ```python
     # first argument is an array, second argument is the hypothesis value
     one_sample_test = stats.ttest_1samp(store1, 14.5)
     # outputs t-test statistic and a p-value
     TtestResult(statistic=7.203710690696487, pvalue=1.1660253276676903e-10, df=99)
     ```
<div style="background-color: #e0e0e0; padding: 10px; border-radius: 10px; border: 1px solid #333;">

- **Assumptions:**  
  1. **IID** - The samples are **independent** and from the **same population**. 
  2. **Randomness** - The samples are **random** and **representative** of the underlying population. 
       - The variance and mean of the sample and underlying population are the same.
  3.  **Normality** - The samples follow a normal distribution via the Central Limit Theorem. This has to meet one of the following conditions:
       - Sample itself is normally distributed
       - Sample size is large enough (n > 30)
       - It passes some other test for normality

</div>





## <a id='toc1_3_'></a>[**Two-Sample T-Test (Unpaired)**](#toc0_)
   - **Use Case:** This test is used when you want to compare the means of two independent groups to see if they are significantly different from each other.
   - **Examples:** Testing if the average height of plants grown with two different types of fertilizer is different.
   - **Hypothesis:**  
     - $$H0: \mu_1 -  \mu_2 = 0$$
     - $$H1: \mu_1 -  \mu_2 \neq 0$$
   - **Formula:**
     - $ t = \frac{(\overline{X}_1 - \overline{X}_2) - (\mu_1 - \mu_2)}{s_p \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}$
     - Where the $\overline{X}_1$ and $\overline{X}_2$ are the sample means, $(\mu_1 - \mu_2)$ is the population difference you want to test against and $s_p$ is the [pooled standard deviation](https://en.wikipedia.org/wiki/Pooled_variance) of the samples.
     - $s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}$
   - **Code:**
     ```python
     # first argument is an array, second argument is the array value to compare to
     two_sample_test = stats.ttest_ind(store_suburbs, store_downtown) 
     # outputs t-test statistic and a p-value
     Ttest_indResult(statistic=-2.3843100564172697, pvalue=0.01805557370062323)
     ```
   - **Assumptions:**
     - Data points in each group are independent of each other.
     - The means of both samples come from a normal sampling distribution.
       - Sample itself is normally distributed
       - Sample size is large enough (n > 30)
     - Data in group A are independent from data in group B.
     - Variances of both populations are identical.





## <a id='toc1_4_'></a>[**Two-Sample T-Test (Paired)**](#toc0_)
   - **Use Case:** This test is used when you want to compare the means of the same group at two different times (for example, before and after a treatment).
   - **Examples:** Testing the average spend of the same customers at two different stores.
   - **Hypothesis:**  
     - $$H0: \mu_{test1} - \mu_{test2} = 0$$
     - $$H1: \mu_{test1} - \mu_{test2} \neq 0$$
   - **Formula:**
     - $ t = \frac{\overline{X_D} - µ_{test}}{\frac{s_D}{\sqrt{n}}} $
     - Where $X_D$ is the set of differences between paired samples, $µ_{test}$ is our mean difference that we want to compare to (0 in this case), $s_D$ is difference sample standard deviation, and n is the number of samples.
   - **Code:**
     ```python
     # first argument is an array, second argument is the array value to compare to
     stats.ttest_rel(store_suburbs, store_downtown)
     # outputs t-test statistic and a p-value
     TtestResult(statistic=-2.4153163510165245, pvalue=0.017556013707886925, df=99)
     ```
   - **Assumptions:**
     - Data points in each group are independent of each other.
     - Data points in group A and B are paired/matched.
     - The mean of the samples differences comes from a normal sampling distribution.
       - Sample itself is normally distributed
       - Sample size is large enough (n > 30)
     - Variances of both populations are identical.




## <a id='toc1_5_'></a>[**Proportion Z Test**](#toc0_)
   - **Use Case:** This test is used when you want to compare the proportions of two groups to see if they are significantly different from each other.
   - **Examples:** Testing if the proportion of customers who bought a product is different between two stores.
   - **Hypothesis:**  
     - $$H0: p_1 = p_2$$
     - $$H1: p_1 \neq p_2$$
   - **Formula:**
     - $ z = \frac{p_1 - p_2}{\sqrt{p(1 - p)(\frac{1}{n_1} + \frac{1}{n_2})}}$
     - Where $p_1$ and $p_2$ are the sample proportions, $p$ is the pooled sample proportion, and $n_1$ and $n_2$ are the sample sizes.
   - **Code:**
     ```python
     # first argument is an array, second argument is the array value to compare to
     z_test = stats.proportions_ztest([successes_in_a, successes_in_b], [count_of_a, count_of_b]) 
     # outputs z-test statistic and a p-value
     (statistic=-1.705800384635185, pvalue=0.08806845338765186)
     ```

<div style="background-color: #e0e0e0; padding: 10px; border-radius: 10px; border: 1px solid #333;">

**Assumptions:** 

1. **IID** - The samples are **independent** and from the **same population**. 
2. **Randomness** - The samples are **random** and **representative** of the underlying population. 
      - The variance and mean of the sample and underlying population are the same.
3.  **Normality** - The samples follow a normal distribution via the Central Limit Theorem. This has to meet one of the following conditions:
      - Sample itself is normally distributed
      - Sample size is large enough ($n*p > 5$ and $n(1-p) > 5$) for both groups
      - Up to 3 Standard Deviations of each proportion are completely contained with 0-1. $[p-3s_w, p+3s_w] \subseteq [0,1]$
      - It passes some other test for normality

</div>



## <a id='toc1_6_'></a>[**Correlation Coefficient T-Test**](#toc0_)
   - **Use Case:** This test is used when you want to determine if there is a significant correlation between two variables.
   - **Examples:** Testing if there is a correlation between the amount of rainfall and the number of mosquitoes.
   - **Hypothesis:**
     - $$H_0: \rho = 0$$
     - $$H_1: \rho \neq 0$$
   - **Formula:**
     - $ t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$
     - Where $r$ is the sample correlation coefficient and $n$ is the sample size.
   - **Code:**
     ```python
     # first argument is an array, second argument is the array value to compare to
     correlation_test = stats.pearsonr(x, y) 
     # outputs correlation coefficient and a p-value
     (correlation=0.8658002752562366, pvalue=0.011807351043126281)
     ```
   - **Assumptions:**
     - The variables are continuous and numeric.
     - The variables are linearly related.
     - The data is homoscedastic, meaning the variance of the residuals (or "errors") is constant.


## <a id='toc1_7_'></a>[**Linear Regression T-Test**](#toc0_)
   - **Use Case:** This test is used when you want to determine if the coefficients of the linear regression model are significantly different from zero. This helps to understand if a predictor variable has a significant effect on the outcome variable.
   - **Examples:** Testing if the number of mosquitoes increases with the amount of rainfall.
   - **Hypothesis:**
     - $H_0: β_i = 0$ (The predictor variable has no effect on the outcome variable)
     - $H_a: β_i ≠ 0$ (The predictor variable has a significant effect on the outcome variable)
   - **Formula:**
     - $ t = \frac{β_i}{SE_{β_i}}$
     - Where $β_i$ is the coefficient of the predictor variable in the linear regression model and $SE_{β_i}$ is the standard error of the coefficient.
   - **Code:**



In [2]:
import statsmodels.api as sm
bp = pd.read_csv('bp.csv')


#seperate data into X (independant) and Y (dependant)
X = bp['Age']
y = bp['Systolic_Blood_Pressure']

# add constant
X_withconstant = sm.add_constant(X)

# Instantiate Model
myregression = sm.OLS(y, X_withconstant)

# Fit Model (this returns a seperate object with the parameters)
myregression_results = myregression.fit()

#################### plotting steps below ################
p = myregression_results.params

sns.scatterplot(x='Age', y='Systolic_Blood_Pressure', data=bp)
x = np.linspace(np.min(X), np.max(X), 100)
plt.plot(x, p.const + p.Age * x)


NameError: name 'pd' is not defined

<div style="background-color: #e0e0e0; padding: 10px; border-radius: 10px; border: 1px solid #333;">

**Assumptions**:

- **Independence and Identically Distributed**: The observations and their corresponding residuals should be independent of each other. Also the data is assumed to be from the same probability distribution or population.

- **Linearity**: The relationship between the independent and dependent variables should be linear.  

- **No Multicollinearity**: The independent variables are not too highly correlated with each other.  

- **Normality of Residuals**: The errors of the model (differences between predicted and actual values) follow a normal distribution.

- **Homoscedasticity**: The variance of the errors is constant across all levels of the independent variables. In other words, the "spread" of the residuals should be the same throughout the range of the independent variables.

</div>


## <a id='toc1_8_'></a>[**Logistic Regression Wald's Test**](#toc0_)
   - **Use Case:** This test is used when you want to determine if the coefficients of the logistic regression model are significantly different from zero. This helps to understand if a predictor variable has a significant effect on the outcome variable.
   - **Examples:** Testing if the presence of West Nile Virus in mosquitoes is significantly affected by temperature or rainfall.
   - **Hypothesis:**
     - $H_0: β_i = 0$ (The predictor variable has no effect on the outcome variable)
     - $H_a: β_i ≠ 0$ (The predictor variable has a significant effect on the outcome variable)
   - **Formula:**
     - $ W = \frac{β_i}{SE_{β_i}}$
     - Where $β_i$ is the coefficient of the predictor variable in the logistic regression model and $SE_{β_i}$ is the standard error of the coefficient.
   - **Code:**
     ```python
     import statsmodels.api as sm

     # Add constant to predictor variables
     X = sm.add_constant(X)

     # Fit logistic regression model
     model = sm.Logit(y, X)
     result = model.fit()

     # Print summary statistics
     print(result.summary())
     ```
<div style="background-color: #e0e0e0; padding: 10px; border-radius: 10px; border: 1px solid #333;">

**Assumptions for Logistic Regression**:
- **Independence and Identically Distributed**: The observations and the corresponding residuals should be independent of each other. Also the data is assumed to be from the same probability distribution or population.

- **Discrete Outcome**: The dependent variable should be binary in binary logistic regression. One can also do multi-class logistic regression, using the one vs one or one vs all approach.

- **Linearity of Independent Variables and Log Odds**: The independent variables are linearly related to the log odds. This does not mean that the relationship between the independent and dependent variables is linear, but that the logit transformation of the dependent variable results in a linear relationship.

- **Absence of Multicollinearity**: The independent variables should not be too highly correlated with each other. This assumption is similar to the assumption in multiple linear regression. Multicollinearity can lead to unstable estimates and decreased interpretability of the model.

- **Class Balance and Normality**: Logistic regression requires a large sample size to achieve reliable results. While there is no definitive rule for the minimum sample size, a common guideline is that logistic regression requires at least 10 cases with the least frequent outcome for each independent variable in the model.

</div>


## <a id='toc1_9_'></a>[**ANOVA**](#toc0_)
   - **Use Case:** This test is used when you want to compare the means of more than two groups to see if they are significantly different from each other.
   - **Examples:** Testing if the average height of plants grown with three different types of fertilizer is different.
   - **Hypothesis:**
     - $H_0$: The means of the groups are **equal**.
     - $H_1$: The means of the groups are **not equal**.
   - **Formula:**
     - $$ F = \frac{MS_{between}}{MS_{within}}$$
       - Where $MS_{between}$ is the mean square between the groups and $MS_{within}$ is the mean square within the groups.
     - $$ MS_{between} = \frac{\sum_{i=1}^k N_i (\bar{Y_i} - \bar{Y})^2}{k - 1} $$
        - $\bar Y_i$ is the mean for group $i$, 
        - $\bar Y$ is the overall mean,
        - $N_i$ is the size of group $i$, and
        - k is the number of groups.
      - $$ MS_{within} = \frac{\sum_{i=1}^k (N_i-1)s_i^2}{N - k}$$
        - The only new notation here is $s^2_i$ which stands for the variance of group $i$.
   - **Code:**
     ```python
     # first argument is an array, second argument is the array value to compare to
     anova_test = stats.f_oneway(group1, group2, group3) 
     # outputs F statistic and a p-value
     F_onewayResult(statistic=3.7113359882669763, pvalue=0.043589334959178244)
     ```
   - **Assumptions:**
     - The samples are independent.
     - The data in each group are normally distributed.
     - The variances of the populations are equal.


## <a id='toc1_10_'></a>[**Tukey HSD Test**](#toc0_)
   - **Use Case:** This test is used when you want to compare the means of more than two groups and find out which specific groups' means (compared with each other) are different.
   - **Examples:** Testing which specific types of fertilizers result in different average plant heights.
   - **Hypothesis:**
     - $H_0$: The means of the groups are **equal**.
     - $H_1$: The means of the groups are **not equal**.
   - **Formula:**
     - The Tukey HSD test uses a formula to calculate a range for comparison between the means of each pair of groups. If the difference between the means of a pair of groups falls within this range, then the means are considered not significantly different.
   - **Code:**
     ```python
     from statsmodels.stats.multicomp import pairwise_tukeyhsd
     
     # first argument is the data, second argument is the groups, third argument is the level of significance
     pairwise_tukeyhsd(endog=data, groups=groups, alpha=0.05).summary()
     print(tukey)
     # outputs a table with each pair of groups, the difference of their means, the lower and upper bounds of the comparison range, and whether the means are significantly different
     ```
   - **Assumptions:**
     - The samples are independent.
     - The data in each group are normally distributed.
     - The variances of the populations are equal.
     - The groups have the same sample size.


## <a id='toc1_11_'></a>[**Chi-Squared Test**](#toc0_)
   - **Use Case:** This test is used when you want to see if there is a significant association between two categorical variables.
   - **Examples:** Testing if there is an association between the type of trap and the presence of West Nile Virus.
   - **Hypothesis test:** 
     - $H_0$: There is **no relationship** between the categorical variables.
     - $H_A$: There is **a relationship** between the categorical variables.
   - **Formula:**
     - $ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$
     - Where $O_i$ are the observed frequencies and $E_i$ are the expected frequencies.
   - **Code:**
     ```python
     # first argument is an array
     stats.chisquare(count_biased_list2)
      
     # Outputs the chi-square stat, and the p-value 
     Power_divergenceResult(statistic=11.866000000000001, pvalue=0.036670745046782215)

     # first argument is an df
     chi_squared_test = stats.chi2_contingency(observed) 

     # outputs chi-squared statistic, p-value, degrees of freedom, and expected frequencies
     (chi2=10.48020691831597, p=0.005275729025274299, df=2, expected=array([[ 93.6,  98.4],
       [ 20.4,  21.6]]))
     ```
   - **Assumptions:**
     - The samples are independent.
     - The categories are mutually exclusive.
     - The sample size is large enough (expected frequency in each cell should be at least 5).

## <a id='toc1_12_'></a>[**Bootstrapping**](#toc0_)
   - **Use Case:** This test is used when you want to estimate the sampling distribution of a statistic by generating many samples of the same size from the original data, with replacement.
   - **Examples:** Estimating the mean height of a population based on a sample.
   - **Hypothesis:**
     - $$H_0: \text{The sample statistic is equal to the population statistic}$$
     - $$H_1: \text{The sample statistic is not equal to the population statistic}$$
   - **Procedure:**
     - Draw a sample with replacement from the original data (same size as original data).
     - Compute the statistic of interest for this bootstrap sample.
     - Repeat the process many times (commonly 1000 or 10000 times), each time computing the statistic of interest.
     - Use the distribution of these bootstrap statistics to estimate the standard error, confidence intervals, or significance tests.
   - **Code:**
     ```python
     # numpy's random.choice function can be used to generate bootstrap samples
     bootstrap_sample = np.random.choice(original_data, size=len(original_data), replace=True)
     # compute statistic of interest on bootstrap_sample
     ```
   - **Assumptions:**
     - The original sample is representative of the population.
     - The bootstrap samples are drawn independently and with replacement.
     - The number of bootstrap samples is large enough to approximate the sampling distribution.

## <a id='toc1_13_'></a>[**Spearman Coefficient T-Test**](#toc0_)
   - **Use Case:** This test is used when you want to determine if there is a significant monotonic relationship between two variables. It is a non-parametric test that does not assume linearity or normally distributed data.
   - **Examples:** Testing if the rank of mosquito population size is related to the rank of rainfall amount.
   - **Hypothesis:**
     - $$H_0: \rho = 0$$
     - $$H_1: \rho \neq 0$$
   - **Formula:**
     - Spearman's rank correlation coefficient is calculated using the following formula:
     - $ρ_s = 1 - \frac{6 ∑d_i^2}{n(n^2 - 1)}$
     - Where $d_i$ is the difference between the two ranks of each observation and $n$ is the number of observations.
   - **Code:**
     ```python
     from scipy import stats

     # Calculate Spearman's rank correlation coefficient and the p-value
     rho, p_value = stats.spearmanr(x, y)

     print(f"Spearman's rank correlation coefficient: {rho}")
     print(f"P-value: {p_value}")
     ```

<div style="background-color: #e0e0e0; padding: 10px; border-radius: 10px; border: 1px solid #333;">

**Assumptions**:

 - **Measurability**: The variables are ordinal, interval or ratio.
 - **Monotonicity**: The relationship between the variables is monotonic, meaning that the variables tend to change together, but not necessarily at a constant rate.

</div>


## <a id='toc1_14_'></a>[**Point Biserial T-Test**](#toc0_)
   - **Use Case:** This test is used when you want to determine if there is a significant difference in a continuous variable between two groups. One variable should be binary and the other should be continuous.
   - **Examples:** Testing if the average mosquito population size is different between two different trap types.
   - **Hypothesis:**
     - $$H_0: \rho = 0$$
     - $$H_1: \rho \neq 0$$
   - **Formula:**
     - Point Biserial correlation coefficient is calculated using the following formula:
     - $r_{pb} = \frac{M_1 - M_0}{s_n} \sqrt{\frac{n_1 n_0}{n^2}}$
     - Where $M_1$ and $M_0$ are the means of the two groups, $s_n$ is the standard deviation, $n_1$ and $n_0$ are the sizes of the two groups, and $n$ is the total sample size.
   - **Code:**
     ```python
     from scipy import stats

     # Calculate Point Biserial correlation coefficient and the p-value
     r, p_value = stats.pointbiserialr(x, y)

     print(f"Point Biserial correlation coefficient: {r}")
     print(f"P-value: {p_value}")
     ```

<div style="background-color: #e0e0e0; padding: 10px; border-radius: 10px; border: 1px solid #333;">

**Assumptions**:

1. **Binary variable**: One of the variables should be dichotomous, i.e., it should take only two possible outcomes (0/1, True/False, etc.).
2. **Normality**: The continuous variable should be approximately normally distributed for each category of the binary variable.
3. **Linearity**: The relationship between the continuous and binary variables should be linear, 
4. **Homoscedasticity**: The variances of the continuous variable for each category of the binary variable should be equal (homoscedasticity).

</div>


# <a id='toc2_'></a>[Two Tailed vs One Tailed Tests](#toc0_)


## <a id='toc2_1_'></a>[**One-Tailed Hypothesis Test**](#toc0_)
   - **Use Case:** This test is used when the direction of the relationship between variables is known or hypothesized. It tests whether the value of a parameter is greater than or less than the hypothesized value.
     - You should only use this test if you have a good reason to believe that the parameter is greater than or less than the hypothesized value. Otherwise, you should use a two-tailed test.
     - You can also use this test for further exploration of a hypothesis *with a new dataset*.
   - **Examples:** Testing if the average height of a certain type of plant is greater than 10 cm.
   - **Hypothesis test:** 
     - $H_0$: The parameter is less than or equal to (or greater than or equal to) the hypothesized value.
     - $H_A$: The parameter is greater than (or less than) the hypothesized value.
   - **Formula:**
     - Depends on the specific test being used (e.g., t-test, z-test, etc.)
   - **Code:**
     ```python
     # For a one-sample t-test
     stats.ttest_1samp(sample, popmean)
     ```
   - **Assumptions:**
     - Depends on the specific test being used. For a t-test, the assumptions include independence of observations, normality, and (for a two-sample t-test) equal variances.



## <a id='toc2_2_'></a>[**Two-Tailed Hypothesis Test**](#toc0_)
   - **Use Case:** This test is used when the direction of the relationship between variables is not known or specified. It tests whether the value of a parameter is different (either greater or less) than the hypothesized value.
     - This is the STANDARD and most common type of test. You need to have a good reason to use a one-tailed test.
   - **Examples:** Testing if the average height of a certain type of plant is different from 10 cm.
   - **Hypothesis test:** 
     - $H_0$: The parameter is equal to the hypothesized value.
     - $H_A$: The parameter is not equal to the hypothesized value.
   - **Formula:**
     - Depends on the specific test being used (e.g., t-test, z-test, etc.)
   - **Code:**
     ```python
     # For a one-sample t-test
     stats.ttest_samp(sample, popmean)
     ```
   - **Assumptions:**
     - Depends on the specific test being used. For a t-test, the assumptions include independence of observations, normality, and (for a two-sample t-test) equal variances.