# **Concepts Covered:**

- <a href = #link16>Paired Sample T-test for Equality of Means</a>
- <a href = #link12>Chi-Square Test for Variance</a>
- <a href = #link13>F-test for Equality of Variances</a>


## The parameter 'alternative' has been introduced in the SciPy version 1.6.0. Hence, it is necessary to install the required Scipy version in the system.

In [4]:
# import the scipy and check the version to be sure that the version is above 1.6.1.
import scipy
scipy.__version__

'1.10.1'

In [2]:
# if the scipy version is lower than 1.6.1, then uncomment the below code to update the scipy package.
!pip install scipy

Defaulting to user installation because normal site-packages is not writeable
Collecting scipy
  Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.4/34.4 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0mm
Installing collected packages: scipy
Successfully installed scipy-1.10.1


## Import the required packages

In [5]:
#import the important packages
import pandas as pd #library used for data manipulation and analysis
import numpy as np # library used for working with arrays.
import matplotlib.pyplot as plt # library for plots and visualisations
import seaborn as sns # library for visualisations
%matplotlib inline 

import scipy.stats as stats # this library contains a large number of probability distributions as well as a growing library of statistical functions.

# <a name='link16'>**Paired Sample T-test for Equality of Means**</a>

### Let's revisit the example
Typical prices of single-family homes in Florida are given for a sample of 15 metropolitan areas (in 1000 USD) for 2002 and 2003 in a CSV file.
 
Assuming the house prices are normally distributed, do we have enough statistical evidence to say that there is an increase in the house price in one year at 0.05 significance level?

### Let's write the null and alternative hypothesis

Let $\mu_1, \mu_2$ be the mean price of single-family homes in metropolitan areas of Florida for 2002 and 2003 respectively.

We want to test whether there is an increase in the house price from 2002 to 2003.

We will test the null hypothesis

>$H_0:\mu_1=\mu_2$

against the alternate hypothesis

>$H_a:\mu_1<\mu_2$

### Let's have a look on the sample data

In [6]:
# import the data
houseprice = pd.read_csv('Florida.csv')
houseprice.head()

Unnamed: 0,Metropolitan Area,Jan_2003,Jan_2002
0,Daytona Beach,117,96
1,Fort Lauderdale,207,169
2,Fort Myers,143,129
3,Fort Walton Beach,139,134
4,Gainesville,131,119


In [7]:
# find the mean difference between the house prices from 2003 to 2002
# the aim behind finding this difference is to check there is really an increase in the house price from 2002 to 2003.
diff = np.mean(houseprice['Jan_2003'] - houseprice['Jan_2002'])
print('The mean of the differences between the house prices from 2003 to 2002', diff)

The mean of the differences between the house prices from 2003 to 2002 15.0


### Let's test whether the paired T-test assumptions are satisfied or not

* Continuous data - Yes, the house price is measured on a continuous scale.
* Normally distributed populations - Yes, we are informed that the populations are assumed to be normal.
* Independent observations - As we are taking the sampled unit randomly, the observed units are independent.
* Random sampling from the population - Yes, we are informed that the collected sample is a simple random sample.

Voila! We can use paired sample T-test for this problem.



### Let's find the p-value

In [8]:
#import the required functions
from scipy.stats import ttest_rel

# find the p-value
test_stat, p_value = ttest_rel(houseprice['Jan_2002'], houseprice['Jan_2003'], alternative = 'less')
print('The p-value is ', p_value)

The p-value is  8.282698151615477e-05


### Insight
As the p-value is much less than the level of significance, the null hypothesis can be rejected. Thus, it may be concluded that there is enough statistical evidence to conclude that there is an increase in the price from 2002 to 2003.

# <a name='link12'>**Chi-Square Test for Variance**</a>



### Let's revisit an example
It is conjectured that the standard deviation for the annual return of mid cap mutual funds is 22.4%, when all such funds are considered and over a long period of time. The sample standard deviation of a certain mid cap mutual fund based on a random sample of size 32 is observed to be 26.4%. 

Do we have enough evidence to claim that the standard deviation of the chosen mutual fund is greater than the conjectured standard deviation for mid cap mutual funds at 0.05 level of significance?



### Let's write the null and alternative hypothesis
Let $\sigma$ be the average standard deviation of the mutual funds.

We will test the null hypothesis

>$H_0:\sigma^2 = 22.4^2$

against the alternate hypothesis

>$H_a:\sigma^2 > 22.4^2$

### Let's test whether the assumptions are satisfied or not

* Continuous data - Yes
* Normally distributed population - Since the sample sizes are greater than 30, Central Limit Theorem states that the distribution of sample means will be normal.
* Random sampling from the population - Yes, we are informed that the collected sample is a simple random sample.   


### Let's find the p-value

In [9]:
#import the required function
from scipy.stats import chi2

# user-defined function to get the test stat and p-value
# To know more about the derivation of test statistic formula, please refer to the monographs and additional materials
def chi_var(pop_var, sample_var, n):
  # calculate the test statistic
  test_stat = (n - 1) * sample_var / pop_var
  # calculate the p-value
  p_value = 1 - chi2.cdf(test_stat, n-1)
  return (test_stat, p_value)

# set the value of sample size
n = 32
# set the values of population and sample variance
sigma_2, s_2 = 22.4**2, 26.4**2

test_stat, p_value = chi_var(sigma_2, s_2, n)

print('The p-value is ', p_value)

The p-value is  0.0733923626973344


In [11]:
test_stat, p_value = chi_var(250**2, 250**2, 25)
print('The p-value is ', p_value,test_stat)

The p-value is  0.4615973330636183 24.0


### Insight
As the p-value is greater than the significance level, we can not reject the null hypothesis. Hence, we do not have enough statistical significance to conclude that the standard deviation of the chosen mutual fund is greater than the average standard deviation for mid cap mutual funds at 0.05 level of significance.

# <a name='link13'>**F-test for Equality of Variances**</a>



### Let's revisit the example

The variance of a process is an important quality of the process. A large variance implies that the process needs better control and there is opportunity to improve. 


The data (Bags.csv) includes weights for two different sets of bags manufactured from two different machines. It is assumed that the weights for two sets of bags follow normal distribution.

Do we have enough statistical evidence at 5% significance level  to conclude that there is a significant difference between the variances of the bag weights for the two machines.



### Let's write the null and alternative hypothesis
Let $\sigma_1^2, \sigma_2^2$ be the variances of weights of the bags produced by two different machines.

We will test the null hypothesis

>$H_0:\sigma_1^2 = \sigma_2^2$

against the alternate hypothesis

>$H_a:\sigma_1^2 \neq \sigma_2^2$

### Let's test whether the assumptions are satisfied or not

* Continuous data - Yes, the weight is measured on a continuous scale.
* Normally distributed populations - Yes, it is assumed that the populations are normally distributed.
* Independent populations - As the two sets of bags are manufactured from two different machines, the populations are independent.
* Random sampling from the population - Yes, we are informed that the collected sample is a simple random sample.


### Let's have a look on the sample data

In [None]:
bagweight = pd.read_csv('Bags.csv')
bagweight.head()

Unnamed: 0,Machine 1,Machine 2
0,2.95,3.22
1,3.45,3.3
2,3.5,3.34
3,3.75,3.28
4,3.48,3.29


### Let's find the p-value

In [None]:
# import the required function
from scipy.stats import f

# user-defined function to perform F-test
# To know more about the derivation of test statistic formula, please refer to the monographs and additional materials
def f_test(x, y):
  x = np.array(x)
  y = np.array(y) 
  test_stat = np.var(x, ddof=1)/np.var(y, ddof=1) #calculate F test statistic 
  dfn = x.size-1 #define degrees of freedom numerator 
  dfd = y.size-1 #define degrees of freedom denominator 
  p = (1 - f.cdf(test_stat, dfn, dfd)) # find p-value of F test statistic 
  p1 = p*2 # Converting one-tail to two-tail test 
  return(print("The p_value is {}" .format(round(p,8)))) 

#perform F-test 
f_test(bagweight.dropna()['Machine 1'], bagweight.dropna()['Machine 2'])

The p_value is 2.55e-06


### Insight
As the p-value is much smaller than the level of significance, the null hypothesis can be rejected. Hence, we have enough statistical evidence to conclude that there is a difference between the bag weights for the two machines at 0.05 significance level.

# ---------------------------------------------**The End**-------------------------------------------------