# Hypothesis testing - 1

### BuyerRatio.csv Dataset

##   Sales of products in four different regions is tabulated for males and females. Find if male-female buyer rations are similar across regions.

### Business objective / problem 

    Objective: To compare and find if there is any significant deviation in the PROPORTION OF MALE:FEEMALE Buyers accross the Regions with 5% significance. 

    
    Alpha (a) = 0.05 or 5%
    
    In this case we are comparing more than two samples and the data is categorical. Hence, we use Chi2 Test

### Data Collection
    We need to collect or randomly sample the data set from the population. In rare cases we need to survey to collect the primary data mostly we use the secondary data for analysis. However in this case we assume that the observations are randomly sampled and are independent of each other. We will proceed with our study.
    
    Let us use the given data : “BuyerRatio.csv”

In [1]:
# First thing first Import all the libraries
# %matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
DF = pd.read_csv('BuyerRatio.csv')

### Data preparation / cleaning
#### Check the attributes of the data, check the integrity of data import into the system

In [3]:
DF.head()

Unnamed: 0,Observed Values,East,West,North,South
0,Males,50,142,131,70
1,Females,435,1523,1356,750


In [4]:
DF.shape

(2, 5)

#### Check for data types and decide proper data type for analysis

In [5]:
DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Observed Values  2 non-null      object
 1   East             2 non-null      int64 
 2   West             2 non-null      int64 
 3   North            2 non-null      int64 
 4   South            2 non-null      int64 
dtypes: int64(4), object(1)
memory usage: 208.0+ bytes


In [6]:
# PREPARING THE CROSS TABLE / CONTIGENCY TABLE FOR FEEDING THE TEST
table = DF.set_index(DF['Observed Values'])
table.drop(columns='Observed Values',inplace=True)
table

Unnamed: 0_level_0,East,West,North,South
Observed Values,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Males,50,142,131,70
Females,435,1523,1356,750


### Data analysis in the context of business case

#### Constructing Hypothesis  
    H0 = Buyer Ratios / Proportions of all regions are equal           
    H1 = Buyer Ratios / Proportions of atleast one region is not equal

#### CASE for Chi2 Test  -- Chi2_contingency() -- as Categorical data and more than 2 populations to compare
    
    For Chi2 Test following conditions must be satisfied
    
    1. The samples are Random. - ***SAMPLES ARE RANDOM AND INDEPENDENT WE ASSUME***
    2. Minimum of 5 frequencies in each cell of the cross table. - ***DATA SATISFIES THE CONDITION**

#### Calculate p values using Chi2 Test - Chi2_contingency()

In [7]:
chi_stat, Pval, dfree, Expval = stats.chi2_contingency(table)
Expval = pd.DataFrame(Expval)
print('Chi Statistic is {:.2f} and pvalue is {:.2f}'.format(chi_stat, Pval))

Chi Statistic is 1.60 and pvalue is 0.66


#### Check if p value id less than alpha

In [8]:
if Pval <=0.05:
    print('\nThere is sufficient evidence to reject null Hypothesis, Hence, consider "Alternate Hypothesis" and take action\n')
else:
    print('\nThere is not enough evidence to reject null Hypothesis. Hence, we shall continue to trust Null hypothesis\n')


There is not enough evidence to reject null Hypothesis. Hence, we shall continue to trust Null hypothesis



## Inference / conclusion
AS PVALUE IS HIGHER THAN ***ALPHA*** WE CONCLUDE THAT THERE IS NOT ENOUGH EVIDENCE TO REJECT NULL HYPOTHESIS. HENCE, WE CAN DRAW INFERENCE THAT THERE IS NO DIFFERENCE IN THE BUYER RATIONS OF ALL THE REGIONS AND EVEN IF ANY DIFFERENCE OF PROPORTIONS IS NOTED IT IS DUE THE CHANCE ASSOCIATED WITH SAMPLING.

## Hypothesis testing using critical value
    if observed chi-stat < critical chi-square, then variables are not related
    if observed chi-stat > critical chi-square, then variables are not independent (and hence may be related).

In [9]:
alpha = 0.05
critical_value = stats.chi2.ppf(q = 1 - alpha,df=dfree)    # Find the critical value for 95% confidence*

observed_chi_val = chi_stat
print(critical_value,observed_chi_val,'\n')

print('Interpretation by critical value')
if observed_chi_val <= critical_value:
    # if observed value is not in critical area then we Fail to reject anull hypothesis
    print ('Null hypothesis cannot be rejected (variables are not related and independent of each other. \nHence the buyer ratios are equal among all the regions)')
else:
    # if observed value is in critical area then we reject null hypothesis
    print ('Null hypothesis cannot be excepted (variables are not independent)')

7.814727903251179 1.595945538661058 

Interpretation by critical value
Null hypothesis cannot be rejected (variables are not related and independent of each other. 
Hence the buyer ratios are equal among all the regions)
