# Purpose:

***To verify if the correlation between wei and ei is actually strong or is merely by chance!***

# Previously in the series:

1. [Autumn of Matriarch: The complete guide to EDA - 1](https://www.kaggle.com/ritikpnayak/autumn-of-matriarch-the-complete-guide-to-eda-1)

2. [Autumn of Matriarch: The complete guide to EDA - 2](https://www.kaggle.com/ritikpnayak/autumn-of-matriarch-the-complete-guide-to-eda-2)

3. [Autumn of Matriarch: The complete guide to EDA - 3](https://www.kaggle.com/ritikpnayak/autumn-of-matriarch-the-complete-guide-to-eda-3)

4. [Autumn of Matriarch: The complete guide to EDA - 4](https://www.kaggle.com/ritikpnayak/autumn-of-matriarch-the-complete-guide-to-eda-4)

# Important note:

***Much of what I try to explain in this notebook would be understood with the help of the code. However, it oft not easy. Therefore, I humbly request my generous audience to kindly put their concerns regarding this in the comment section so that I make a more comprehensive notebook, explaining the topic.***

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/women-entrepreneurship-and-labor-force/Dataset3.csv', delimiter = ';')

In [None]:
df.head()

In [None]:
df.rename(columns = {'Level of development':'lod',
                    'European Union Membership':'eum',
                    'Currency':'currency',
                    'Women Entrepreneurship Index':'wei',
                    'Entrepreneurship Index':'ei',
                    'Inflation rate':'ir',
                    'Female Labor Force Participation Rate':'flfp',
                    'Country':'country',
                    'No':'no'},
         inplace = True)

In [None]:
df.tail()

In [None]:
np.random.permutation([10, 11, 12])

# Using Hypothesis Testing:

***We want to verify that the effect ("a strong correlation between the variables wei and ei") is actually true for a larger population or is a mere chance!***

***Please note:***

I have already made a notebook on Hypothesis Testing; [Introduction to Hypothesis Testing and Estimation](https://www.kaggle.com/ritikpnayak/introduction-to-hypothesis-testing-and-estimation). One may find that more wholsesome. Thus, one may review that to tenaciously hold the knowledge.

***To answer that question, we need to assume a "Null Hypothesis"***

***What is a null hypothesis?***

The opposite of what we are trying to prove. In Hypothesis Testing, we justify or seek a justification of the apparent effect that is visible in our data. In our data, the apparent effect is that; there is a significant correlation between the 2 variables (wei and ei) that we found in this notebook: [Autumn of Matriarch: The complete guide to EDA - 3](https://www.kaggle.com/ritikpnayak/autumn-of-matriarch-the-complete-guide-to-eda-3).

Then, we employ the "Proof by Contradiction" method to prove the signifacance of the result (correlation). That is to say, if we are able to justify the statement; we say that the effect is statistically significant.

Null Hypothesis - The prices for both fiction and non fiction books are same, because to prove that the the non fiction books on average are less expensive than that of their non fiction counterparts, we need to prove that their distribution is dissimilar

In [None]:
class HypothesisTest(object):
    
    def __init__(self, data):
        self.data = data
        self.MakeModel()
        self.actual = self.TestStatistic(data)
        
    def PValue(self, iters=1000):
        self.test_stats = [self.TestStatistic(self.RunModel())
                          for _ in range(iters)]
        count = sum(1 for x in self.test_stats
                   if x >= self.actual)
        return count / iters
    
    def TestStastic(self, data):
        raise UnimplementedMethodException()
        
    def MakeModel(self):
        pass
    
    def RunModel(self):
        raise UnimplementedMethodException()

In [None]:
class CorrHypothesisTest(HypothesisTest):
    
    def TestStatistic(self, data):
        xs, ys = data
        test_stat = abs(correlation(xs, ys))
        return test_stat
    
    def RunModel(self):
        xs, ys = self.data
        xs = np.random.permutation(xs)
        return xs, ys
    
def de_mean(x):
    x_bar = np.mean(x)
    return [x_i - x_bar for x_i in x]

def covariance(x, y):
    n = len(x)
    return np.dot(de_mean(x), de_mean(y)) / (n - 1)

def correlation(x, y):
    std_x = np.std(x)
    std_y = np.std(y)
    if std_x > 0 and std_y > 0:
        return covariance(x, y) / (std_x * std_y)
        # we can also return covariance(x, y) / std_x / std_y
    else:
        return 0

In [None]:
data = df.ei.values, df.wei.values
ht = CorrHypothesisTest(data)
pvalue = ht.PValue()

In [None]:
pvalue

***What does the P Value tell us?***

1. If it is less (certainly less than 5%), it means that the null hypothesis is false or that the vice versa is true (which is our assumption).

2. If it is more than 10%, we can say that our assumption is false or that our assumption/outcome is not "statistically important" to be more precise and technical.

3. If it is between 5 to 10%, we cannot say or assume much/anything about the estimation/outcome that we get because Hypothesis Testing tells us if there is strong evidence to believe that fact or not. It does not always give us a precise answer.

***What does the outcome tell us?***

The outcome is between 0 and 1%, which means that our estimation(i.e. "There is a strong correlation between the variables wei and ei") is statistically important and also holds true for a larger population.