# Purpose:

***To check if there is a correlation between the "Entrepreneurship index" and "Women entrepreneurship index".***

# Previously in the series:

1. Part 1 of the series: [Autumn of Matriarch: The complete guide to EDA - 1](https://www.kaggle.com/ritikpnayak/autumn-of-matriarch-the-complete-guide-to-eda-1)

2. Part 2 of the series: [Autumn of Matriarch: The complete guide to EDA - 2](https://www.kaggle.com/ritikpnayak/autumn-of-matriarch-the-complete-guide-to-eda-2)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import random
import math
import statistics
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import style
style.use('fivethirtyeight')

In [None]:
df = pd.read_csv('/kaggle/input/women-entrepreneurship-and-labor-force/Dataset3.csv', delimiter = ';')

In [None]:
df.head()

In [None]:
ei = df['Entrepreneurship Index'].values
wei = df['Women Entrepreneurship Index'].values
flfp = df['Female Labor Force Participation Rate'].values

# Pre-requisites:

In a previous notebook, I have discussed at length about correlation as whole with special emphasis on "scatter plots" and "Pearson's method". Therefore, I would want the generous audience to view that notebook to have a better understanding of a couple of primitive methods which we use to find the correlation between two variables. The link to the notebook is as follows;

[Elemental approach to finding correlation](https://www.kaggle.com/ritikpnayak/elemental-approach-to-finding-correlation)

***That said, I wouldn't dwell much on the "scatter plots" and the "Pearson's method" part of this notebook.***

# 1. Scatter plots:

In [None]:
plt.scatter(ei, wei)

***What does the plot tell us?***

1. The relationship is almost linear.
2. The data that we have with us is mere 51 in number. Therefore, I conclude that had we have more data, the data plot would have been more linearly scattered.
3. A strong correlationship in my opinion exists.

# 2. Pearson's method:

In [None]:
def de_mean(x):
    x_bar = np.mean(x)
    return [x_i - x_bar for x_i in x]

def covariance(x, y):
    n = len(x)
    return np.dot(de_mean(x), de_mean(y)) / (n - 1)

In [None]:
def correlation(x, y):
    std_x = np.std(x)
    std_y = np.std(y)
    if std_x > 0 and std_y > 0:
        return covariance(x, y) / (std_x * std_y)
        # we can also return covariance(x, y) / std_x / std_y
    else:
        return 0

In [None]:
print('correlation between ei and wei using Pearson method: ', correlation(ei, wei))

***What does the output tell us?***

1. As assumed, the variables are strongly correlated.
2. The magnitude is more than 0.9. With an upper limit of +1, it could well be concluded this way.

***Is it the best method for our data?***

It is a worthwhile method that gave a beauty of the correlation, ***it is but not the best***. Why?

Let's look at the distribution of both the variables

In [None]:
def EvalCdf(sample, x):
    count = 0
    for i in sample:
        if i <= x:
            count += 1
    prob = count / len(sample)
    return prob

In [None]:
cdf_ei = [EvalCdf(ei, x) for x in sorted(ei)]

plt.plot(sorted(ei), cdf_ei)

In [None]:
cdf_wei = [EvalCdf(wei, x) for x in sorted(wei)]

plt.plot(sorted(wei), cdf_wei)

The distribution of both the variables is not normal. This gives us a reason to rule out the possibility of the Pearson's correlation being the best possible in this case.

***What are the other reasons for which the actual correlation is in contravention with the Pearson's method?***

1. The presence of outliers (skewness in the data). We saw in the first part of the series, that both the variables are positively skewed.
2. If the relationship between the variables is non-linear.

***What can be done about it?***

1. We can use Spearman's correlation method.
2. We can also bin the values in wei and plot the percentiles of the values of ei in each bin of wei.
3. We can also take the logarithms of ei and then find Pearson's correlation; if there is a skewness in the data.

***Let's use these 3 methods one at a time.***

# 3. Spearman's correlation method:

In [None]:
def SpearmanCorrelation(x, y):
    xranks = pd.Series(x).rank()
    yranks = pd.Series(y).rank()
    return correlation(xranks, yranks)

In [None]:
print('correlation between ei and wei using Spearman method: ', SpearmanCorrelation(ei, wei))

***What does the output tell us?***

1. The result is as good as that of the previous method.
2. This means that there actually is a stong a correlation amid the 2 variables.

# 4. Characterizing relationships:

***What is characterizing relationships?***

1. A process in which we "bin" one variable (in our case, Women Entrepreneurship Index) and plot the percentiles of the other (Entrepreneurship Index).
2. The codes would certainly be a better explaination of the concept.

***What is the benefit?***

1. It is, in my opinion, furthering the applications of scatter plots, that is to visualize the relationships.
2. It provides more insights into the nature of the relationship.

In [None]:
variables = ['Entrepreneurship Index', 'Women Entrepreneurship Index']
features = df[variables]

bins = np.arange(25, 80, 5)
indices = np.digitize(features['Women Entrepreneurship Index'], bins)
groups = features.groupby(indices)

***How many values are there in each group?***

In [None]:
for i, group in groups:
    print(i, len(group))

In [None]:
mean_wei = [group['Women Entrepreneurship Index'].mean() for i, group in groups]
ls = []

for i, group in groups:
    sample = group['Entrepreneurship Index'].values[:len(group)]
    ls.append(sample)

***What does the values in each group; namely 1, 2, 3..., 10, look like?***

In [None]:
ls

***Take the 75th, 50th and 25th percentile of the values in ei against each bin of wei and plot those.***

In [None]:
plt.figure(figsize=(15,8))
plt.plot([30, 70], [30, 70], label = 'uniform', ls = ':', color = 'green')
for percent in [75, 50, 25]:
    ei_percentiles = [np.percentile(l, percent) for l in ls]
    label = "%dth" % percent
    plt.plot(mean_wei, ei_percentiles, label = label)
    plt.legend()

***What does the plot tell us?***

1. Not a single of the three lines is straight, which suggests that they are not linear, that further suggests that there is a number of outliers in our data and that the resulting skewness is justified.
2. This also means that there is not a strong correlation! However, ***that very assumption is questionable***.

***Why is that questionable?***

1. We have a very little data. 
2. Therefore, when we plot the graph, the are highlighted contains a small amount of data.
3. For instance, of the line pointing to the 75th percentiles, a dip is vissible between 40 and 50 on the x-axis. Nonetheless, that dip is not quite significant as it appears to be. Had we had more amount of data, this wouldn't have appeared to be a great deal of significant.

***We are content with the previous result that is there is a strong correlation between the variables.***

# 5. Taking logarithm:

In this method, we take the log of the variables (wei and ei) and find the correlation between them.

***Why this method?***

As we have seen in the first part, the values of our variables are skewed. Therefore, to reduce the effect of the skewness, we take the log of the values and then find the correlation. 

***What if only one variable had skewness not both?***

We would take the log of the variables of that variable only, not both.

In [None]:
log_wei = [math.log(x) for x in wei]
log_ei = [math.log(y) for y in ei]

print('correlation between log_ei and log_wei using Pearson method: ', correlation(log_ei, log_wei))

Thus, the correlation is strong and looms large for a small yet significant data.

***Why is such?***

Both the variables, deal with the entreprenurship index, one does so generally while the other one deals with the data for one gender.