# Disclaimer

***In the series; "Autumn of Matriarch" based on the "Women Entrepreneurship and Labor Force" dataset, I will guide and present my work for fellow Kagglers to enact an effective Exploratory Data Analysis. My approach, throughout the series would be, as many may point out, "a statistical analysis". I hope the notebooks fing the appropriate audience.***

# Importing the data and libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import math
from collections import Counter
from collections import defaultdict
import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')

In [None]:
df = pd.read_csv('/kaggle/input/women-entrepreneurship-and-labor-force/Dataset3.csv')

# 1. Data Frame

I need not acknowledge the very meaning of a dataframe. I would rather import and perform a couple of abatements in the dataframe which are as follows:

1. Cleaning and
2. Transformation

***How to clean this specific set of data?***

In [None]:
df.head()

***Is the data clean?***

No, becuase the names of the columns and their values are not apparent.

***How do we clean that?***

1. In this specific dataset, a minor change while importing would do the job. 
2. It looks like the dataframe is ***not separated with commas (,) but with semi colons (;)***.
3. To do away with that, we can use the ***delimiter attribute*** of the ***pandas.read_csv function***

In [None]:
df = pd.read_csv('/kaggle/input/women-entrepreneurship-and-labor-force/Dataset3.csv', delimiter = ';')

In [None]:
df.head()

The dataframe now looks good

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

# 1.1. Transformation

***What transformations do we require to do on this data?***

I'll specifically change the name of the columns to less words in small alphabets. 

In [None]:
df.rename(columns = {'Level of development':'lod',
                    'European Union Membership':'eum',
                    'Currency':'currency',
                    'Women Entrepreneurship Index':'wei',
                    'Entrepreneurship Index':'ei',
                    'Inflation rate':'ir',
                    'Female Labor Force Participation Rate':'flfp',
                    'Country':'country',
                    'No':'no'},
         inplace = True)

In [None]:
df.head()

The dataframe now looks like this.

In [None]:
df.country.value_counts()

In [None]:
df.eum.value_counts()

In [None]:
df.lod.value_counts()

In [None]:
df.currency.value_counts()

# 2. Distributions

***What is distributions?***

1. It is the share of the data that every group of the dataframe has.
2. Put differently, it is the visualization of the footprint of each value of a variable/column of the dataframe.

***How do we report it?***

I'll use 3 methods:

1. Histograms
2. Probability mass function
3. Cumulative distribution function

# 2.1. Histograms

***Why Histograms?***

1. One of the best ways to describe a variable.
2. It reports the number of times each value of a variable appear in the dataset. 

I'll plot the histograms in pairs to see the relationship between variables. That is to say, I'll check if the same values have same or exact opposite distributions in the 2 variables.

In [None]:
plt.hist(df.wei, label = 'wei', alpha = 0.5)
plt.hist(df.ei, label = 'ei', alpha = 0.8)
plt.legend()

In [None]:
plt.hist(df.flfp, label = 'flfp', alpha = 0.5)
plt.hist(df.ei, label = 'ei', alpha = 0.8)
plt.legend()

In [None]:
developed = df[df.lod == 'Developed']
developing = df[df.lod == 'Developing']

In [None]:
plt.hist(developed.flfp, label = 'flfp', alpha = 0.5)
plt.hist(developed.ei, label = 'ei', alpha = 0.8)
plt.legend()

In [None]:
plt.hist(developing.flfp, label = 'flfp', alpha = 0.5)
plt.hist(developing.ei, label = 'ei', alpha = 0.8)
plt.legend()

***What is the conclusion?***

In all three of the histograms, I see no apparent relation between variables. Nevertheless, we'll find the correlation further in the series as well.

# 2.2. Outliers and skewness

***What are Outliers?***

1. The values in the data that are either too large or too small.
2. The outliers are values that are far off from the mean and median of the data.
3. They can directly affect the mean, because it is takes into account the sum of all values in the data, but not the median.
4. Recognizing the outliers is very important because in its presence, the mean of the data may be very misleading.

***How to recognize the outliers?***

1. Recognizing the outliers is premised upon the general notion that most of the values of a distribution lie in the range of (mean - standard dev) and (mean + standard dev).
2. Therefore, the values below (mean - standard dev) and the values above (mean + standard dev) are all outliers. 

An example is shown below:

In [None]:
print('Big outliers:')
for i in df.ei:
    if i > np.mean(df.ei) + np.std(df.ei):
        print(i)

In [None]:
print('Small outliers:')
for i in df.ei:
    if i < np.mean(df.ei) - np.std(df.ei):
        print(i)

***What is the range of outliers in the variables; 'wei', 'ei', and 'flfp'?***

In [None]:
print('Outliers in the column wei are the values below; {}, and above; {}: '.format(np.mean(df.wei) - np.std(df.wei), np.mean(df.wei) + np.std(df.wei)))
print('Outliers in the column ei are the values below; {}, and above; {}: '.format(np.mean(df.ei) - np.std(df.ei), np.mean(df.ei) + np.std(df.ei)))
print('Outliers in the column flfp are the values below; {}, and above; {}: '.format(np.mean(df.flfp) - np.std(df.flfp), np.mean(df.flfp) + np.std(df.flfp)))

***What is skewness?***

1. It is the measure of the assymetry in our distribution.
2. It can be used to detect outliers. 
3. Positive skewness means that the tail extends more to the right.
4. Negative skewness means that the tail extends more to the left

***How to compute skewness?***

I'll employ ***"Pearson's median coefficient"*** to compute the strength of the skewness, because;

1. It's more efficient.
2. Based on the difference between the sample mean and the median.

***What are the other methods?***

The other method includes computing it using moments. I'll not ply this method for my notebook, however one can check my notebook that is strictly made for this purpose: [Computing the magnitude of skewness in Maths score](https://www.kaggle.com/ritikpnayak/computing-the-magnitude-of-skewness-in-maths-score).

In [None]:
def PearsonMedianCoeff(sample, xbar, median):
    gp = 3 * (xbar - median) / len(sample)
    return gp

In [None]:
print('Skewness in wei: ', PearsonMedianCoeff(df.wei, np.mean(df.wei), np.median(df.wei)))
print('Skewness in ei: ', PearsonMedianCoeff(df.ei, np.mean(df.ei), np.median(df.ei)))
print('Skewness in flfp: ', PearsonMedianCoeff(df.flfp, np.mean(df.flfp), np.median(df.flfp)))

***What are the countries that account for the outiers in the dataset?***

In [None]:
df.loc[(df['wei'] <= 33.7073934294814) | (df['wei'] >= 61.96319480581269)]['country']

In [None]:
df.loc[(df['ei'] <= 31.207569895884333) | (df['ei'] >= 63.2747830452921)]['country']

In [None]:
df.loc[(df['flfp'] <= 44.753797301977436) | (df['flfp'] >= 72.20973210978727)]['country']

***Are the countries with highest wei rate, members of EU?***

In [None]:
df.loc[df['wei'] >= 61.96319480581269]['eum']

***Are the countries with highest wei rate, developed?***

In [None]:
df.loc[df['wei'] >= 61.96319480581269]['lod']

***While it is evident that most of the countries that have the highest rate of wei, are the members of EU, it is rather stringent that all the outperforming countries are developed.***

***The same analysis can be done for flfp rate***

In [None]:
df.loc[df['flfp'] >= 72.20973210978727]['eum']

In [None]:
df.loc[df['flfp'] >= 72.20973210978727]['lod']

***EU members are not in the fore in leading the flfp rates, however, the developed countries outperform their developing counterparts, again.***

# 2.3. Effect Size

***What is effect size?***

1. It is quite evident from the name itself that it is the measure of the size of an effect.
2. In our data, the effect size is not quite helpful.
3. However, we can find if there is an apparent difference between the range of wei, ei and flfp, taken two at a time.

For this purpose, we'll use "Cohen's d"; which is defined as; 

d = [mean(x1) - mean(x2)] / s

In [None]:
def CohenEffectSize(group1, group2):
    
    diff = np.mean(group1) - np.mean(group2)
    
    var1 = np.var(group1)
    var2 = np.var(group2)
    n1, n2 = len(group1), len(group2)
    
    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
    cohens_d = diff / math.sqrt(pooled_var)
    
    return cohens_d

In [None]:
print('Effect size between wei and ei: ', CohenEffectSize(df.wei, df.ei))
print('Effect size between flfp and ei: ', CohenEffectSize(df.flfp, df.ei))
print('Effect size between flfp and wei: ', CohenEffectSize(df.flfp, df.wei))

***There is a lil difference between the range of wei and ei but the difference between the range of flfp and ei, and flfp and wei is rich.***

# 2.4. Probability mass function

***What is Probability mass function (pmf)?***

1. Another way to represent a distribution.
2. It is the normalized frequency.
3. Put differently, it is the measure of the frequency (the no. of times a value occurs in a variable/column) divided by n.
4. We divide the frequencies by n to find the probabilities of the frequencies. Division by the sum of all frequencies is called "normalization".

***For this purpose, I'll convert the values of flfp, wei and ei into ranges. I'll make 3 variables/columns that will tell us the range to which the values of flfp, wei and ei belong.***

In [None]:
df['rng_flfp'] = df.flfp.apply(lambda x: 1 if x<=10
                            else 2 if x<=20
                            else 3 if x<=30
                            else 4 if x<=40
                            else 5 if x<=50
                            else 6 if x<=60
                            else 7 if x<=70
                            else 8 if x<=80
                            else 9)

df['rng_wei'] = df.wei.apply(lambda x: 1 if x<=10
                            else 2 if x<=20
                            else 3 if x<=30
                            else 4 if x<=40
                            else 5 if x<=50
                            else 6 if x<=60
                            else 7 if x<=70
                            else 8 if x<=80
                            else 9)

df['rng_ei'] = df.ei.apply(lambda x: 1 if x<=10
                            else 2 if x<=20
                            else 3 if x<=30
                            else 4 if x<=40
                            else 5 if x<=50
                            else 6 if x<=60
                            else 7 if x<=70
                            else 8 if x<=80
                            else 9)

***PMF for what purpose?***

1. To see if there is any difference in the range of wei values between the developed and non developed countries.
2. That is, if there is a difference in the no. of times the range of wei occurs in developed and developing countries, then by what percent is the lead.

The idea would be more evident from the code. ***Oftentimes, the code that speaks volumes.***

In [None]:
developed = df[df.lod == 'Developed']
developing = df[df.lod == 'Developing']

In [None]:
def defaultval():
    return 0

d1 = defaultdict(defaultval)
for key, value in Counter(developed.rng_wei.value_counts()).items():
    d1[key] = value / sum(developed.rng_wei.value_counts())
    
d2 = defaultdict(defaultval)
for key, value in Counter(developing.rng_wei.value_counts()).items():
    d2[key] = value / sum(developing.rng_wei.value_counts())

diffs = []
for i in range(1, 10):
    diff = d1[i] - d2[i]
    diffs.append(100 * diff)
    
plt.bar(range(1, 10), diffs)

In [None]:
d1 = defaultdict(defaultval)
for key, value in Counter(developed.rng_ei.value_counts()).items():
    d1[key] = value / sum(developed.rng_ei.value_counts())
    
d2 = defaultdict(defaultval)
for key, value in Counter(developing.rng_ei.value_counts()).items():
    d2[key] = value / sum(developing.rng_ei.value_counts())

diffs = []
for i in range(1, 10):
    diff = d1[i] - d2[i]
    diffs.append(100 * diff)
    
plt.bar(range(1, 10), diffs)

In [None]:
d1 = defaultdict(defaultval)
for key, value in Counter(developed.rng_flfp.value_counts()).items():
    d1[key] = value / sum(developed.rng_flfp.value_counts())
    
d2 = defaultdict(defaultval)
for key, value in Counter(developing.rng_flfp.value_counts()).items():
    d2[key] = value / sum(developing.rng_flfp.value_counts())

diffs = []
for i in range(1, 10):
    diff = d1[i] - d2[i]
    diffs.append(100 * diff)
    
plt.bar(range(1, 10), diffs)

***The graphs manifest some apparent conclusions that I leave for the readers to draw.***

# 2.5. Cumulative distribution function

***What is Cumulative distribution function (cdf)?***

I'll not dwell on the definition, because I have used it in most of my notebooks. To know more about it, please refer to my notebook; [Introduction: Analytic distribution w/ Volkswagen](https://www.kaggle.com/ritikpnayak/introduction-analytic-distribution-w-volkswagen#4.-Brief-introduction-to-CDF).

Following is the application of cdfs:

In [None]:
def EvalCdf(sample, x):
    count = 0
    for i in sample:
        if i <= x:
            count += 1
    prob = count / len(sample)
    return prob

In [None]:
plt.figure(figsize = (15, 8))

c1 = [EvalCdf(sorted(developed.ei), x) for x in sorted(developed.ei)]
c2 = [EvalCdf(sorted(developing.ei), x) for x in sorted(developing.ei)]

plt.plot(sorted(developed.ei), c1, label = 'CDF of ei of developed countries')
plt.plot(sorted(developing.ei), c2, label = 'CDF of ei of developing countries')

plt.legend()

In [None]:
plt.figure(figsize = (15, 8))

c1 = [EvalCdf(sorted(developed.wei), x) for x in sorted(developed.wei)]
c2 = [EvalCdf(sorted(developing.wei), x) for x in sorted(developing.wei)]

plt.plot(sorted(developed.wei), c1, label = 'CDF of wei of developed countries')
plt.plot(sorted(developing.wei), c2, label = 'CDF of wei of developing countries')

plt.legend()

In [None]:
plt.figure(figsize = (15, 8))

c1 = [EvalCdf(sorted(developed.flfp), x) for x in sorted(developed.flfp)]
c2 = [EvalCdf(sorted(developing.flfp), x) for x in sorted(developing.flfp)]

plt.plot(sorted(developed.flfp), c1, label = 'CDF of flfp of developed countries')
plt.plot(sorted(developing.flfp), c2, label = 'CDF of flfp of developing countries')

plt.legend()

***What is the conclusion?***

1. ***Graph 1:*** 80% of the values of ei rate in developing countries are less than 40 whereas, almost 45% of the values of ei rate in developed countries is more than 60.
2. ***Graph 2:*** 80% of the values of wei rate in developing countries are less than 40 whereas, almost 55% of the values of wei rate in developed countries is more than 60.
3. ***Graph 3:*** 80% of the values of flfp rate in developing countries are less than 70 whereas, almost 80% of the values of flfp rate in developed countries is more than 60.

From all of the graphs it is evident that the developing countries are trailing behind the developed countries in more than one index. Nevertheless, making that kind of inference solely on this analysis not possible and unethical. A lot more things can be done that we'll see in the notebooks that are to follow. 

# Epilogue:

Dear reader,

After spending a lot of time teaching data science and making notebooks on Kaggle I ave realized that what is better than to set about doing what one really wants to do. Whenever I wanted to do something, I initially used to think unnecessarily and all that used to culminate in the wilting of my very conviction. Bow, I have started to apply the concepts that I have learnt by making notebooks, many not one, on Kaggle. This gives not only confidence, but eternal joy of godly grace. 

In that sense, I introduced to an yet another "application notebook" of some of the basic concepts of statistical analysis that I want everyone to know and learn about. It is thourough but all. Some more notebooks would be required to complete the analysis and to end up answering certain questions. As they say, many a little makes a mickle, I assume this little notebook is complete in its own way.

Yours making more notebooks

Ritik Prakash Nayak