# Purpose:

1. To learn about percentiles,

2. Inter quartile range, and

3. Analytic distribution.

# Previously in the series:

[Autumn of Matriarch: The complete guide to EDA - 1](https://www.kaggle.com/ritikpnayak/autumn-of-matriarch-the-complete-guide-to-eda-1)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import math
import scipy
from scipy import stats
import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')

In [None]:
df = pd.read_csv('/kaggle/input/women-entrepreneurship-and-labor-force/Dataset3.csv', delimiter = ';')

In [None]:
df.tail()

# Percentiles

***What is Percentile?***

1. Demonstrate the number of that are less than or equal or less in number than another.
2. For example, if we say that a person scored 75 percentile in an exam, then he has scored more than or equal to as many as 75 people.
3. This also means that 75 people have scored either equal to or less than that person.

In [None]:
## Percentile rank from score

def PercentileRank(scores, your_score):
    count = 0
    for score in scores:
        if score <= your_score:
            count += 1
        
    percentile_rank = 100 * count / len(scores)
    return percentile_rank

## Score from percentile rank

def Percentile(scores, percentile_rank):
    scores.sort()
    index = percentile_rank * (len(scores) - 1) // 100
    return scores[index]

In [None]:
ei = df['Entrepreneurship Index'].values
wei = df['Women Entrepreneurship Index'].values
flfp = df['Female Labor Force Participation Rate'].values

# Inter quartile range (IQR):

***What is IQR?***

1. A summary statistic.
2. It is used to demonstrate the spread of a data.
3. It is the difference between the 75th and the 25th percentile of the values of a variable/column of a data.

In [None]:
print('IQR of ei: ', Percentile(ei, 75) - Percentile(ei, 25))
print('IQR of wei: ', Percentile(wei, 75) - Percentile(wei, 25))
print('IQR of flfp: ', Percentile(flfp, 75) - Percentile(flfp, 25))

***More the IQR, more is the spread of the data. For example, the values of ei are more spread out as compared to the values of the flfp***

# Modelling distributions

# Which model does my data fit?

In this part of my analysis, I won't dwell much on what distributions are or, in this regard, what analytic distributions are, for I have in the past written in length about this topic in some of my previous notebooks.

Nevertheless, it is verily important to know the concept, or have a knowhow of what kind of research/analysis is one going to do. Therefore, I would suggest my generous audience to kindly check one of my previous notebook; [Introduction: Analytic distribution w/ Volkswagen](https://www.kaggle.com/ritikpnayak/introduction-analytic-distribution-w-volkswagen)

In [None]:
def EvalCdf(sample, x):
    total = 0
    for i in sample:
        if i <= x:
            total += 1
    prob = total / len(sample)
    return prob

In [None]:
cdf = [EvalCdf(wei, x) for x in wei]

plt.plot(wei, cdf)

***The distribution is nearly a straight line which makes it clear that the "uniform distribution" is a befitting analytic distribution model for the data. However, my purpose here is to check that if in case, though sure not, the data parallels a different kind of distribution model.*** 

Moreover, exploring about different analytic distribution models would quite be beneficial in knowing about the different distribution models.

I would emphasis on the following models; 

***The exponential distribution, the normal distribution, the log-normal distribution and the Pareto distribution.***

These are some models that are in my knowledge and I think these are all. For all other kind of distributions, if any, are well encompassed by these distributions.

***1. Exponential distribution?***

***How to check if our data belongs to the exponential distribution?***

By using the complement of the cdf of an exponential distribution. That is, "*ccdf = 1 - cdf(x)*"

In [None]:
ccdf = [math.log(1-y) for y in cdf[:-1]]

plt.plot(wei[:-1], ccdf)

***What does the plot tell us?***

More than 80% of the plot but the tail is a straight line which suggests that the exponential distribution is not a perfect model for this distribution.

***2. Normal distribution?***

We'll use a ***normal probability plot*** to check whether the normal distribution is a fitting model for our distribution or not.

In [None]:
scipy.stats.probplot(wei, plot = plt)

***What does the plot tell us?***

The outcome is not a straight line which portrays that the normal distribution is neither the best option for this data.

***3. Log-normal distribution***

For verifying whether the log-normal distribution fits our data, we'll first take the log-ordered values of data (i.e. values in wei).

Then, we'll plot the normal probability plot of the data and see whether the output is a straight line or not.

In [None]:
log_ordered_values = [math.log(x) for x in stats.probplot(wei)[0][1]]

In [None]:
scipy.stats.probplot(log_ordered_values, plot = plt)

***What does the plot tell us?***

As expected, the values does not fit the log-normal distribution

***4. Pareto distribution?***

We'll plot the ***ccdfs on a log-log scale*** to demonstrate whether the Pareto distribution fits the data or not. The ccdfs are as were. The only difference is that we take 10 to be the base of the logarithm.

In [None]:
log_x = [math.log(x, 10) for x in wei[:-1]]
ccdf = [math.log(1-y, 10) for y in cdf[:-1]]

plt.plot(log_x, ccdf)

***What does the plot tell us?***

Would the distribution be the Pareto distribution, the ccdfs should align in a straight line. Thus, the plot portrays that the Pareto distribution also doesn't fit the model.

# For a deeper understanding:

Some questions are unconcluded though important that I wouldn't seek to answer in the notebooks. The prominent questions that would occur in many a mind after going through this specific notebook are:

1. Why the ccdf of the exponential distribution necessarily be a straight line?
2. Why also the ccdf of the Pareto distribution be straight line when plotted on a log-log scale?
3. What is a normal probability plot

and possibly more...

The answers to these questions are simple. One can easily understand those by studying about the formulae of the corresponding distributions and then relating those with the analysis done in this notebook. For example, it would be wise to know the formula of the exponential distribution, and then try answering the first question. That way, the knowledge persists with a little settlement with the hardwork on my part.