                                                 -- Under construction -- 

# Purpose:

***1. To demonstrate the better estimator for the population mean; sample mean or the median.***

***2. To characterize the uncertainty of the the estimate; that is, to compute the sampling error and the confidence interval.***

# Previously in the series:

1. [Autumn of Matriarch: The complete guide to EDA - 1](https://www.kaggle.com/ritikpnayak/autumn-of-matriarch-the-complete-guide-to-eda-1)

2. [Autumn of Matriarch: The complete guide to EDA - 2](https://www.kaggle.com/ritikpnayak/autumn-of-matriarch-the-complete-guide-to-eda-2)

3. [Autumn of Matriarch: The complete guide to EDA - 3](https://www.kaggle.com/ritikpnayak/autumn-of-matriarch-the-complete-guide-to-eda-3)

# Pre-requisite:

The concepts explained in this notebook have previously been explained by me in a yet lucid notebook, kindly refer to that, as this notebook would sure be less inclusive than that was.

[Estimation, Confidence Interval and Standard Error](https://www.kaggle.com/ritikpnayak/estimation-confidence-interval-and-standard-error) - link to the notebook.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import random
import math
import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')

In [None]:
df = pd.read_csv('/kaggle/input/women-entrepreneurship-and-labor-force/Dataset3.csv', delimiter = ';')

In [None]:
df.head()

In [None]:
developed = df[df['Level of development'] == 'Developed']

In [None]:
ei = df['Entrepreneurship Index'].values
wei = df['Women Entrepreneurship Index'].values
flfp = df['Female Labor Force Participation Rate'].values

# Estimation:

***1. Estimator for the population mean (mu):***

In [None]:
def Estimate(val, mu=0, median=0, n=7, m=1000):
    
    means = []
    medians = []
    
    for _ in range(m):
        xs = random.sample(list(val), n)
        xbar = np.mean(xs)
        median = np.median(xs)
        means.append(xbar)
        medians.append(median)
        
    print('rmse of xbar is: ', RMSE(means, mu))
    print('rmse of median is: ', RMSE(medians, mu))
    
def RMSE(estimates, actual):
    e2 = [(estimate - actual) ** 2 for estimate in estimates]
    mse = np.mean(e2)
    return math.sqrt(mse)

In [None]:
df.describe()

In [None]:
print('Estimation in wei: ')
Estimate(wei, mu=47.835294, median=44.500000)
print('\n')
print('Estimation in ei: ')
Estimate(ei, mu=47.241176, median=42.700000)
print('\n')
print('Estimation in flfp: ')
Estimate(flfp, mu=58.481765, median=61.000000)

***What is the conclusion?***

In all three variables, the rmse of sample mean is less than that of the median; therefore, we would say that the sample mean (xbar) is a good juxtapose of the population mean, and not the median.

***2. Estimator for Variance (sigma squared):***

Why estimating variance?

1. There are 2 formulae for the variance; one is using 'n' in the denominator.
2. Other one is using 'n-1' in the denominator.
3. When 

In [None]:
def EstimateVar(val, sigma=0, n=7, m=1000):
    
    estimates1 = []
    estimates2 = []
    
    for _ in range(m):
        xs = [random.sample(list(val), n)]
        biased = np.var(xs)
        unbiased = np.var(xs, ddof=1)
        estimates1.append(biased)
        estimates2.append(unbiased)
        
    print('mean error of biased: ', MeanError(estimates1, sigma ** 2))
    print('mean error of unbiased: ', MeanError(estimates2, sigma ** 2))
    
def MeanError(estimates, actual):
    errors = [estimate - actual for estimate in estimates]
    return np.mean(errors)

In [None]:
print('Estimation in wei: ')
EstimateVar(wei, sigma=14.268480)
print('\n')
print('Estimation in ei: ')
EstimateVar(ei, sigma=16.193149)
print('\n')
print('Estimation in flfp: ')
EstimateVar(flfp, sigma=13.864567)

# Sampling distributions:

In [None]:
def SimulateSample(val, n=9, m=1000):
    mu = np.mean(val)
    means = []
    for j in range(m):
        xs = random.sample(list(val), n)
        xbar = np.mean(xs)
        means.append(xbar)
        
    return means

In [None]:
x_bars_ei = SimulateSample(ei)
x_bars_wei = SimulateSample(wei)
x_bars_flfp = SimulateSample(flfp)

In [None]:
print('rmse of x_bars_ei: ', RMSE(x_bars_ei, 47.241176))
print('rmse of x_bars_wei: ', RMSE(x_bars_wei, 47.835294))
print('rmse of x_bars_flfp: ', RMSE(x_bars_flfp, 58.481765))
print('\n')
print('90% Confidence Interval of x_bars_ei: ', np.percentile(x_bars_ei, 5), np.percentile(x_bars_ei, 95))
print('90% Confidence Interval of x_bars_wei: ', np.percentile(x_bars_wei, 5), np.percentile(x_bars_wei, 95))
print('90% Confidence Interval of x_bars_flfp: ', np.percentile(x_bars_flfp, 5), np.percentile(x_bars_flfp, 95))

In [None]:
def EvalCdf(sample, x):
    count = 0
    
    for i in sample:
        if i <= x:
            count += 1
    prob = count / len(sample)
    return prob

In [None]:
cdf_ei = [EvalCdf(sorted(x_bars_ei), x) for x in sorted(x_bars_ei)]
cdf_wei = [EvalCdf(sorted(x_bars_wei), x) for x in sorted(x_bars_wei)]
cdf_flfp = [EvalCdf(sorted(x_bars_flfp), x) for x in sorted(x_bars_flfp)]

# Confidence Intervals:

In [None]:
plt.figure(figsize = (15, 8))

plt.plot(sorted(x_bars_ei), cdf_ei)
plt.axvline(np.percentile(x_bars_ei, 5), 0, ls = '--', color = 'blue')
plt.axvline(np.percentile(x_bars_ei, 95), 0, ls = '--', color = 'blue')
plt.axvline(np.mean(x_bars_ei), 0, ls = ':', color = 'red')
plt.axvline(np.mean(ei), 0, ls = ':', color = 'green')

In [None]:
plt.figure(figsize = (15, 8))

plt.plot(sorted(x_bars_wei), cdf_wei)
plt.axvline(np.percentile(x_bars_wei, 5), 0, ls = '--', color = 'blue')
plt.axvline(np.percentile(x_bars_wei, 95), 0, ls = '--', color = 'blue')
plt.axvline(np.mean(x_bars_wei), 0, ls = ':', color = 'red')
plt.axvline(np.mean(wei), 0, ls = ':', color = 'green')

In [None]:
plt.figure(figsize = (15, 8))

plt.plot(sorted(x_bars_flfp), cdf_flfp)
plt.axvline(np.percentile(x_bars_flfp, 5), 0, ls = '--', color = 'blue')
plt.axvline(np.percentile(x_bars_flfp, 95), 0, ls = '--', color = 'blue')
plt.axvline(np.mean(x_bars_flfp), 0, ls = ':', color = 'red')
plt.axvline(np.mean(flfp), 0, ls = ':', color = 'green')