# Previous notebooks on the same dataset:

1. [Elemental approach to finding correlation](https://www.kaggle.com/ritikpnayak/elemental-approach-to-finding-correlation)

2. [Computing the magnitude of skewness in Maths score](https://www.kaggle.com/ritikpnayak/computing-the-magnitude-of-skewness-in-maths-score)

# Previous notebooks on the same subject:

1. [Introduction to Hypothesis Testing and Estimation](https://www.kaggle.com/ritikpnayak/introduction-to-hypothesis-testing-and-estimation)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import random
import math
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')

In [None]:
df = pd.read_csv('/kaggle/input/students-performance-in-exams/StudentsPerformance.csv')

In [None]:
df.head()

In [None]:
ms = df['math score']
rs = df['reading score']
ws = df['writing score']

In [None]:
df.describe()

# 1. Estimation

***What is Estimation?***

Let's understand that with an example:

1. Take the variable/column "math score" as an example.
2. We can use 2 stats for mean; one is the sample mean or the median
3. We have to decide which one better portrays the population mean (mu); sample mean (x_bar) or the median.
4. For that we'll run the experiment many times, each time taking a sample of certain values from the data. 
5. We'll collect the x_bar and median of each sample separately.
6. Finally, we'll compute the RMSE (Root Mean Square Error) for both x_bar and median. The less erroneous one would be our estimator.

***How will we run the experiments many times? What is the experiment in our example?***

1. Generalizing the term experiment, we can take the dice throwing for an experiment. Each time we throw a dice, a no. between 1-6 appears. 
2. In our example, we'll run the experiment in the sense that each time we take a new sample, we'll note down the sample mean and the median.
3. For instance, we know that the length of our dataframe is 1000. So, we run the experiment 2000 times, each time taking a sample of length, say for instance, 7 from the given population (data) itself. 

***What is the concept behind "estimation"?***

1. If we have a population of length, say, 2000 with mean 'mu', and we wish to take a small sample out of it, say of length 1000, the mean of that sample 'x_bar' should be equal to 'mu'.
2. We have but a median stat associated with the data which can also be taken for the sample mean (x_bar).
3. Our task is to find which estimator for the population mean (mu) is the best.
4. That would be decided using the RMSE for both the estimators.

***How to find the RMSE and what is the whole process?***

The process would be better understood through the code that I write forthwith.

In [None]:
def Estimate(series, mu=0, median=0, n=7, m=1000):
    means = []
    medians = []
    
    for _ in range(m):
        xs = [random.sample(list(series), n)]
        x_bar = np.mean(xs)
        median = np.median(xs)
        means.append(x_bar)
        medians.append(median)
        
    print('rmse of x_bar: ', RMSE(means, mu))
    print('rmse of median: ', RMSE(medians, mu))
    
def RMSE(estimates, actual):
    e2 = [(estimate - actual) ** 2 for estimate in estimates]
    mse = np.mean(e2)
    return math.sqrt(mse)

In [None]:
print('Estimate of math score: ', Estimate(ms, mu=66.08900, median=66.00000))
print('\n')
print('Estimate of reading score: ', Estimate(rs, mu=69.169000, median=70.000000))
print('\n')
print('Estimate of writing score: ', Estimate(ws, mu=68.054000, median=69.000000))
print('\n')

**Looks like the sample mean is the best estimator of mu in all the subjects**

***Why?***

Because the RMSE for x_bar or sample mean is less than that of the median in all the cases.

# 2. Sampling Distributions

***What is Sampling Distributions?***

1. Suppose that you want to know the mean of the marks of all students in a city in a specific subject.
2. For that, you can take a record of all the students from the city and find the mean. But we generally don't do that.
3. It would be reasonable for you to take a sample of say, 400 children, find the sample mean and take that for an estimator of the population mean.
4. Let's say for instance that the sample mean comes out to be 66. You may report that the mean of the population is 66 marks.

***How confident would you be?***

1. You might end up choosing the 400 most scoring or least scoring children in the face of it.
2. That will significantly overestimate or underestimate your results.
3. For that purpose, we would like to **summarize our estimation** in 2 stats. One is the **"standard error"**. The other one being, the **"confidence interval"**.

***What is Standard Error?***

1. Strictly speaking, it's the measure of how far our estimate is expected to be off, on average.
2. We take the RMSE for standard error.

***What is Confidence Interval?***

1. Range in which a certain percentage of the values would fall.
2. If we compute the 90% confidence interval, we are saying that the values of our mean (sample mean) would variate or fall in this range if 90 out of 100 values of sample means (sample mean of a number of simulations or experiments/samples) were to be taken at random.

***How to find the RMSE and what is the whole process?***

The process would be better understood through the code that I write forthwith.

In [None]:
def SimulateSample(series, mu=0, n=7, m=1000):
    x_bars = []
    cdf = []
    
    for _ in range(m):
        xs = random.sample(list(series), n)
        x_bar = np.mean(xs)
        x_bars.append(x_bar)
        
    #cdf.append(EvalCdf(means, x) for x in means)
    #ci = np.percentile(means, 5), np.percentile(means, 95)
    #std_error = RMSE(means, mu)
    
    #print('Confidence Interval: ', ci)
    #print('rmse: ', std_error)
    return x_bars

def EvalCdf(sample, x):
    count = 0
    
    for i in sample:
        if i <= x:
            count += 1
    prob = count / len(sample)
    return prob

***What is a Cumulative Distribution Function or CDF?***

Please refer to my notebook; [Introduction: Analytic distribution w/ Volkswagen](https://www.kaggle.com/ritikpnayak/introduction-analytic-distribution-w-volkswagen) to read about it

In [None]:
x_bars_ms = SimulateSample(ms, mu=66.08900)
x_bars_rs = SimulateSample(rs, mu=69.169000)
x_bars_ws = SimulateSample(ws, mu=68.054000)

In [None]:
print('rmse of x_bars_ms: ', RMSE(x_bars_ms, 66.08900))
print('rmse of x_bars_rs: ', RMSE(x_bars_rs, 69.169000))
print('rmse of x_bars_ws: ', RMSE(x_bars_ws, 68.054000))
print('\n')
print('90% Confidence Interval of x_bars_ms: ', np.percentile(x_bars_ms, 5), np.percentile(x_bars_ms, 95))
print('90% Confidence Interval of x_bars_rs: ', np.percentile(x_bars_rs, 5), np.percentile(x_bars_rs, 95))
print('90% Confidence Interval of x_bars_ws: ', np.percentile(x_bars_ws, 5), np.percentile(x_bars_ws, 95))

In [None]:
cdf_ms = [EvalCdf(sorted(x_bars_ms), x) for x in sorted(x_bars_ms)]
cdf_rs = [EvalCdf(sorted(x_bars_rs), x) for x in sorted(x_bars_rs)]
cdf_ws = [EvalCdf(sorted(x_bars_ws), x) for x in sorted(x_bars_ws)]

In [None]:
plt.figure(figsize = (15, 8))

plt.plot(sorted(x_bars_ms), cdf_ms)
plt.axvline(np.percentile(x_bars_ms, 5), 0, ls = '--', color = 'blue')
plt.axvline(np.percentile(x_bars_ms, 95), 0, ls = '--', color = 'blue')
plt.axvline(np.mean(x_bars_ms), 0, ls = ':', color = 'red')
plt.axvline(np.mean(ms), 0, ls = ':', color = 'green')

In [None]:
plt.figure(figsize = (15, 8))

plt.plot(sorted(x_bars_rs), cdf_rs)
plt.axvline(np.percentile(x_bars_rs, 5), 0, ls = '--', color = 'blue')
plt.axvline(np.percentile(x_bars_rs, 95), 0, ls = '--', color = 'blue')
plt.axvline(np.mean(x_bars_rs), 0, ls = ':', color = 'red')
plt.axvline(np.mean(rs), 0, ls = ':', color = 'green')

In [None]:
plt.figure(figsize = (15, 8))

plt.plot(sorted(x_bars_ws), cdf_ws)
plt.axvline(np.percentile(x_bars_ws, 5), 0, ls = '--', color = 'blue')
plt.axvline(np.percentile(x_bars_ws, 95), 0, ls = '--', color = 'blue')
plt.axvline(np.mean(x_bars_ws), 0, ls = ':', color = 'red')
plt.axvline(np.mean(ws), 0, ls = ':', color = 'green')

***What do the graphs show?***

1. The confidence intervals (in blue). The values of the CIs shown above have the obvious meaning, one can leave a comment if in case, doubtful.
2. The sample mean (in red).
3. The population mean (in green). 

***What can be concluded?***

1. The vertical lines representing the sample mean and the mu almost overlap in all of the above graphs.
2. This shows that the sample mean or x_bar undoubtedly represents the whole of the population or its mean mu.
3. Therefore, we can say that there is a little difference in the x_bar and mu thus depicting that the sample mean is representative of whole of the population with a standard error of the respective RMSE values. 