In this tutorial cum analysis, my focus will be on 3 important concepts of statistics, viz. 

1. Probability Mass Function
2. Cumulative Distribution Function
3. Normal Probability Plot

If you haven't read my previous analysis, here is the link:
[The Bihar Statistics - 1](http://www.kaggle.com/ritikpnayak/the-bihar-statistics-1)

# Let's Begin

# 1. Importing Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
from scipy import stats
from collections import defaultdict
style.use('ggplot')

In [None]:
gross_enrollment = pd.read_csv('/kaggle/input/indian-school-education-statistics/gross-enrollment-ratio-2013-2016.csv', index_col = 'State_UT')

# 1. Got the Fix!

In my last analysis, I didn't change the names of some states that were recognized with different names in different years, the reason being; their names were changed in the subsequent year.

If you observe in the above cell, you can notice that I used the 'State_UT' column as index such that I can rename some entries of the column. I reset the index later on.

In [None]:
gross_enrollment.rename({'MADHYA PRADESH' : 'Madhya Pradesh', 'Pondicherry' : 'Puducherry', 'Uttaranchal' : 'Uttarakhand'}, inplace = True)

In [None]:
gross_enrollment.reset_index(inplace = True)

In [None]:
gross_enrollment.head()

In [None]:
gross_enrollment.shape

# 2. Object to Float conversion

In [None]:
gross_enrollment.info()

In [None]:
cols_to_convert = ['Higher_Secondary_Boys', 'Higher_Secondary_Girls', 'Higher_Secondary_Total']
np_vals = ['NR', '@']
gross_enrollment[cols_to_convert] = gross_enrollment[cols_to_convert].replace(np_vals,np.nan)
gross_enrollment[cols_to_convert] = gross_enrollment[cols_to_convert].astype('float64')

In [None]:
gross_enrollment.info()

# 3. Histograms are a good way to observe distribution

In [None]:
plt.figure(figsize=(15,8))
gross_enrollment.Primary_Total.hist()

In [None]:
plt.figure(figsize=(15,8))

gross_enrollment.Primary_Boys.hist()

In [None]:
plt.figure(figsize=(15,8))

gross_enrollment.Primary_Girls.hist()

# 4. Adding Columns and Probability Mass Function

**What is a Probability mass function (Pmf)?**

Simply put, pmfs are a representation of a distribution as a function that maps from values to probabilities. (from Think stats book by Alan B. Downey)

In our case, we do not have a handful of numbers as output instead we can take values belonging to a certain range for single or double digit numbers. Recall, in school, all the students getting marks within 91 - 100 were alotted an A1. Here, we are doing the same 

In [None]:
gross_enrollment['class'] = gross_enrollment.Primary_Total.apply(lambda x:1 if x<=70
                                                                else 2 if x<=80
                                                                else 3 if x<=90
                                                                else 4 if x<=100
                                                                else 5 if x<=110
                                                                else 6 if x<=120
                                                                else 7 if x<=130
                                                                else 8 if x<=140
                                                                else 9)   

gross_enrollment['class_boys'] = gross_enrollment.Primary_Boys.apply(lambda x:1 if x<=70
                                                                else 2 if x<=80
                                                                else 3 if x<=90
                                                                else 4 if x<=100
                                                                else 5 if x<=110
                                                                else 6 if x<=120
                                                                else 7 if x<=130
                                                                else 8 if x<=140
                                                                else 9)      

gross_enrollment['class_girls'] = gross_enrollment.Primary_Girls.apply(lambda x:1 if x<=70
                                                                else 2 if x<=80
                                                                else 3 if x<=90
                                                                else 4 if x<=100
                                                                else 5 if x<=110
                                                                else 6 if x<=120
                                                                else 7 if x<=130
                                                                else 8 if x<=140
                                                                else 9)                                                        

In [None]:
bihar = gross_enrollment[gross_enrollment.State_UT == 'Bihar']

Now that we have created 3 separate columns for **Primary Total**, **Primary Boys** and **Primary Girls**, we can move on to make our pmf functions separately for both **All India** and **Bihar**

In [None]:
hist_all = dict(gross_enrollment['class'].value_counts())
hist_bihar = dict(bihar['class'].value_counts())

pmf_all = defaultdict()
pmf_bihar = defaultdict()

n_all = sum(hist_all.values())
n_bihar = sum(hist_bihar.values())

for index, freq in hist_all.items():
    pmf_all[index] = freq/n_all

for index, freq in hist_bihar.items():
    pmf_bihar[index] = freq/n_bihar

Clearly, **hist_all** and **hist_bihar** contain the data of the **class** column in the form of a dictionary in which the index is the range value and the corresponding values are the number of times those values appear in the class column. 

For instance, the number 5 comes 41 times in hist_all which means that the **GER in the range 110 - 120** occupies as many as 42 columns in the dataset. Which means many a state have had **GER** in this range at some of time between **2013 - 2016**. 

**pmf_all and pmf_bihar** represent the probability of choosing these unique range/values of **GER**. The probability is always between 0 and 1. So more the value of pmf, the more are the chances for those unique values being selected. 

For instance, the pmf of 5 in pmf_all is 0.37 which is also the highest. This means that the probability of selecting a State/UT having a GER in the range of 110-12 at any point of time is 0.37 or 37%.

In [None]:
pmf_all

In [None]:
pmf_bihar

Now, in the same way we can create pmf for our newly created **'class_boys'** and **'class_girls'** columns as well

In [None]:
hist_boys = dict(gross_enrollment['class_boys'].value_counts())
hist_girls = dict(gross_enrollment['class_girls'].value_counts())

pmf_boys = defaultdict()
pmf_girls = defaultdict()

n_boys = sum(hist_boys.values())
n_girls = sum(hist_girls.values())

for index, freq in hist_boys.items():
    pmf_boys[index] = freq/n_boys

for index, freq in hist_girls.items():
    pmf_girls[index] = freq/n_girls

# 4.a. Plotting PMFs

Now that we have the pmfs for **'class_boys'** and **'class_girls'**, we can plot a graph of the pmfs

In [None]:
l = []
for i in pmf_boys.keys():
    if i in pmf_girls.keys():
        l.append(i)

diffs = []
for val in l:
    p_boys = pmf_boys[val]
    p_girls = pmf_girls[val]
    diff = 100*(p_boys-p_girls)
    diffs.append(diff)

plt.figure(figsize=(20,8))

plt.bar(l, diffs)

The graph illustrates some significant observations:

1. the probability of values 2, 3 and 4 of 'class_boys' outpace the probabilities of 'class_girls' which means that the GER in range 70-100 is more dense in 'pmf_boys'.
3. the probability of values 5, 6 and 9 of 'class_girls' outpace the probabilities of 'class_boys' which means that the GER in range 100-120 and more than 140 is more dense in 'pmf_girls'.

This inference among a couple of other less significant observations is important for it portrays that the GER of girls in India in Primary schools is significantly higher than that of the boys. 

**Plotting a pmf for Bihar will not be wise because there are only 3 values in case of Bihar. Although it can still illustrate useful information, but for such a small dataset, it can be interpreted visually as well.**

# 5. Cumulative Distribution Function

**What is a Cumulative Distribution Function?**

To understand that, you first need to know what a percentile is. I won't dwell on the definition of percentile so I'm putting a link for reference; [What is a Percentile rank?](https://en.wikipedia.org/wiki/Percentile_rank).

A Cumulative Distribution Function (cdf), straightforwardly, is a function that maps from a value to its percentile rank.

**"The CDF is a function of x, where x is any value that might appear in the
distribution. To evaluate CDF(x) for a particular value of x, we compute
the fraction of values in the distribution less than or equal to x." (from Think Stats book)** 

In [None]:
## Defining our cdf function.

def cdf(sample, val):
    count = 0.0
    for value in sample:
        if value <= val:
            count += 1
            
    prob = count/len(sample)
    return prob

sample = [1,2,2,4,5,1,3]

d = {}
for i in set(sample):
    d[i] = cdf(sample, i)

In [None]:
d

**The cdf of 5 is 1.0 because it is the greatest value in the sample.**

In [None]:
## Defining a cdf function that takes a dictionary as sample.
# it will be beneficial because we are dealing with Series object that can be easily converted to dictionary dataframe.

def cdf_dict(sample, val):
    count = 0.0
    for value in sample:
        if sample[value] <= sample[val]:
            count += 1
            
    prob = count/len(sample)
    return prob


In [None]:
sample_boys = dict(gross_enrollment['class_boys'].value_counts())
sample_girls = dict(gross_enrollment['class_girls'].value_counts())

y_boys = {}
for i in set(gross_enrollment['class_boys']):
    cdf = cdf_dict(sample_boys, i)
    y_boys[i] = cdf

y_girls = {}
for i in set(gross_enrollment['class_girls']):
    cdf = cdf_dict(sample_girls, i)
    y_girls[i] = cdf

In [None]:
fig, axs = plt.subplots(2, figsize = (20,8))
fig.suptitle('CDF Subplots')
axs[0].bar(y_boys.keys(), y_boys.values(), color = 'blue')
axs[1].bar(y_girls.keys(), y_girls.values())

**I'm not infering the observations from the above cdf subplots because everything is quite obvious**

# 6. Normal Probability Plot

**"For the exponential distribution, and a few others, there are simple transformations we can use to test whether an analytic distribution is a good model for a dataset".**

**"For the normal distribution there is no such transformation, but there is an alternative called a normal probability plot". (from Think Stats book)**

For more information about normal probability plot; click on the following link; [What is a Normal Probability Plot?](https://en.wikipedia.org/wiki/Normal_probability_plot)

I'll plot the normal probability plots using the scipy library. I imported stats from scipy. I'll use ***stats.probplot*** for this purpose.

Let's look at an example

In [None]:
stats.probplot(x = gross_enrollment['Upper_Primary_Total'])

In [None]:
## This is how the graph of probplot looks like

stats.probplot(x = gross_enrollment['Upper_Primary_Total'], plot = plt)

In [None]:
fig, axs = plt.subplots(4, figsize = (20,12))
fig.suptitle('Vertically stacked Normal Probability Plots')

axs[0].plot(stats.probplot(x = gross_enrollment['Primary_Total'])[0][0], stats.probplot(x = gross_enrollment['Primary_Total'])[0][1])
axs[1].plot(stats.probplot(x = gross_enrollment['Upper_Primary_Total'])[0][0], stats.probplot(x = gross_enrollment['Upper_Primary_Total'])[0][1])
axs[2].plot(stats.probplot(x = gross_enrollment['Secondary_Total'])[0][0], stats.probplot(x = gross_enrollment['Secondary_Total'])[0][1])
axs[3].plot(stats.probplot(x = gross_enrollment['Higher_Secondary_Total'])[0][0], stats.probplot(x = gross_enrollment['Higher_Secondary_Total'])[0][1])

**If the distribution of the sample is approximately normal, the result is a straight line with intercept mu and slope sigma. **

Now you have 2 new terminologies to explore about, they are; ***mu and sigma*** . Click on the following link; [What is Normal Distribution and what are mu and sigma](https://en.wikipedia.org/wiki/Normal_distribution)

**Based on the definiton in the previous cell, you can infer obvious interpretations.**

# Epilogue

Dwelling on statistical model in a slight shift from my previous analysis. Statistics, in my opinion are the building blocks of data analysis, because our analysis itself begins with a brief discription of the data. This includes 'mean', 'median', 'standard deviation' among other things. That said, one need not have prior knowledge of herculean statistical models. It is with experience that one shall be able to discover and apply these statistical models. In this part of analysis, I wished to introduce the audience with some basic statistical models. Though they are not enough, they still are important, for they beget useful inferences about the data. It is also evident that I restricted my analysis to only a few columns/data of our dataframe. A lot more could have been done with only these three concepts, nonetheless, I leave it to the readers to apply these concepts further in the data. My aim is to answer a question on which my whole analysis is premised, i.e.; **What is the gross enrolment rate in Primary Schools of Bihar and is it faring well as compared to the corresponding statistics of all india?**

**Kindly follow, like and comment to help me tweak my work.**