In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

# Asking Question

Before collecting data, one needs to ask questions. It is the question that shall describe the data and subsequently lead oneself to collect the data that fit as best possible answer to the question. 

In this case, we have a dataset, so let's do the aforementioned job in reverse, i.e. ask a question that might surmise a potential outcome from the given dataset. Before that let's import and explore all the given dataframes

In [None]:
dropout_ratio = pd.read_csv('/kaggle/input/indian-school-education-statistics/dropout-ratio-2012-2015.csv')
schools_with_electricity = pd.read_csv('/kaggle/input/indian-school-education-statistics/percentage-of-schools-with-electricity-2013-2016.csv')
gross_enrolment_ratio = pd.read_csv('/kaggle/input/indian-school-education-statistics/gross-enrollment-ratio-2013-2016.csv')
schools_with_water_facility = pd.read_csv('/kaggle/input/indian-school-education-statistics/percentage-of-schools-with-water-facility-2013-2016.csv')
schools_with_boys_toilet = pd.read_csv('/kaggle/input/indian-school-education-statistics/schools-with-boys-toilet-2013-2016.csv')
schools_with_girls_toilet = pd.read_csv('/kaggle/input/indian-school-education-statistics/schools-with-girls-toilet-2013-2016.csv')
schools_with_comps = pd.read_csv('/kaggle/input/indian-school-education-statistics/percentage-of-schools-with-comps-2013-2016.csv')

I won't show/print each of the above dataframes for that would be a perennial task. After going through all the dataframes, one dataset that arrested my imagination was the "gross_enrolment_ratio" dataset. Also, the simplest and most obvious question that appeared to me is: **What is the gross enrolment rate in Primary Schools of Bihar and is it faring well as compared to the corresponding statistics of all india?**

Our whole EDA shall revolve around that question

# What is the gross enrolment rate in Primary Schools of Bihar and is it faring well as compared to the corresponding statistics of all india?

# 1. Data Cleaning

In [None]:
gross_enrolment_ratio.head()

In [None]:
gross_enrolment_ratio.State_UT.value_counts().sort_index()

There are 40 states and UTs in the dataframe, which is sort of weird and fairly incorrect because India has 37 states and UTs. **sort_index()** will help us to identify the states that occuring more than once in the dataframe.

states/UTs that are occuring twice are:

1. Madhya Pradesh
2. Puducherry
3. Uttarakhand

Let's see what different data do these possess in separate rows.

In [None]:
df = gross_enrolment_ratio[gross_enrolment_ratio.State_UT.isin(['MADHYA PRADESH', 'Madhya Pradesh', 'Pondicherry', 'Puducherry', 'Uttarakhand', 'Uttaranchal'])]

In [None]:
df

In [None]:
gross_enrolment_ratio.set_index('State_UT', inplace = True)

In [None]:
gross_enrolment_ratio.rename({'MADHYA PRADESH' : 'Madhya Pradesh', 'Pondicherry' : 'Puducherry', 'Uttaranchal' : 'Uttarakhand'}, inplace = True)

In [None]:
gross_enrolment_ratio.reset_index(inplace = True)

In [None]:
gross_enrolment_ratio.State_UT.value_counts().sort_index()

As you can notice now each state occurs once for each session, namely 2013-14, 2014-15 and 2015-16. This is going to untangle our task.

In [None]:
gb_year = gross_enrolment_ratio.groupby(['State_UT','Year'])

In [None]:
gb_year.first()

In [None]:
bihar = gross_enrolment_ratio[gross_enrolment_ratio.State_UT == 'Bihar']
all_india = gross_enrolment_ratio[gross_enrolment_ratio.State_UT == 'All India']

In [None]:
bihar_total = bihar[['Year', 'Primary_Total', 'Upper_Primary_Total', 'Secondary_Total', 'Higher_Secondary_Total']]

In [None]:
bihar_total

In [None]:
bihar_total['Higher_Secondary_Total'] = bihar_total.Higher_Secondary_Total.astype(float)
all_india['Higher_Secondary_Total'] = all_india.Higher_Secondary_Total.astype(float)

In [None]:
bihar_total.describe()

In [None]:
bihar

In [None]:
plt.figure(figsize = (15,8))

bihar.Primary_Girls.plot()
bihar.Primary_Boys.plot()
bihar.Primary_Total.plot()
plt.legend()

In Bihar, as you can espy that even as the gross enrolment rate of both boys and girls soared between 2013-15, the opposite is happening in the subsequent session. A significant decline in the slope of the curve is alarming 

# Histogram

Histograms are a good way of discerning distributions. However, there are some outliers that need to be identified because they may a devise a despicable deal in the future. 

To manage outliers, we ramify the staes into 3 broad categories, namely 'good_states', 'non_good_states' and 'modest_states', the details of which are given below

In [None]:
plt.figure(figsize = (15,8))

gross_enrolment_ratio.Primary_Total.hist()

In [None]:
good_states = set(gross_enrolment_ratio.State_UT[gross_enrolment_ratio.Primary_Total > 110])
modest_states = set(gross_enrolment_ratio.State_UT[gross_enrolment_ratio.Primary_Total <= 110])
not_good_states = set(gross_enrolment_ratio.State_UT[gross_enrolment_ratio.Primary_Total <= 90])

In [None]:
len(not_good_states)+len(modest_states)+len(good_states)

In [None]:
not_good_states

In [None]:
good_states

In [None]:
modest_states

# Comparison using line graphs

In [None]:
plt.figure(figsize=(15,8))
sns.lineplot(x = 'Year',y = 'Primary_Total', data = bihar_total, label = 'Primary')
sns.lineplot(x = 'Year',y = 'Upper_Primary_Total', data = bihar_total, label = 'Upper Primary')
sns.lineplot(x = 'Year',y = 'Secondary_Total', data = bihar_total, label = 'Higher')
sns.lineplot(x = 'Year',y = 'Higher_Secondary_Total', data = bihar_total, label = 'Higher Secondary')
plt.legend()

In the first graph, it is quite evident that the GER in primary schools have significantly risen since 2014-15, nonetheless, an upward trend is noticed even before that.

The GER in Upper Primary, Higher and Higher Secondary, albeit lower than that in Primary schools, rose at a faster pace than the former. The GER in Upper Primary even spanned the slope of GER in Primary level.

In [None]:
plt.figure(figsize=(15,8))
sns.lineplot(x = 'Year',y = 'Primary_Total', data = all_india, label = 'Primary')
sns.lineplot(x = 'Year',y = 'Upper_Primary_Total', data = all_india, label = 'Upper Primary')
sns.lineplot(x = 'Year',y = 'Secondary_Total', data = all_india, label = 'Higher')
sns.lineplot(x = 'Year',y = 'Higher_Secondary_Total', data = all_india, label = 'Higher Secondary')
plt.legend()

The GER of Primary schools in All India seems to digress form the trend in Bihar, or say, the other way round that the Primary schools of Bihar are outperforming their All India counterparts in terms of GER

In [None]:
plt.figure(figsize=(15,8))
sns.lineplot(x = 'Year', y = 'Primary_Boys', data = bihar, label = 'Boys')
sns.lineplot(x = 'Year', y = 'Primary_Girls', data = bihar, label = 'Girls')
plt.legend()

Well, it might reflect the blooming Autumn of matriarch in Bihar...

In [None]:
plt.figure(figsize=(15,8))
sns.lineplot(x = 'Year', y = 'Primary_Boys', data = all_india, label = 'Boys')
sns.lineplot(x = 'Year', y = 'Primary_Girls', data = all_india, label = 'Girls')
plt.legend()

The doom of the yet blooming matriarch...

# End of Part -1 

Let's put a full stop on part 1 of our research by retrieving the chances of selecting a 'good', 'not so good' and 'moderate' states out of the dataframe based on GER of all of these states.

Note: The categorization is done on the basis of frequency of of GER in all year for all states, therefore the repetition of states in more than one category indicates that it performed specifically in one year and then suddenly its performance soared or declined in the subsequent years

In [None]:
print('Probability that a good state is selected = ', gross_enrolment_ratio.Primary_Total[gross_enrolment_ratio.Primary_Total>=110].sum() / gross_enrolment_ratio.Primary_Total.sum())
print('Probability that not a good state is selected = ', gross_enrolment_ratio.Primary_Total[gross_enrolment_ratio.Primary_Total<=90].sum() / gross_enrolment_ratio.Primary_Total.sum())
print('Probability that a modest state is selected = ', gross_enrolment_ratio.Primary_Total[gross_enrolment_ratio.Primary_Total<=110].sum() / gross_enrolment_ratio.Primary_Total.sum())

# Epilogue

"To climb steep hills requires a slow pace at first". The immortal words of Shakespeare are to be recalled before I rest my case. The notebook is created to especially assist newbies who might find it difficult to kick start the analysis. The upcoming versions shall lift the bar of the analysis. A step by step guide is to follow. 

Kindly follow, like and comment to help me tweak my work.