# Descriptive statistics. Exploring the data

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [10]:
districts = pd.read_csv('education_districtwise.csv')

 `head() `- to get a quick overview of the dataset

In [6]:
districts.head(10)

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
0,DISTRICT32,STATE1,13,391,104,875564.0,66.92
1,DISTRICT649,STATE1,18,678,144,1015503.0,66.93
2,DISTRICT229,STATE1,8,94,65,1269751.0,71.21
3,DISTRICT259,STATE1,13,523,104,735753.0,57.98
4,DISTRICT486,STATE1,8,359,64,570060.0,65.0
5,DISTRICT323,STATE1,12,523,96,1070144.0,64.32
6,DISTRICT114,STATE1,6,110,49,147104.0,80.48
7,DISTRICT438,STATE1,7,134,54,143388.0,74.49
8,DISTRICT610,STATE1,10,388,80,409576.0,65.97
9,DISTRICT476,STATE1,11,361,86,555357.0,69.9


- the `VILLAGES` column indicates how many villages are in each district
- the `TOTPOPULAT` column indicates the population for each district
- the `OVERALL_LI` column indicates the literacy rate for each district

### describe() to compute descriptive stats

 `describe()` function is a convenient way to calculate many key stats all at once. For a numeric column, `describe()` gives the following output: 

*   `count`: Number of non-NA/null observations
*   `mean`: The arithmetic average
*   `std`: The standard deviation
*   `min`: The smallest (minimum) value
*   `25%`: The first quartile (25th percentile)
*   `50%`: The median (50th percentile) 
*   `75%`: The third quartile (75th percentile)
*   `max`: The largest (maximum) value


**doc**: [pandas.DataFrame.describe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)

using the `describe()` function to reveal key stats about literacy rate:

In [7]:
districts['OVERALL_LI'].describe()

count    634.000000
mean      73.395189
std       10.098460
min       37.220000
25%       66.437500
50%       73.490000
75%       80.815000
max       98.760000
Name: OVERALL_LI, dtype: float64

the third quartile (75th percentile) is 80.815 means that 75% of the values in the data are 80.815 or lower and the remaining 25% are higher than this value.

`describe()` excludes missing values (`NaN`) in the dataset. the count of observations for `OVERALL_LI` (634), is fewer than the number of rows in the dataset (680).

Try also to use the `describe()` function for a column with categorical (enum) data, like the `STATNAME` column. 

For this type of column, `describe()` gives you the following output: 

*   `count`: number of non-NA/null observations
*  `unique`: number of unique values
*   `top`: the most common value ( = the mode)
*   `freq`: the frequency of the mode


In [8]:
districts['STATNAME'].describe()

count         680
unique         36
top       STATE21
freq           75
Name: STATNAME, dtype: object

The `unique` category indicates that there are 36 states (enum values) in the dataset. /
The `top` category indicates that `STATE21` is the mode. /
The `frequency` category tells that `STATE21` appears in 75 rows, it includes 75 different districts. 


### Functions for stats

The `describe()` function is also useful because it reveals a variety of key stats all at once. 

### Max() and min() 
to compute range


Recall that the **range** is `max()` - `min()`. I can use `max()` and `min()` to compute the range for the literacy rate of all districts in the dataset. 

In [9]:
range_overall_li = districts['OVERALL_LI'].max() - districts['OVERALL_LI'].min()
range_overall_li

61.540000000000006

The range in literacy rates for all districts is ~ 61.5 percentage points. 
