# Descriptive Statistics
A descriptive statistic is a summary statistic that quantitatively describes or summarizes features of a collection of information, while descriptive statistics in the mass noun sense is the process of using and analyzing those statistics.

## Descriptive Statistics using Pandas Dataframe
When we have observation data, it will be useful to summarize the data features into one definition called descriptive statistics. These statistics are divided into two general categories: 
- **Measures of central tendency** use a single value to describe the center of a data set. The mean, median, and mode are all the three measures of central tendency.
- **Measures of dispersion** provide information about the spread of a variable's values. There are four key measures of dispersion: Range, Variance, Standard Deviation, Skew

In [1]:
import numpy as np
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy', 'Joane'], 
        'age': [42, 52, 42, 24, 73, 52], 
        'preTestScore': [4, 24, 31, 4, 3, 28],
        'postTestScore': [25, 94, 57, 62, 70, 76],
        'sex': ['M', 'F', 'F', 'M', 'F', 'F']}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore', 'sex'])
df

Unnamed: 0,name,age,preTestScore,postTestScore,sex
0,Jason,42,4,25,M
1,Molly,52,24,94,F
2,Tina,42,31,57,F
3,Jake,24,4,62,M
4,Amy,73,3,70,F
5,Joane,52,28,76,F


## 1. Measures of Central Tendency
Measures of central tendency are the most basic and, often, the most informative description of a population's characteristics. They describe the "average" member of the population of interest. There are three measures of central tendency:

- **Mean** the sum of a variable's values divided by the total number of values
- **Median** the middle value of a variable
- **Mode** the value that occurs most often

### Mean 
A descriptive statistic used as a measure of central tendency. To calculate the mean, all the values of a variable are added and then the sum is divided by the number of values. For example, if the age of the respondents in a sample were 21, 35, 40, 46, and 76, the mean age of the sample would be (4+24+31+4+3+28)/6 = 15.666

In [2]:
df['preTestScore'].mean()

15.666666666666666

**Median**

A descriptive statistic used to measure central tendency. The median is the value that is the middle value of a set of values. 50% of the values lie above the median, and 50% lie below the median. For example, if a sample of preTestScore are 3, 4, 4, 24, 31, and 28 the median value is (4+24)/2 = 14.

In [3]:
df['preTestScore'].median()

14.0

### Mode 
A descriptive statistic that is a measure of central tendency. It is the value that occurs most frequently in the data. For example, if the preTestScore are 4, 24, 31, 4, 3, and 28 the modal is 4.

In [4]:
df['preTestScore'].mode()

0    4
dtype: int64

In [5]:
df['sex'].mode()

0    F
dtype: object

## Important characteristics for a good measure of central tendency
The central value / average tendency is the representative value of a data distribution, so it must have the following properties:

- Must consider all data clusters
- May not be affected by extreme values.
- Must be stable from sample to sample.
- Must be able to be used for further statistical analysis.

From several measures of central values, Mean almost fulfills all of these requirements, except the conditions in the second point, the average is influenced by extreme values. For example, if the item is 2; 4; 5; 6; 6; 6; 7; 7; 8; 9 then the mean, median and all modes are equal, that is 6. If the last value is 90 instead of 9, the average will be 14.10, while the median and mode do not change. Although in this case the median and mode are better, but do not meet other requirements. Therefore the Mean is the best measure of the central value and is often used in statistical analysis.

## When do we use a different central tendency value?
The exact center size to use depends on the nature of the data, the nature of the frequency distribution and the destination. If the data is categorical **(qualitative)**, only the mode can be used. For example, if we are interested in knowing the type of soil that is typical of a location, or the pattern of planting in an area, we can only use mode.

On the other hand, if the data is numerical **(quantitative)**, we can use one of the measures of the center value, mean or median or mode. Although in quantitative data we can use all three measures of central tendency, we must consider the nature of the frequency distribution of these data groups.

- If the data frequency distribution is **not normal (asymmetrical)**, the median or mode is the right center size.
- If there are **extreme values**, whether small or large, it is more appropriate to use median or mode.
- If the data distribution is **normal (symmetrical)**, all measures of central values, whether mean, median or mode can be used. However, the mean is used more often than others because it meets the requirements for a good central size.

### Other functions in numeric data


In [6]:
df['preTestScore'].cumsum()

0     4
1    28
2    59
3    63
4    66
5    94
Name: preTestScore, dtype: int64

In [7]:
df['preTestScore'].count()

6

In [8]:
df['preTestScore'].min()

3

In [9]:
df['preTestScore'].max()

31

In [10]:
df['preTestScore'].describe()

count     6.000000
mean     15.666667
std      13.336666
min       3.000000
25%       4.000000
50%      14.000000
75%      27.000000
max      31.000000
Name: preTestScore, dtype: float64

## 2. Measures of Dispersion
The size of the spread or the size of the diversity of observations of the average value is called deviation / dispersion. There are several measures to determine the dispersion of observation data:

- Range
- quartile deviation
- standard deviation.

### Range
The simplest measure of dispersion is Range. The range of a group of observation data is the difference between the minimum and maximum values.


In [11]:
df['preTestScore'].max()-df['preTestScore'].min()

28

Range only takes into account two values, namely the maximum value and the minimum value and does not take into account all values, so it is very unstable or cannot be relied upon as an indicator of the size of the spread. This happens because the range is greatly influenced by extreme values.

Another disadvantage of Range is not describing the distribution of data to its central value. To avoid the weaknesses of the range as above, other dispersion sizes are like quartile deviations

### Quartile Deviation

The quartile deviation is calculated by removing values that lie below the first quartile and values above the third quartile, so that extreme values, whether below or above the data distribution, are omitted.

Quartile deviation is obtained by calculating the average value of the two quartiles, Q1 and Q3.

Quartile deviation = (Q3-Q1) / 2

In [12]:
df.age.quantile([0.25,0.5,0.75])

0.25    42.0
0.50    47.0
0.75    52.0
Name: age, dtype: float64

In [13]:
np.percentile(df.age,25), np.percentile(df.age,75)

(42.0, 52.0)

Quartile deviation is more stable than Range because it is not influenced by extreme values. Extreme values have been deleted. However, just like Range, the quartile deviation also does not pay attention and takes into account the deviations of all the data clusters. Quartile deviation only takes into account the value of the first quartile and the third quartile only.

### Standard deviation
Standard deviation is a statistical value that is used to determine how the data is distributed in the sample, as well as how close the individual data points are to the mean or the average value of the sample.

A standard deviation from a data set equal to zero indicates that all values in the set are equal. While a larger deviation value indicates that individual data points are far from the average value.

In [14]:
df['preTestScore'].var(), df['preTestScore'].std()

(177.86666666666667, 13.336666250104134)