<a href="https://colab.research.google.com/github/stevenkhwun/P4DS/blob/main/Chp02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic descriptive statistics

__A copy of this notebook has been moved to `Statistics/Python` repository on 26 December 2023. The notebook will be kept here for sometime until deleted.__ 

The following demonstrates calculation of some common descriptive statistics, which includes mean, trimmed mean, weighted mean, weighted median, sample standard deviation, interquartile range (IQR) and median absolute deviation from the median (MAD).

We import the data as a pandas dataframe as the pandas dataframe methods, that is the `.method()`, can easily provide the mean, median, sample standard deviation and quantiles.

For trimmed mean, we need to use the `trim_mean` function in `scipy.stats`. For weighted mean, we use `average` function in `NumPy`. For weighted median, we use the specialized package `wquantiles`. And for MAD, we need the `robust` module in the package `statsmodels`.

Firstly, we need to install the `wquantiles` package as this is not included in the base Colab environment.

In [None]:
# Install the package "wquantiles"
!pip install wquantiles

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wquantiles
  Downloading wquantiles-0.6-py3-none-any.whl (3.3 kB)
Installing collected packages: wquantiles
Successfully installed wquantiles-0.6


We now import the necessary packages:

In [None]:
# Import necessary packages
import pandas as pd
import numpy as np
import scipy.stats
import wquantiles
from statsmodels import robust

We now load the data as a pandas dataframe:

In [None]:
# Load the dataset as pandas dataframe
link = "https://raw.githubusercontent.com/stevenkhwun/P4DS/main/Data/state.csv"
state = pd.read_csv(link)
state.head()

Unnamed: 0,State,Population,Murder.Rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA


## Mean

In [None]:
# Mean by pandas dataframe method
state['Population'].mean()

6162876.3

## Trimmed mean

In [None]:
# Trimmed mean using the scipy.stats package
scipy.stats.trim_mean(state['Population'], 0.1)

4783697.125

## Median

In [None]:
# Median by pandas dataframe method
state['Population'].median()

4436369.5

## Weighted mean

In [None]:
# Weighted mean by average function in NumPy
np.average(state['Murder.Rate'], weights=state['Population'])

4.445833981123393

## Weighted median

In [None]:
# Weighted median by median function in wquantiles
wquantiles.median(state['Murder.Rate'], weights=state['Population'])

4.4

## Standard deviation

Note that the result is a sample standard deviation.

In [None]:
# Sample standard deviation by pandas datafram method
state['Population'].std()

6848235.347401142

In [None]:
data = [2, 9, 12, 19, 86]
datadf = pd.DataFrame (data)
datadf.std()

0    34.311806
dtype: float64

## Interquartile range (IQR)

In [None]:
# Interquartile range (IQR) by pandas dataframe method
state['Population'].quantile(0.75) - state['Population'].quantile(0.25)

4847308.0

## Absolute deviation from the median (MAD)

In [None]:
# MAD by robust function in statsmodels package
robust.mad(state['Population'])

3849876.1459979336

# Built-in functions for descriptive statistics

## Maximum function `max()` and minimun function `min()`

These functions can be apply to tuples, lists and pandas dataframe.

In [None]:
# Maximun function max() apply to a tuple
max(36, 27, 12)

36

In [None]:
# Minimum function min() apply to a list
min([36, 27, 12])

12

In [None]:
# Maximun function max() apply to a pandas dataframe
max(state['Population'])

37253956

## `sum()` and `len()`

The Python’s built-in function `sum()` is an efficient way to sum a list of numeric values.

The Python’s built-in functions `len()` returns the length of an object. For example, it can return the number of items in a list. You can use the function with many different data types. However, not all data types are valid arguments for `len()`.

In [None]:
# Creates a list grades
grades = [85, 93, 45, 89, 85]

Calculate the mean grade by calculate the total and divided by the number of grades:

In [None]:
# Mean grade
sum(grades) / len(grades)

79.4

## `mean()`, `median()` and `mode()` functions in `statistics` module

The Python Standard Library's `statistics` module provides functions for calculating the mean, median and mode. Each function's argument must be an *iterable* and can apply to tuples, lists and pandas dataframe.

To use these capabilities, first import the `statistics` module:

In [None]:
# Import statistics module
import statistics

### `mean()`

In [None]:
# Function mean() apply to a list
statistics.mean(grades)


79.4

In [None]:
# Function mean() apply to a pandas dataframe
statistics.mean(state['Population'])

6162876.3

### `median()`

In [None]:
# Function median() apply to a list
statistics.median(grades)

85

### `mode()`

In [None]:
# Function mode() apply to a list
statistics.mode(grades)

85

The `mode()` function causes a `StatisticsError` for lists like [85, 93, 45, 89, 85, 93] in which there are two or more "most frequent" values.

## `sorted()` function

To confirm that the median and mode are correct, you can use the built-in `sorted()` function to get a copy of `grades` with its values arranged in increasing order:

In [None]:
# Sort the object grades
sorted(grades)

[45, 85, 85, 89, 93]

## `pvariancd()` and `pstdev` in `statistics` module

In [None]:
# Create the data
die = [1, 3, 4, 2, 6, 5, 3, 4, 5, 2]

In [None]:
# Population variance
statistics.pvariance(die)

2.25

In [None]:
# Population standard variation
statistics.pstdev(die)

1.5

## `sqrt()` function in `math` module

Passing the `pvariance()` function's result to the `math` module's `sqrt()` function confirms the population standard deviation is 1.5:

In [None]:
# Standard deviation by sqrt() function
import math      # Import the math module
math.sqrt(statistics.pvariance(die))

1.5

# This is the end of the document

In [None]:
die = [1, 3, 4, 2, 6, 5, 3, 4, 5, 2]
statistics.mean(die)

3.5

In [None]:
diff = list(map(lambda x: x - 3.5, die))
diff

[-2.5, -0.5, 0.5, -1.5, 2.5, 1.5, -0.5, 0.5, 1.5, -1.5]