# Summer of Code - Artificial Intelligence
## Week 02: Descriptive Statistics and Probability
### Day 05: Measure of Dispersion

In this notebook, we will learn about **Measures of Dispersion** using Python.

## `scipy` Library
The `scipy` library is a powerful library for scientific and technical computing in Python.
### `scipy.stats` Module
The `scipy.stats` module provides a wide range of statistical functions and tools.

# Measures of Dispersion
Dispersion or variability describes the extent to which a data distribution is spread out or clustered together. It provides insights into the spread of data points around a central value, such as the mean or median. Common measures of dispersion include:
- **Range**
- **Variance**
- **Standard Deviation**

## Range
The range is the simplest measure of dispersion, calculated as the difference between the maximum and minimum values in a dataset. Mathematically, it is expressed as:
$$\text{Range} = \text{Max} - \text{Min}$$

In [None]:
age = [12, 30, 40, 25, 10, 35, 24]
age

In [None]:
max(age) - min(age)

In [None]:
mean_age = sum(age) / len(age)
print(mean_age)

## Variance
Variance quantifies the average squared deviation of each data point from the mean. It provides a measure of how much the data points vary around the mean. The formula for variance (σ²) is:
$$\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2$$
where:
- $N$ is the number of data points
- $x_i$ is each individual data point
- $\mu$ is the mean of the data points

In [None]:
sum_of_squared_deviation = 0
for item in age:
    deviation = (item - mean_age) ** 2
    sum_of_squared_deviation += deviation
sum_of_squared_deviation

In [None]:
variance = sum_of_squared_deviation / len(age)
print(variance)

## Standard Deviation
Standard deviation is the square root of the variance and provides a measure of dispersion in the same units as the original data. It indicates how much the data points typically deviate from the mean. The formula for standard deviation (σ) is:
$$\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}$$
where:
- $N$ is the number of data points
- $x_i$ is each individual data point
- $\mu$ is the mean of the data points

In [None]:
# standard deviation in statistics
std = variance**0.5
std

## Skewness
Skewness measures the asymmetry of a data distribution around its mean.


In [13]:
!pip install scipy



In [12]:
# skewness using scipy library
from scipy import stats

stats

<module 'scipy.stats' from 'c:\\Users\\DIPLAB\\.conda\\envs\\pt-env\\Lib\\site-packages\\scipy\\stats\\__init__.py'>

In [15]:
data = list(range(1, 102))
data[:10]

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [16]:
stats.skew(data)

0.0

In [21]:
data = [-1000, -990, 10, 10, 3, 2, 5, 10, 15, 20, 25, 10, 110, 100]
stats.skew(data)

-2.003813521290219

## Kurtosis
Kurtosis measures the "tailedness" of a data distribution, indicating the presence of outliers.

In [None]:
# kurtosis using scipy library

# Measures of Position
Measures of position describe the relative standing of a data point within a dataset. They help to identify where a particular value lies in relation to the rest of the data. Common measures of position include:
- ***z* scores**
- **Percentiles**
- **Quartiles**

## Percentiles, Deciles, and Quartiles
Percentiles indicate the value below which a given percentage of observations in a dataset fall. For example, the 25th percentile (P25) is the value below which 25% of the data points lie.
To calculate the pth percentile, the formula is:
$$P_p = \left( \frac{p}{100} \right) \times (N)$$
where:
- $P_p$ is the index of pth percentile
- $p$ is the desired percentile (e.g., 25 for the 25th percentile)
- $N$ is the number of data points


In [24]:
data = list(range(1, 102))
len(data)

101

In [None]:
median = data[(len(data) + 1)//2]


52

In [29]:
pos = 50/100 * len(data)
pos

50.5

In [31]:
data[51]

52

In [33]:
import numpy as np


np.percentile(data, 50)

51.0

## Interquartile Range (IQR)
The interquartile range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1). It measures the spread of the middle 50% of the data and is calculated as:
$$\text{IQR} = Q3 - Q1$$
It is useful for identifying outliers in a dataset.

# The Normal Distribution
The normal distribution, also known as the Gaussian distribution, is a continuous probability distribution characterized by its bell-shaped curve. It is defined by two parameters: the mean (μ) and the standard deviation (σ). The normal distribution is symmetric around the mean, with approximately 68% of the data falling within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.
- Mean, Median, and Mode: In a normal distribution, the mean, median, and mode are all equal.


# Standard Normal Distribution and *z* Scores
The standard normal distribution is a special case of the normal distribution with a mean of 0 and a standard deviation of 1. It is often denoted as N(0, 1). The standard normal distribution is used to calculate z-scores, which indicate how many standard deviations a data point is from the mean. The formula for calculating a z-score is:
$$z = \frac{(x - \mu)}{\sigma}$$
where:
- $x$ is the individual data point
- $\mu$ is the mean of the data points
- $\sigma$ is the standard deviation

In [3]:
# calculating z scores
from scipy import stats
data = [5, 10, 15, 20, 25]
stats.zscore(data)

array([-1.41421356, -0.70710678,  0.        ,  0.70710678,  1.41421356])