### Day 1: Descriptive Statistics

Descriptive statistics is a branch of statistics that deals with summarizing and describing the main features of a dataset. It helps to present large amounts of data in a simple and meaningful way by using numerical measures, tables, and graphs. The two key aspects of descriptive statistics are:

- **Measures of Central Tendency**: These indicate where the center of the data lies.
- **Measures of Dispersion**: These describe how spread out the data is.

---

### 1. **Measures of Central Tendency**

Measures of central tendency provide information about the "central" point of a dataset. These measures summarize the data by identifying a value that represents the dataset as a whole. The most commonly used measures of central tendency are:

- **Mean**: The average of all data points.
- **Median**: The middle value when the data points are sorted.
- **Mode**: The value that appears most frequently in the dataset.

#### a) **Mean**
The **mean** is the sum of all the values in a dataset divided by the number of data points. It is highly influenced by outliers, which are extreme values in the data.

**Formula:**

$$\text{Mean} = \frac{\sum_{i=1}^{n} x_i}{n}$$

Where:
- $( x_i )$ is each individual data point.
- \( n \) is the number of data points.

**Example:**
Suppose you have the following dataset: [5, 10, 15, 20, 25]. The mean is:


$$\text{Mean} = \frac{5 + 10 + 15 + 20 + 25}{5} = 15$$

#### b) **Median**
The **median** is the middle value in a sorted dataset. If the dataset has an odd number of values, the median is the middle one. If it has an even number of values, the median is the average of the two middle values. The median is less sensitive to outliers than the mean.

**Steps to calculate the median:**
1. Sort the data in ascending order.
2. If the number of data points is odd, the median is the middle value.
3. If the number of data points is even, the median is the average of the two middle values.

**Example:**
For the dataset [5, 10, 15, 20, 25]:
- Sorted: [5, 10, 15, 20, 25].
- Median: 15 (since it's the middle value).

For the dataset [5, 10, 15, 20]:
- Sorted: [5, 10, 15, 20].

$$\text{Median}: \frac{10 + 15}{2} = 12.5$$

#### c) **Mode**
The **mode** is the value that appears most frequently in the dataset. There can be more than one mode in a dataset (bimodal or multimodal) or no mode if no value repeats.

**Example:**
For the dataset [5, 10, 10, 15, 20, 20, 25]:
- Both 10 and 20 appear twice, so the dataset is bimodal with modes 10 and 20.

---

### 2. **Measures of Dispersion**

While measures of central tendency focus on the central value, **measures of dispersion** describe the spread or variability of the data. These measures help to understand how much the data points differ from the central tendency. The most common measures of dispersion are:

- **Range**
- **Variance**
- **Standard Deviation**
- **Interquartile Range (IQR)**

#### a) **Range**
The **range** is the difference between the maximum and minimum values in the dataset. It is the simplest measure of dispersion but is highly affected by outliers.

**Formula:**

$$\text{Range} = \text{Maximum value} - \text{Minimum value}$$


**Example:**
For the dataset [5, 10, 15, 20, 25], the range is:

$$\text{Range} = 25 - 5 = 20$$


#### b) **Variance**
The **variance** measures how far each data point is from the mean, and it gives a sense of how spread out the data is. It is the average of the squared differences from the mean.

**Formula:**

$$\text{Variance} (\sigma^2) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n}$$

Where:
- $( x_i)$ is each data point.
- $( \bar{x})$ is the mean.
- (n) is the number of data points.

**Example:**
For the dataset [5, 10, 15, 20, 25]:
- Mean = 15.

$$\text{Variance} = \frac{(5-15)^2 + (10-15)^2 + (15-15)^2 + (20-15)^2 + (25-15)^2}{5} = 50$$


#### c) **Standard Deviation**
The **standard deviation** is the square root of the variance. It is a widely used measure of dispersion that describes how much the values deviate from the mean in the same units as the data.

**Formula:**

$$\text{Standard_Deviation} (\sigma) = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n}}$$


**Example:**
For the dataset [5, 10, 15, 20, 25]:
- Variance = 50.
$$Standard Deviation = \sqrt{50} = 7.07$$

A smaller standard deviation means the data points are closer to the mean, while a larger standard deviation indicates more spread.

#### d) **Interquartile Range (IQR)**
The **Interquartile Range (IQR)** measures the spread of the middle 50% of the data. It is the difference between the third quartile (Q3) and the first quartile (Q1), and it is not affected by outliers.

**Formula:**

$$\text{IQR} = Q3 - Q1$$

Where:
- ( Q1 ) is the 25th percentile (first quartile).
- ( Q3 ) is the 75th percentile (third quartile).

**Example:**
For the dataset [5, 10, 15, 20, 25]:
- ( Q1 = 10 ) and ( Q3 = 20 ).
- ( IQR = 20 - 10 = 10 ).

#### e) **Semi-Interquartile Range (SIQR)**
The semi-interquartile range is half of the IQR
i.e

**Formula:**

$$\text{SIQR} = IQR/2$$
e.g. 10/2 = 5 


---

### Summary

In descriptive statistics, the **measures of central tendency** (mean, median, mode) provide information about the central point of the data, while the **measures of dispersion** (range, variance, standard deviation, interquartile range) describe the spread or variability of the data.

- **Mean** is useful when you want to find the average value, but it can be skewed by outliers.
- **Median** is a better measure of central tendency when the data has outliers.
- **Mode** is used when the most frequent value is important, especially for categorical data.
- **Range** gives a quick sense of the spread of the data, but it is affected by outliers.
- **Variance** and **standard deviation** provide more detailed information on the variability of the data.
- **IQR** is useful for understanding the spread of the middle half of the data and is not influenced by outliers.

By using these measures together, you can better understand the distribution and spread of your dataset.




In [1]:
import numpy as np
from scipy import stats
from collections import Counter
# Sample dataset
data = [5, 10, 15, 20, 20, 25]

In [2]:
# 1. Measures of Central Tendency

## a) Mean
mean_value = np.mean(data)
print(f"Mean: {mean_value}")

## b) Median
median_value = np.median(data)
print(f"Median: {median_value}")

## c) Mode
data_counts = Counter(data)
max_count = max(data_counts.values())
modes = [k for k, v in data_counts.items() if v == max_count and max_count > 1]

if modes:
    print(f"Mode: {modes} (each appears {max_count} times)")
else:
    print("Mode: No mode, as all values are unique.")

Mean: 15.833333333333334
Median: 17.5
Mode: [20] (each appears 2 times)


In [3]:
# 2. Measures of Dispersion

## a) Range
range_value = np.max(data) - np.min(data)
print(f"Range: {range_value}")

## b) Variance
variance_value = np.var(data, ddof=0)  # Population variance
print(f"Variance: {variance_value}")

## c) Standard Deviation
std_dev = np.std(data, ddof=0)  # Population standard deviation
print(f"Standard Deviation: {std_dev}")

## d) Interquartile Range (IQR)
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
print(f"Interquartile Range (IQR): {IQR}")


Range: 20
Variance: 45.13888888888889
Standard Deviation: 6.718548123582125
Interquartile Range (IQR): 8.75
