<a href="https://github.com/zia207/python-colab/blob/main/NoteBook/Python_for_Beginners/01-04-01-descriptive-statistics-python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://drive.google.com/uc?export=view&id=1IFEWet-Aw4DhkkVe1xv_2YYqlvRe9m5_)

# 4.1 Descriptive Statistics 


This tutorial provides a comprehensive introduction to **descriptive statistics** using the Python programming language. It covers essential concepts such as mean, median, mode, range, variance, standard deviation, quantiles, and interquartile range (IQR). By the end of this tutorial, you will be able to compute and interpret these fundamental statistical measures using real-world data in Python — ideal for beginners and those refreshing their skills.

> 💡 *Note: While R is widely used in statistics, Python offers powerful, flexible tools for data analysis. This guide bridges the gap by translating R-based descriptive statistics into Python equivalents.*

## Introduction

**Descriptive Statistics** is the branch of statistics focused on summarizing and describing the main features of a dataset. It helps answer questions like:
- What is the typical value? *(Central Tendency)*
- How spread out are the values? *(Dispersion)*
- Are there extreme or unusual values? *(Outliers)*

Key measures include:
- **Mean**, **Median**, **Mode** → Central tendency
- **Range**, **Variance**, **Standard Deviation**, **IQR** → Dispersion
- **Quantiles** → Distribution shape

These metrics enable data analysts to make sense of raw numbers, detect patterns, and communicate insights effectively — without advanced modeling.

We'll use the **rice arsenic dataset** (same as in the R tutorial) to demonstrate all concepts with real data.

### Prerequisites

Install the required packages:

In [None]:
import importlib.util
import sys

# List of required packages
packages = ['pandas', 'scipy']

# Check and install missing packages
for package in packages:
    if not importlib.util.find_spec(package):
        try:
            import pip
            pip.main(['install', package])
        except ImportError:
            print(f"Failed to install {package}. Pip is not available.")

# Import packages
import pandas as pd
import numpy as np
from scipy import stats


In [2]:
# Verify package availability
for package in packages:
    print(f"{package} installed: {bool(importlib.util.find_spec(package))}")

pandas installed: True
scipy installed: True


## Data

All data set use in this exercise can be downloaded from my [Dropbox](https://www.dropbox.com/scl/fo/fohioij7h503duitpl040/h?rlkey=3voumajiklwhgqw75fe8kby3o&dl=0) or from my [Github](https://github.com/zia207/python-colab/tree/main/Data/Python_for_Beginners/Data) accounts.

In [4]:
# Load the dataset directly from GitHub
url = "https://github.com/zia207/python-colab/raw/refs/heads/main/Data/Python_for_Beginners/Data/rice_arsenic_data.csv"
df = pd.read_csv(url)

# Display first few rows to inspect
print(df.head())

   ID  TREAT_ID   TREAT   VAR          PH         TN         PN       ster  \
0   1         1  Low As  BR01  119.748701  16.701608  15.509622   1.121060   
1   2         1  Low As  BR01   98.698244  27.946359  26.738585  11.272871   
2   3         1  Low As  BR01  133.877538   6.416868   2.846243  15.267027   
3   4         1  Low As  BR01  123.007192  20.932223  16.971565   4.953537   
4   5         1  Low As  BR01   89.497158  25.957307  21.515372   3.814338   

          DTM         GY         SW       GAs       STAs  
0  116.688768  43.914848  24.449009  0.862644  15.237639  
1  119.423068  47.813066  30.658419  0.844258  13.369586  
2  121.314026  21.875951  25.888309  1.138247  16.652081  
3  120.924087  48.439764  54.924009  1.044528  20.770175  
4  115.363049  44.404465  57.380661  0.686414  13.670520  


## Central Tendency

**Central tendency** is a statistical measure that assists in describing the center point of a set of data values. This concept is used to identify a single value that is considered most representative of the entire distribution. By determining the central tendency, we can gain insights into the typical or common values in a dataset. **Mean**, **median**, and **mode** are the three most commonly used measures of central tendency.

### Mean

The **mean** is the arithmetic average: sum of all values divided by count.

$$ \bar{X} = \frac{\sum_{i=1}^{n} X_i}{n} $$

In [5]:
# Overall mean of Grain Yield (GY)
mean_gy = df['GY'].mean()
print(f"Overall Mean Grain Yield: {mean_gy:.4f}")

# Mean by soil treatment group (TREAT)
mean_by_treat = df.groupby('TREAT')['GY'].mean()
print("\nMean Grain Yield by Soil Treatment:")
print(mean_by_treat)

Overall Mean Grain Yield: 28.6647

Mean Grain Yield by Soil Treatment:
TREAT
High As     18.871414
Low As      38.458018
Name: GY, dtype: float64


### Median

The **median** is the middle value when data is sorted. It is robust to outliers.

- Odd n: Middle value
- Even n: Average of two middle values

In [6]:
# Overall median of GY
median_gy = df['GY'].median()
print(f"\nOverall Median Grain Yield: {median_gy:.4f}")

# Median by treatment group
median_by_treat = df.groupby('TREAT')['GY'].median()
print("\nMedian Grain Yield by Soil Treatment:")
print(median_by_treat)


Overall Median Grain Yield: 25.3143

Median Grain Yield by Soil Treatment:
TREAT
High As     19.359651
Low As      40.175253
Name: GY, dtype: float64


### Mode

The **mode** is the most frequently occurring value. A dataset can have one mode (unimodal), multiple modes (bimodal, multimodal), or none.

Unlike R, Python doesn't have a built-in `mode()` function for Series, but we can use `scipy.stats.mode()` or `value_counts()`.

In [7]:
# Using scipy.stats.mode (returns mode and count)
mode_result = stats.mode(df['GAs'], keepdims=True)
print(f"\nMode of Grain Arsenic (GAs): {mode_result.mode[0]:.4f} (appears {mode_result.count[0]} times)")

# Alternative: Using pandas value_counts()
mode_pandas = df['GAs'].value_counts().idxmax()
mode_count = df['GAs'].value_counts().max()
print(f"Mode (using value_counts()): {mode_pandas:.4f} (count: {mode_count})")

# Check if multiple modes exist
value_counts = df['GAs'].value_counts()
modes = value_counts[value_counts == value_counts.max()].index.tolist()
if len(modes) > 1:
    print(f"Multiple modes detected: {modes}")
else:
    print("Unimodal dataset.")


Mode of Grain Arsenic (GAs): 0.5185 (appears 1 times)
Mode (using value_counts()): 0.8626 (count: 1)
Multiple modes detected: [0.862643879, 0.844258385, 1.138247093, 1.044528241, 0.68641388, 0.9225152, 1.302452808, 0.985652136, 1.139609472, 1.032997333, 0.711062061, 1.098613826, 1.118978628, 1.290582104, 0.706931607, 0.716082587, 0.995642265, 0.588732061, 1.042509245, 0.793278894, 1.030646539, 1.116202534, 0.996918642, 0.713206393, 0.795331639, 1.106555281, 0.750482555, 0.944611454, 0.88576913, 0.611179795, 1.029925896, 1.340014533, 0.685824516, 1.209130237, 1.101616079, 0.971423029, 1.336741548, 0.993895768, 0.866177598, 0.820737172, 1.064259831, 0.921350393, 0.926300101, 0.660773822, 0.872275622, 1.096326588, 1.199063986, 0.833851119, 1.161143832, 0.742567324, 0.89843001, 0.61145101, 0.9712233, 0.732986007, 1.013240029, 1.014000388, 0.976498636, 1.060269952, 1.211988075, 1.198825168, 1.118122344, 0.518545632, 0.884259688, 1.161992621, 1.075704098, 0.991279215, 0.94379689, 0.71071837

## Range

The **range** is the difference between the maximum and minimum values. It gives a quick sense of spread.

$$ \text{Range} = \max(X) - \min(X) $$

In [8]:
# Calculate range using min and max
range_gy = df['GY'].max() - df['GY'].min()
print(f"\nRange of Grain Yield: {range_gy:.4f}")

# Alternative: using numpy's.ptp() ("peak to peak")
range_gy_np = np.ptp(df['GY'])
print(f"Range (using np.ptp()): {range_gy_np:.4f}")


Range of Grain Yield: 60.5873
Range (using np.ptp()): 60.5873


## Variance

**Variance** measures how far each number in the set is from the mean. Higher variance = more spread.

### Sample Variance (most common in practice):

$$ s^2 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})^2}{n-1} $$

In [9]:
# Sample variance (default in pandas)
var_gy = df['GY'].var()
print(f"\nSample Variance of Grain Yield: {var_gy:.4f}")

# Population variance (divide by n instead of n-1)
pop_var_gy = df['GY'].var(ddof=0)  # ddof=0 for population
print(f"Population Variance: {pop_var_gy:.4f}")

# Manual calculation (for verification)
mean_gy_manual = df['GY'].mean()
squared_diffs = (df['GY'] - mean_gy_manual)**2
manual_var = squared_diffs.sum() / (len(df) - 1)  # sample variance
print(f"Manual sample variance: {manual_var:.4f}")


Sample Variance of Grain Yield: 180.5147
Population Variance: 179.2253
Manual sample variance: 180.5147


## Standard Deviation

**Standard Deviation** is the square root of variance. It's expressed in the same units as the original data, making it more interpretable.

$$ s = \sqrt{s^2} $$

In [10]:
# Sample standard deviation
std_gy = df['GY'].std()
print(f"\nSample Standard Deviation of Grain Yield: {std_gy:.4f}")

# Population standard deviation
pop_std_gy = df['GY'].std(ddof=0)
print(f"Population Standard Deviation: {pop_std_gy:.4f}")

# Verify: sqrt(variance)
import math
print(f"Verification (sqrt(var)): {math.sqrt(var_gy):.4f}")


Sample Standard Deviation of Grain Yield: 13.4356
Population Standard Deviation: 13.3875
Verification (sqrt(var)): 13.4356


## Quantiles

Quantiles divide data into equal parts. Common ones include quartiles (Q1, Q2, Q3), deciles, and percentiles.

In [11]:
# Specific quantiles
q25 = df['GY'].quantile(0.25)
q50 = df['GY'].quantile(0.50)  # same as median
q75 = df['GY'].quantile(0.75)

print(f"\nQuantiles of Grain Yield:")
print(f"25th percentile (Q1): {q25:.4f}")
print(f"50th percentile (Median): {q50:.4f}")
print(f"75th percentile (Q3): {q75:.4f}")

# All quartiles at once
quartiles = df['GY'].quantile([0.25, 0.5, 0.75])
print(f"\nAll Quartiles:\n{quartiles}")

# Deciles (10% intervals)
deciles = df['GY'].quantile(np.arange(0, 1.1, 0.1))
print(f"\nDeciles (0% to 100% in 10% steps):\n{deciles}")


Quantiles of Grain Yield:
25th percentile (Q1): 18.7201
50th percentile (Median): 25.3143
75th percentile (Q3): 39.9541

All Quartiles:
0.25    18.720147
0.50    25.314346
0.75    39.954118
Name: GY, dtype: float64

Deciles (0% to 100% in 10% steps):
0.0     4.749097
0.1    13.766805
0.2    16.922771
0.3    19.706930
0.4    21.885601
0.5    25.314346
0.6    29.763025
0.7    35.715022
0.8    43.021497
0.9    47.105559
1.0    65.336384
Name: GY, dtype: float64


## Interquartile Range (IQR)

The **IQR** is the range between the first (Q1) and third (Q3) quartiles. It's a robust measure of spread unaffected by outliers.

$$ \text{IQR} = Q3 - Q1 $$

In [12]:
# Method 1: Subtract quantiles manually
iqr_value = q75 - q25
print(f"\nInterquartile Range (IQR): {iqr_value:.4f}")

# Method 2: Using scipy.stats.iqr (recommended)
from scipy.stats import iqr
iqr_scipy = iqr(df['GY'])
print(f"IQR (using scipy.stats.iqr): {iqr_scipy:.4f}")

# IQR is often used to detect outliers:
lower_bound = q25 - 1.5 * iqr_value
upper_bound = q75 + 1.5 * iqr_value
outliers = df[(df['GY'] < lower_bound) | (df['GY'] > upper_bound)]
print(f"\nNumber of potential outliers (using 1.5*IQR rule): {len(outliers)}")


Interquartile Range (IQR): 21.2340
IQR (using scipy.stats.iqr): 21.2340

Number of potential outliers (using 1.5*IQR rule): 0


### Summary Statistics

In [15]:
# Create a summary table for GY column
summary_table = pd.DataFrame({
    'Statistic': ['Mean', 'Median', 'Min', 'Max', 'Range', 'Variance', 'Std Dev', 'Q1', 'Q3', 'IQR'],
    'Value': [
        df['GY'].mean(),
        df['GY'].median(),
        df['GY'].min(),
        df['GY'].max(),
        df['GY'].max() - df['GY'].min(),
        df['GY'].var(),
        df['GY'].std(),
        df['GY'].quantile(0.25),
        df['GY'].quantile(0.75),
        iqr(df['GY'])
    ]
})

print("\n" + "="*50)
print("SUMMARY STATISTICS FOR GRAIN YIELD (GY)")
print("="*50)
print(summary_table.round(4))


SUMMARY STATISTICS FOR GRAIN YIELD (GY)
  Statistic     Value
0      Mean   28.6647
1    Median   25.3143
2       Min    4.7491
3       Max   65.3364
4     Range   60.5873
5  Variance  180.5147
6   Std Dev   13.4356
7        Q1   18.7201
8        Q3   39.9541
9       IQR   21.2340


## Summary and Conclusion

In this tutorial, we've covered the core components of **descriptive statistics** in Python:

| Concept           | Python Function                     |
|-------------------|-------------------------------------|
| Mean              | `df['col'].mean()`                  |
| Median            | `df['col'].median()`                |
| Mode              | `stats.mode()` or `value_counts()`  |
| Range             | `max() - min()` or `np.ptp()`       |
| Variance          | `df['col'].var()`                   |
| Standard Deviation| `df['col'].std()`                   |
| Quantiles         | `df['col'].quantile(0.x)`           |
| IQR               | `scipy.stats.iqr(df['col'])`        |


 **Why this matters**: Descriptive statistics are the foundation of exploratory data analysis (EDA). They help you:
- Spot anomalies
- Understand data distribution
- Choose appropriate models
- Communicate findings clearly

Always visualize alongside numerical summaries (e.g., histograms, boxplots) for full insight.

## Resources 

Here are excellent resources to deepen your understanding:

1. **[Python for Data Analysis](https://wesmckinney.com/book/) by Wes McKinney** – Creator of pandas. Chapter 3 covers descriptive stats and data aggregation.
2. **[Data Visualization and Descriptive Statistics with Python](https://www.datacamp.com/courses/introduction-to-data-visualization-with-python)** – Interactive course on DataCamp.
3. **[Real Python: Descriptive Statistics with Python](https://realpython.com/python-descriptive-statistics/)** – Clear examples with NumPy, SciPy, and pandas.
4. **[Kaggle Learn: Data Visualization & Descriptive Stats](https://www.kaggle.com/learn/data-visualization)** – Hands-on mini-courses.
5. **[Introduction to Statistics with R (adapted for Python)](https://rafalab.dfci.harvard.edu/dsbook/introduction-to-statistics-with-r.html)** – Though R-focused, the concepts translate directly. Use this guide as your conceptual reference.