# Normal Distribution & Z-Scores

**Module: Descriptive & Inferential Statistics**

## Learning Objectives
- Understand properties of the normal distribution
- Calculate and interpret z-scores
- Use scipy.stats for probability calculations
- Standardize data and compare across different scales

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

---
## Quick Refresher

### The Normal Distribution
- Bell-shaped, symmetric around the mean
- Defined by two parameters: **μ (mean)** and **σ (standard deviation)**
- ~68% of data within 1σ, ~95% within 2σ, ~99.7% within 3σ (Empirical Rule)

### Z-Score
The number of standard deviations a value is from the mean:

$$z = \frac{x - \mu}{\sigma}$$

- z = 0 → value equals the mean
- z = 1 → value is 1 std dev above mean
- z = -2 → value is 2 std devs below mean

---
## Working Example: Test Scores

Exam scores are often normally distributed.

In [None]:
# Generate normally distributed test scores
# Mean = 72, Std Dev = 10
scores = np.random.normal(loc=72, scale=10, size=1000)
scores = np.clip(scores, 0, 100)  # Keep within 0-100

print(f"Mean: {scores.mean():.2f}")
print(f"Std Dev: {scores.std():.2f}")

In [None]:
# Visualize the distribution
plt.figure(figsize=(10, 4))
plt.hist(scores, bins=30, density=True, alpha=0.7, edgecolor='black')

# Overlay theoretical normal curve
x = np.linspace(scores.min(), scores.max(), 100)
plt.plot(x, stats.norm.pdf(x, scores.mean(), scores.std()), 'r-', linewidth=2)

plt.xlabel('Score')
plt.ylabel('Density')
plt.title('Distribution of Test Scores')
plt.show()

### Calculating Z-Scores

In [None]:
# Manual z-score calculation
score = 85
z = (score - scores.mean()) / scores.std()
print(f"A score of {score} has z-score: {z:.2f}")
print(f"This is {abs(z):.2f} standard deviations {'above' if z > 0 else 'below'} the mean")

In [None]:
# Using scipy.stats.zscore for entire dataset
z_scores = stats.zscore(scores)

print(f"Z-scores: mean = {z_scores.mean():.4f}, std = {z_scores.std():.4f}")
print(f"\nFirst 5 scores: {scores[:5]}")
print(f"Their z-scores: {z_scores[:5]}")

### Probability Calculations with scipy.stats

In [None]:
# Create a normal distribution object
score_dist = stats.norm(loc=72, scale=10)

# What percentage scored below 60?
prob_below_60 = score_dist.cdf(60)
print(f"P(X < 60) = {prob_below_60:.4f} or {prob_below_60*100:.2f}%")

# What percentage scored above 85?
prob_above_85 = 1 - score_dist.cdf(85)
print(f"P(X > 85) = {prob_above_85:.4f} or {prob_above_85*100:.2f}%")

# What percentage scored between 65 and 80?
prob_between = score_dist.cdf(80) - score_dist.cdf(65)
print(f"P(65 < X < 80) = {prob_between:.4f} or {prob_between*100:.2f}%")

In [None]:
# Inverse: What score is at the 90th percentile?
score_90th = score_dist.ppf(0.90)
print(f"90th percentile score: {score_90th:.2f}")

# What score separates the bottom 25%?
score_25th = score_dist.ppf(0.25)
print(f"25th percentile score: {score_25th:.2f}")

### Standard Normal Distribution (Z-table equivalent)

In [None]:
# Standard normal: mean=0, std=1
standard_normal = stats.norm(0, 1)

# Common z-score lookups
print("Standard Normal Probabilities:")
for z in [-2, -1, 0, 1, 2]:
    print(f"P(Z < {z:2d}) = {standard_normal.cdf(z):.4f}")

---
## Exercises

### Exercise 1: Employee Performance Scores

A company's annual performance reviews follow a normal distribution with mean 75 and std dev 8.

In [None]:
# Performance distribution
perf_mean = 75
perf_std = 8

# TODO: Create a scipy.stats normal distribution object
perf_dist = None  # Your code here

In [None]:
# TODO: An employee scored 88. Calculate their z-score.
# What percentile are they in?



In [None]:
# TODO: The company wants to give bonuses to the top 15% of performers.
# What's the minimum score needed for a bonus?



In [None]:
# TODO: What percentage of employees score between 70 and 85?



### Exercise 2: Comparing Across Different Scales

Compare performance across two tests with different scales using z-scores.

In [None]:
# Test A: mean=500, std=100
# Test B: mean=25, std=5

# Alice scored 650 on Test A
# Bob scored 33 on Test B

# TODO: Who performed better relative to their peers?
# Calculate z-scores for both and compare.



In [None]:
# TODO: What percentile is each person in?



### Exercise 3: Quality Control

A manufacturing process produces bolts with diameter normally distributed: μ=10mm, σ=0.2mm. Bolts outside 9.6mm to 10.4mm are rejected.

In [None]:
bolt_mean = 10
bolt_std = 0.2

# TODO: What percentage of bolts will be rejected?



In [None]:
# TODO: If 10,000 bolts are produced, how many will be rejected?



In [None]:
# TODO: Management wants to reduce rejection rate to 1%.
# If they can't change the tolerance limits, what std dev would they need?
# Hint: Work backwards from the z-score needed for 0.5% in each tail



### Exercise 4: Standardizing a Dataset

In [None]:
# Customer metrics with different scales
customers = pd.DataFrame({
    'customer_id': range(1, 51),
    'purchase_amount': np.random.normal(150, 40, 50),      # dollars
    'visits_per_month': np.random.normal(8, 3, 50),        # count
    'satisfaction_score': np.random.normal(4.2, 0.5, 50)   # 1-5 scale
})

customers.head()

In [None]:
# TODO: Standardize all three numeric columns (convert to z-scores)
# Add new columns: 'purchase_z', 'visits_z', 'satisfaction_z'



In [None]:
# TODO: Create a composite score by averaging the three z-scores
# Who are the top 5 customers overall?



In [None]:
# TODO: Find any customers with z-score > 2 or < -2 in any metric
# These are unusual (outliers)



### Exercise 5: Checking for Normality

In [None]:
# Three different datasets
data_normal = np.random.normal(50, 10, 500)
data_skewed = np.random.exponential(10, 500)
data_bimodal = np.concatenate([np.random.normal(30, 5, 250), np.random.normal(70, 5, 250)])

In [None]:
# TODO: For each dataset:
# 1. Plot a histogram
# 2. Calculate skewness (use stats.skew)
# 3. Perform Shapiro-Wilk test (use stats.shapiro)
# 4. Determine if it's approximately normal



---
## Solutions

In [None]:
# Exercise 1 Solutions

perf_dist = stats.norm(loc=75, scale=8)

# Z-score for 88
z_88 = (88 - 75) / 8
percentile_88 = perf_dist.cdf(88)
print(f"Score of 88: z = {z_88:.2f}, percentile = {percentile_88*100:.1f}%")

# Minimum for top 15%
min_bonus = perf_dist.ppf(0.85)
print(f"Minimum score for bonus (top 15%): {min_bonus:.2f}")

# Between 70 and 85
pct_70_85 = perf_dist.cdf(85) - perf_dist.cdf(70)
print(f"Percentage between 70-85: {pct_70_85*100:.1f}%")

In [None]:
# Exercise 2 Solutions

# Z-scores
z_alice = (650 - 500) / 100
z_bob = (33 - 25) / 5

print(f"Alice's z-score: {z_alice:.2f}")
print(f"Bob's z-score: {z_bob:.2f}")
print(f"\n{'Bob' if z_bob > z_alice else 'Alice'} performed better relative to peers")

# Percentiles
print(f"\nAlice's percentile: {stats.norm.cdf(z_alice)*100:.1f}%")
print(f"Bob's percentile: {stats.norm.cdf(z_bob)*100:.1f}%")

In [None]:
# Exercise 3 Solutions

bolt_dist = stats.norm(10, 0.2)

# Rejection rate
reject_rate = bolt_dist.cdf(9.6) + (1 - bolt_dist.cdf(10.4))
print(f"Rejection rate: {reject_rate*100:.2f}%")

# Out of 10,000
print(f"Rejected out of 10,000: {int(reject_rate * 10000)}")

# Required std dev for 1% rejection
# 0.5% in each tail means z = 2.576
# (10.4 - 10) / new_std = 2.576
z_for_0_5_pct = stats.norm.ppf(0.995)
new_std = 0.4 / z_for_0_5_pct
print(f"\nRequired std dev for 1% rejection: {new_std:.4f}mm")

In [None]:
# Exercise 4 Solutions

# Standardize columns
customers['purchase_z'] = stats.zscore(customers['purchase_amount'])
customers['visits_z'] = stats.zscore(customers['visits_per_month'])
customers['satisfaction_z'] = stats.zscore(customers['satisfaction_score'])

# Composite score
customers['composite_z'] = (customers['purchase_z'] + customers['visits_z'] + customers['satisfaction_z']) / 3

# Top 5 customers
print("Top 5 customers:")
print(customers.nlargest(5, 'composite_z')[['customer_id', 'composite_z']])

In [None]:
# Outliers
z_cols = ['purchase_z', 'visits_z', 'satisfaction_z']
outliers = customers[(customers[z_cols].abs() > 2).any(axis=1)]
print(f"\nCustomers with unusual metrics (|z| > 2):")
print(outliers[['customer_id'] + z_cols])

In [None]:
# Exercise 5 Solutions

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

for ax, data, title in zip(axes, 
                           [data_normal, data_skewed, data_bimodal],
                           ['Normal', 'Skewed', 'Bimodal']):
    ax.hist(data, bins=30, edgecolor='black', alpha=0.7)
    ax.set_title(f"{title}\nSkew: {stats.skew(data):.2f}")
    
    # Shapiro-Wilk test
    stat, p = stats.shapiro(data[:100])  # Use subset for large samples
    ax.set_xlabel(f"Shapiro p={p:.4f}")

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- Normal: low skew, high p-value → normal")
print("- Skewed: high positive skew, low p-value → not normal")
print("- Bimodal: may have low skew but low p-value → not normal")