# Sklearn Statistics - Part 1: Descriptive Statistics

This notebook covers computing descriptive statistics using NumPy, Pandas, and SciPy.

**Topics covered:**
- Central tendency (mean, median, mode)
- Dispersion (variance, std, range)
- Percentiles and quartiles
- Skewness and kurtosis

**Problems:** 17 (Easy: 1-6, Medium: 7-12, Hard: 13-17)

In [None]:
# ============================================
# SETUP - Run this cell first!
# ============================================
import sys
sys.path.insert(0, '..')
from utils.checker import check
from utils.checks import sklearn_01_descriptive_stats as verify

print("Checker loaded! Now import the libraries you need.")

---
## Problem 0: Import Required Libraries
**Difficulty:** Easy

### Concept
Before performing statistical analysis, you need to import the necessary libraries. NumPy provides numerical operations, Pandas handles data structures, and SciPy offers advanced statistical functions.

### Syntax
```python
import numpy as np
import pandas as pd
from scipy import stats
```

### Task
Import the following libraries with their standard aliases:
- NumPy as `np`
- Pandas as `pd`
- `stats` from scipy

### Expected Properties
- `np` should be the numpy module
- `pd` should be the pandas module
- `stats` should be the scipy.stats module

In [None]:
# Your solution:


In [None]:
# Verification
verify.p0(globals())

---
## Problem 1: Calculate Mean
**Difficulty:** Easy

### Concept
The mean (average) is a measure of central tendency. It's calculated by summing all values and dividing by the count of values. The mean is sensitive to outliers.

### Syntax
```python
# NumPy
np.mean(array)
array.mean()

# Pandas
df['column'].mean()
```

### Example
```python
>>> data = np.array([2, 4, 6, 8, 10])
>>> np.mean(data)
6.0
```

### Task
Calculate the mean of the data array `[10, 20, 30, 40, 50]`. Store the result in `mean_val`.

### Expected Properties
- `mean_val` should be a numeric value (int or float)
- The mean should be the middle value of this evenly-spaced sequence

In [None]:
# Your solution:
data = np.array([10, 20, 30, 40, 50])
mean_val = None

In [None]:
# Verification
verify.p1(mean_val)

---
## Problem 2: Calculate Median
**Difficulty:** Easy

### Concept
The median is the middle value when data is sorted. If there's an even number of values, it's the average of the two middle values. Unlike the mean, the median is robust to outliers.

### Syntax
```python
# NumPy
np.median(array)

# Pandas
df['column'].median()
```

### Example
```python
>>> np.median([1, 2, 3, 4, 5])
3.0
>>> np.median([1, 2, 3, 4])  # Even number: average of middle two
2.5
```

### Task
Calculate the median of `[1, 3, 5, 7, 9, 11]`. Store the result in `median_val`.

### Expected Properties
- `median_val` should be a numeric value
- For an even-length array, median is the average of the two middle values

In [None]:
# Your solution:
data = np.array([1, 3, 5, 7, 9, 11])
median_val = None

In [None]:
# Verification
verify.p2(median_val)

---
## Problem 3: Calculate Mode
**Difficulty:** Easy

### Concept
The mode is the most frequently occurring value in a dataset. A dataset can have one mode (unimodal), multiple modes (bimodal/multimodal), or no mode if all values are unique.

### Syntax
```python
# SciPy (returns ModeResult object)
from scipy import stats
result = stats.mode(array, keepdims=True)
mode_value = result.mode[0]  # Extract the mode value
```

### Example
```python
>>> data = [1, 1, 2, 3, 3, 3, 4]
>>> result = stats.mode(data, keepdims=True)
>>> result.mode[0]
3
```

### Task
Find the mode of `[1, 2, 2, 3, 3, 3, 4]` using `scipy.stats.mode()`. Store the mode value (not the ModeResult object) in `mode_val`.

### Expected Properties
- `mode_val` should be a single numeric value
- It should be the value that appears most frequently in the array

In [None]:
# Your solution:
data = np.array([1, 2, 2, 3, 3, 3, 4])
mode_val = None

In [None]:
# Verification
verify.p3(mode_val)

---
## Problem 4: Calculate Variance
**Difficulty:** Easy

### Concept
Variance measures the spread of data points around the mean. It's the average of squared deviations from the mean. Sample variance uses `n-1` (Bessel's correction) while population variance uses `n`.

### Syntax
```python
# Population variance (ddof=0, default)
np.var(array)

# Sample variance (ddof=1)
np.var(array, ddof=1)
```

### Example
```python
>>> data = [2, 4, 6, 8]
>>> np.var(data, ddof=1)  # Sample variance
6.666...
```

### Task
Calculate the **sample variance** of `[2, 4, 4, 4, 5, 5, 7, 9]`. Use `ddof=1` for sample variance. Store in `var_val`.

### Expected Properties
- `var_val` should be a positive numeric value
- Sample variance should be greater than 0 for non-constant data
- Value should be approximately 4-5

In [None]:
# Your solution:
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])
var_val = None

In [None]:
# Verification
verify.p4(var_val)

---
## Problem 5: Calculate Range
**Difficulty:** Easy

### Concept
The range is the simplest measure of dispersion. It's the difference between the maximum and minimum values. While easy to compute, it's very sensitive to outliers.

### Syntax
```python
range_val = np.max(array) - np.min(array)
# Or
range_val = np.ptp(array)  # Peak-to-peak
```

### Example
```python
>>> data = [10, 15, 20, 25, 30]
>>> np.max(data) - np.min(data)
20
```

### Task
Calculate the range of `[5, 10, 15, 20, 25]`. Store in `range_val`.

### Expected Properties
- `range_val` should be a non-negative numeric value
- It should equal max - min

In [None]:
# Your solution:
data = np.array([5, 10, 15, 20, 25])
range_val = None

In [None]:
# Verification
verify.p5(range_val)

---
## Problem 6: Calculate Standard Deviation
**Difficulty:** Easy

### Concept
Standard deviation is the square root of variance. It measures dispersion in the same units as the original data, making it more interpretable than variance. Like variance, sample std uses `ddof=1`.

### Syntax
```python
# Population std
np.std(array)

# Sample std (ddof=1)
np.std(array, ddof=1)
```

### Example
```python
>>> data = [10, 20, 30, 40, 50]
>>> np.std(data, ddof=1)
15.811...
```

### Task
Calculate the **sample standard deviation** of `[10, 12, 23, 23, 16, 23, 21, 16]`. Use `ddof=1`. Store in `std_val`.

### Expected Properties
- `std_val` should be a positive numeric value
- Value should be approximately 5-6

In [None]:
# Your solution:
data = np.array([10, 12, 23, 23, 16, 23, 21, 16])
std_val = None

In [None]:
# Verification
verify.p6(std_val)

---
## Problem 7: Calculate Quartiles
**Difficulty:** Medium

### Concept
Quartiles divide data into four equal parts. Q1 (25th percentile) is the median of the lower half, Q2 (50th percentile) is the median, and Q3 (75th percentile) is the median of the upper half.

### Syntax
```python
# NumPy percentile
q1 = np.percentile(array, 25)
q2 = np.percentile(array, 50)  # Same as median
q3 = np.percentile(array, 75)

# Or use quantile (0-1 scale)
q1 = np.quantile(array, 0.25)
```

### Example
```python
>>> data = np.arange(1, 11)  # [1, 2, ..., 10]
>>> np.percentile(data, 25)
3.25
```

### Task
Calculate Q1 (25th percentile), Q2 (median/50th percentile), and Q3 (75th percentile) for the data `np.arange(1, 101)` (numbers 1-100). Store in `q1`, `q2`, and `q3`.

### Expected Properties
- All quartiles should be numeric values
- Q1 < Q2 < Q3
- Q2 should be approximately 50.5 (median of 1-100)

In [None]:
# Your solution:
data = np.arange(1, 101)  # 1 to 100
q1 = None
q2 = None
q3 = None

In [None]:
# Verification
check.is_not_none(q1, "P7a: Q1 not None")
check.is_not_none(q2, "P7b: Q2 not None")
check.is_not_none(q3, "P7c: Q3 not None")
check.is_true(q1 < q2 < q3, "P7d: Quartiles ordered", "Q1 < Q2 < Q3 should hold")
check.value_in_range(q2, 49, 52, "P7e: Q2 (median) in reasonable range")

---
## Problem 8: Calculate IQR
**Difficulty:** Medium

### Concept
The Interquartile Range (IQR) is the difference between Q3 and Q1. It represents the middle 50% of the data and is robust to outliers. IQR is used in the boxplot and for outlier detection.

### Syntax
```python
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1

# Or use scipy
from scipy.stats import iqr
iqr_val = iqr(data)
```

### Example
```python
>>> data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> q1, q3 = np.percentile(data, [25, 75])
>>> q3 - q1
4.0
```

### Task
Calculate the IQR of `[7, 7, 31, 31, 47, 75, 87, 115, 116, 119, 119, 155, 177]`. Store in `iqr_val`.

### Expected Properties
- `iqr_val` should be a positive numeric value
- IQR should be less than the range of the data

In [None]:
# Your solution:
data = np.array([7, 7, 31, 31, 47, 75, 87, 115, 116, 119, 119, 155, 177])
iqr_val = None

In [None]:
# Verification
check.is_not_none(iqr_val, "P8: Not None")
check.is_type(iqr_val, (int, float, np.number), "P8: Type check")
check.is_true(iqr_val > 0, "P8: Positive IQR", "IQR should be positive")
check.value_in_range(iqr_val, 80, 95, "P8: Reasonable range")

---
## Problem 9: Identify Outliers using IQR
**Difficulty:** Medium

### Concept
The IQR method defines outliers as values that fall below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. This is the standard rule used in boxplots.

### Syntax
```python
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = data[(data < lower_bound) | (data > upper_bound)]
```

### Example
```python
>>> data = np.array([1, 2, 3, 4, 5, 100])
>>> # 100 would be detected as an outlier
```

### Task
Find outliers in `[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100]` using the IQR method. Store the outliers array in `outliers`.

### Expected Properties
- `outliers` should be a numpy array
- Should contain at least one value
- The outlier(s) should be much larger than typical values

In [None]:
# Your solution:
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100])
outliers = None

In [None]:
# Verification
check.is_not_none(outliers, "P9: Not None")
check.is_type(outliers, np.ndarray, "P9: Type check")
check.is_true(len(outliers) > 0, "P9: Has outliers", "Should detect at least one outlier")
check.is_true(np.all(outliers > 50), "P9: Outliers are extreme", "Outliers should be much larger than typical values")

---
## Problem 10: Z-Score Calculation
**Difficulty:** Medium

### Concept
A z-score indicates how many standard deviations a value is from the mean. Z-scores standardize data to have mean=0 and std=1. Values with |z| > 3 are often considered outliers.

### Syntax
```python
# Manual calculation
z_scores = (data - np.mean(data)) / np.std(data)

# Using scipy
from scipy.stats import zscore
z_scores = zscore(data)
```

### Example
```python
>>> data = [10, 20, 30, 40, 50]
>>> zscore(data)
array([-1.26, -0.63,  0.  ,  0.63,  1.26])
```

### Task
Calculate z-scores for `[10, 20, 30, 40, 50]` using `scipy.stats.zscore()`. Store in `z_scores`.

### Expected Properties
- `z_scores` should be a numpy array
- Should have same length as input data
- Mean of z-scores should be approximately 0
- The middle value (30) should have a z-score of approximately 0

In [None]:
# Your solution:
data = np.array([10, 20, 30, 40, 50])
z_scores = None

In [None]:
# Verification
check.is_not_none(z_scores, "P10: Not None")
check.is_type(z_scores, np.ndarray, "P10: Type check")
check.has_length(z_scores, 5, "P10: Correct length")
check.mean_is_close(z_scores, 0.0, "P10: Mean is 0", tolerance=0.01)
check.is_true(abs(z_scores[2]) < 0.1, "P10: Middle value z-score", "Middle value should have z-score near 0")

---
## Problem 11: Coefficient of Variation
**Difficulty:** Medium

### Concept
The Coefficient of Variation (CV) is the ratio of standard deviation to the mean, expressed as a percentage. It's a standardized measure of dispersion that allows comparison across datasets with different units or scales.

### Syntax
```python
cv = (np.std(data, ddof=1) / np.mean(data)) * 100

# Using scipy
from scipy.stats import variation
cv = variation(data, ddof=1) * 100
```

### Example
```python
>>> data = [100, 110, 120, 130, 140]
>>> cv = (np.std(data, ddof=1) / np.mean(data)) * 100
>>> cv
13.36...
```

### Task
Calculate the coefficient of variation for `[10, 20, 30, 40, 50]`. Express as a percentage. Store in `cv`.

### Expected Properties
- `cv` should be a positive numeric value
- Should be expressed as a percentage (approximately 40-60)

In [None]:
# Your solution:
data = np.array([10, 20, 30, 40, 50])
cv = None

In [None]:
# Verification
check.is_not_none(cv, "P11: Not None")
check.is_type(cv, (int, float, np.number), "P11: Type check")
check.value_in_range(cv, 40, 60, "P11: Reasonable percentage range")

---
## Problem 12: Five Number Summary
**Difficulty:** Medium

### Concept
The five number summary consists of: minimum, Q1, median, Q3, and maximum. It provides a complete picture of data distribution and is visualized in a boxplot.

### Syntax
```python
minimum = np.min(data)
q1 = np.percentile(data, 25)
median = np.median(data)
q3 = np.percentile(data, 75)
maximum = np.max(data)
five_num = [minimum, q1, median, q3, maximum]
```

### Example
```python
>>> data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> [np.min(data), np.percentile(data, 25), np.median(data), 
...  np.percentile(data, 75), np.max(data)]
[1, 2.5, 5.0, 7.5, 9]
```

### Task
Calculate the five number summary for `[2, 3, 5, 7, 11, 13, 17, 19, 23]`. Store as a list `[min, Q1, median, Q3, max]` in `five_num`.

### Expected Properties
- `five_num` should be a list with 5 elements
- Elements should be in ascending order
- First element should be the minimum
- Last element should be the maximum
- Middle element (index 2) should be the median

In [None]:
# Your solution:
data = np.array([2, 3, 5, 7, 11, 13, 17, 19, 23])
five_num = None

In [None]:
# Verification
check.is_not_none(five_num, "P12: Not None")
check.is_type(five_num, (list, np.ndarray), "P12: Type check")
check.has_length(five_num, 5, "P12: Correct length")
check.is_true(five_num[0] == np.min(data), "P12a: Min correct", "First element should be minimum")
check.is_true(five_num[4] == np.max(data), "P12b: Max correct", "Last element should be maximum")
check.is_sorted(five_num, "P12c: Sorted order", ascending=True)

---
## Problem 13: Calculate Skewness
**Difficulty:** Hard

### Concept
Skewness measures the asymmetry of a distribution. Positive skew (right-skewed) has a long tail on the right; negative skew (left-skewed) has a long tail on the left. Zero skew indicates a symmetric distribution.

### Syntax
```python
from scipy.stats import skew
skewness = skew(data)
```

### Example
```python
>>> # Right-skewed data (few large values)
>>> data = [1, 1, 2, 2, 2, 3, 3, 10, 20]
>>> skew(data)
2.14...  # Positive = right-skewed
```

### Task
Calculate the skewness of `[1, 1, 1, 2, 2, 3, 3, 4, 5, 10, 15]` using `scipy.stats.skew()`. Store in `skew_val`.

### Expected Properties
- `skew_val` should be a numeric value
- Should be positive (data is right-skewed with larger values at the end)
- Typically between 0 and 3 for moderately skewed data

In [None]:
# Your solution:
data = np.array([1, 1, 1, 2, 2, 3, 3, 4, 5, 10, 15])
skew_val = None

In [None]:
# Verification
check.is_not_none(skew_val, "P13: Not None")
check.is_type(skew_val, (int, float, np.number), "P13: Type check")
check.is_true(skew_val > 0, "P13: Positive skew", "Data is right-skewed, skewness should be positive")

---
## Problem 14: Calculate Kurtosis
**Difficulty:** Hard

### Concept
Kurtosis measures the "tailedness" of a distribution. Positive excess kurtosis indicates heavy tails (more outliers), negative indicates light tails. Normal distribution has excess kurtosis of 0.

### Syntax
```python
from scipy.stats import kurtosis
# Fisher=True gives excess kurtosis (normal = 0)
kurt_val = kurtosis(data, fisher=True)
```

### Example
```python
>>> normal_data = np.random.randn(1000)
>>> kurtosis(normal_data, fisher=True)
0.02...  # Close to 0 for normal distribution
```

### Task
Calculate the kurtosis of normally distributed random data `np.random.randn(1000)` using `scipy.stats.kurtosis()`. Store in `kurt_val`.

### Expected Properties
- `kurt_val` should be a numeric value
- For normal distribution, excess kurtosis should be close to 0
- Should be in range [-1, 1] for this normally distributed data

In [None]:
# Your solution:
data = np.random.randn(1000)
kurt_val = None

In [None]:
# Verification
check.is_not_none(kurt_val, "P14: Not None")
check.is_type(kurt_val, (int, float, np.number), "P14: Type check")
check.value_in_range(kurt_val, -1, 1, "P14: Normal range for normal distribution")

---
## Problem 15: Weighted Mean
**Difficulty:** Hard

### Concept
A weighted mean accounts for the importance of each value. Each value is multiplied by its weight, summed, and divided by the sum of weights. Weights must sum to 1 or be normalized.

### Syntax
```python
weighted_mean = np.average(values, weights=weights)
```

### Example
```python
>>> grades = [80, 90, 85]
>>> weights = [0.3, 0.5, 0.2]  # Exam weights
>>> np.average(grades, weights=weights)
85.5
```

### Task
Calculate the weighted mean of exam grades `[85, 90, 78, 92]` with weights `[0.2, 0.3, 0.2, 0.3]`. Store in `weighted_mean`.

### Expected Properties
- `weighted_mean` should be a numeric value
- Should be between the min and max of the grades
- Should be approximately 85-90

In [None]:
# Your solution:
grades = np.array([85, 90, 78, 92])
weights = np.array([0.2, 0.3, 0.2, 0.3])
weighted_mean = None

In [None]:
# Verification
check.is_not_none(weighted_mean, "P15: Not None")
check.is_type(weighted_mean, (int, float, np.number), "P15: Type check")
check.value_in_range(weighted_mean, np.min(grades), np.max(grades), "P15: Within grade range")
check.value_in_range(weighted_mean, 85, 90, "P15: Reasonable range")

---
## Problem 16: Descriptive Stats with Pandas
**Difficulty:** Hard

### Concept
Pandas `describe()` provides a comprehensive statistical summary including count, mean, std, min, quartiles, and max. It's a quick way to understand your data.

### Syntax
```python
# For DataFrame
df.describe()

# For Series
df['column'].describe()
```

### Example
```python
>>> df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
>>> df.describe()
              A
count  5.000000
mean   3.000000
std    1.581139
min    1.000000
25%    2.000000
50%    3.000000
75%    4.000000
max    5.000000
```

### Task
Use pandas `describe()` to get descriptive statistics for the provided DataFrame. Store the result in `desc_stats`.

### Expected Properties
- `desc_stats` should be a DataFrame
- Should have 'mean' in its index
- Should have columns for both 'A' and 'B'

In [None]:
# Your solution:
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})

desc_stats = None

In [None]:
# Verification
check.is_not_none(desc_stats, "P16: Not None")
check.is_type(desc_stats, pd.DataFrame, "P16: Type check")
check.contains(list(desc_stats.index), 'mean', "P16a: Has mean")
check.contains(list(desc_stats.columns), 'A', "P16b: Has column A")
check.contains(list(desc_stats.columns), 'B', "P16c: Has column B")

---
## Problem 17: Correlation Coefficient
**Difficulty:** Hard

### Concept
The Pearson correlation coefficient measures the linear relationship between two variables. It ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship.

### Syntax
```python
# NumPy correlation matrix
corr_matrix = np.corrcoef(x, y)
corr = corr_matrix[0, 1]  # Extract correlation

# SciPy Pearson r
from scipy.stats import pearsonr
corr, p_value = pearsonr(x, y)

# Pandas
df['x'].corr(df['y'])
```

### Example
```python
>>> x = [1, 2, 3, 4, 5]
>>> y = [2, 4, 6, 8, 10]  # Perfect positive correlation
>>> np.corrcoef(x, y)[0, 1]
1.0
```

### Task
Calculate the Pearson correlation coefficient between `x = [1, 2, 3, 4, 5]` and `y = [2, 4, 5, 4, 5]`. Store in `corr_val`. Extract just the correlation value (not the p-value if using scipy).

### Expected Properties
- `corr_val` should be a numeric value
- Should be between -1 and 1
- Should be positive (x and y generally increase together)

In [None]:
# Your solution:
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
corr_val = None

In [None]:
# Verification
check.is_not_none(corr_val, "P17: Not None")
check.is_type(corr_val, (int, float, np.number), "P17: Type check")
check.value_in_range(corr_val, -1, 1, "P17a: Valid correlation range")
check.is_true(corr_val > 0, "P17b: Positive correlation", "x and y should have positive correlation")

---
## Summary

Run this cell to see your overall progress on this notebook.

In [None]:
check.summary()