# Week 5 - In-Class Exercise: Univariate Analysis
## EDA Part 1: Statistics Refresher + Univariate Analysis

**Dataset:** Water Consumption (HISTORICO_CONSUMO) from datos.gov.co  
**Time:** ~30 minutes  
**Objective:** Apply the 5-step univariate analysis framework to water consumption data

---

### The 5-Step Univariate Analysis Framework

1. **Identify** - What type of variable is it?
2. **Summarize** - Calculate central tendency (mean, median, mode)
3. **Spread** - Calculate dispersion (std, IQR)
4. **Visualize** - Create histogram and boxplot
5. **Detect** - Find outliers using IQR method

## Setup

First, let's import the libraries we need and load the dataset.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:,.2f}'.format)

# Plot style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

In [None]:
# Load the dataset
df = pd.read_csv('../data/HISTORICO_CONSUMO.csv')

# Drop constant columns (same value in every row)
df = df.drop(columns=['NIT', 'RAZON SOCIAL'])

# Rename columns to pandas-friendly names
df = df.rename(columns={
    'AÃ‘O': 'ANO',
    'No. SUSCRIPTORES ACUEDUCTO': 'SUSCRIPTORES_ACUEDUCTO',
    'CONSUMO M3 ACUEDUCTO': 'CONSUMO_ACUEDUCTO',
    'PROMEDIO CONSUMO ACUEDUCTO': 'PROMEDIO_ACUEDUCTO',
    'No. SUSCRIPTORES ALCANTARILLADO': 'SUSCRIPTORES_ALCANTARILLADO',
    'CONSUMO M3 ALCANTARILLADO': 'CONSUMO_ALCANTARILLADO',
    'PROMEDIO CONSUMO ALCANTARILLADO': 'PROMEDIO_ALCANTARILLADO'
})

# Clean ANO: "2,015" -> 2015 (comma = thousands separator)
df['ANO'] = df['ANO'].str.replace(',', '', regex=False).astype(int)

# Clean subscriber and consumption columns: dot = thousands separator
for col in ['SUSCRIPTORES_ACUEDUCTO', 'CONSUMO_ACUEDUCTO',
            'SUSCRIPTORES_ALCANTARILLADO', 'CONSUMO_ALCANTARILLADO']:
    df[col] = df[col].str.replace('.', '', regex=False).astype(int)

# Clean PROMEDIO columns: comma = thousands separator, dot = decimal
for col in ['PROMEDIO_ACUEDUCTO', 'PROMEDIO_ALCANTARILLADO']:
    df[col] = df[col].str.replace(',', '', regex=False).astype(float)

print(f"Dataset: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"Years: {df['ANO'].min()}-{df['ANO'].max()}")
print(f"Municipalities: {df['MUNICIPIO'].nunique()}")
df.head()

---

## Exercise 1: Identify the Variable (5 minutes)

We will analyze the variable **CONSUMO_ACUEDUCTO** (Water consumption in cubic meters (m3)).

**Task:** Answer the following questions about the variable.

In [None]:
# Variable of interest
var = 'CONSUMO_ACUEDUCTO'

# TODO: Check the data type of this variable
print(f"Data type: {df[var].dtype}")

# TODO: Count non-null values
print(f"Non-null count: {df[var].count()} out of {len(df)}")

# TODO: Check for missing values
print(f"Missing values: {df[var].isna().sum()}")

**Question:** Is CONSUMO_ACUEDUCTO a quantitative or categorical variable? Discrete or continuous?

**Your answer:** (Write your answer here)

---

## Exercise 2: Calculate Central Tendency (5 minutes)

Calculate the three measures of central tendency: **mean**, **median**, and **mode**.

**Key concept:** When mean and median differ significantly, the distribution is skewed.

In [None]:
# TODO: Calculate mean
mean_val = df[var].mean()
print(f"Mean: {mean_val:,.2f}")

# TODO: Calculate median
median_val = df[var].median()
print(f"Median: {median_val:,.2f}")

# TODO: Calculate mode (most frequent value)
mode_val = df[var].mode()[0]
print(f"Mode: {mode_val:,.2f}")

In [None]:
# TODO: Calculate the ratio of mean to median
ratio = mean_val / median_val
print(f"\nMean/Median ratio: {ratio:.2f}")

# Interpretation
if ratio > 1.2:
    print("Interpretation: RIGHT-SKEWED distribution (use median)")
elif ratio < 0.8:
    print("Interpretation: LEFT-SKEWED distribution (use median)")
else:
    print("Interpretation: Approximately SYMMETRIC (mean is appropriate)")

**Question:** Which measure of central tendency is more appropriate for this data? Why?

**Your answer:** (Write your answer here)

---

## Exercise 3: Calculate Dispersion (5 minutes)

Calculate the measures of spread: **standard deviation** and **IQR**.

**Key concept:** 
- Standard deviation measures spread around the mean
- IQR measures the range of the middle 50% of data

In [None]:
# TODO: Calculate standard deviation
std_val = df[var].std()
print(f"Standard Deviation: {std_val:,.2f}")

# TODO: Calculate variance (just for reference)
var_val = df[var].var()
print(f"Variance: {var_val:,.2f}")

In [None]:
# TODO: Calculate Q1, Q3, and IQR
q1 = df[var].quantile(0.25)
q3 = df[var].quantile(0.75)
iqr = q3 - q1

print(f"Q1 (25th percentile): {q1:,.2f}")
print(f"Q3 (75th percentile): {q3:,.2f}")
print(f"IQR (Q3 - Q1): {iqr:,.2f}")

In [None]:
# Use pandas describe() to verify our calculations
df[var].describe()

**Question:** The standard deviation is very high compared to the mean. What does this tell us about the data?

**Your answer:** (Write your answer here)

---

## Exercise 4: Visualize the Distribution (8 minutes)

Create a **histogram** and a **boxplot** to visualize the distribution.

**Key concept:**
- Histogram shows the shape of the distribution
- Boxplot shows quartiles and outliers

In [None]:
# TODO: Create a histogram
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(df[var].dropna(), bins=50, edgecolor='black', alpha=0.7, color='steelblue')
axes[0].axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:,.0f}')
axes[0].axvline(median_val, color='green', linestyle='-', linewidth=2, label=f'Median: {median_val:,.0f}')
axes[0].set_xlabel('Consumption (m3)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title(f'Distribution of {var}', fontsize=14)
axes[0].legend()

# Boxplot
axes[1].boxplot(df[var].dropna(), vert=True, patch_artist=True,
                boxprops=dict(facecolor='steelblue', alpha=0.7))
axes[1].set_ylabel('Consumption (m3)', fontsize=12)
axes[1].set_title(f'Box Plot of {var}', fontsize=14)
axes[1].set_xticklabels(['Consumption'])

plt.tight_layout()
plt.show()

**Question:** Based on the histogram, what type of distribution does the data follow?

- [ ] Normal (symmetric, bell-shaped)
- [ ] Right-skewed (tail extends to the right)
- [ ] Left-skewed (tail extends to the left)
- [ ] Bimodal (two peaks)

**Your answer:** (Select one and explain)

---

## Exercise 5: Detect Outliers (7 minutes)

Use the **IQR method** to detect outliers.

**Key concept:**
- Lower bound = Q1 - 1.5 * IQR
- Upper bound = Q3 + 1.5 * IQR
- Values outside these bounds are considered outliers

In [None]:
# TODO: Calculate outlier bounds
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

print(f"Lower bound: {lower_bound:,.2f}")
print(f"Upper bound: {upper_bound:,.2f}")

In [None]:
# TODO: Identify outliers
outliers = df[(df[var] < lower_bound) | (df[var] > upper_bound)]

print(f"\nNumber of outliers: {len(outliers)}")
print(f"Percentage of outliers: {len(outliers) / len(df) * 100:.2f}%")

In [None]:
# TODO: Separate lower and upper outliers
lower_outliers = df[df[var] < lower_bound]
upper_outliers = df[df[var] > upper_bound]

print(f"Lower outliers (below {lower_bound:,.2f}): {len(lower_outliers)}")
print(f"Upper outliers (above {upper_bound:,.2f}): {len(upper_outliers)}")

In [None]:
# Let's look at some of the extreme outliers
print("\nTop 5 highest consumption values:")
print(df.nlargest(5, var)[[var, 'MUNICIPIO', 'ESTRATO', 'ANO']].to_string())

**Question:** Should we remove these outliers? Why or why not?

**Your answer:** (Consider: Are these data errors or real unusual cases?)

---

## Exercise 6: GroupBy - Compare Statistics by Group (5 minutes)

Now that you know the overall statistics, let's see how they change **by group**.

**Key concept:** GroupBy = Split by group + Apply a statistic + Combine results.  
Think of it like sorting M&Ms by color and counting each pile.

**Task 1:** Calculate the mean consumption by **ESTRATO** (stratum/category). Which stratum/category consumes the most?

In [None]:
# TODO: Mean consumption by stratum/category
# Pattern: df.groupby('GROUP_COLUMN')['VALUE_COLUMN'].mean()
mean_by_estrato = df.groupby('ESTRATO')['CONSUMO_ACUEDUCTO'].mean().sort_values(ascending=False)
print("Mean consumption by stratum/category:")
print(mean_by_estrato)

**Task 2:** Calculate the median consumption by **ESTRATO**. Does consumption increase with stratum?

In [None]:
# TODO: Median consumption by stratum
median_by_estrato = df.groupby('ESTRATO')['CONSUMO_ACUEDUCTO'].median().sort_index()
print("Median consumption by stratum:")
print(median_by_estrato)

**Question:** Compare the overall mean you calculated in Exercise 2 with the group means above. What pattern do you see that was hidden in the single number?

**Your answer:** (Write your answer here)

---

## Summary: Complete Univariate Analysis

Let's compile all our findings into a summary.

In [None]:
# Complete summary
print("=" * 60)
print(f"UNIVARIATE ANALYSIS SUMMARY: {var}")
print("=" * 60)

print(f"\n1. DATA TYPE: {df[var].dtype}")
print(f"   - Non-null values: {df[var].count():,}")
print(f"   - Missing values: {df[var].isna().sum():,}")

print(f"\n2. CENTRAL TENDENCY:")
print(f"   - Mean: {mean_val:,.2f}")
print(f"   - Median: {median_val:,.2f}")
print(f"   - Mode: {mode_val:,.2f}")
print(f"   - Mean/Median ratio: {ratio:.2f}")

print(f"\n3. DISPERSION:")
print(f"   - Standard Deviation: {std_val:,.2f}")
print(f"   - IQR: {iqr:,.2f}")
print(f"   - Range: {df[var].min():,.2f} to {df[var].max():,.2f}")

print(f"\n4. DISTRIBUTION TYPE: Right-skewed (mean > median)")

print(f"\n5. OUTLIERS (IQR method):")
print(f"   - Bounds: [{lower_bound:,.2f}, {upper_bound:,.2f}]")
print(f"   - Outliers detected: {len(outliers):,} ({len(outliers)/len(df)*100:.2f}%)")

print("\n" + "=" * 60)

---

## Bonus: Interpretation

Write one sentence summarizing what you learned about water consumption from this analysis.

**Example interpretation:** "Water consumption in the dataset shows high variability with a right-skewed distribution, indicating that most users consume moderate amounts while a small number of heavy users (likely commercial or industrial) drive up the average."

**Your interpretation:** (Write your own interpretation here)

---

## Key Takeaways

1. **Always compare mean and median** - If they differ by more than 20%, use median
2. **Visualize before deciding** - Histograms reveal distribution shape
3. **Don't automatically remove outliers** - Investigate them first
4. **Use IQR for skewed data** - More robust than standard deviation
5. **Document your findings** - Create a summary like we did above