# Week 6 Workshop: Bivariate & Multivariate Analysis

## Saber Pro Test Scores - Finding Relationships Between Variables

**Student Name:** _____________________

**Date:** _____________________

---

### Workshop Objectives

1. Build a complete correlation matrix analysis of test scores
2. Test if relationships change by institution type or socioeconomic group
3. Generate 5 actionable insights from bivariate analysis

### Dataset

**Saber Pro** is the Colombian national exam for university students. This dataset contains
scores from multiple competency areas, along with demographic information.

**Key columns:**
- Test scores (numeric): `puntaje_global`, `punt_comp_ciud`, `punt_comu_escr`, `punt_ingles`, `punt_lect_crit`, `punt_razo_cuant`
- Grouping variables (categorical): `sexo`, `estrato`, `tipo_col`, `dep_proc`, `year`

---

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)

df = pd.read_csv('../data/saber_pro.csv')
print(f"Dataset: {df.shape[0]} rows, {df.shape[1]} columns")
df.head()

In [None]:
# Explore the dataset structure
df.info()

In [None]:
# Check unique values for key categorical columns
for col in ['sexo', 'estrato', 'tipo_col']:
    if col in df.columns:
        print(f"{col}: {df[col].unique()}")

### Helper Functions

These utility functions are provided for you. They handle the mechanics of extracting and grouping correlations so you can focus on the analysis.

In [None]:
def get_all_correlations(corr_matrix):
    """
    Extract all unique correlation pairs from correlation matrix.
    Returns a DataFrame sorted by absolute correlation.
    """
    correlations = []
    
    for i in range(len(corr_matrix.columns)):
        for j in range(i + 1, len(corr_matrix.columns)):
            var1 = corr_matrix.columns[i]
            var2 = corr_matrix.columns[j]
            r = corr_matrix.iloc[i, j]
            correlations.append({
                'Variable 1': var1,
                'Variable 2': var2,
                'Correlation': r,
                'Abs Correlation': abs(r)
            })
    
    corr_df = pd.DataFrame(correlations)
    corr_df = corr_df.sort_values('Abs Correlation', ascending=False)
    return corr_df


def calculate_group_correlations(df, x_var, y_var, group_var, min_samples=30):
    """
    Calculate correlation between x_var and y_var for each group.
    Only includes groups with at least min_samples observations.
    """
    results = []
    
    for group in df[group_var].dropna().unique():
        subset = df[df[group_var] == group]
        n = len(subset)
        
        if n >= min_samples:
            r = subset[x_var].corr(subset[y_var])
            if not np.isnan(r):
                results.append({
                    group_var: group,
                    'Sample Size (n)': n,
                    'Correlation (r)': r
                })
    
    return pd.DataFrame(results).sort_values('Correlation (r)', ascending=False)


def interpret_correlation(r):
    """Interpret a correlation value as direction + strength."""
    direction = "Positive" if r > 0 else "Negative"
    abs_r = abs(r)
    if abs_r >= 0.7:
        strength = "Strong"
    elif abs_r >= 0.3:
        strength = "Moderate"
    else:
        strength = "Weak"
    return f"{strength} {direction}"


print("Helper functions loaded.")

---

# Part 1: Correlation Matrix Analysis (45 minutes)

---

## Task 1.1: Calculate Correlation Matrix for Test Scores

Calculate the correlation matrix using only the 6 test score columns. Then create a heatmap.

In [None]:
# YOUR CODE HERE
# Hint: Define the score columns first, then use .corr()
# score_cols = ['puntaje_global', 'punt_comp_ciud', 'punt_comu_escr',
#               'punt_ingles', 'punt_lect_crit', 'punt_razo_cuant']
# corr_matrix = df[score_cols].corr()



In [None]:
# YOUR CODE HERE: Create a heatmap of the correlation matrix
# Requirements:
# - Use RdYlGn or coolwarm color scheme
# - Show annotation with correlation values (annot=True)
# - Center at 0, vmin=-1, vmax=1
# - Add a descriptive title

plt.figure(figsize=(10, 8))

# sns.heatmap(corr_matrix, annot=True, cmap='RdYlGn', center=0,
#             vmin=-1, vmax=1, fmt='.2f', square=True, linewidths=0.5)


plt.title('Correlation Heatmap - Saber Pro Test Scores', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Task 1.2: Extract and Rank All Unique Correlations

Use the `get_all_correlations()` helper to extract all pairs. Document the top 5.

In [None]:
# YOUR CODE HERE
# Hint: all_corrs = get_all_correlations(corr_matrix)
# Then add interpretation: all_corrs['Interpretation'] = all_corrs['Correlation'].apply(interpret_correlation)



### Top 5 Correlations Summary

Fill in this table based on your results:

| Rank | Variable 1 | Variable 2 | r | Interpretation |
|------|------------|------------|---|----------------|
| 1 | | | | |
| 2 | | | | |
| 3 | | | | |
| 4 | | | | |
| 5 | | | | |

## Task 1.3: Scatter Plots for Top 3 Correlations

Create scatter plots with regression lines for the 3 strongest correlation pairs.

In [None]:
# YOUR CODE HERE
# Hint: Use all_corrs.head(3) to get top 3 pairs
# Create a 1x3 subplot figure
# For each pair, use sns.regplot(x=var1, y=var2, data=df, ax=axes[i])

# top_3 = all_corrs.head(3)
# fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# for i, row in enumerate(top_3.itertuples()):
#     var1 = row._1
#     var2 = row._2
#     r = row.Correlation
#     sns.regplot(x=var1, y=var2, data=df, ax=axes[i],
#                 scatter_kws={'alpha': 0.1}, line_kws={'color': 'red'})
#     axes[i].set_title(f'{var1}\nvs {var2}\nr = {r:.3f}')



### Scatter Plot Observations

For each of the top 3 correlations, document your observations:

**Correlation 1:**
- Linear pattern? Yes/No
- Outliers present? Yes/No
- Notes:

**Correlation 2:**
- Linear pattern? Yes/No
- Outliers present? Yes/No
- Notes:

**Correlation 3:**
- Linear pattern? Yes/No
- Outliers present? Yes/No
- Notes:

---

# Part 2: Test Relationships by Group (45 minutes)

---

## Task 2.1: Select Key Variables

Choose a pair of numeric variables from Part 1 (e.g., `punt_razo_cuant` vs `puntaje_global`) and a categorical grouping variable (`tipo_col`, `sexo`, or `estrato`).

In [None]:
# YOUR CODE HERE: Define your variables
# x_var = 'punt_razo_cuant'
# y_var = 'puntaje_global'
# group_var = 'tipo_col'  # Options: 'tipo_col', 'sexo', 'estrato'


# print(f"Analyzing: {x_var} vs {y_var}")
# print(f"Grouped by: {group_var}")
# print(f"Groups: {df[group_var].unique()}")

## Task 2.2: Calculate Correlations by Group

Use the `calculate_group_correlations()` helper function to see how the correlation changes across groups.

In [None]:
# YOUR CODE HERE
# Hint: group_corr = calculate_group_correlations(df, x_var, y_var, group_var)
# group_corr



In [None]:
# Summary statistics of group correlations
# YOUR CODE HERE
# print(f"Mean correlation: {group_corr['Correlation (r)'].mean():.3f}")
# print(f"Std deviation:   {group_corr['Correlation (r)'].std():.3f}")
# print(f"Min correlation: {group_corr['Correlation (r)'].min():.3f}")
# print(f"Max correlation: {group_corr['Correlation (r)'].max():.3f}")



## Task 2.3: Scatter Plot Colored by Group

Create a scatter plot of your two numeric variables, colored by the grouping variable.

If using `dep_proc` (department), filter to the top 5 departments by sample size first.

In [None]:
# YOUR CODE HERE
# Hint: sns.scatterplot(x=x_var, y=y_var, hue=group_var, data=df, alpha=0.3)
# If too many groups, filter first:
#   top_groups = df[group_var].value_counts().head(5).index
#   df_filtered = df[df[group_var].isin(top_groups)]

plt.figure(figsize=(12, 8))


plt.title(f'{x_var} vs {y_var} by {group_var}')
plt.legend(title=group_var, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

## Task 2.4: Simpson's Paradox Check

Compare the overall correlation to the per-group correlations. Does the direction or strength change?

In [None]:
# YOUR CODE HERE
# Step 1: Calculate overall correlation
# overall_r = df[x_var].corr(df[y_var])
# print(f"Overall correlation (all data): r = {overall_r:.3f}")

# Step 2: Compare to subgroup correlations (already in group_corr)
# print(group_corr)

# Step 3: Check for Simpson's Paradox
# Are any subgroups showing a different direction than overall?
# positive_overall = overall_r > 0
# subgroup_same_dir = (group_corr['Correlation (r)'] > 0) == positive_overall
# print(f"\nSubgroups with same direction as overall: {subgroup_same_dir.sum()}/{len(group_corr)}")



### Simpson's Paradox Analysis

Is there evidence of Simpson's Paradox?

**Overall correlation direction:** Positive / Negative

**Do any subgroups show a different direction?** Yes / No

**Your analysis:**

_Write your analysis here..._

---

# Part 3: Generate 5 Actionable Insights (30 minutes)

---

Based on your bivariate and multivariate analysis, generate 5 insights.

For each insight, document:
1. **Finding:** What did you discover? (Be specific with numbers)
2. **So What?:** Why does this matter?
3. **Now What?:** What action could be taken?
4. **Caution:** Is this correlation or causation? Confounding variables?

### Insight #1: [Title]

---

**Finding:**

_Describe your finding with specific numbers (e.g., "There is a strong positive correlation (r = 0.85) between...")_

**So What?:**

_Explain why this matters for education policy or student support_

**Now What?:**

_Suggest a specific action based on this finding_

**Caution:**

_Identify potential confounding variables or limitations_

---

### Insight #2: [Title]

---

**Finding:**


**So What?:**


**Now What?:**


**Caution:**


---

### Insight #3: [Title]

---

**Finding:**


**So What?:**


**Now What?:**


**Caution:**


---

### Insight #4: [Title]

---

**Finding:**


**So What?:**


**Now What?:**


**Caution:**


---

### Insight #5: [Title]

---

**Finding:**


**So What?:**


**Now What?:**


**Caution:**


---

---

# Summary

---

## Key Findings

1. **Strongest correlations found:**
   - 
   - 
   - 

2. **How relationships vary by group:**
   - 

3. **Most actionable insight:**
   - 

## Reflection

1. **What surprised you most in this analysis?**

   _Your answer:_

2. **Which correlation was most likely to be confused for causation?**

   _Your answer:_

3. **What additional data would help confirm your findings?**

   _Your answer:_

---

## Checklist Before Submission

- [ ] All code cells executed without errors
- [ ] Correlation heatmap created and labeled
- [ ] Top 5 correlations identified and interpreted
- [ ] Scatter plots for top 3 correlations
- [ ] Group correlation analysis completed
- [ ] Simpson's Paradox check performed
- [ ] 5 actionable insights documented
- [ ] All markdown cells completed

---

*End of Workshop*