---
title: "Selection Bias & Missing Data Challenge"
subtitle: "Creating a Statistics Meme: Write Your Own Functions"

author: Vikram Goyal
date: 2025-11-20
output:
  html_document:
    toc: true
    toc_depth: 2
    number_sections: false
    theme: readable
    highlight: tango
format:
  html: default
execute:
  echo: false
  eval: true
---

# üé® Selection Bias & Missing Data Challenge

<!-- ::: {.callout-important} -->
## üìä Challenge Overview

**Your Task:** Create a four-panel statistics meme demonstrating selection bias. You'll write three Python functions yourself to complete the workflow, then assemble them into a professional meme.

<!-- **Deliverables:**

1. Three Python functions you write yourself:
   - `step4_create_block_letter.py` - Create a block letter "S" matching image dimensions
   - `step5_create_masked.py` - Apply the letter mask to the stippled image
   - `create_meme.py` - Assemble all components into the four-panel meme
2. A complete `index.qmd` file that uses all functions to generate your meme
3. Your final statistics meme (as a PNG file) using your own image

**Key Learning:** This challenge teaches you to write modular Python functions and assemble them into a complete workflow. You'll learn to structure code professionally and create a memorable visual representation of selection bias.

**Repository Information:**

- **Source/Starter File:** Available in the main repository at [https://github.com/flyaflya/selectionBiasChallenge](https://github.com/flyaflya/selectionBiasChallenge)
- **Challenge Read Online:** View the challenge instructions at [https://flyaflya.github.io/selectionBiasChallenge](https://flyaflya.github.io/selectionBiasChallenge)
- **Your Submission:** Fork this challenge and create your GitHub Pages site at `https://[your-username].github.io/selectionBiasChallenge/`
:::

## The Problem: Visualizing Selection Bias

Selection bias occurs when observed data isn't representative of the population. Your meme will demonstrate this concept through visual metaphor:

- **Reality**: Your original image represents the true population
- **Your Model**: Your stippled image represents your data collection (sampling)
- **Selection Bias**: A bold letter "S" represents a systematic pattern of missing data
- **Estimate**: Stippled image with "S" mask applied shows the biased estimate‚Äîwhat you see when selection bias removes data points in a systematic pattern

**Key Concept:** Images are simply matrices‚Äî2D arrays where each value represents a pixel (0.0 = black, 1.0 = white). Your stippled image is a matrix with black dots (data points) on a white background. Selection bias removes some of these pixels (data points) in a systematic pattern (the "S"), creating a biased estimate.

**Key Insight:** When data is missing in a systematic pattern (not random), your estimates become biased. The "S" shape makes it visually obvious that the missing data follows a pattern, just like real selection bias in statistics.

## Example Output

Here's what your final meme should look like:

![Four-panel statistics meme showing Reality (original image), Your Model (stippled version), Selection Bias (letter S), and Estimate (masked stippled image)](statistics_meme.png)

**Your challenge:** Create a similar meme using your own image, with all code hidden in your `index.qmd` file. The final output should show only the meme image and a brief 1-3 sentence explanation of how it demonstrates selection bias.

## Getting Started: Repository Setup üöÄ

::: {.callout-important}
## üìÅ Repository Setup Instructions

**Step 1:** Fork the starter repository:

- Navigate to [https://github.com/flyaflya/selectionBiasChallenge](https://github.com/flyaflya/selectionBiasChallenge)
- Fork the repository to your GitHub account (this creates `https://github.com/[your-username]/selectionBiasChallenge`)

**Step 2:** Clone your forked repository locally using Cursor (or VS Code)

**Step 3:** Set up GitHub Pages:

- Go to your repository settings (click the "Settings" tab in your GitHub repository)
- Scroll down to the "Pages" section in the left sidebar
- Under "Source", select "Deploy from a branch"
- Choose "main" branch and "/ (root)" folder
- Click "Save"
- Your site will be available at: `https://[your-username].github.io/selectionBiasChallenge/`
- **Note:** It may take a few minutes for the site to become available after enabling Pages

**Step 4:** You're ready to start! Use the `index.qmd` file as your starting point.
:::

## Workflow Overview

This challenge is organized into discrete steps. Steps 1-3 are provided for you. **You must write Steps 4-5 and create_meme.py yourself:**

1. **Step 1**: Prepare black and white image (provided) ‚úÖ
2. **Step 2**: Create stippled image using blue noise stippling (provided) ‚úÖ
3. **Step 3**: Create tonal analysis (optional refinement step, provided) ‚úÖ
4. **Step 4**: Create block letter "S" matching image dimensions (**YOU WRITE THIS**) ‚ö†Ô∏è
5. **Step 5**: Create masked image by applying the letter mask to the stippled image (**YOU WRITE THIS**) ‚ö†Ô∏è
6. **create_meme.py**: Assemble all components into the four-panel meme (**YOU WRITE THIS**) ‚ö†Ô∏è

**Note:** Step 3 is optional but recommended. It helps you understand your image's brightness distribution and refine the stippling parameters in Step 2 for better results.

## Understanding the Workflow

This challenge uses a modular design where each step is implemented as a discrete function in a separate file. This structure provides several benefits:

### Modular Design Benefits

1. **Modularity**: Each step can be modified independently
2. **Reusability**: Functions can be used in other projects
3. **Testability**: Each function can be tested separately
4. **Clarity**: The workflow is easy to understand and follow
5. **Maintainability**: Changes to one step don't affect others

### Function Files

**Steps you'll use (provided):**
- **`step1_prepare_image.py`**: Image loading and preprocessing
- **`step2_create_stipple.py`**: Blue noise stippling algorithm
- **`step3_create_tonal.py`**: Tonal analysis (optional)

**Steps you'll write:**
- **`step4_create_block_letter.py`**: Block letter generation ‚ö†Ô∏è
- **`step5_create_masked.py`**: Mask application ‚ö†Ô∏è
- **`create_meme.py`**: Final assembly and visualization ‚ö†Ô∏è

**Supporting functions (provided):**
- **`importance_map.py`**: Computes importance map for stippling
- **`stippling_functions.py`**: Core stippling algorithm functions -->

## Step 1: Prepare Image

Load an image, convert to grayscale, and resize to appropriate dimensions while maintaining aspect ratio.

In [None]:
#| label: step1-prepare
#| echo: false
#| fig-cap: Original image prepared for processing

import numpy as np
import matplotlib.pyplot as plt
from step1_prepare_image import prepare_image

# Load and prepare the image
# CHANGE THIS to use your own image!
img_path = 'vikramgoyal2.jpeg'  # Example image - replace with your own image
gray_image = prepare_image(img_path, max_size=512)

# Display the prepared image
fig, ax = plt.subplots(figsize=(6.5, 5))
ax.imshow(gray_image, cmap='gray', vmin=0, vmax=1)
ax.axis('off')
ax.set_title('Step 1: Prepared Image', fontsize=14, fontweight='bold', pad=10)
plt.tight_layout()
plt.show()

## Step 2: Create Stippled Image

Generate a blue noise stippling pattern from the prepared image. This creates a pattern of dots that preserves visual information while maintaining good spatial distribution.

In [None]:
#| label: step2-stipple
#| echo: false
#| fig-cap: Blue noise stippling pattern
#| warning: false

from step2_create_stipple import create_stipple

# Create stippled image
stipple_pattern, samples = create_stipple(
    gray_image,
    percentage=0.08,  # 8% of pixels will be stippled
    sigma=0.9,  # Repulsion radius
    content_bias=0.9,  # Strongly follow importance map
    noise_scale_factor=0.1,  # Moderate exploration
    extreme_downweight=0.5,  # Moderate downweighting of extremes
    extreme_threshold_low=0.2,  # Downweight tones below 0.2
    extreme_threshold_high=0.8,  # Downweight tones above 0.8
    extreme_sigma=0.1  # Smooth transition width
)

# Display the stippled image
fig, ax = plt.subplots(figsize=(6.5, 5))
ax.imshow(stipple_pattern, cmap='gray', vmin=0, vmax=1)
ax.axis('off')
ax.set_title('Step 2: Stippled Image', fontsize=14, fontweight='bold', pad=10)
plt.tight_layout()
plt.show()

## Step 3: Create Tonal Analysis
<!-- ::: {.callout-note} 
## üîß Optional Refinement Step

**Step 3 is optional** but highly recommended! It creates a box-averaged tonal analysis that helps you understand the brightness distribution across your image. Use this information to **tune the stippling parameters in Step 2** for better results.

**How to use it:**
- Analyze the tonal distribution to identify key brightness ranges
- Adjust `extreme_threshold_low` and `extreme_threshold_high` based on your image's tone distribution
- Tune `mid_tone_center` to match important features (e.g., skin tones around 0.7)
- Refine `extreme_downweight` based on how much you want to reduce stipples in extreme regions
::: -->

Create a tonal analysis by dividing the image into a grid and computing average brightness in each section. This visualizes the distribution of tones and helps identify which brightness ranges are most important.

In [None]:
#| label: step3-tonal
#| echo: false
#| fig-cap: Box-averaged tonal analysis showing brightness distribution

from step3_create_tonal import create_tonal
import matplotlib.pyplot as plt

# Create tonal analysis with a 16√ó12 grid
grid_rows = 16
grid_cols = 12
tonal_image, average_tones, tonal_stats = create_tonal(
    gray_image,
    grid_rows=grid_rows,
    grid_cols=grid_cols,
    return_full_image=True
)

# Display the box-averaged tonal image with text annotations
fig, ax = plt.subplots(figsize=(6.5, 5))

# Show box-averaged tonal image
ax.imshow(tonal_image, cmap='gray', vmin=0, vmax=1)
ax.axis('off')
ax.set_title('Step 3: Box-Averaged Tonal Analysis', fontsize=14, fontweight='bold', pad=10)

# Calculate grid cell dimensions for text placement
h, w = gray_image.shape
section_h = h / grid_rows
section_w = w / grid_cols

# Add text annotations showing tone values (2 decimals) at the center of each grid cell
for i in range(grid_rows):
    for j in range(grid_cols):
        tone = average_tones[i, j]
        # Calculate center position of the grid cell
        y_center = (i + 0.5) * section_h
        x_center = (j + 0.5) * section_w
        # Use white text for dark sections, black text for light sections
        text_color = 'white' if tone < 0.5 else 'black'
        ax.text(x_center, y_center, f'{tone:.2f}', 
                ha='center', va='center', 
                color=text_color, fontsize=6, fontweight='bold')

plt.tight_layout()
plt.show()

# Print key statistics for parameter tuning
# print(f"\nüìä Tonal Statistics for Parameter Tuning:")
# print(f"  Mean brightness: {tonal_stats['mean']:.3f}")
# print(f"  Standard deviation: {tonal_stats['std']:.3f}")
# print(f"  Brightness range: [{tonal_stats['min']:.3f}, {tonal_stats['max']:.3f}]")
# print(f"\nüí° Tuning Tips:")
# print(f"  - If mean < 0.4: Image is dark, consider lowering extreme_threshold_low")
# print(f"  - If mean > 0.6: Image is light, consider raising extreme_threshold_high")
# print(f"  - If std > 0.2: High contrast, may need stronger extreme_downweight")
# print(f"  - Use mid_tone_center around {tonal_stats['mean']:.2f} to emphasize average tones")

## Step 4: Create Block Letter "S"

<!-- ::: {.callout-warning} -->
<!-- ## üéØ Your Challenge: Write `step4_create_block_letter.py` -->

**Task:** Create a function `create_block_letter_s()` that generates a block letter "S" matching image dimensions.

<!-- **Requirements:**
- Function signature: `create_block_letter_s(height: int, width: int, letter: str = "S", font_size_ratio: float = 0.9) -> np.ndarray`
- Returns a 2D numpy array (height √ó width) with values in [0, 1]
- The letter should be black (0.0) on a white background (1.0)
- The letter should be centered and scaled appropriately to fit within the image
- Use PIL/Pillow's ImageDraw or similar to render the letter

**Hints:**
- You can use `PIL.Image` and `PIL.ImageDraw` to draw text
- Try multiple font paths (e.g., system fonts) if one doesn't work
- Make the letter bold and large enough to be clearly visible
- The letter represents the "selection bias" pattern in your meme
:::

**Your code should go in a file called `step4_create_block_letter.py`.** Once you've written it, you'll use it like this: -->

In [None]:
#| label: step4-block-letter
#| echo: false
#| fig-cap: Block letter S representing selection bias
#| eval: false

# UNCOMMENT AND USE THIS ONCE YOU'VE WRITTEN step4_create_block_letter.py:
# from step4_create_block_letter import create_block_letter_s
# 
# # Get image dimensions
# h, w = gray_image.shape
# 
# # Create block letter S
# block_letter = create_block_letter_s(h, w, letter="S", font_size_ratio=0.9)
# 
# # Display the block letter
# fig, ax = plt.subplots(figsize=(6.5, 5))
# ax.imshow(block_letter, cmap='gray', vmin=0, vmax=1)
# ax.axis('off')
# ax.set_title('Step 4: Selection Bias (Block Letter S)', fontsize=14, fontweight='bold', pad=10)
# plt.tight_layout()
# plt.show()

## Step 5: Create Masked Image ‚ö†Ô∏è **YOUR TASK**

<!-- ::: {.callout-warning} -->
<!-- ## üéØ Your Challenge: Write `step5_create_masked.py` -->

**Task:** Create a function `create_masked_stipple()` that applies the block letter mask to the stippled image.

<!-- **Requirements:**
- Function signature: `create_masked_stipple(stipple_img: np.ndarray, mask_img: np.ndarray, threshold: float = 0.5) -> np.ndarray`
- Returns a 2D numpy array with the same shape as the input images
- Where the mask is dark (below threshold), remove stipples (set to white/1.0)
- Where the mask is light (above threshold), keep the stipples as they are
- This creates the "biased estimate" by systematically removing data points

**Hints:**
- The mask image has values in [0, 1] where 0.0 = black (mask area) and 1.0 = white (keep area)
- Use numpy boolean indexing or np.where() to apply the mask
- The threshold determines what counts as "part of the mask"
:::

**Your code should go in a file called `step5_create_masked.py`.** Once you've written it, you'll use it like this: -->

In [None]:
#| label: step5-masked
#| echo: false
#| fig-cap: Masked stippled image showing selection bias effect
#| eval: false

# UNCOMMENT AND USE THIS ONCE YOU'VE WRITTEN step5_create_masked.py:
# from step5_create_masked import create_masked_stipple
# 
# # Create masked stippled image
# masked_stipple = create_masked_stipple(
#     stipple_pattern,
#     block_letter,
#     threshold=0.5  # Pixels below 0.5 are considered part of the mask
# )
# 
# # Display the masked image
# fig, ax = plt.subplots(figsize=(6.5, 5))
# ax.imshow(masked_stipple, cmap='gray', vmin=0, vmax=1)
# ax.axis('off')
# ax.set_title('Step 5: Masked Stippled Image (Estimate)', fontsize=14, fontweight='bold', pad=10)
# plt.tight_layout()
# plt.show()

## Create the Final Statistics Meme 

<!-- ::: {.callout-warning} -->
<!-- ## üéØ Your Challenge: Write `create_meme.py` -->

**Task:** Create a function `create_statistics_meme()` that assembles all four panels into a professional-looking meme.

<!-- **Requirements:**
- Function signature: `create_statistics_meme(original_img: np.ndarray, stipple_img: np.ndarray, block_letter_img: np.ndarray, masked_stipple_img: np.ndarray, output_path: str, dpi: int = 150, background_color: str = "white") -> None`
- Creates a 1√ó4 layout (four panels side by side)
- Each panel should be labeled: "Reality", "Your Model", "Selection Bias", "Estimate"
- Save the result as a PNG file
- Make it look professional with good spacing, labels, and layout

**Hints:**
- Use matplotlib's `subplots()` or `GridSpec` to create the layout
- Add text labels above or below each panel
- Consider adding a border or background color
- Use high DPI (150-300) for publication quality
- Make sure all images are the same size or handle resizing appropriately
:::

**Your code should go in a file called `create_meme.py`.** Once you've written it, you'll use it like this: -->

In [None]:
#| label: create-final-meme
#| echo: false
#| eval: false

# UNCOMMENT AND USE THIS ONCE YOU'VE WRITTEN create_meme.py:
# from create_meme import create_statistics_meme
# 
# # Create the final meme
# create_statistics_meme(
#     original_img=gray_image,
#     stipple_img=stipple_pattern,
#     block_letter_img=block_letter,
#     masked_stipple_img=masked_stipple,
#     output_path="my_statistics_meme.png",
#     dpi=150,
#     background_color="white"  # or "pink", "lightgray", etc.
# )

<!-- ## Your Final Submission

### Complete Checklist

To complete this challenge, you must:

1. ‚úÖ **Use Step 1** to prepare your own image (with your own image file)
2. ‚úÖ **Use Step 2** to generate a stippled image using blue noise stippling
3. ‚≠ê **Optionally use Step 3** to analyze tonal distribution and refine Step 2 parameters (recommended)
4. ‚ö†Ô∏è **Write Step 4**: Create `step4_create_block_letter.py` to generate the block letter "S"
5. ‚ö†Ô∏è **Write Step 5**: Create `step5_create_masked.py` to apply the mask
6. ‚ö†Ô∏è **Write create_meme.py**: Create `create_meme.py` to assemble the four-panel meme
7. ‚úÖ **Create a complete `index.qmd`** that uses all functions (with code hidden)
8. ‚úÖ **Generate your final meme** using your own image
9. ‚úÖ **Include a brief explanation** (1-3 sentences) of how the meme demonstrates selection bias

### Final Output Requirements

**Important:** All code should be hidden (`echo: false`) in your final `index.qmd` output. The rendered HTML should show only:
- The final meme image
- A brief explanation (1-3 sentences) of how it demonstrates selection bias

### Template for Final Section

Here's a template for your final section: -->

In [None]:
#| label: final-meme
#| echo: false
#| eval: false
#| fig-cap: Statistics meme demonstrating selection bias

# Your code to create and display the meme goes here
# (all the steps above, uncommented and working)

# Display the final meme
# from IPython.display import Image, display
# display(Image("my_statistics_meme.png"))

<!-- ### Example Explanation

Your explanation should be 1-3 sentences. Here's an example:

> This meme demonstrates selection bias by showing how systematic missing data patterns distort our understanding of reality. The original image (Reality) represents the true population, while the stippled version (Your Model) shows our data collection. When selection bias removes data points in a systematic "S" pattern, the resulting estimate becomes biased and no longer represents the true population, just as missing data in real-world studies can lead to incorrect conclusions.

## Tips for Success

1. **Image Selection**: Choose an image with good contrast for best stippling results
2. **Use Tonal Analysis**: Run Step 3 to understand your image's brightness distribution, then refine Step 2 parameters
3. **Function Design**: Write clean, well-documented functions with clear parameter types and return values
4. **Test Incrementally**: Test each function separately before integrating them
5. **Professional Output**: Make your meme look polished with good labels, spacing, and layout
6. **Code Organization**: Keep your functions in separate `.py` files as specified
7. **Documentation**: Add docstrings to your functions explaining parameters and return values -->

## Final Statistics Meme

In [None]:
#| label: display-statistics-meme
#| echo: false
#| fig-cap: Statistics meme demonstrating selection bias

from IPython.display import Image, display

# Display the final statistics meme
display(Image("statistics_meme.png"))

### Explanation

This meme demonstrates selection bias by showing how systematic missing data patterns distort our understanding of reality. The original image (Reality) represents the true population, while the stippled version (Your Model) shows our data collection through sampling. When selection bias removes data points in a systematic "S" pattern, the resulting estimate becomes biased and no longer accurately represents the true population, just as missing data in real-world studies can lead to incorrect conclusions.

## Conclusion

By completing this challenge, you'll have created a memorable visual representation of selection bias that demonstrates how systematic missing data patterns can distort our understanding of reality. The skills you've practiced‚Äîwriting modular Python functions, image processing, and creating professional visualizations‚Äîare directly applicable to real-world data analysis projects.

As you work with real datasets, remember the lesson of this meme: when data is missing in a systematic pattern rather than randomly, your estimates become biased. Recognizing and addressing selection bias is crucial for drawing valid conclusions from your data.
