# Chapter 10: Observational Studies vs. Designed Experiments

## Lesson Title: "Crash Course: Types of Studies in Engineering"

## Learning Goals
1.  **Observational Studies**: Differentiate between **Retrospective** (looking back) and **Prospective** (looking forward) studies.
2.  **Lurking Variables**: Identify hidden variables that cause misleading correlations (e.g., Ice Cream vs. Drowning).
3.  **Designed Experiments**: Apply the **Four Principles**: Control, Randomize, Replicate, and Block.
4.  **Vocabulary**: Define Factors, Levels, Treatments, Blinding, and Placebos.

## Part 1: Observational Studies
An Observational Study finds trends but **cannot prove cause-and-effect**. We just watch what happens.

### Retrospective vs. Prospective
*   **Retrospective**: We look at past records (e.g., Analyzing 10 years of bridge maintenance logs).
*   **Prospective**: We follow subjects into the future (e.g., Identifying 100 new bridges and tracking their cracks for the next 10 years).

In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output

# ------------------------------------------------------------------------
# INTERACTIVE QUIZ: RETROSPECTIVE VS PROSPECTIVE
# ------------------------------------------------------------------------

quiz_questions = [
    {
        "q": "A university recruits 200 students and asks them to log their daily screen time and sleep duration for the next 6 months.",
        "a": "Prospective"
    },
    {
        "q": "Researchers gathered medical records of 500 people with lung cancer and examined their past smoking histories.",
        "a": "Retrospective"
    },
    {
        "q": "Engineers identify neighborhoods planning tree-planting and measure temperature changes over the next decade.",
        "a": "Prospective"
    }
]

output_quiz = widgets.Output()
score = 0

def run_quiz():
    with output_quiz:
        clear_output()
        print("--- QUIZ: Identify the Study Type ---")
        for i, item in enumerate(quiz_questions):
            print(f"\nQ{i+1}: {item['q']}")
            # Simple answer reveal for the notebook format
            print(f"(Answer: This is a {item['a']} study because it looks {'forward' if item['a']=='Prospective' else 'backward'}.)")

run_quiz()
display(output_quiz)

### Lurking Variables
A **Lurking Variable** is a hidden cause that links two things that aren't actually related.

**Classic Example**: "Neighborhoods with more **Ice Cream Shops** have higher rates of **Drowning**."
*   Does ice cream cause drowning?
*   **NO**. The Lurking Variable is **Summer Heat**. Hot weather makes people eat ice cream AND swim more.

Let's look at a manufacturing example below.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# ------------------------------------------------------------------------
# DEMO: LURKING VARIABLE IN CAR ACCIDENTS
# ------------------------------------------------------------------------

def generate_crash_data(n=300):
    np.random.seed(42)
    
    # Lurking Variable: Driver Aggression Level (Hidden from the dataset!)
    # 0 = Calm, 1 = Aggressive
    aggression = np.random.choice([0, 1], size=n, p=[0.7, 0.3])
    
    data = []
    for i in range(n):
        is_aggressive = aggression[i]
        
        # Aggressive drivers prefer Red Sports Cars
        if is_aggressive:
            car_color = np.random.choice(['Red', 'Black'], p=[0.8, 0.2])
            speed = np.random.normal(90, 10) # Fast
        else:
            car_color = np.random.choice(['White', 'Silver', 'Blue', 'Red'], p=[0.3, 0.3, 0.3, 0.1])
            speed = np.random.normal(60, 5) # Normal
            
        # Accident risk depends on SPEED, not COLOR
        accident_risk = (speed - 50) / 100 
        has_accident = 1 if np.random.random() < accident_risk else 0
        
        data.append({'Color': car_color, 'Accident': has_accident})
        
    return pd.DataFrame(data)

df_cars = generate_crash_data()

plt.figure(figsize=(8, 4))
sns.barplot(data=df_cars, x='Color', y='Accident', ci=None, palette='viridis')
plt.title("Observational Data: Accident Rate by Car Color")
plt.ylabel("Probability of Accident")
plt.show()

print("Does Red Paint cause accidents? Or is it the driver?")

## Part 2: Designed Experiments

To prove cause-and-effect, we must **Design an Experiment**.

### Vocabulary
*   **Experimental Units**: The things we test on (e.g., The Cars).
*   **Factors**: The variables we manipulate (e.g., Brake Type, Tire Brand).
*   **Levels**: The specific values of the factor (e.g., Brake System A vs Brake System B).
*   **Treatments**: The specific combination of levels applied to a unit.

### The Four Principles
1.  **Control**: Make conditions as similar as possible (Same track, same weather).
2.  **Randomize**: Assign treatments randomly to equalize unknown effects.
3.  **Replicate**: Test multiple cars, not just one.
4.  **Block**: Group similar units together to reduce variability (e.g., Test on Dry Road separate from Wet Road).

### Diagramming the Experiment
We can visualize our "Randomized Block Design" below.

In [None]:
# Code to just print a text-based diagram of the experiment design
def print_diagram():
    print(" EXPERIMENT TREE DIAGRAM")
    print(" =======================")
    print("           [ All 20 Cars ] ")
    print("                  | ")
    print("         BLOCKING (by Road) ")
    print("          /               \ ")
    print("    [10 Dry Road]     [10 Wet Road] ")
    print("       /     \           /     \ ")
    print("  RANDOMIZE  RANDOMIZE  RANDOMIZE  RANDOMIZE")
    print("    /           \       /           \ ")
    print(" [sys A]     [Sys B] [Sys A]     [Sys B]")
    print("    |           |       |           | ")
    print(" COMPARE     COMPARE  COMPARE     COMPARE")

print_diagram()

### Blinding and Placebos
*   **Blinding**: Hiding the treatment info to prevent bias.
*   **Single-Blind**: The driver doesn't know if they have the "New Brakes" or "Old Brakes" (so they don't subconciously brake harder).
*   **Double-Blind**: Neither the driver NOR the scientist measuring the skid marks knows which car is which until the end.
*   **Placebo**: A fake treatment (like a sugar pill) used in medical trials. In engineering, this might be a "Control Group" using the standard existing part.

In [None]:
# ------------------------------------------------------------------------
# SIMULATION: THE BRAKING EXPERIMENT (With Blocking & Randomization)
# ------------------------------------------------------------------------

style = {'description_width': 'initial'}

randomize_check = widgets.Checkbox(
    value=False, 
    description='ENABLE Randomized Assignment',
    style=style
)

output_exp = widgets.Output()

def run_brake_experiment(b):
    is_randomized = randomize_check.value
    
    with output_exp:
        clear_output(wait=True)
        
        # Setup the "environment"
        # 20 Test Runs. First 10 are Dry, Last 10 are Wet (Nature decides this)
        conditions = ['Dry'] * 10 + ['Wet'] * 10
        
        # ASSIGN TREATMENTS (Brake System A vs B)
        if not is_randomized:
            # BAD SCIENCE: Lazy engineer tests A in the morning (Dry) and B in the afternoon (Wet)
            treatments = ['System A'] * 10 + ['System B'] * 10
            design_score = "POOR (Confounded)"
        else:
            # GOOD SCIENCE: Randomly assign A or B to each run
            treatments = np.random.choice(['System A', 'System B'], size=20, p=[0.5, 0.5])
            design_score = "EXCELLENT (Randomized)"
            
        # SIMULATE RESULTS (Stopping Distance in meters)
        # Truth: System A is slightly better (shorter distance) than B.
        # Truth: Wet roads add massive distance.
        results = []
        for i in range(20):
            base_dist = 40 # avg stopping distance
            
            # Treatment Effect
            if treatments[i] == 'System A': dist = base_dist - 5
            else: dist = base_dist + 0
                
            # Blocking Factor Effect
            if conditions[i] == 'Wet': dist += 20 
            
            # Random noise
            dist += np.random.normal(0, 2)
            
            results.append({
                'Run_ID': i+1,
                'Road_Condition': conditions[i],
                'Brake_System': treatments[i],
                'Stop_Distance_m': dist
            })
            
        df_exp = pd.DataFrame(results)
        
        # VISUALIZE
        plt.figure(figsize=(10,6))
        sns.boxplot(data=df_exp, x='Brake_System', y='Stop_Distance_m', hue='Road_Condition')
        plt.title(f"Experiment Results (Design: {design_score})")
        plt.ylabel("Stopping Distance (meters) [Lower is Better]")
        plt.show()
        
        # ANALYSIS TEXT
        if not is_randomized:
            print("\n⚠ CONFOUNDING ALERT: System A looks amazing, but it was only tested on DRY roads!")
            print("We cannot separate the effect of the Brakes from the effect of the Water.")
        else:
            print("\n✅ SUCCESS: Randomization ensured both systems were tested on Wet and Dry roads.")
            print("The Boxplot shows System A is slightly better in BOTH conditions.")

btn = widgets.Button(description="Run Experiment")
btn.on_click(run_brake_experiment)

display(widgets.HBox([randomize_check, btn]))
display(output_exp)

## Discussion Questions

1.  **Ethics**: Why is it unethical to run a "Designed Experiment" on the effects of Smoking? (Hint: Can we force 500 people to smoke?)
2.  **Blinding**: In our Brake Test, if the professional driver *knew* he was testing the "New Experimental Brakes"; how might that change the results? Would he subconsciously drive safer?
3.  **Lurking Variables**: A study finds that students who have tutors get *lower* grades than students who don't. Does tutoring make you dumber? What is the lurking variable?