# Chapter 10: Observational Studies vs. Designed Experiments

## Lesson Title: "Crash Course: Types of Studies in Engineering"

## Learning Goals
1.  **Observational Studies**: Differentiate between **Retrospective** (looking back) and **Prospective** (looking forward) studies.
2.  **Lurking Variables**: Identify hidden variables that cause misleading correlations (e.g., Ice Cream vs. Drowning).
3.  **Designed Experiments**: Apply the **Four Principles**: Control, Randomize, Replicate, and Block.
4.  **Vocabulary**: Define Factors, Levels, Treatments, Blinding, Placebos, and Statistical Significance.

## Part 1: Observational Studies
An Observational Study is valuable for discovering trends but **cannot prove cause-and-effect**. Researchers simply observe choices; they do not assign them.

![Observational Studies Diagram](https://raw.githubusercontent.com/rkn2/hdr-dsc-k12/main/modules/chapter_10_experiment_design/images/observational_studies.png)

### Retrospective vs. Prospective
*   **Retrospective**: Subjects are selected and their **previous conditions** or behaviors are determined. (e.g., Analyzing 10 years of past bridge maintenance logs).
*   **Prospective**: Subjects are followed to observe **future outcomes**. (e.g., Identifying 100 new bridges and tracking their cracks for the next 10 years).

### Exercise 1: Classify the Study
For each of the following scenarios, decide if it is **Retrospective** or **Prospective**.

1.  **Screen Time**: A university recruits 200 students and asks them to log their daily screen time and sleep duration for the next 6 months.
2.  **Study Habits**: A researcher investigates how study habits impact test scores. They track a group of students over the course of a school year, recording their study habits and test scores at regular intervals.
3.  **Smoking**: A group of researchers wanted to study the effects of smoking on lung health. They gathered medical records of 500 people who had already been diagnosed with lung cancer and examined their smoking histories to determine if there was a pattern.
4.  **Pesticides**: A team of scientists studies the potential connection between pesticide exposure and Parkinson’s disease. They identify a group of people who have been diagnosed with Parkinson’s and review their work histories to see if they were exposed to pesticides in the past.
5.  **Sugary Beverages**: Researchers are interested in the long-term health effects of drinking sugary beverages. They recruit 1,000 participants who currently do not have diabetes and track their beverage consumption and health outcomes over the next 10 years.
6.  **Neighborhood Cohesion**: A sociologist examines the effect of community events on neighborhood cohesion. They review past community event attendance records and survey long-time residents about their sense of neighborhood belonging.
7.  **Tree Planting**: An environmental scientist studies the impact of tree planting on reducing local temperatures. They identify neighborhoods planning large tree-planting initiatives and measure the area's temperature changes over the next decade.

<details>
<summary><strong style="color:blue; cursor:pointer;">CLICK TO REVEAL ANSWERS</strong></summary>

*   **1. Prospective** (Tracking for next 6 months)
*   **2. Prospective** (Tracking over school year)
*   **3. Retrospective** (Examining past records)
*   **4. Retrospective** (Reviewing work histories)
*   **5. Prospective** (Tracking over next 10 years)
*   **6. Retrospective** (Reviewing past records)
*   **7. Prospective** (Measuring changes over next decade)

</details>

### Lurking Variables
A **Lurking Variable** is a hidden cause that links two things that aren't actually related. 
*   It is usually thought of as a **prior cause** of both X and Y.

**Classic Example**: "Neighborhoods with more **Ice Cream Shops** have higher rates of **Drowning**."
*   Does ice cream cause drowning?
*   **NO**. The Lurking Variable is **Summer Heat**. Hot weather makes people eat ice cream AND swim more.

Let's look at a manufacturing example below.

In [None]:
# ------------------------------------------------------------------------
# DEMO: LURKING VARIABLE IN CAR ACCIDENTS (INTERACTIVE)
# ------------------------------------------------------------------------
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatSlider

def explore_lurking_variables(aggressive_pct=0.3, red_preference=0.8):
    """
    Play with these sliders:
    - aggressive_pct: What % of drivers are aggressive? (Try 0.1 vs 0.9)
    - red_preference: How much do aggressive drivers prefer red cars? (Try 0.5 vs 1.0)
    """
    np.random.seed(42)
    n = 500
    
    # Hidden Variable: Driver Aggression
    aggression = np.random.choice([0, 1], size=n, p=[1-aggressive_pct, aggressive_pct])
    
    data = []
    for i in range(n):
        is_aggressive = aggression[i]
        
        # CAUSAL CHAIN 1: Aggression -> Car Color Preference
        if is_aggressive:
            car_color = np.random.choice(['Red', 'Black'], p=[red_preference, 1-red_preference])
            speed = np.random.normal(90, 10) # Fast
        else:
            car_color = np.random.choice(['White', 'Silver', 'Blue', 'Red'], p=[0.3, 0.3, 0.3, 0.1])
            speed = np.random.normal(60, 5) # Normal
            
        # CAUSAL CHAIN 2: Aggression -> Speed -> Accident Risk
        accident_risk = (speed - 50) / 100 
        has_accident = 1 if np.random.random() < accident_risk else 0
        
        data.append({'Color': car_color, 'Accident': has_accident})
        
    df_cars = pd.DataFrame(data)
    
    plt.figure(figsize=(8, 4))
    sns.barplot(data=df_cars, x='Color', y='Accident', errorbar=None, palette='viridis')
    plt.title(f"Accident Rate by Color (Aggressive Drivers: {aggressive_pct*100:.0f}%)")
    plt.ylabel("Probability of Accident")
    plt.ylim(0, 1.0)
    plt.show()
    
    print("Move the sliders above to see how the connection between RED cars and ACCIDENTS changes!")

# Create interactive widget
interact(explore_lurking_variables,
         aggressive_pct=FloatSlider(min=0.1, max=0.9, step=0.1, value=0.3, description='Aggressive %'),
         red_preference=FloatSlider(min=0.5, max=1.0, step=0.1, value=0.8, description='Red Love %'))

In [None]:
# ------------------------------------------------------------------------
# STUDENT ACTIVITY: Survey Your Class
# Ask 5-10 classmates: 'Hours of sleep' and 'Hours of screen time'
# ------------------------------------------------------------------------

# ENTER YOUR DATA HERE:
# Example: sleep_hours = [7, 6, 8, 5]
sleep_hours = [  ]  # <--- Fill this in
screen_hours = [  ]  # <--- Fill this in

# Once you fill in the data above, select this cell and press Shift+Enter to run
if len(sleep_hours) > 0:
    plt.figure(figsize=(6,4))
    plt.scatter(screen_hours, sleep_hours, color='purple')
    plt.xlabel("Screen Time (hours)")
    plt.ylabel("Sleep (hours)")
    plt.title("My Class Data: Screen Time vs Sleep")
    plt.grid(True)
    plt.show()
else:
    print("Waiting for data... Please enter numbers in the brackets above!")

### Exercise 2: Identify the Lurking Variable
Consider these studies described in the notes:

1.  **Tutoring**: Students who attend more after-school tutoring sessions tend to have *lower* grades. (Does tutoring make you dumber?)
2.  **Libraries**: Communities with more libraries have *higher* literacy rates. (Does building a library building magically teach people to read?)
3.  **Coffee**: Cities with more coffee shops have *higher* crime rates. (Does caffeine cause crime?)
4.  **Convertibles**: Communities where people own a lot of convertible cars receive *less* rainfall. (Does owning a convertible stop the rain?)

<details>
<summary><strong style="color:blue; cursor:pointer;">CLICK TO REVEAL ANSWERS</strong></summary>

*   **1. Tutoring**: **Student Motivation/Struggle**. (Struggling students are the ones who seek tutoring; their grades are low BEFORE they start).
*   **2. Libraries**: **Socioeconomic Status (Wealth)**. (Wealthier towns have money for both libraries and better schools).
*   **3. Coffee**: **Population Size**. (Big cities have more of everything: More coffee, more people, more crime).
*   **4. Rainfall**: **Geography/Climate**. (People only buy convertibles in places that are already sunny/dry).

</details>

In [None]:
# ------------------------------------------------------------------------
# STUDENT ACTIVITY: Match the Lurking Variable
# ------------------------------------------------------------------------
import ipywidgets as widgets
from IPython.display import display

scenarios = {
    "Tutoring -> Lower Grades": ["Student Struggle/Motivation", "Wealth", "Population"],
    "Libraries -> Higher Literacy": ["Wealth/Socioeconomic", "Student Struggle", "Population"],
    "Coffee Shops -> Crime": ["Population Size", "Wealth", "Weather"]
}

print("Select the correct Lurking Variable for each scenario:")
for scenario, options in scenarios.items():
    dd = widgets.Dropdown(options=['Select...'] + options, description='Lurking Var:')
    print(f"\nScenario: {scenario}")
    display(dd)

## Part 2: Designed Experiments

To prove cause-and-effect, we must **Design an Experiment**. The experimenter deliberately manipulates factors to create treatments.

### Vocabulary
*   **Experimental Units**: The things we test on (e.g., The Cars). When humans are involved, they are **Subjects**.
*   **Factors**: The variables we manipulate (e.g., Brake Type).
*   **Levels**: The specific values of the factor (e.g., Brake System A vs Brake System B).
*   **Treatments**: A combination of specific levels from all factors.

### The Four Principles
![The Four Principles](https://raw.githubusercontent.com/rkn2/hdr-dsc-k12/main/modules/chapter_10_experiment_design/images/four_principles.png)

1.  **Control**: We control sources of variation (other than the factors) to make conditions similar for all groups.
2.  **Randomize**: We assign treatments randomly to equalize unknown sources of variation. It doesn't eliminate effects, but **spreads them out**.
3.  **Replicate**: Repeat the experiment on many subjects. **"An anecdote (one subject) is not data."**
4.  **Block**: Group similar individuals together (e.g., by Road Condition) and randomize within the blocks. This removes variability coming from the blocking variable.

### Diagramming the Experiment
Below is the standard diagram for an Experiment:

![Experiment Diagram](https://raw.githubusercontent.com/rkn2/hdr-dsc-k12/main/modules/chapter_10_experiment_design/images/experiment_diagram.png)

### Tree Diagrams
We can use Tree Diagrams to plan our **Blocking** and **Multi-Factor** designs:

#### 1. Blocking Diagram
![Blocking Tree Diagram](https://raw.githubusercontent.com/rkn2/hdr-dsc-k12/main/modules/chapter_10_experiment_design/images/tree_blocking.png)

#### 2. Multi-Factor Diagram
![Multi Factor Tree Diagram](https://raw.githubusercontent.com/rkn2/hdr-dsc-k12/main/modules/chapter_10_experiment_design/images/tree_multi_factor.png)

### Blinding and Placebos
*   **Blinding**: Hiding the treatment info to prevent bias from those who *influence* result (drivers) or *evaluate* results (judges).
*   **Single-Blind**: Everyone in one class (e.g., drivers) is blinded.
*   **Double-Blind**: Everyone in BOTH classes (drivers and judges) is blinded.
*   **Placebo**: A "fake" treatment (like a sugar pill) so subjects don't know if they are getting the real thing. In engineering, this is often the "Business as Usual" or Control standard.

--- 

### Statistically Significant
How large do the differences need to be to say there is a real difference?
*   **Definition**: Differences that are larger than we’d get just from **randomization alone** are called **Statistically Significant**.

### 🤔 Before We Run the Simulation...

**PREDICT**: If we DON'T randomize (test System A only on dry roads, System B only on wet roads):
1.  Which system will APPEAR to perform better? _____________
2.  Why is this misleading? _____________
3.  What's the **lurking variable**? _____________

*Now click the checkbox below and see if you were right!*

In [None]:
# ------------------------------------------------------------------------
# SIMULATION: THE BRAKING EXPERIMENT (With Blocking & Randomization)
# ------------------------------------------------------------------------
import ipywidgets as widgets
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

def run_simulation(use_randomization):
    # 1. SETUP THE ENVIRONMENT
    # We have 20 test runs available. 
    # Nature (or the track manager) forces the first 10 to be Dry, last 10 to be Wet.
    conditions = ['Dry'] * 10 + ['Wet'] * 10
    
    # 2. ASSIGN TREATMENTS (Brake System A vs B)
    if not use_randomization:
        # BAD DESIGN (Confounded): 
        # The engineer tests System A first (while it's Dry), then System B later (when it's Wet).
        treatments = ['System A'] * 10 + ['System B'] * 10
        design_label = "POOR DESIGN (Confounded)"
    else:
        # GOOD DESIGN (Randomized):
        # We flip a coin for every single run to decide which brake system to use.
        # This breaks the link between 'Time of Day' and 'Brake System'.
        treatments = np.random.choice(['System A', 'System B'], size=20, p=[0.5, 0.5])
        design_label = "EXCELLENT DESIGN (Randomized)"

    # 3. SIMULATE RESULTS (The Truth)
    # Truth: System A is inherently better (stops 5m shorter).
    # Truth: Wet roads are inherently worse (add 20m to stop distance).
    results = []
    base_dist = 40 
    
    for i in range(20):
        # Calculate Stop Distance based on Physics
        dist = base_dist
        
        # Effect of Brakes
        if treatments[i] == 'System A': dist -= 5
        
        # Effect of Road
        if conditions[i] == 'Wet': dist += 20
        
        # Random Noise (Driver reaction time variability)
        dist += np.random.normal(0, 2)
        
        results.append({
            'Run': i+1,
            'Road_Condition': conditions[i],
            'Brake_System': treatments[i],
            'Stop_Distance_m': dist
        })
    
    df = pd.DataFrame(results)
    
    # 4. VISUALIZE
    plt.figure(figsize=(10, 6))
    
    # specific order to keep colors consistent
    sns.boxplot(data=df, x='Brake_System', y='Stop_Distance_m', hue='Road_Condition', 
                hue_order=['Dry', 'Wet'], order=['System A', 'System B'], palette='Set2')
    
    plt.title(f"Experimental Results: {design_label}")
    plt.ylabel("Stopping Distance (meters)")
    plt.grid(True, alpha=0.3)
    plt.show()
    
    # 5. ANALYSIS FEEDBACK
    if not use_randomization:
        print("⚠ ANALYSIS: Look at the graph. System A looks INCREDIBLE (very low stopping distance)." )
        print("   But wait! System A was only tested on DRY roads.")
        print("   We don't know if the car stopped fast because of the Brakes or the Dry Road.")
        print("   This is a CONFOUNDED experiment.")
    else:
        print("✅ ANALYSIS: Now look. System A and B were both tested on Dry and Wet roads.")
        print("   We can fairly compare them. System A is indeed consistently better than B,")
        print("   regardless of the road condition.")

# Create the interactive widget
# This will run the simulation IMMEDIATELY with the default value (False)
style = {'description_width': 'initial'}
widgets.interact(
    run_simulation, 
    use_randomization=widgets.Checkbox(
        value=False, 
        description='Enable Randomization (Correct Design)', 
        style=style
    )
);

### 🔬 What Did We Just Learn?

**The Confounded Design (No Randomization)**:
*   System A: Always tested on DRY roads → stopped fast
*   System B: Always tested on WET roads → stopped slow
*   **Conclusion**: System A looks way better... but is it the brakes or the road?

**The Randomized Design**:
*   Both systems tested on BOTH road conditions
*   **Conclusion**: Now we can fairly compare them! System A really IS better.

**Key Insight**: Randomization doesn't eliminate confounders (wet roads still exist). It just makes sure the confounder affects BOTH groups equally.

In [None]:
# DEMO: Why Randomization Spreads Out Confounders
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Bad Design (Confounded)
ax1.set_title("❌ Without Randomization")
ax1.text(0.1, 0.7, "System A\n(All DRY)", fontsize=12, bbox=dict(boxstyle='round', facecolor='lightblue'))
ax1.text(0.6, 0.7, "System B\n(All WET)", fontsize=12, bbox=dict(boxstyle='round', facecolor='lightcoral'))
ax1.text(0.35, 0.3, "Confounder (Road) is STUCK\nwith one group!", ha='center', color='red')
ax1.axis('off')

# Good Design (Randomized)
ax2.set_title("✅ With Randomization")
ax2.text(0.1, 0.7, "System A\n(DRY + WET)", fontsize=12, bbox=dict(boxstyle='round', facecolor='lightgreen'))
ax2.text(0.6, 0.7, "System B\n(DRY + WET)", fontsize=12, bbox=dict(boxstyle='round', facecolor='lightgreen'))
ax2.text(0.35, 0.3, "Confounder affects\nBOTH groups equally!", ha='center', color='green')
ax2.axis('off')

plt.tight_layout()
plt.show()

### Exercise 3: Identify the Confounding Factor
Read these poor experiment designs and identify what went wrong (the Confounder).

1.  **Fertilizer**: A farmer tests a new fertilizer on the half of his field that is *closest to the river* (water source). The other half gets no fertilizer.
2.  **Teaching**: A new teaching method is tested on the *Honors Class*, while the Standard Class keeps the old method.
3.  **Gym**: A new fitness program is given to people who *already attend the gym 5 days a week*.
4.  **Diet**: One diet group is asked to attend *weekly support meetings*. The other group just gets a pamphlet.
5.  **WFH**: We compare productivity between employees who *volunteer* to work from home vs those who prefer the office.

<details>
<summary><strong style="color:blue; cursor:pointer;">CLICK TO REVEAL ANSWERS</strong></summary>

*   **1. Fertilizer**: **WATER**. (The fertilizer group also got more water. We don't know which helped).
*   **2. Teaching**: **STUDENT ABILITY**. (Honors students might score higher regardless of the method).
*   **3. Gym**: **MOTIVATION**. (Frequent gym-goers are already more motivated/fit).
*   **4. Diet**: **SOCIAL SUPPORT**. (Was it the diet or the meetings that helped?)
*   **5. WFH**: **PERSONALITY/JOB TYPE**. (Volunteers might be more self-disciplined or have easier jobs).

</details>

## 🎯 Final Challenge: Design Your Own Experiment!

**Scenario**: Your school wants to know if a new study app actually helps students improve their grades.

**Your Task**: Design an experiment using the Four Principles. Fill in:\n
1.  **Research Question**: _________________________\n2.  **Factor(s)**: _________________________\n3.  **Levels**: _________________________\n4.  **How will you CONTROL?** _________________________\n5.  **How will you RANDOMIZE?** _________________________\n6.  **How many REPLICATES?** _________________________\n7.  **What variables should you BLOCK on?** _________________________\n8.  **Possible CONFOUNDERS to watch out for**: _________________________\n\n<details>\n<summary><strong>See Example Answer</strong></summary>\n\n1. **Research Question**: Does using StudyBuddy app for 30 min/day improve test scores?\n2. **Factor**: App usage (yes/no)\n3. **Levels**: StudyBuddy app vs. no app\n4. **Control**: Same test, same teacher, same time period\n5. **Randomize**: Randomly assign 100 students to app or no-app group\n6. **Replicates**: 50 students per group\n7. **Block on**: Prior GPA (group high/medium/low performers separately)\n8. **Confounders**: Study time (track it!), access to tutoring, motivation\n\n</details>

In [None]:
# ------------------------------------------------------------------------
# BONUS: P-HACKING DEMO (How to Lie with Statistics)
# We will test 20 random 'junk' variables to see if any predict Accidents.
# By random chance, one usually triggers a 'False Positive'!
# ------------------------------------------------------------------------

def run_phacking_demo():
    np.random.seed(None) # Truly random each time
    
    print("Testing 20 random variables against Accident Risk...")
    
    junk_vars = ['Zodiac Sign', 'Shoe Size', 'Fav Color', 'Pet Name', 'Lunch Choice',
                 'Music Taste', 'Hair Color', 'Height', 'Sibling Count', 'Birth Month',
                 'Phone Brand', 'Shirt Color', 'Wakeup Time', 'Commute Method', 'Coffee Order',
                 'Fav Sport', 'Eye Color', 'Street Name', 'Last Name Length', 'Lucky Number']
    
    found_significance = False
    
    for var in junk_vars:
        # Generate totally random p-value between 0 and 1
        p_value = np.random.uniform(0, 1)
        
        if p_value < 0.05: # The standard cutoff for 'Significant'
            print(f"\n🚨 FAKE NEWS ALERT: We found a correlation!")
            print(f"   Variable: {var}")
            print(f"   P-Value: {p_value:.4f} (This looks 'Statistically Significant'!)")
            print("   REALITY: This was just random noise.")
            found_significance = True
            
    if not found_significance:
        print("\n(No fake correlations found this time. Run it again!)")

run_phacking_demo()

In [None]:
# Quick Check: Can You Spot the Problem?\nproblems = {\n    "A coach gives the new training program to the varsity team, keeps JV on old program": \n        "Not randomized - confounded by skill level",\n    \n    "A farmer tests fertilizer on 3 plants": \n        "Not enough replication - need more plants!",\n    \n    "Drug trial: patients CHOOSE which pill to take": \n        "Self-selection bias - not randomized"\n}\n\nprint("Read the scenarios above and discuss WHY they are bad designs.")