# Randomized Block and Latin Square Design Example

## RBD Scenario:

In recent years, artificial intelligence (AI) tools have become increasingly integrated into software development workflows. Engineers now have access to a variety of AI assistants that can help with coding tasks, ranging from general-purpose language models to specialized agents tailored to specific programming domains.

In this study, we aim to investigate how different types of AI assistance affect the speed with which software engineers complete a standardized coding task (measured in minutes). Specifically, we are interested in comparing three conditions:

1. No AI assistance
2. Generic GenAI assistance (e.g., a general-purpose model like ChatGPT)
3. Task-specific AI agent (e.g., an assistant trained or fine-tuned for the coding task at hand)

However, we recognize that individual differences among engineers—particularly their level of experience—may influence task completion time. To account for this variability, we will group our engineers based on their experience level (e.g., Junior, Mid-level, Senior), with us recruiting 15 engineers from each experience level ($N=45$). Within each block, engineers will be randomly assigned to one of the three AI assistance conditions (5 for each level). The data are contained in the `AI Coding Speed Experiment.xlsx`. With these data, your tasks are:

**Question 1.** Briefly define the objective of this experiment

**Question 2.** Specify the outcome variable

**Question 3.** Specify the independent variable and blocking factor. What are some possible lurking variables?

**Question 4.** Briefly explain why a randomized block design would be appropriate here. Similarly, explain why a completely randomized design would not be appropriate.

**Question 5.** State the null and alternative hypotheses for this experiment.

**Question 6.** Perform appropriate exploratory analysis, including summary statistics **and** data visualizations. Do the results of these analyses support the null or alternative hypothesis more strongly?

**Question 7.** Build a two-way ANOVA model. Test the assumption of normality using **both** a visual method and a testing method. Do the results of the normality test(s) support the assumption of normality?

**Question 8.** Test the assumption of homogeneity of variance using **both** a visual method and a testing method. Do the results of the test(s) support the assumption of homogeneity of variance?

**Question 9.** Report the F-statistic and its associated p-value for the treatment effect. Which of our two hypotheses is more strongly supported? Why?

**Question 10.** If the data more strongly support the alternative hypothesis, perform Tukey's HSD post-hoc test to determine which levels of the treatment effect are significantly different from each other. If the data more strongly support the null hypothesis, explain why a post-hoc test would not be appropriate.

**Question 11.** Write a brief conclusion summarizing the results of your analyses, including potential limitations of this experiment.

## LSD Scenario:

A software development team is evaluating how different code review formats affect the quality of feedback provided to engineers (as measured on a 1-10 scale with greater scores indicating greater quality). 

The team suspects that **both** the reviewer and the day of the week on which the review is conducted may influence the feedback quality, independent of the review format itself.

The three review formats under consideration are:

1. Synchronous verbal review (e.g., live meeting)
2. Asynchronous written comments (e.g., via pull request tools in GitHub)
3. Hybrid format (initial written comments followed by a brief meeting)

Each reviewer will evaluate one code submission per day, using a different format each day. Over the course of three days, each format will be used exactly once by each reviewer, and on each day, each format will be used exactly once. Below is a table demonstrating how the study will be conducted:


### Latin Square Design: Code Review Formats

| Reviewer \ Day | Day 1              | Day 2              | Day 3              |
|----------------|--------------------|--------------------|--------------------|
| **R1**         | Verbal             | Written            | Hybrid             |
| **R2**         | Written            | Hybrid             | Verbal             |
| **R3**         | Hybrid             | Verbal             | Written            |


The response variable is a feedback quality score, rated by an independent panel based on clarity, usefulness, and depth of the comments. The data from this experiment are contained in the `Review Feedback Example.xlsx` file. With these data, your tasks are:

**Question 1.** Briefly define the objective of this experiment

**Question 2.** Specify the outcome variable

**Question 3.** Specify the independent variable and blocking factors. What are some other possible lurking variables?

**Question 4.** Briefly explain why a Latin Square Design would be appropriate here. Similarly, explain why a completely randomized design or randomized block design would not be appropriate.

**Question 5.** State the null and alternative hypotheses for this experiment.

**Question 6.** Perform appropriate exploratory analysis, including summary statistics **and** data visualizations. Do the results of these analyses support the null or alternative hypothesis more strongly?

**Question 7.** Build a three-way ANOVA model. Test the assumption of normality using **both** a visual method and a testing method. Do the results of the normality test(s) support the assumption of normality?

**Question 8.** Test the assumption of homogeneity of variance using **both** a visual method and a testing method. Do the results of the test(s) support the assumption of homogeneity of variance?

**Question 9.** Report the F-statistic and its associated p-value for the treatment effect. Which of our two hypotheses is more strongly supported? Why?

**Question 10.** If the data more strongly support the alternative hypothesis, perform Tukey's HSD post-hoc test to determine which levels of the treatment effect are significantly different from each other. If the data more strongly support the null hypothesis, explain why a post-hoc test would not be appropriate.

**Question 11.** Write a brief conclusion summarizing the results of your analyses, including potential limitations of this experiment.