# GeneralizIT Tutorial: Analyzing Brennan's (2001) Design 4

This tutorial demonstrates how to use GeneralizIT for Generalizability Theory (G-theory) analyses using the synthetic dataset #4 from Brennan (2001). This design represents a "person x (r:t)" study where:

- persons (p) are crossed with tasks (t)
- raters (r) are nested within tasks (t)
- the notation is "person x (r:t)"

## What is Generalizability Theory?

Generalizability theory (G-theory) is a statistical framework for evaluating the reliability of measurements. It extends classical test theory by examining multiple sources of measurement error simultaneously. G-theory helps researchers understand how different facets of a study design (e.g., raters, items, occasions) contribute to measurement variance.

Unlike classical reliability approaches that provide a single reliability coefficient, G-theory provides:
- Variance component estimates for each facet and their interactions
- Generalizability coefficients (G-coefficients) that estimate reliability
- Decision study (D-study) capabilities to optimize measurement procedures

GeneralizIT performs "Analogous ANOVA" based on Henderson's Method 1, which allows it to be flexible with unbalanced designs and missing data.
More details on the method can be found in the [Generalizability Methods Tutorial](https://github.com/tylerjsmith111/GeneralizIT/blob/main/tutorials/generalizability_methods_tutorial.md).

## Setting up the Environment

Install the GeneralizIT package through PyPI or downloading the source code from GitHub.
```bash
pip install GeneralizIT
```

Next import the necessary libraries for this tutorial.

In [1]:
# Import necessary libraries
import pandas as pd

# Import GeneralizIT
from generalizit import GeneralizIT

## Creating Brennan's Design 4 Dataset
We'll recreate the synthetic dataset #4 from Brennan (2001) which represents a "person x (r:t)" design:
- 10 persons
- 3 tasks (t1, t2, t3)
- 4 raters nested within each task (r1-r4 for t1, r5-r8 for t2, r9-r12 for t3)
- The dataset is given below in wide format and will be converted to long format for analysis.

| person | t1_r1 | t1_r2 | t1_r3 | t1_r4 | t2_r5 | t2_r6 | t2_r7 | t2_r8 | t3_r9 | t3_r10 | t3_r11 | t3_r12 |
| ------ | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ------ | ------ | ------ |
| 1      | 5     | 6     | 5     | 5     | 5     | 3     | 4     | 5     | 6     | 7      | 3      | 3      |
| 2      | 9     | 3     | 7     | 7     | 7     | 5     | 5     | 5     | 7     | 7      | 5      | 2      |
| 3      | 3     | 4     | 3     | 3     | 5     | 3     | 3     | 5     | 6     | 5      | 1      | 6      |
| 4      | 7     | 5     | 5     | 3     | 3     | 1     | 4     | 3     | 5     | 3      | 3      | 5      |
| 5      | 9     | 2     | 9     | 7     | 7     | 7     | 3     | 7     | 2     | 7      | 5      | 3      |
| 6      | 3     | 4     | 3     | 5     | 3     | 3     | 6     | 3     | 4     | 5      | 1      | 2      |
| 7      | 7     | 3     | 7     | 7     | 7     | 5     | 5     | 7     | 5     | 5      | 5      | 4      |
| 8      | 5     | 8     | 5     | 7     | 7     | 5     | 5     | 4     | 3     | 2      | 1      | 1      |
| 9      | 9     | 9     | 8     | 8     | 6     | 6     | 6     | 5     | 5     | 8      | 1      | 1      |
| 10     | 4     | 4     | 4     | 3     | 3     | 5     | 6     | 5     | 5     | 7      | 1      | 1      |


In [2]:
# First create the dataset with the original structure from Brennan's example
data = {
    'person': range(1, 11),
    't1_r1': [5, 9, 3, 7, 9, 3, 7, 5, 9, 4],
    't1_r2': [6, 3, 4, 5, 2, 4, 3, 8, 9, 4],
    't1_r3': [5, 7, 3, 5, 9, 3, 7, 5, 8, 4],
    't1_r4': [5, 7, 3, 3, 7, 5, 7, 7, 8, 3],
    't2_r5': [5, 7, 5, 3, 7, 3, 7, 7, 6, 3],
    't2_r6': [3, 5, 3, 1, 7, 3, 5, 5, 6, 5],
    't2_r7': [4, 5, 3, 4, 3, 6, 5, 5, 6, 6],
    't2_r8': [5, 5, 5, 3, 7, 3, 7, 4, 5, 5],
    't3_r9': [6, 7, 6, 5, 2, 4, 5, 3, 5, 5],
    't3_r10': [7, 7, 5, 3, 7, 5, 5, 2, 8, 7],
    't3_r11': [3, 5, 1, 3, 5, 1, 5, 1, 1, 1],
    't3_r12': [3, 2, 6, 5, 3, 2, 4, 1, 1, 1]
}

df = pd.DataFrame(data)
df.head()

Unnamed: 0,person,t1_r1,t1_r2,t1_r3,t1_r4,t2_r5,t2_r6,t2_r7,t2_r8,t3_r9,t3_r10,t3_r11,t3_r12
0,1,5,6,5,5,5,3,4,5,6,7,3,3
1,2,9,3,7,7,7,5,5,5,7,7,5,2
2,3,3,4,3,3,5,3,3,5,6,5,1,6
3,4,7,5,5,3,3,1,4,3,5,3,3,5
4,5,9,2,9,7,7,7,3,7,2,7,5,3


Now we need to reshape this dataset to the format required by GeneralizIT, where each row represents a single measurement with all facets specified.

In [3]:
# Convert to long format required by GeneralizIT
long_data = {
    'person': [],
    't': [],
    'r': [],
    'score': []  # The response variable
}

# Populate the long-format DataFrame
for person in range(1, 11):
    for t in [1, 2, 3]:  # Tasks 1, 2, 3
        for r in range(1, 13):  # Raters 1-12
            key = f't{t}_r{r}'
            # Check if the key exists (as raters are nested within tasks)
            if key in df.columns:
                response = df.at[person-1, key]
                long_data['person'].append(person)
                long_data['t'].append(t)
                long_data['r'].append(r)
                long_data['score'].append(response)

# Convert to DataFrame
brennan_design4 = pd.DataFrame(long_data)
brennan_design4.head(10)

Unnamed: 0,person,t,r,score
0,1,1,1,5
1,1,1,2,6
2,1,1,3,5
3,1,1,4,5
4,1,2,5,5
5,1,2,6,3
6,1,2,7,4
7,1,2,8,5
8,1,3,9,6
9,1,3,10,7


Let's examine the structure of our dataset to make sure it correctly represents the design:

In [4]:
# Check the design structure 
print(f"Number of persons: {brennan_design4['person'].nunique()}")
print(f"Number of tasks: {brennan_design4['t'].nunique()}")
print(f"Number of raters: {brennan_design4['r'].nunique()}")
print(f"Total number of observations: {len(brennan_design4)}")

# Let's see which raters are used for each task
task_rater_counts = brennan_design4.groupby(['t', 'r']).size().reset_index()
task_rater_counts.columns = ['Task', 'Rater', 'Count']
print("\nTask-Rater combinations:")
print(task_rater_counts.pivot(index='Task', columns='Rater', values='Count').fillna(0))

Number of persons: 10
Number of tasks: 3
Number of raters: 12
Total number of observations: 120

Task-Rater combinations:
Rater    1     2     3     4     5     6     7     8     9     10    11    12
Task                                                                         
1      10.0  10.0  10.0  10.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0
2       0.0   0.0   0.0   0.0  10.0  10.0  10.0  10.0   0.0   0.0   0.0   0.0
3       0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  10.0  10.0  10.0  10.0


This confirms that we have the correct structure for Brennan's design 4:
- 10 persons
- 3 tasks
- 12 raters (4 raters per task)
- A nested design where raters 1-4 are used for task 1, raters 5-8 for task 2, and raters 9-12 for task 3

## Analyzing the Dataset with GeneralizIT

Now let's perform a generalizability analysis on this dataset using the GeneralizIT package.

### Step 1: Initialize the GeneralizIT Object

First, we'll create a GeneralizIT object with our dataset and design specification.

In [5]:
# Initialize the GeneralizIT object
gt = GeneralizIT(
    data=brennan_design4,             # Our dataset
    design_str="person x (r:t)",      # Design specification in standard notation
    response="score"                  # Name of our response variable
)

# Examine the design object created by GeneralizIT
print("Variance tuple dictionary:")
print(gt.design.variance_tuple_dictionary)

Variance tuple dictionary:
{'person': ('person',), 't': ('t',), 'r:t': ('r', 't'), 'person x t': ('person', 't'), 'person x (r:t)': ('person', 'r', 't'), 'mean': ()}


The variance tuple dictionary shows how GeneralizIT has parsed our design specification and identified the relationships between facets.

Specifically, it shows which variances are captured in the linear effects model for this design. The variance components are:

$ Y = \mu + \sigma^2_{person} + \sigma^2_{t} + \sigma^2_{r:t} + \sigma^2_{person x t} + \sigma^2{person x (r:t),e }$

### Step 2: Calculate ANOVA Components

Next, we'll calculate the ANOVA components [uncorrected sums of squares, $T$, and $\sigma$ ], which is the first step in G-Theory analysis.

In [6]:
# Calculate the ANOVA components
gt.calculate_anova()

# Display the ANOVA summary
gt.anova_summary()


--------------------
    ANOVA Table     
--------------------
                person          t               r:t             person x t      person x (r:t)  mean            T               Variance       
person          120.0000        40.0000         10.0000         40.0000         10.0000         120.0000        2800.1667       0.4731         
t               12.0000         120.0000        30.0000         12.0000         3.0000          120.0000        2755.7000       0.3252         
r:t             12.0000         120.0000        120.0000        12.0000         12.0000         120.0000        2835.4000       0.6475         
person x t      120.0000        120.0000        30.0000         120.0000        30.0000         120.0000        2931.5000       0.5596         
person x (r:t)  120.0000        120.0000        120.0000        120.0000        120.0000        120.0000        3204.0000       2.3802         
mean            12.0000         40.0000         10.0000         4.0000  

In [7]:
# Display variance components
gt.variance_summary()


--------------------
Variance Components 
--------------------
                Variance       
person          0.4731         
t               0.3252         
r:t             0.6475         
person x t      0.5596         
person x (r:t)  2.3802         
mean            22.3144        




Looking at the variance components, we can see how much variance is attributed to each facet in our design:

1. **Person Variance**: This represents consistent differences between persons across tasks and raters. Higher values indicate that persons differ systematically in their performances.

2. **Task Variance**: This represents consistent differences between tasks across persons and raters. Higher values indicate that some tasks are consistently easier or harder than others.

3. **Rater:Task Variance**: This represents consistent differences between raters within tasks. Higher values indicate that some raters are consistently more lenient or severe than others within their task groups.

4. **Person × Task Variance**: This represents the interaction between persons and tasks. Higher values indicate that the relative standing of persons changes across different tasks.

5. **Person × (Rater:Task) Variance**: This represents the residual variance, including the three-way interaction and random error. It's often the largest source of variance in G-theory studies.

The interaction variances are particularly important in understanding reliability. For example, a high Person × Task variance suggests that ranking of individuals changes depending on which task they perform, indicating task specificity in the measurements.

### Step 3: Calculate G-Coefficients

Now, let's calculate the G-coefficients (generalizability coefficients). G-coefficients are reliability-like indices that tell us how generalizable our measurements are across different facets of the measurement design.

GeneralizIT allows us to use the built in variance and model design to calculate G-coefficients or we can specify our own 'variance_dictionary' if it has been computed elsewhere. Since we have already computed the variance components, we will use the built-in variance components.

GeneralizIT provides two types of coefficients:
- The generalizability coefficient (ρ²) for relative decisions (comparing individuals)
- The dependability coefficient (Φ) for absolute decisions (measuring against a standard)

These coefficients help us evaluate the reliability of our measurement procedure and determine if it's appropriate for making the types of decisions we need to make in our assessment context.

In [8]:
# Calculate G-coefficients
gt.calculate_g_coefficients()

# Display G-coefficients summary
gt.g_coefficients_summary()

Using default variance tuple dictionary
Using ANOVA Table Variance Dictionary for Generalizability Coefficients

--------------------
   G Coefficients   
--------------------
                Φ               ρ²             
person          0.4637          0.5514         
t               0.5004          0.5397         
r:t             0.7403          0.7679         
person x t      0.6421          0.6421         




### Understanding G-Coefficients

In this analysis, we have two reliability-like coefficients:

1. **Generalizability coefficient (ρ²)**: This is analogous to a reliability coefficient in classical test theory. It represents relative decisions where the ranking of persons is important, but the absolute value is not.

2. **Dependability coefficient (Φ)**: This is for absolute decisions where the actual values matter, not just the ranking. It's typically lower than ρ² because it accounts for all sources of error.

Looking at the "person" facet (which is typically our object of measurement in G-theory):
- The coefficients tell us how reliable our measurements would be if we were trying to make generalizations about persons.
- A coefficient above 0.80 is generally considered good for making decisions. Our coefficients are:
  - G-coefficient (ρ²): 0.55
  - Dependability coefficient (Φ): 0.46
- These coefficients suggest that while we have some reliability in our measurements, they are not high enough for strong conclusions about persons scores 
on this assessment.
- We might consider making changes to our measurement procedure (i.e. the assessment) through a decision study.


### Step 4: Decision Study (D-study)

Decision studies (D-studies) allow us to explore how changes in the design (e.g., number of tasks, raters, or persons) would affect our reliability estimates.

Let's conduct a decision study to see how increasing the number of tasks while decreasing the number of raters would affect our reliability estimates.

In [9]:
# Define different designs to test
d_study_design = {
    'person': [10], # Keep the number of persons constant
    't': [4],  # Increase the number of tasks
    'r': [3]      # Decrease the number of raters
}

# Calculate D-study
gt.calculate_d_study(d_study_design=d_study_design)

# Display D-study results
gt.d_study_summary()

Performing Balanced D-Study Design for the provided designs
Using user-provided variance tuple dictionary
Using ANOVA Table Variance Dictionary for Generalizability Coefficients
Using User Provided Levels Coefficients

--------------------
D-Study: person: 10, t: 4, r: 3
--------------------
                Φ               ρ²             
person          0.4998          0.5831         
t               0.4494          0.4808         
r:t             0.7403          0.7679         
person x t      0.5736          0.5736         




The D-study results show how reliability would change if we used different numbers of tasks and raters. This helps in making decisions about the optimal design for future studies. As you can see, increasing the number of tasks while decreasing the number of raters actually increases the reliability of our measurements. This occurs for two reasons:

1. We reduced one task but added one rater, meaning the corrected variance $\hat{\sigma}_{r:t}$ remains unchanged since the total number of levels is still 12.
2. The person × task interaction variance, $\hat{\sigma}_{person x t}$—a major source of variability—is reduced by adding a task.

### Step 5: Confidence Intervals

Let's also calculate confidence intervals for our variance components to assess their precision.

In [10]:
# Calculate confidence intervals with alpha = 0.05 (95% confidence intervals)
gt.calculate_confidence_intervals(alpha=0.05)

# Display confidence intervals
gt.confidence_intervals_summary()

Using ANOVA Table Variance Dictionary
Using previously calculated levels coefficients

--------------------
95% CI for 'person' 
--------------------
Group           2.5%            mean            97.5%          
1               3.3001          4.7500          6.1999         
2               4.3001          5.7500          7.1999         
3               2.4668          3.9167          5.3666         
4               2.4668          3.9167          5.3666         
5               4.2168          5.6667          7.1166         
6               2.0501          3.5000          4.9499         
7               4.1334          5.5833          7.0332         
8               2.9668          4.4167          5.8666         
9               4.5501          6.0000          7.4499         
10              2.5501          4.0000          5.4499         



--------------------
   95% CI for 't'   
--------------------
Group           2.5%            mean            97.5%          
1               

The confidence intervals provide a range of plausible values for mean scores by facet. Higher reliability coefficients should give greater confidence in the range of plausible values.

## Conclusions

In this tutorial, we've demonstrated how to use GeneralizIT to perform a complete generalizability theory analysis on Brennan's Design 4 dataset. We've shown how to:

1. Prepare and format data for G-theory analysis
2. Initialize a GeneralizIT object with the appropriate design specification
3. Calculate ANOVA components and variance components
4. Calculate G-coefficients for reliability assessment
5. Conduct decision studies to optimize measurement procedures
6. Calculate confidence intervals for precision assessment

The results from this analysis help us understand:

1. The major sources of variance in our measurements
2. The reliability of our current measurement procedure
3. How to optimize future studies by adjusting the number of tasks and raters

GeneralizIT simplifies this complex statistical procedure and provides a comprehensive framework for reliability analysis in educational, psychological, and other measurement contexts.

## Advanced Usage
GeneralizIT is built with flexibility in mind. Here are some advanced features you can explore. 

### Studies with > 3 facets
GeneralizIT's built in design parser is currently limited to 3 facets. However, you can specify your own "variance_tuple_dictionary" to analyze designs with more than 3 facets.
**Important Note**: The variance_tuple_dictionary must be a dictionary of tuples, where the keys are the names of the variance components and the values are tuples of the facet names that make up that component. Crossed facets should be separated by " x " and nested facets should be separated by ":".

```python
# Design with 4 crossed facets
design_str = "a x b x c x d"
# Create a custom variance dictionary
variance_tuple_dict = {
    "a": ('a',),
    "b": ('b',),
    "c": ('c',),
    "d": ('d',),
    "a x b": ('a', 'b'),
    "a x c": ('a', 'c'),
    "a x d": ('a', 'd'),
    "b x c": ('b', 'c'),
    "b x d": ('b', 'd'),
    "c x d": ('c', 'd'),
    "a x b x c": ('a', 'b', 'c'),
    "a x b x d": ('a', 'b', 'd'),
    "a x c x d": ('a', 'c', 'd'),
    "b x c x d": ('b', 'c', 'd'),
    "a x b x c x d": ('a', 'b', 'c', 'd'),
}
# Create a GeneralizIT object with the custom variance dictionary
gt = GeneralizIT(
    data=data, 
    design_str=design_str, 
    variance_tuple_dict=variance_tuple_dict,
    response_col='score',
)
```

### D Studies with missing or unbalanced data
GeneralizIT's built in utility functions for D-studies are designed to work with balanced data. However, if you provide "pseudo_counts_dfs" instead of the "d_study_design" argument, GeneralizIT will use the provided dataframes to calculate the appropriate levels for the D-study.
which is a list of pandas DataFrames, GeneralizIT will be able to use these to calculate the appropriate levels for the D-study. 
A pseudo_counts_df is a pandas DataFrame in long format that mimicks the structure of the original data, but without a response column.

For example, if we wanted to examine having 4 raters on task 1, 2 raters on task 2, and 2 raters on task 3, we could create a pseudo_counts_df like this:
```python
pseudo_data = {
    'p': [],
    'r': [],
    't': [],
    'Response': []
}
# valid nests
valid_nests = ["t1_r1", "t1_r2", "t1_r3", "t1_r4", "t2_r1", "t2_r2", "t3_r1", "t3_r2"]

# Populate the new DataFrame
for person in range(1, 11):
    for t in [1, 2, 3]:  # Assuming 't1', 't2', 't3'
        for r in range(1, 5):  # Assuming 'i1' to 'i4'
            key = f't{t}_i{r}'
            # check if the key exists
            if key in valid_nests:
                pseudo_data['p'].append(person)
                pseudo_data['h'].append(h)
                pseudo_data['i'].append(i)
# Create a pseudo_counts_df for the D-study
pseudo_counts_df = pd.DataFrame(pseudo_data)

# perform the D-study
gt.calculate_d_study(
    pseudo_counts_dfs=[pseudo_counts_df],
)
```

### D-Studies with design changes
Finally, you can also perform D-Studies with design changes.
For example if the G Study was originally fully crossed, i.e. 'persons x raters x tasks', but you wanted to see how the design would change if in the nested design, i.e. 'persons x (r:t)', you could do this by providing 'psuedo_counts_dfs', an updated 'variance_tuple_dictionary' that reflects the new design, and an updated 'variance_dictionary' that reflects the change in variance components through nesting.
```python
gt.calculate_d_study(
    pseudo_counts_dfs=[pseudo_counts_df],
    variance_tuple_dictionary=variance_tuple_dictionary,
    variance_dictionary=variance_dictionary,
)
```


These advanced features allow you to customize your G-theory analyses to fit your specific research needs!

## References

1. Brennan, R. L. (2001). Generalizability Theory. New York: Springer.