# Intercoder reliability

- Author: Zachary Kilhoffer
- Updated 2024-06-17

### Purpose
To create training data for the model finetuning, it's useful to have labeled data that shows the LLM how to perform in a given task.

To create better training data, it's best to have multiple humans collaborate to reduce certain types of bias and error.

Intercoder reliability helps us do this.

### Method Outline

The way we did this is as follows.

1. Each of the 3 researchers gets the same 30 "controls", which are short texts about how to implement privacy/security. 
    - These controls were randomly selected from our dataset. 
2. Each researcher individually labels each of the 30 controls with one of the possible labels. 
    - The labels are the 33 "domain" names taken from [SCF - Secure Controls Framework](https://content.securecontrolsframework.com/SCF-Recommended-Practices.pdf), such as "Asset Management", "Risk Management", "Mobile Device Management", etc.
    - The labels and abbreviations are in train-data-redacted.xlsx  
3. We combine every researcher's label into a combined datset.
4. We assess intercoder reliability, or how much we agree, by calculating the Fleiss kappa with the combined labels of the 3 researchers.
5. We sit together and look at how we labeled each of the 30 controls, then discuss our reasoning for doing so.
6. We go through each of the 30 controls, and each person has the chance to change their response.
7. We repeat step 3, calculating intercoder reliability with the new labels. 

After these steps, we took the labeled controls for which 2/3 or 3/3 researchers agreed (after having the chance to reconsider) and used this in our LLM finetuning pipeline. 

In removing data where we couldn't agree, we intended to remove data that is noisier and less reliable as a source of "ground truth".

### Important notes
- I can only provide the control texts for FedRAMP and C5 due to copyright
    - see train-data-redacted.xlsx for more info 
- I've replaced the actual control texts we used with numbers 1-30 in the code below due to the redactions
***

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import cohen_kappa_score

# Before reconciliation

In [3]:
# Actual first round data, before reconciliation
data = {
    'researcher1': ["CLD", "CLD", "GOV", "CLD", "CPL", "MNT", "CLD", "IAC", "CRY", "CRY", "RSK", "HRS", "IAC", "DCH", "DCH",
            "CPL", "CPL", "DCH", "GOV", "CPL", "CPL", "DCH", "CPL", "DCH", "PRI", "HRS", "IRO", "THR", "RSK", "PRI"],
    'researcher2': ["CLD", "IAC", "CPL", "PRI", "DCH", "DCH", "DCH", "SAT", "MON", "CRY", "TPM", "SAT", "NET", "CPL", "PES",
               "PRM", "THR", "CLD", "NET", "PRI", "SAT", "OPS", "END", "GOV", "PRI", "HRS", "IRO", "IAO", "RSK", "GOV"],
    'researcher3': ["CLD", "CLD", "SEA", "GOV", "PRI", "PRI", "PRI", "TDA", "AST", "CRY", "TDA", "SAT", "NET", "DCH", "DCH",
            "TPM", "MON", "DCH", "NET", "GOV", "SAT", "DCH", "AST", "PRI", "PRI", "HRS", "IRO", "RSK", "RSK", "PRI"]
}

# Convert the simulated data to a dataframe
df_before = pd.DataFrame(data)
print(df_before.shape)

(30, 3)


In [7]:
# To calculate Fleiss' Kappa, we first need to construct a matrix where each row represents one of the paragraphs of text
# and each column represents one of the possible codes. Each cell in the matrix will contain the count of raters
# who assigned the corresponding code to the corresponding paragraph.

# Identify all unique codes across coders
unique_codes = pd.unique(df_before.values.ravel())

# Initialize a matrix to store the counts
code_matrix = pd.DataFrame(0, index=np.arange(len(df_before)), columns=unique_codes)

# Populate the matrix with counts
for index, row in df_before.iterrows():
    for coder in df_before.columns:
        code_matrix.at[index, row[coder]] += 1

code_matrix.head()  # Show the first few rows of the matrix to verify correct setup

Unnamed: 0,CLD,IAC,GOV,CPL,SEA,PRI,DCH,MNT,SAT,TDA,...,TPM,HRS,NET,PES,PRM,THR,OPS,END,IRO,IAO
0,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
def fleiss_kappa(M):
    """
    Compute the Fleiss' kappa for a matrix of ratings.
    
    :param M: The matrix of ratings where each row represents a subject and each column represents a category.
              Entries are the number of raters who assigned the corresponding category to the subject.
    :returns: The Fleiss' kappa score.
    """

    n, k = M.shape  # n is the number of subjects, k is the number of categories
    N = M.sum(axis=1)[0]  # The total number of ratings per subject (assumed to be the same for all subjects)
    
    # The proportion of all assignments which were to the j-th category
    p_j = M.sum(axis=0) / (n * N)
    
    # The extent to which raters agree for the i-th subject
    P_i = (M**2).sum(axis=1) - N
    P_i = P_i / (N * (N - 1))
    
    # The overall extent of agreement
    P_bar = P_i.mean()
    
    # The expected extent of agreement when assignments are made at random
    P_e = (p_j**2).sum()
    
    # The kappa score
    kappa = (P_bar - P_e) / (1 - P_e)
    
    return kappa

# Calculate Fleiss' Kappa
fleiss_kappa_value = fleiss_kappa(code_matrix)
fleiss_kappa_value


0.2587672688629118

The Fleiss' Kappa score for the data is approximately 0.259. This indicates a fair level of agreement among the three coders beyond what would be expected by chance. In general, kappa values can be interpreted as follows:

    Less than 0: Poor agreement
    0.01 - 0.20: Slight agreement
    0.21 - 0.40: Fair agreement
    0.41 - 0.60: Moderate agreement
    0.61 - 0.80: Substantial agreement
    0.81 - 0.99: Almost perfect agreement
    1: Perfect agreement

So, in our case, the coders are fairly consistent with each other in their coding of the qualitative data. 

# After reconciliation

In [9]:
data_reconciled = {
    'researcher_1': ["CLD", "CLD", "GOV", "GOV", "PRI", "PRI", "PRI", "TDA", "CRY", "CRY", "TDA", "HRS", "IAC", "DCH", "DCH",
            "CPL", "MON", "DCH", "GOV", "CPL", "CPL", "DCH", "END", "PRI", "PRI", "HRS", "IRO", "RSK", "RSK", "PRI"],
    'researcher_2': ["CLD", "IAC", "CPL", "GOV", "PRI", "PRI", "PRI", "TDA", "MON", "CRY", "TDA", "SAT", "NET", "DCH", "DCH",
               "PRM", "MON", "DCH", "NET", "PRI", "SAT", "OPS", "END", "PRI", "PRI", "HRS", "IRO", "RSK", "RSK", "PRI"],
    'researcher_3': ["CLD", "CLD", "SEA", "GOV", "PRI", "PRI", "PRI", "TDA", "AST", "CRY", "TDA", "SAT", "NET", "DCH", "DCH",
            "TPM", "MON", "DCH", "NET", "GOV", "SAT", "DCH", "END", "PRI", "PRI", "HRS", "IRO", "RSK", "RSK", "PRI"]
}

# Convert the simulated data to a dataframe
df_reconciled = pd.DataFrame(data_reconciled)
print(df_reconciled.shape)

(30, 3)


In [10]:
# Identify all unique codes across coders in the new data
unique_codes_reconciled = pd.unique(df_reconciled.values.ravel())

# Initialize a matrix to store the counts for the new data
code_matrix_reconciled = pd.DataFrame(0, index=np.arange(len(df_reconciled)), columns=unique_codes_reconciled)

# Populate the matrix with counts for the new data
for index, row in df_reconciled.iterrows():
    for coder in df_reconciled.columns:
        code_matrix_reconciled.at[index, row[coder]] += 1

# Calculate Fleiss' Kappa for the reconciled data
fleiss_kappa_reconciled_value = fleiss_kappa(code_matrix_reconciled)
fleiss_kappa_reconciled_value

0.7066014669926651

The score of 0.7 indicates "substantial agreement". 

In [13]:
# To make sure we only train on data we know is highest quality, we exclude any rows where there was disagreement.
df_final = df_reconciled[df_reconciled.apply(lambda row: row.nunique() == 1, axis=1)]

In [14]:
df_final

Unnamed: 0,researcher_1,researcher_2,researcher_3
0,CLD,CLD,CLD
3,GOV,GOV,GOV
4,PRI,PRI,PRI
5,PRI,PRI,PRI
6,PRI,PRI,PRI
7,TDA,TDA,TDA
9,CRY,CRY,CRY
10,TDA,TDA,TDA
13,DCH,DCH,DCH
14,DCH,DCH,DCH


In our actual finetuning process, we used the control texts instead of numbers, and instead of the three digit abbreviations, the full category names taken from [SCF - Secure Controls Framework](https://content.securecontrolsframework.com/SCF-Recommended-Practices.pdf), such as "Asset Management", "Risk Management", "Mobile Device Management", etc.

For more info, see train-data-redacted.xlsx
