## Develop Grade-Discrepancy-Report
3.12.24 - 3.15.24

When two models or two runs act on the same sheet, the grade may differ.
This report is to compare the grade of two (or more) models or two runs and find the discrepancies (if any).

In [41]:
import pandas as pd
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [42]:

from lime.agg import build_data_wrapper
from lime.modules.views.agg.collect import build_data
from lime.modules.views.agg.query import (
    sheet_by_model_pct_correct,
    all_sheets_all_questions,
    input_by_model,
    format_multi_index,
)


In [43]:
base_path = '../../datasets/hello-qa/hotpot1/_aggs/'
exp = '*wins*'
!ls {base_path + exp}
data = build_data_wrapper(base_path + exp)

../../datasets/hello-qa/hotpot1/_aggs/output-cpl-wins-1-cpl-rag-1-41cb.json
../../datasets/hello-qa/hotpot1/_aggs/output-cpl-wins-1-gpt-3.5-turbo-4543.json
../../datasets/hello-qa/hotpot1/_aggs/output-gpt-wins-1-cpl-rag-1-d032.json
../../datasets/hello-qa/hotpot1/_aggs/output-gpt-wins-1-gpt-3.5-turbo-2711.json


In [44]:
data.shape, data.dtypes[:4]

((40, 19),
 name            object
 meta_data       object
 ground_truth    object
 question_usr    object
 dtype: object)

### `--discrepency` report

In [40]:
question_names_by_models = pd.pivot(
    data, 
    index='name', 
    columns=['model_name', 'run_id'], 
    values='grade_bool'
)
print(question_names_by_models.head(3).to_markdown())

| name   |   ('cpl-rag-1', '41cb') |   ('gpt-3.5-turbo', '4543') |   ('cpl-rag-1', 'd032') |   ('gpt-3.5-turbo', '2711') |
|:-------|------------------------:|----------------------------:|------------------------:|----------------------------:|
| Q-10   |                       1 |                           1 |                     nan |                         nan |
| Q-12   |                     nan |                         nan |                       0 |                           1 |
| Q-14   |                     nan |                         nan |                       0 |                           1 |


In [26]:
diff_index = (
    question_names_by_models.apply(lambda x: x.dropna().nunique() > 1, axis=1)
)

In [32]:
tmp = question_names_by_models[diff_index==False]
tmp

model_name,cpl-rag-1,gpt-3.5-turbo,cpl-rag-1,gpt-3.5-turbo
run_id,41cb,4543,d032,2711
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Q-10,True,True,,
Q-4,False,False,,


In [37]:
print(tmp.fillna('-').to_markdown())

| name   | ('cpl-rag-1', '41cb')   | ('gpt-3.5-turbo', '4543')   | ('cpl-rag-1', 'd032')   | ('gpt-3.5-turbo', '2711')   |
|:-------|:------------------------|:----------------------------|:------------------------|:----------------------------|
| Q-10   | True                    | True                        | -                       | -                           |
| Q-4    | False                   | False                       | -                       | -                           |


### `--discrepency-full` report

In [51]:
question_names_by_models = pd.pivot(
    data, 
    index='name', 
    columns=['model_name', 'run_id'], 
    values=['grade_bool', 'completion']
)
question_names_by_models.head(3)

Unnamed: 0_level_0,grade_bool,grade_bool,grade_bool,grade_bool,completion,completion,completion,completion
model_name,cpl-rag-1,gpt-3.5-turbo,cpl-rag-1,gpt-3.5-turbo,cpl-rag-1,gpt-3.5-turbo,cpl-rag-1,gpt-3.5-turbo
run_id,41cb,4543,d032,2711,41cb,4543,d032,2711
name,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
Q-10,True,True,,,The Afghan Whigs,The Afghan Whigs have more recently reformed. ...,,
Q-12,,,False,True,,,oldest,The 72nd Field Brigade is part of the oldest e...
Q-14,,,False,True,,,Wang Xiaoshuai,Wang Xiaoshuai is younger than Del Lord. Wang ...


In [59]:
from lime.modules.views.agg.query import (
    grade_discrepancy_by_runid,
)

In [60]:
base_path = '../../datasets/hello-qa/hotpot1/_aggs/'
exp = '*train-ten-1*'
!ls {base_path + exp}
data = build_data_wrapper(base_path + exp)

../../datasets/hello-qa/hotpot1/_aggs/output-train-ten-1-cpl-basic-1-dee2.json
../../datasets/hello-qa/hotpot1/_aggs/output-train-ten-1-cpl-rag-1-d9cf.json
../../datasets/hello-qa/hotpot1/_aggs/output-train-ten-1-gpt-3.5-turbo-580c.json


In [61]:
# filter out third dataset
data = data[data['model_name'] != 'cpl-basic-1']

In [63]:
# basic --discrepencies report
grade_discrepancy_by_runid(data)

Unnamed: 0_level_0,grade_bool,grade_bool
model_name,cpl-rag-1,gpt-3.5-turbo
run_id,d9cf,580c
name,Unnamed: 1_level_3,Unnamed: 2_level_3
Q-10,True,False
Q-4,True,False


In [67]:
question_names_by_models = pd.pivot(
    data, 
    index='name', 
    columns=['model_name', 'run_id'], 
    values=['grade_bool', 'completion']
)
print(question_names_by_models.head(3).to_markdown())

| name   | ('grade_bool', 'cpl-rag-1', 'd9cf')   | ('grade_bool', 'gpt-3.5-turbo', '580c')   | ('completion', 'cpl-rag-1', 'd9cf')   | ('completion', 'gpt-3.5-turbo', '580c')                                                                                             |
|:-------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------|
| Q-1    | False                                 | False                                     | Townes Van Zandt                      | At My Window was released by American singer-songwriter Townes Van Zandt.                                                           |
| Q-10   | True                                  | False                                     | Operation Citadel                     | The code name for the German offensive that started this S

In [70]:
# augmented --discrepencies report
tmp = grade_discrepancy_by_runid(data, add_values=['completion'])
tmp
# print(tmp.head(3).to_markdown())

Unnamed: 0_level_0,grade_bool,grade_bool,completion,completion
model_name,cpl-rag-1,gpt-3.5-turbo,cpl-rag-1,gpt-3.5-turbo
run_id,d9cf,580c,d9cf,580c
name,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
Q-10,True,False,Operation Citadel,The code name for the German offensive that st...
Q-4,True,False,1950,The author of The Victorians - Their Story In ...


In [71]:
from lime.agg import (
    do_discrepancies
)

In [73]:
tmp = do_discrepancies(data, is_full=False)
print(tmp)

### Model/RunIDs: rows where grade_bool has discrepancy 

| name   | ('grade_bool', 'cpl-rag-1', 'd9cf')   | ('grade_bool', 'gpt-3.5-turbo', '580c')   |
|:-------|:--------------------------------------|:------------------------------------------|
| Q-10   | True                                  | False                                     |
| Q-4    | True                                  | False                                     |




In [74]:
tmp = do_discrepancies(data, is_full=True)
print(tmp)

### Model/RunIDs: rows where grade_bool has discrepancy 

| name   | ('grade_bool', 'cpl-rag-1', 'd9cf')   | ('grade_bool', 'gpt-3.5-turbo', '580c')   | ('completion', 'cpl-rag-1', 'd9cf')   | ('completion', 'gpt-3.5-turbo', '580c')   |
|:-------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|
| Q-10   | True                                  | False                                     | Operation Citadel                     | The code name for the German o...         |
| Q-4    | True                                  | False                                     | 1950                                  | The author of The Victorians -...         |




### Fix format_multi_index

In [75]:
# import fmt_text_field
from lime.modules.views.agg.utils import fmt_text_field

In [77]:
is_full = True
if is_full:
        
    data = fmt_text_field(
        data, 
        'completion', 
        max_chars=30,
    )

add_values = ['completion'] if is_full else []

output  = '''### Model/RunIDs: rows where grade_bool has discrepancy \n\n'''
output += format_multi_index(
    grade_discrepancy_by_runid(data, add_values=add_values)
).to_markdown(index=False)

In [79]:
print(output)

### Model/RunIDs: rows where grade_bool has discrepancy 

| name   | ('grade_bool', 'cpl-rag-1', 'd9cf')   | ('grade_bool', 'gpt-3.5-turbo', '580c')   | ('completion', 'cpl-rag-1', 'd9cf')   | ('completion', 'gpt-3.5-turbo', '580c')   |
|:-------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|
| Q-10   | True                                  | False                                     | Operation Citadel                     | The code name for the German o...         |
| Q-4    | True                                  | False                                     | 1950                                  | The author of The Victorians -...         |


In [94]:
df = grade_discrepancy_by_runid(data, add_values=add_values)

In [89]:
df.shape

(2, 4)

In [90]:
print(df.to_markdown())

| name   | ('grade_bool', 'cpl-rag-1', 'd9cf')   | ('grade_bool', 'gpt-3.5-turbo', '580c')   | ('completion', 'cpl-rag-1', 'd9cf')   | ('completion', 'gpt-3.5-turbo', '580c')   |
|:-------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|
| Q-10   | True                                  | False                                     | Operation Citadel                     | The code name for the German o...         |
| Q-4    | True                                  | False                                     | 1950                                  | The author of The Victorians -...         |


In [84]:
x = None
if len(df.columns) == 0:
    x = pd.DataFrame(df.index.to_list(), columns=df.index.names)
else:
    left = pd.DataFrame(df.index.to_list(), columns=df.index.names)
    right = df.reset_index(drop=True)
    x = pd.concat([left, right], axis=1)
x

Unnamed: 0,name,"(grade_bool, cpl-rag-1, d9cf)","(grade_bool, gpt-3.5-turbo, 580c)","(completion, cpl-rag-1, d9cf)","(completion, gpt-3.5-turbo, 580c)"
0,Q-10,True,False,Operation Citadel,The code name for the German o...
1,Q-4,True,False,1950,The author of The Victorians -...


In [91]:
# the answer was simple
df.columns = ['\n'.join(col) for col in df.columns]

In [93]:
print(df.to_markdown())

| name   | grade_bool   | grade_bool      | completion        | completion                        |
|        | cpl-rag-1    | gpt-3.5-turbo   | cpl-rag-1         | gpt-3.5-turbo                     |
|        | d9cf         | 580c            | d9cf              | 580c                              |
|:-------|:-------------|:----------------|:------------------|:----------------------------------|
| Q-10   | True         | False           | Operation Citadel | The code name for the German o... |
| Q-4    | True         | False           | 1950              | The author of The Victorians -... |


In [86]:
df.columns

MultiIndex([('grade_bool',     'cpl-rag-1', 'd9cf'),
            ('grade_bool', 'gpt-3.5-turbo', '580c'),
            ('completion',     'cpl-rag-1', 'd9cf'),
            ('completion', 'gpt-3.5-turbo', '580c')],
           names=[None, 'model_name', 'run_id'])

In [107]:
# okay this is good
s = do_discrepancies(data, is_full=True)
print(s)

### Model/RunIDs: rows where grade_bool has discrepancy 

| grade_bool   | grade_bool      | completion        | completion                        |
| cpl-rag-1    | gpt-3.5-turbo   | cpl-rag-1         | gpt-3.5-turbo                     |
| d9cf         | 580c            | d9cf              | 580c                              |
|:-------------|:----------------|:------------------|:----------------------------------|
| True         | False           | Operation Citadel | The code name for the German o... |
| True         | False           | 1950              | The author of The Victorians -... |




In [105]:
# import question_by_runid_completion
from lime.modules.views.agg.query import question_by_runid_completion
df = question_by_runid_completion(data, add_index_cols=[])

In [98]:
print(s)

                                    completion
name run_id                                   
Q-1  580c    At My Window was released by A...
     d9cf                     Townes Van Zandt
Q-10 580c    The code name for the German o...
     d9cf                    Operation Citadel
Q-2  580c    Candace Kita guest starred wit...
     d9cf                            Nora Dunn
Q-3  580c    Self was most recently publish...
     d9cf                                 Self
Q-4  580c    The author of The Victorians -...
     d9cf                                 1950
Q-5  580c    Tae Kwon Do Times has publishe...
     d9cf                    Tae Kwon Do Times
Q-6  580c    The club that played Mancheste...
     d9cf                                 1874
Q-7  580c    The Bank of America Tower is t...
     d9cf                Empire State Building
Q-8  580c                       Rosario Dawson
     d9cf                       Rosario Dawson
Q-9  580c    Tombstone starred actor Kurt R...
     d9cf    

In [87]:
x.columns

Index([                                 'name',
           ('grade_bool', 'cpl-rag-1', 'd9cf'),
       ('grade_bool', 'gpt-3.5-turbo', '580c'),
           ('completion', 'cpl-rag-1', 'd9cf'),
       ('completion', 'gpt-3.5-turbo', '580c')],
      dtype='object')