## Intro

This notebook will generate the benchmark data for the task "Video-Language Misalignment Detection".

The task takes input a pair of video and narration in natural language. Then for each semantic group in the narration, the method should output if it aligns with the video or no.

Variance
- control alignment granularity with the semantic group granularity

Challenging Cases
- An object is in the video but not in the way as described in the narration. E.g., <a video of washing cucumber while a carrot is nearby, "wash carrot">

Excepted Behavior
- current method are performing bad on this benchmark which means this benchmark fill a gap in exiting evluation
- Polysemy. E.g., "nut_(tool)" vs. "nut_(food)". "turn_(off/on,switch)"

Others
- potential method: a VL model (both ego-centric and exo-centric) trained with such fine-grained semantic sup-signal can be forced to learn faster and more capable of distinguishing fine-grained semantic difference. Accordingly, could improve performance for all VL tasks

## Load egoclip_srl.csv and browse

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import ast
import itertools
import random
from tqdm.notebook import tqdm


In [3]:
# naive output of SRL method w/o post processing
egoclip_srl = pd.read_csv('../dataset/ego4d/annotation/egoclip_srl.csv')

  egoclip_srl = pd.read_csv('../dataset/ego4d/annotation/egoclip_srl.csv')


In [None]:
egoclip_srl

In [None]:
egoclip_srl.info()

In [None]:
egoclip_srl.describe()

Most of clips have only have rols of 'ARG0', 'V', 'ARG1'. Suprisingly, 'ARG1' only has 80+% non-NaN values. We think the 20-% NaN are mostly false detecting by 

In [None]:
nan_counts = egoclip_srl.isna().sum()
nan_ratios = nan_counts / len(egoclip_srl)

nan_df = pd.concat([nan_counts, nan_ratios], axis=1)
nan_df.columns = ['nan_counts', 'nan_ratios']
nan_df


Drop columns with too many NaNs. Then, drop rows >= 1 NaNs. The new dataframe would not have rows with NaN in V or ARG1


In [None]:
threshold = 0.8 # meaning for each column, >threshold% of rows should have non-NaN values
clean_srl_df = egoclip_srl.dropna(axis=1, thresh=len(egoclip_srl)*threshold).dropna(axis=0)
clean_srl_df

In [9]:
clean_srl_df['clip_uid'] = clean_srl_df.apply(lambda row: f"{row['video_uid']}_{row['clip_start']}_{row['clip_end']}", axis=1)

In [10]:
clean_srl_df.drop(columns=["index"], inplace=True, axis=1)

In [11]:
clean_srl_df.reset_index(inplace=True)

In [12]:
clean_srl_df = clean_srl_df[["clip_uid"] + list(clean_srl_df.columns[:-1])]

In [13]:
clean_srl_df["tag_verb"] = clean_srl_df["tag_verb"].apply(ast.literal_eval)
clean_srl_df["tag_noun"] = clean_srl_df["tag_noun"].apply(ast.literal_eval)

In [None]:
clean_srl_df

## benchmark
For each group, k clips with the same group and k clips with different V and k clips with different ARG1, and k clips with different V and ARG1.

The granularity of group depends on taxonomy. For example, "cut cucumber", "dice courgette" and "slice fruit" can be all in one group (divide fruit) or splited into 2 ("cut cucumber" and "dice courgette" vs. "slice fruit") or 3 groups.

> TODO: visualize all narrations where closer semantic meaning will be close. Our job is to quantitatively measure model ability over different granularity of clustering

In [15]:
# verb taxonomy: canonical verb -> verbs
# noun taxonomy: canonical noun -> nouns
# canonical narrations: canonical narration -> index of clips having the same canonical narration, ... different V, different ARG1

verb taxonomy

In [None]:
# load taxonomy
taxonomy_verb_path = '/z/dat/Ego4D/raw/v2/annotations/narration_verb_taxonomy.csv'
taxonomy_verb = pd.read_csv(taxonomy_verb_path)
taxonomy_verb['group'] = taxonomy_verb['group'].apply(ast.literal_eval)
taxonomy_verb


In [None]:
# make canonical_verb (will be the expression in canonical narration latter)
taxonomy_verb["canonical_verb"] = taxonomy_verb["label"].str.split("_\(").apply(lambda x: x[0])
display(taxonomy_verb.head())
display(taxonomy_verb.describe())


noun taxonomy


In [None]:
taxonomy_noun_path = "/z/dat/Ego4D/raw/v2/annotations/narration_noun_taxonomy.csv"
taxonomy_noun = pd.read_csv(taxonomy_noun_path)
taxonomy_noun['group'] = taxonomy_noun['group'].apply(ast.literal_eval)
taxonomy_noun


In [None]:
taxonomy_noun["canonical_noun"] = taxonomy_noun["label"].str.split("_\(").apply(lambda x: x[0])
display(taxonomy_noun.head())
display(taxonomy_noun.describe())



There are duplicated canonical_noun but it's fine. That's because the clustering/grouping process will be based on the "group" column which has different values for differnt rows. THe canonical_noun is served as the word in narration so it is good to be natural instead of distinguishable

In [None]:
# taxonomy_noun["canonical_noun"].value_counts()
taxonomy_noun[taxonomy_noun["canonical_noun"].isin(["nut", "bat", "pot", "chip"])]

Make the table of canonical narration.
- assign V and ARG1 of each row of clip
- extract all valid canonical narrations (V + ARG1) from all clips. At the same time, log which clips belongs to which row of canonical narrations

In [None]:
# Get the cross product of canonical verbs and canonical nouns
canonical_verbs = taxonomy_verb["canonical_verb"]
canonical_nouns = taxonomy_noun["canonical_noun"]
cross_product = list(itertools.product(canonical_verbs, canonical_nouns))

# Create a dataframe from the cross product
canonical_narrations_df = pd.DataFrame(cross_product, columns=["canonical_verb", "canonical_noun"])

# Add columns representing the index of the verb and noun in the taxonomy dataframes
canonical_narrations_df["verb_index"] = [elem for elem in range(len(canonical_verbs)) for _ in range(len(canonical_nouns))]
canonical_narrations_df["noun_index"] = list(range(len(canonical_nouns))) * len(canonical_verbs)

# Display the dataframe
canonical_narrations_df


In [22]:
# substantiate the index of canonical verbs and nouns are correct
canonical_narrations_df['og_canonical_verb'] = canonical_narrations_df['verb_index'].apply(lambda x: taxonomy_verb.loc[x, 'canonical_verb'])
sum(canonical_narrations_df["canonical_verb"] != canonical_narrations_df["og_canonical_verb"]) # should be 0. I.e., the value from cross-product and accessed by index are the same

0

Now we have the semantic roles labeled for each clip narration. Next, for each role (e.g., V or ARG1), we put it into the semantic groups we want it to be based on the taxonomy we defined. (This is where we control the distinguishing granularity.)

Semantic roles have different expression even for the same semantic meaning. We levarage LLM who is good at capturing essential semantic meaning under the cover of variance expressions. Concretely, for each semantic role expression, e.g., "picks", we provide groups with the predefined expressions in them, e.g. {'93': ['collect', 'collects', 'fetch', 'gather', 'gathers', 'get', 'grab', 'lift', 'pick', 'pick out', 'picks', 'select', 'take', 'take back', 'take down', 'take out', 'takes', 'takes down'], '66': ['add', 'adds', 'bottom', 'drop', 'drops', 'heap', 'heaps', 'keep', 'leave', 'let go', 'pile', 'place', 'places', 'put', 'put aside', 'put back', 'release', 'return', 'returns', 'set', 'soak', 'soaks']}. Then we let LLM to decide which group "picks" belongs to. 

Nouns (ARG1) is a bit trickier than verbs, since it seems have more variance in the expression and there could be multiple goups to go. For example, "the bottle of spice" could belongs to the group of "bottle" and the group of "spice" at the same time. ({'280': ['blender cup', 'bottle', 'bottle lid', 'bottle top', 'cap', 'carburator bowl', 'carburetor bowl', 'clip art lid', 'cooking pot cover', 'cover', 'cover lid', 'cover pot', 'gas tank cover', 'gear stick', 'jug lid', 'lid', 'oil filler cap', 'pen lid', 'pot cover', 'tea urn lid', 'watercolor tube top'], '473': ['spice'], '364': ['herself', 'himself', 'lady', 'man', 'person', 'shoulder', 'they', 'woman']}). But a good LLM seems can handle it situation well. Limited experiments shows that `gpt-4-1106-preview` is good enough while `gpt-3.5-turbo-1106` is not. 

_gpt-4-1106-preview_
string='picks', groups={'93': ['collect', 'collects', 'fetch', 'gather', 'gathers', 'get', 'grab', 'lift', 'pick', 'pick out', 'picks', 'select', 'take', 'take back', 'take down', 'take out', 'takes', 'takes down']}
resp={'93': True}
string='a bag of clothes', groups={'192': ['compound', 'floor', 'ground', 'land', 'surface', 'wood floor', 'wooden floor', 'wooden floor board'], '115': ['black cloth', 'cloth', 'clothe', 'clothing material', 'cotton', 'cotton bud', 'cotton yarn', 'crochet pattern', 'cuff of shirt', 'cutting model', 'dashboard', 'fabric', 'garment', 'kanga', 'piece of cloth', 'pillow case', 'pillow cover', 'pillow garment', 'rag', 't shirt sleeve', 'table cloth', 'table cover', 'white cloth'], '12': ['bag', 'bag of manure', 'grocery', 'litter bag', 'nylon', 'nylon bag', 'nylon storage pack', 'paper bag', 'paper cover', 'paper packet', 'paperbag', 'plastic', 'plastic bag', 'polythene', 'polythene bag', 'pouch', 'sachet', 'sachet of #unsure', 'sachet of meat', 'sachet of wipe', 'sachet piece', 'sack', 'satchet', 'shopping bag', 'suitcase']}
resp={'192': False, '115': True, '12': True}

_gpt-3.5-turbo-1106_ (struggle with format)
string='picks', groups={'93': ['collect', 'collects', 'fetch', 'gather', 'gathers', 'get', 'grab', 'lift', 'pick', 'pick out', 'picks', 'select', 'take', 'take back', 'take down', 'take out', 'takes', 'takes down']}
resp={'groups': {'93': True}}
e=AssertionError() retry
resp={'1': True}
e=AssertionError() retry
resp={'1': True, '2': False, '3': False}
e=AssertionError() retry
resp={'groups': {'93': True}}
e=AssertionError() retry
resp={'1': True, '2': True, '3': True}
e=AssertionError() retry
resp={'1': True, '2': False, '3': False, '4': False, '5': False, '6': False, '7': False, '8': False, '9': False, '10': False, '11': False, '12': False, '13': False, '14': False, '15': False, '16': False}
e=AssertionError() retry
resp={'1': True, '2': False, '3': False}
e=AssertionError() retry
resp={'1': True, '2': True, '3': True}
e=AssertionError() retry
resp={'1': True, '2': False, '3': False}
e=AssertionError() retry
resp={'1': True, '2': False, '3': True}
e=AssertionError() retry
resp={'1': True, '2': False, '3': False}
e=AssertionError() retry
resp={'1': True, '2': False, '3': False}
e=AssertionError() retry
resp={'1': True, '2': False, '3': False}
e=AssertionError() retry
resp={'groups': {'93': True, '2': False, '3': False}}
e=AssertionError() retry

_gpt-4_
e=BadRequestError('Error code: 400 - {\'error\': {\'message\': "Invalid parameter: \'response_format\' of type \'json_object\' is not supported with this model.", \'type\': \'invalid_request_error\', \'param\': \'response_format\', \'code\': None}}') retry


Another challenge is the number of groups. When there are too many groups (categories) to be classied into, it could be financially and time wise expensive. Luckly, some annotation comes with a small set of candicate groups. They covers all groups all semantic roles in a narration could belong to. For example, narration " C picks a bag of clothes from the floor" has annotated verb group {'93': ['collect', 'collects', 'fetch', 'gather', 'gathers', 'get', 'grab', 'lift', 'pick', 'pick out', 'picks', 'select', 'take', 'take back', 'take down', 'take out', 'takes', 'takes down']}, and noun groups groups={'192': ['compound', 'floor', 'ground', 'land', 'surface', 'wood floor', 'wooden floor', 'wooden floor board'], '115': ['black cloth', 'cloth', 'clothe', 'clothing material', 'cotton', 'cotton bud', 'cotton yarn', 'crochet pattern', 'cuff of shirt', 'cutting model', 'dashboard', 'fabric', 'garment', 'kanga', 'piece of cloth', 'pillow case', 'pillow cover', 'pillow garment', 'rag', 't shirt sleeve', 'table cloth', 'table cover', 'white cloth'], '12': ['bag', 'bag of manure', 'grocery', 'litter bag', 'nylon', 'nylon bag', 'nylon storage pack', 'paper bag', 'paper cover', 'paper packet', 'paperbag', 'plastic', 'plastic bag', 'polythene', 'polythene bag', 'pouch', 'sachet', 'sachet of #unsure', 'sachet of meat', 'sachet of wipe', 'sachet piece', 'sack', 'satchet', 'shopping bag', 'suitcase']} which covers verb, ARG1 -- group 115 and 12, ARG2 -- group 192.

For those future situations where we have to deal with large amount (hunders of) potential groups. We will discuss latter

In [23]:
import json
from openai import OpenAI
def string_classify(string, groups, printing=False):
    '''
    args:
        string: string
        groups: dictionary. {"group_number": [string1, string2, ...]}
    '''
    
    if printing:
        print(f"{string=}, {groups=}")
        
    # no need to use LLM
    if len(groups) == 1:
        if printing:
            print(f"only one group {groups.keys()}")
        return [int(str_gn) for str_gn in groups.keys()]
    
    # has to use LLM
    client = OpenAI(api_key="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")
    while True:    
        try:
            response = client.chat.completions.create(
            model="gpt-4-1106-preview",
            response_format={ "type": "json_object" },
            messages=[
                {"role": "system", "content": "You are an expert in classifying a description of objects into groups of objects. Please think step by step and be sure to choose the correst groups for the given description. Your output should be in follow JSON format where the key is group number and value is a bool indicating if the stirng is in that group or no. E.g., \"\{\"groups\": true, \"2\": false, \"3\": true\}\" means the description is in group 1 and group 3 but not group 2."},
                {"role": "user", "content": f"The description is {string}. They groups are {groups}"}
            ]
            )
            
            resp = json.loads(response.choices[0].message.content)
            if printing:
                print(f"{resp=}")
            
            group_list = []
            for k, v in resp.items():
                if v:
                    assert k in groups.keys(), "Hallucination. Got a key not in groups"
                    group_list.append(int(k))
            break
        except Exception as e:
            print(f"{e=} retry")
            continue
    return group_list
    


def determine_group(clip_row, taxonomy_noun, taxonomy_verb, printing=False):
    tag_verb = clip_row['tag_verb'] # list
    V = clip_row['V'] # string
    # groups_verb = list(taxonomy_verb.iloc[tag_verb, :]["group"]) # list of list of strings(noun)
    groups_verb = {f"{index}": row['group'] for index, row in taxonomy_verb.iloc[tag_verb, :].iterrows()}
    group_no_verb = string_classify(V, groups_verb, printing=printing) # one of int in the tag list

    tag_noun = clip_row['tag_noun'] # list
    ARG1 = clip_row['ARG1'] # string
    # groups_noun = list(taxonomy_0noun.iloc[tag_noun, :]["group"]) # list of list of strings(noun)
    groups_noun = {f"{index}": row['group'] for index, row in taxonomy_noun.iloc[tag_noun, :].iterrows()}
    
    group_no_noun = string_classify(ARG1, groups_noun, printing=printing) # one of int in the tag list

    return group_no_verb, group_no_noun



In [24]:
# test the function handling one row of clip
# determine_group(clean_srl_df.iloc[0], taxonomy_noun, taxonomy_verb)
determine_group(clean_srl_df.iloc[random.choice(list(range(len(clean_srl_df))))], taxonomy_noun, taxonomy_verb, printing=True)

string='puts', groups={'66': ['add', 'adds', 'bottom', 'drop', 'drops', 'heap', 'heaps', 'keep', 'leave', 'let go', 'pile', 'place', 'places', 'put', 'put aside', 'put back', 'release', 'return', 'returns', 'set', 'soak', 'soaks']}
only one group dict_keys(['66'])
string='the cement', groups={'90': ['cement', 'clay', 'concrete', 'concrete structure', 'mortar', 'motar']}
only one group dict_keys(['90'])


([66], [90])

> need faster method. one clip can cost 1.5s. Then 3M clips will be more than a month. So, even just for unique narrations (1.2M), there it will be 2 weeks.

In [25]:
# unique narrations
len(clean_srl_df["clip_text"].unique())

1164228

1000 narrations cost 63 mins and $4 :below

In [26]:
num_grouping = 1000
# from tqdm.notebook import tqdm
# tqdm.pandas(desc="Progress")
# _ = clean_srl_df.iloc[:num_grouping].progress_apply(determine_group, axis=1, args=(taxonomy_noun, taxonomy_verb))
# _.to_csv(f"../dataset/ego4d/annotation/grouping_gpt_output_{num_grouping}.csv", index=False)

In [None]:
naive_grouping_output = pd.read_csv(f"../dataset/ego4d/annotation/grouping_gpt_output_{num_grouping}.csv")
naive_grouping_output

> maybe have a stat showing how many clips are assigned into multiple groups. Sounds interestig

In [28]:
clip_group_no_df = pd.DataFrame(naive_grouping_output["0"].apply(ast.literal_eval),
                columns=["group_no_verb", "group_no_noun"]) # each row represent one clip row. each column represent one group_no in taxonomy df
clip_group_no_df

Unnamed: 0,group_no_verb,group_no_noun


In [None]:
group_no_verb_list, group_no_noun_list = zip(*naive_grouping_output["0"].apply(ast.literal_eval))
 
clip_group_no_df = pd.DataFrame({"group_no_verb": group_no_verb_list, "group_no_noun": group_no_noun_list}) # each row represent one clip row. each column represent one group_no in taxonomy df
clip_group_no_df

In [None]:
# Initialize the "group" column in canonical_narrations_df as empty lists
canonical_narrations_df['clip_uid_list'] = [[] for _ in range(len(canonical_narrations_df))]
 
# Iterate over each row in clean_srl_df
for index, row in clip_group_no_df.iterrows():
    verb_no_list = row['group_no_verb']  # Get the verb numbers (list of int) for the current row
    noun_no_list = row['group_no_noun']  # Get the noun numbers (list of int) for the current row
    
    # Find the corresponding canonical narration in canonical_narrations_df
    
    flag_v = canonical_narrations_df['verb_index'].isin(verb_no_list)
    flag_n = canonical_narrations_df['noun_index'].isin(noun_no_list)
    matched_canonical_narration_df = canonical_narrations_df[flag_v & flag_n]
    
    # Get the index of the row of the clip
    clip_index = clean_srl_df.iloc[index]['clip_uid']
    
    # Update the "group" column of the corresponding canonical narration with the clip index
    if len(matched_canonical_narration_df) > 0:
        # print(f"{matched_canonical_narration_df=}")
        # there could be multiple canonical narrations for one clip, add the clip_uid to all of them
        for index, mtc_row in matched_canonical_narration_df.iterrows():
            canonical_narrations_df.at[index, 'clip_uid_list'].append(clip_index)
        # break
canonical_narrations_df

In [None]:
# clean up: only keep the rows of canonical narrations that have corresponding clips.clean_srl_df
canonical_narrations_df_clean = canonical_narrations_df[canonical_narrations_df["clip_uid_list"].apply(len) > 0]
# canonical_narrations_df_clean = canonical_narrations_df_clean.reset_index(drop=True)
canonical_narrations_df_clean["canonical_narration_text"] = canonical_narrations_df_clean["canonical_verb"] + " " + canonical_narrations_df_clean["canonical_noun"]
canonical_narrations_df_clean

Make validation table

- copy the canonical narration except for the index of clips belongs to the canonical narration
- For each canonical narration, random sample K clip row indexs from clips having the same canonica narration, or differnt V or diiffernt ARG1 or differnt V and ARG1

In [None]:
misalighmemnt_detection_benchmark = canonical_narrations_df_clean[["canonical_narration_text", "canonical_verb", "canonical_noun", "verb_index", "noun_index"]]
misalighmemnt_detection_benchmark

In [33]:
row = misalighmemnt_detection_benchmark.iloc[0]
canonical_narrations_df.at[row.name, 'clip_uid_list']

['002d2729-df71-438d-8396-5895b349e8fd_90.41664857606146_91.09331062393854',
 '002d2729-df71-438d-8396-5895b349e8fd_313.7813399307248_314.1196709546634']

In [None]:
misalighmemnt_detection_benchmark

In [35]:
import random

k = 5  # Number of elements to sample
def safe_sample(lst, n):
    if n > len(lst):
        return lst
    else:
        return random.sample(lst, n)

misalighmemnt_detection_benchmark['align'] = misalighmemnt_detection_benchmark.apply(lambda row: safe_sample(canonical_narrations_df.at[row.name, 'clip_uid_list'], k), axis=1)

# cop: similarily, make a new column "misalign_V". This time, for each row, you should first find m misaligning_rows in `misalignment_detection_benchmark` that has the same `noun_index` but different `verb_index`. Then, for each misaligning_row, like I didn't above for the column `align`, randomly sample k//m element from a row of `canonical_narrations_df` in column `clip_uid`
m = 3  # Number of misaligning rows to find
k = 5  # Number of elements to sample per misaligning row

def find_misaligning_rows(row, misalign_roles):
    noun_index = row['noun_index']
    verb_index = row['verb_index']
    
    # Find m misaligning rows with the same noun_index but different verb_index
    if misalign_roles == ['verb']:
        misaligning_rows_all = misalighmemnt_detection_benchmark[(misalighmemnt_detection_benchmark['noun_index'] == noun_index) & (misalighmemnt_detection_benchmark['verb_index'] != verb_index)]
    elif misalign_roles == ['noun']:
        misaligning_rows_all = misalighmemnt_detection_benchmark[(misalighmemnt_detection_benchmark['noun_index'] != noun_index) & (misalighmemnt_detection_benchmark['verb_index'] == verb_index)]
    elif 'noun' in misalign_roles and 'verb' in misalign_roles:
        misaligning_rows_all = misalighmemnt_detection_benchmark[(misalighmemnt_detection_benchmark['noun_index'] != noun_index) & (misalighmemnt_detection_benchmark['verb_index'] != verb_index)]
    else:
        raise ValueError(f"misalign_roles should be ['verb'], ['noun'], or ['verb', 'noun'] but got {misalign_roles}")

    if len(misaligning_rows_all) > 0:
        misaligning_rows = misaligning_rows_all.sample(min(m, len(misaligning_rows_all)))
    else:
        # If there are no misaligning rows, return an empty list
        return []
    
    # Randomly sample k//m elements from each misaligning row in canonical_narrations_df
    misalign_V = []
    for _, misaligning_row in misaligning_rows.iterrows():
        clip_uid_list = canonical_narrations_df.at[misaligning_row.name, 'clip_uid_list']
        misalign_V.extend(random.sample(clip_uid_list, k//m))
    
    return misalign_V

misalighmemnt_detection_benchmark['misalign_V'] = misalighmemnt_detection_benchmark.apply(find_misaligning_rows, args=(["verb"], ), axis=1)

misalighmemnt_detection_benchmark['misalign_ARG1'] = misalighmemnt_detection_benchmark.apply(find_misaligning_rows, args=(["noun"], ), axis=1)

misalighmemnt_detection_benchmark['misalign_V_ARG1'] = misalighmemnt_detection_benchmark.apply(find_misaligning_rows, args=(["verb", "noun"], ), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  misalighmemnt_detection_benchmark['align'] = misalighmemnt_detection_benchmark.apply(lambda row: safe_sample(canonical_narrations_df.at[row.name, 'clip_uid_list'], k), axis=1)


In [None]:
misalighmemnt_detection_benchmark

In [None]:
# for each row, if any of t
# he alignness column is empty, drop the row. That means the row may not have clip rows to choose
valid_row_bool = list(misalighmemnt_detection_benchmark[["align", "misalign_V", "misalign_ARG1", "misalign_V_ARG1"]].apply(lambda row: sum([lt==[] for lt in row])!=0, axis=1))
misalighmemnt_detection_benchmark = misalighmemnt_detection_benchmark.drop(misalighmemnt_detection_benchmark.index[valid_row_bool])
misalighmemnt_detection_benchmark

In [90]:
# instantiate that for a row in benchmark, the columns of clip_uids are as expected
for _i in range(10):
    bench_row_n = random.randint(0, len(misalighmemnt_detection_benchmark))
    aligness = random.choice(["align", "misalign_V", "misalign_ARG1", "misalign_V_ARG1"])
    clip_uid = safe_sample(misalighmemnt_detection_benchmark.iloc[bench_row_n][aligness], 1)[0]
    print(f"aligness: \t{aligness}")
    
    print("canonical_narration_text: \t" + misalighmemnt_detection_benchmark.iloc[bench_row_n]["canonical_narration_text"])
    print(f"verb_taxnomoy accessed by verb_index:\t {taxonomy_verb.iloc[misalighmemnt_detection_benchmark.iloc[bench_row_n]['verb_index']]['label']}")
    print(f"noun_taxnomoy accessed by noun_index:\t {taxonomy_noun.iloc[misalighmemnt_detection_benchmark.iloc[bench_row_n]['noun_index']]['label']}")
    print("clip_text: \t" + clean_srl_df[clean_srl_df["clip_uid"] == clip_uid].iloc[0]["clip_text"])
    print(f"=====================")

aligness: 	misalign_V_ARG1
canonical_narration_text: 	hold bowl
verb_taxnomoy accessed by verb_index:	 hold_(support,_grip,_grasp)
noun_taxnomoy accessed by noun_index:	 bowl
clip_text: 	C touches her face with her left hand.
aligness: 	misalign_V
canonical_narration_text: 	move leek
verb_taxnomoy accessed by verb_index:	 move_(transfer,_pass,_exchange)
noun_taxnomoy accessed by noun_index:	 leek
clip_text: 	C takes the nylon of leeks from her left hand with her right hand.
aligness: 	align
canonical_narration_text: 	remove knife
verb_taxnomoy accessed by verb_index:	 remove
noun_taxnomoy accessed by noun_index:	 knife_(knife,_machete)
clip_text: 	C removes sliced vegetables from the knife in her right hand with her left hand.
aligness: 	align
canonical_narration_text: 	pour bowl
verb_taxnomoy accessed by verb_index:	 pour
noun_taxnomoy accessed by noun_index:	 bowl
clip_text: 	C pours the contents of the bowl into a small bucket of flour with her left hand.
aligness: 	misalign_V
canon

In [91]:
# some canonical narrations seems don't make sense, e.g., "close plate", but all canonical narrations are valid according to my design (i.e., having corresponding clips)
clean_srl_df[clean_srl_df["clip_uid"] == misalighmemnt_detection_benchmark[misalighmemnt_detection_benchmark["canonical_narration_text"] == "close plate"].iloc[0]['align'][0]]


Unnamed: 0,clip_uid,index,video_uid,video_dur,narration_source,narration_ind,narration_time,clip_start,clip_end,clip_text,tag_verb,tag_noun,ARG0,V,ARG1
899,002d2729-df71-438d-8396-5895b349e8fd_2489.6262...,1007,002d2729-df71-438d-8396-5895b349e8fd,3571.433333,narration_pass_1,804,2489.9646,2489.626267,2490.302929,C closes the plate with her left hand,[11],"[321, 236, 381]",C,closes,the plate


In [92]:
# palte has many meanings like here it's not the plate we eat on, but the lid on a pot ("petri dish" in this example)
# This could be an error anotation or variance... That's the variance in language. kinda interesting
display(taxonomy_noun.iloc[misalighmemnt_detection_benchmark[misalighmemnt_detection_benchmark["canonical_narration_text"] == "close plate"].iloc[0]["noun_index"]])
display(taxonomy_noun.iloc[321])
display(taxonomy_noun.iloc[236])
taxonomy_noun.iloc[misalighmemnt_detection_benchmark[misalighmemnt_detection_benchmark["canonical_narration_text"] == "close plate"].iloc[0]["noun_index"]]["group"]
 

label                          plate_(dish,_plate,_platter,_saucer)
group             [24 wall plate, 24 well plate, breadcrumb, cer...
canonical_noun                                                plate
Name: 381, dtype: object

label             napkin_(handkerchief,_napkin,_serviette,_tissu...
group             [ash napkin, hand, handkerchief, kitchen tissu...
canonical_noun                                               napkin
Name: 321, dtype: object

label                   hand_(finger,_hand,_palm,_thumb)
group             [finger, hand, left hand, palm, thumb]
canonical_noun                                      hand
Name: 236, dtype: object

['24 wall plate',
 '24 well plate',
 'breadcrumb',
 'ceramic plate',
 'dish',
 'petri dish',
 'plastic dish',
 'plastic plate',
 'plate',
 'plate drainer guide',
 'platter',
 'porcelain',
 'sample plate',
 'saucer']

In [None]:
misalighmemnt_detection_benchmark


In [94]:
misalighmemnt_detection_benchmark.to_csv(f"../dataset/ego4d/annotation/misalignment_detection_benchmark_{num_grouping}.csv", sep='\t', index=False)

In [41]:
clean_srl_df.to_csv(f"../dataset/ego4d/annotation/clean_srl_df.csv", sep='\t', index=False)

In [42]:
taxonomy_verb.to_csv(f"../dataset/ego4d/annotation/taxonomy_verb.csv", sep='\t', index=False)

In [43]:
taxonomy_noun.to_csv(f"../dataset/ego4d/annotation/taxonomy_noun.csv", sep='\t', index=False)