Validation of label column needed for dataset being labeled #178

rishabh-bhargava · 2023-05-28T03:29:26Z

When a user attempts to run agent.plan or agent.run on a dataset, we should first validate that any data columns needed for labeling/evaluation are in the correct format. For example, when labeling an NER dataset, we observed the following error:

╭─────────────────────────────── Traceback (most recent call last) 
│                                                                                                  │
│ /Users/rishabhbhargava/.pyenv/versions/refuel-main/lib/python3.8/site-packages/autolabel/labeler │
│ .py:262 in run                                                                                   │
│                                                                                                  │
│   259 │   │   eval_result = None                                                                 │
│   260 │   │   # if true labels are provided, evaluate accuracy of predictions                    │
│   261 │   │   if gt_labels:                                                                      │
│ ❱ 262 │   │   │   eval_result = self.task.eval(llm_labels, gt_labels)                            │
│   263 │   │   │   # TODO: serialize and write to file                                            │
│   264 │   │   │   for m in eval_result:                                                          │
│   265 │   │   │   │   print(f"Metric: {m.name}: {m.value}")                                      │
│                                                                                                  │
│ /Users/rishabhbhargava/.pyenv/versions/refuel-main/lib/python3.8/site-packages/autolabel/tasks/n │
│ amed_entity_recognition.py:229 in eval                                                           │
│                                                                                                  │
│   226 │   │   Returns:                                                                           │
│   227 │   │   │   List[MetricResult]: list of metrics and corresponding values                   │
│   228 │   │   """                                                                                │
│ ❱ 229 │   │   gt_labels = [                                                                      │
│   230 │   │   │   self.add_text_spans(                                                           │
│   231 │   │   │   │   json.loads(gt_labels[index]), llm_labels[index].curr_sample                │
│   232 │   │   │   )                                                                              │
│                                                                                                  │
│ /Users/rishabhbhargava/.pyenv/versions/refuel-main/lib/python3.8/site-packages/autolabel/tasks/n │
│ amed_entity_recognition.py:231 in <listcomp>                                                     │
│                                                                                                  │
│   228 │   │   """                                                                                │
│   229 │   │   gt_labels = [                                                                      │
│   230 │   │   │   self.add_text_spans(                                                           │
│ ❱ 231 │   │   │   │   json.loads(gt_labels[index]), llm_labels[index].curr_sample                │
│   232 │   │   │   )                                                                              │
│   233 │   │   │   for index in range(len(gt_labels))                                             │
│   234 │   │   ]                                                                                  │
│                                                                                                  │
│ /Users/rishabhbhargava/.pyenv/versions/3.8.13/lib/python3.8/json/__init__.py:357 in loads        │
│                                                                                                  │
│   354 │   if (cls is None and object_hook is None and                                            │
│   355 │   │   │   parse_int is None and parse_float is None and                                  │
│   356 │   │   │   parse_constant is None and object_pairs_hook is None and not kw):              │
│ ❱ 357 │   │   return _default_decoder.decode(s)                                                  │
│   358 │   if cls is None:                                                                        │
│   359 │   │   cls = JSONDecoder                                                                  │
│   360 │   if object_hook is not None:                                                            │
│                                                                                                  │
│ /Users/rishabhbhargava/.pyenv/versions/3.8.13/lib/python3.8/json/decoder.py:337 in decode        │
│                                                                                                  │
│   334 │   │   containing a JSON document).                                                       │
│   335 │   │                                                                                      │
│   336 │   │   """                                                                                │
│ ❱ 337 │   │   obj, end = self.raw_decode(s, idx=_w(s, 0).end())                                  │
│   338 │   │   end = _w(s, end).end()                                                             │
│   339 │   │   if end != len(s):                                                                  │
│   340 │   │   │   raise JSONDecodeError("Extra data", s, end)                                    │
│                                                                                                  │
│ /Users/rishabhbhargava/.pyenv/versions/3.8.13/lib/python3.8/json/decoder.py:353 in raw_decode    │
│                                                                                                  │
│   350 │   │                                                                                      │
│   351 │   │   """                                                                                │
│   352 │   │   try:                                                                               │
│ ❱ 353 │   │   │   obj, end = self.scan_once(s, idx)                                              │
│   354 │   │   except StopIteration as err:                                                       │
│   355 │   │   │   raise JSONDecodeError("Expecting value", s, err.value) from None               │
│   356 │   │   return obj, end                                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

This happened when the column in the CSV wasn't a proper JSON object, but instead looked like a Python object:
{'Disease': [], 'Chemical': ['Naloxone', 'clonidine']}

The text was updated successfully, but these errors were encountered:

nihit · 2023-06-27T04:06:34Z

@Sardhendu is working on this next

Sardhendu · 2023-06-27T19:18:11Z

@rishabh-bhargava Do you have script handy to reproduce this?

Sardhendu · 2023-06-27T19:53:51Z

@nihit @rishabh-bhargava Guys another thing I observed is that we are reading the entire data in memory. It would be a good idea to have a data_loader folder and have adapters for each data reader. I imagine, different data_readers would have different format validation?

rishabh-bhargava · 2023-07-01T01:27:31Z

We will likely have to think about label validation for each task separately. The initial example in this issue was for the NER task. For NER, the label column has to follow a certain schema:

{
    "Location": [],
    "Organization": [
        "Kurdistan Workers Party",
        "PKK"
    ],
    "Person": [],
    "Miscellaneous": [
        "Kurdish"
    ]
}

However, when I change the config here and replace label_column value with IndividualLabels, I get:

In [8]: agent.plan('test.csv')
Generating Prompts... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0/100 0:00:00 -:--:--
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [8], in <cell line: 1>()
----> 1 agent.plan('test.csv')

File ~/dev/refuel/autolabel/src/autolabel/labeler.py:347, in LabelingAgent.plan(self, dataset, max_items, start_index)
    345 else:
    346     examples = []
--> 347 final_prompt = self.task.construct_prompt(input_i, examples)
    348 prompt_list.append(final_prompt)
    350 # Calculate the number of tokens

File ~/dev/refuel/autolabel/src/autolabel/tasks/named_entity_recognition.py:72, in NamedEntityRecognitionTask.construct_prompt(self, input, examples)
     70     eg_copy = deepcopy(eg)
     71     if label_column:
---> 72         eg_copy[label_column] = self._json_to_llm_format(eg_copy[label_column])
     73     fmt_examples.append(example_template.format_map(defaultdict(str, eg_copy)))
     75 # populate the current example in the prompt

File ~/dev/refuel/autolabel/src/autolabel/tasks/named_entity_recognition.py:31, in NamedEntityRecognitionTask._json_to_llm_format(self, input_label)
     29 labels = json.loads(input_label)
     30 rows = []
---> 31 for entity_type, detected_entites in labels.items():
     32     for e in detected_entites:
     33         row = "%".join([e, entity_type])

AttributeError: 'list' object has no attribute 'items'

Are you able to replicate this?

nihit closed this as completed Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation of label column needed for dataset being labeled #178

Validation of label column needed for dataset being labeled #178

rishabh-bhargava commented May 28, 2023

nihit commented Jun 27, 2023

Sardhendu commented Jun 27, 2023 •

edited

Loading

Sardhendu commented Jun 27, 2023

rishabh-bhargava commented Jul 1, 2023

Validation of label column needed for dataset being labeled #178

Validation of label column needed for dataset being labeled #178

Comments

rishabh-bhargava commented May 28, 2023

nihit commented Jun 27, 2023

Sardhendu commented Jun 27, 2023 • edited Loading

Sardhendu commented Jun 27, 2023

rishabh-bhargava commented Jul 1, 2023

Sardhendu commented Jun 27, 2023 •

edited

Loading