Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation of label column needed for dataset being labeled #178

Closed
rishabh-bhargava opened this issue May 28, 2023 · 4 comments
Closed

Comments

@rishabh-bhargava
Copy link
Contributor

When a user attempts to run agent.plan or agent.run on a dataset, we should first validate that any data columns needed for labeling/evaluation are in the correct format. For example, when labeling an NER dataset, we observed the following error:

╭─────────────────────────────── Traceback (most recent call last) 
│                                                                                                  │
│ /Users/rishabhbhargava/.pyenv/versions/refuel-main/lib/python3.8/site-packages/autolabel/labeler │
│ .py:262 in run                                                                                   │
│                                                                                                  │
│   259 │   │   eval_result = None                                                                 │
│   260 │   │   # if true labels are provided, evaluate accuracy of predictions                    │
│   261 │   │   if gt_labels:                                                                      │
│ ❱ 262 │   │   │   eval_result = self.task.eval(llm_labels, gt_labels)                            │
│   263 │   │   │   # TODO: serialize and write to file                                            │
│   264 │   │   │   for m in eval_result:                                                          │
│   265 │   │   │   │   print(f"Metric: {m.name}: {m.value}")                                      │
│                                                                                                  │
│ /Users/rishabhbhargava/.pyenv/versions/refuel-main/lib/python3.8/site-packages/autolabel/tasks/n │
│ amed_entity_recognition.py:229 in eval                                                           │
│                                                                                                  │
│   226 │   │   Returns:                                                                           │
│   227 │   │   │   List[MetricResult]: list of metrics and corresponding values                   │
│   228 │   │   """                                                                                │
│ ❱ 229 │   │   gt_labels = [                                                                      │
│   230 │   │   │   self.add_text_spans(                                                           │
│   231 │   │   │   │   json.loads(gt_labels[index]), llm_labels[index].curr_sample                │
│   232 │   │   │   )                                                                              │
│                                                                                                  │
│ /Users/rishabhbhargava/.pyenv/versions/refuel-main/lib/python3.8/site-packages/autolabel/tasks/n │
│ amed_entity_recognition.py:231 in <listcomp>                                                     │
│                                                                                                  │
│   228 │   │   """                                                                                │
│   229 │   │   gt_labels = [                                                                      │
│   230 │   │   │   self.add_text_spans(                                                           │
│ ❱ 231 │   │   │   │   json.loads(gt_labels[index]), llm_labels[index].curr_sample                │
│   232 │   │   │   )                                                                              │
│   233 │   │   │   for index in range(len(gt_labels))                                             │
│   234 │   │   ]                                                                                  │
│                                                                                                  │
│ /Users/rishabhbhargava/.pyenv/versions/3.8.13/lib/python3.8/json/__init__.py:357 in loads        │
│                                                                                                  │
│   354 │   if (cls is None and object_hook is None and                                            │
│   355 │   │   │   parse_int is None and parse_float is None and                                  │
│   356 │   │   │   parse_constant is None and object_pairs_hook is None and not kw):              │
│ ❱ 357 │   │   return _default_decoder.decode(s)                                                  │
│   358 │   if cls is None:                                                                        │
│   359 │   │   cls = JSONDecoder                                                                  │
│   360 │   if object_hook is not None:                                                            │
│                                                                                                  │
│ /Users/rishabhbhargava/.pyenv/versions/3.8.13/lib/python3.8/json/decoder.py:337 in decode        │
│                                                                                                  │
│   334 │   │   containing a JSON document).                                                       │
│   335 │   │                                                                                      │
│   336 │   │   """                                                                                │
│ ❱ 337 │   │   obj, end = self.raw_decode(s, idx=_w(s, 0).end())                                  │
│   338 │   │   end = _w(s, end).end()                                                             │
│   339 │   │   if end != len(s):                                                                  │
│   340 │   │   │   raise JSONDecodeError("Extra data", s, end)                                    │
│                                                                                                  │
│ /Users/rishabhbhargava/.pyenv/versions/3.8.13/lib/python3.8/json/decoder.py:353 in raw_decode    │
│                                                                                                  │
│   350 │   │                                                                                      │
│   351 │   │   """                                                                                │
│   352 │   │   try:                                                                               │
│ ❱ 353 │   │   │   obj, end = self.scan_once(s, idx)                                              │
│   354 │   │   except StopIteration as err:                                                       │
│   355 │   │   │   raise JSONDecodeError("Expecting value", s, err.value) from None               │
│   356 │   │   return obj, end                                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

This happened when the column in the CSV wasn't a proper JSON object, but instead looked like a Python object:
{'Disease': [], 'Chemical': ['Naloxone', 'clonidine']}

@nihit
Copy link
Contributor

nihit commented Jun 27, 2023

@Sardhendu is working on this next

@Sardhendu
Copy link
Contributor

Sardhendu commented Jun 27, 2023

@rishabh-bhargava Do you have script handy to reproduce this?

@Sardhendu
Copy link
Contributor

@nihit @rishabh-bhargava Guys another thing I observed is that we are reading the entire data in memory. It would be a good idea to have a data_loader folder and have adapters for each data reader. I imagine, different data_readers would have different format validation?

@rishabh-bhargava
Copy link
Contributor Author

We will likely have to think about label validation for each task separately. The initial example in this issue was for the NER task. For NER, the label column has to follow a certain schema:

{
    "Location": [],
    "Organization": [
        "Kurdistan Workers Party",
        "PKK"
    ],
    "Person": [],
    "Miscellaneous": [
        "Kurdish"
    ]
}

However, when I change the config here and replace label_column value with IndividualLabels, I get:

In [8]: agent.plan('test.csv')
Generating Prompts... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0/100 0:00:00 -:--:--
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [8], in <cell line: 1>()
----> 1 agent.plan('test.csv')

File ~/dev/refuel/autolabel/src/autolabel/labeler.py:347, in LabelingAgent.plan(self, dataset, max_items, start_index)
    345 else:
    346     examples = []
--> 347 final_prompt = self.task.construct_prompt(input_i, examples)
    348 prompt_list.append(final_prompt)
    350 # Calculate the number of tokens

File ~/dev/refuel/autolabel/src/autolabel/tasks/named_entity_recognition.py:72, in NamedEntityRecognitionTask.construct_prompt(self, input, examples)
     70     eg_copy = deepcopy(eg)
     71     if label_column:
---> 72         eg_copy[label_column] = self._json_to_llm_format(eg_copy[label_column])
     73     fmt_examples.append(example_template.format_map(defaultdict(str, eg_copy)))
     75 # populate the current example in the prompt

File ~/dev/refuel/autolabel/src/autolabel/tasks/named_entity_recognition.py:31, in NamedEntityRecognitionTask._json_to_llm_format(self, input_label)
     29 labels = json.loads(input_label)
     30 rows = []
---> 31 for entity_type, detected_entites in labels.items():
     32     for e in detected_entites:
     33         row = "%".join([e, entity_type])

AttributeError: 'list' object has no attribute 'items'

Are you able to replicate this?

@nihit nihit closed this as completed Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants