# Dataset Processing

In this notebook we:

1. Load the JSON files from the `json_entity` folder into one place.
2. Clean and format in order to upload the dataset to Argilla.
3. Update the dataset to Argila for curation.
4. Once the dataset has been curated we donwload it and upload it to the HuggingFace Hub.

## Data loading

Note: remember to install the requirements from the `requirements.txt` file.

In [1]:
import json
import pandas as pd
import os
import argilla as rg
import re
import spacy  


In [2]:
# json_folder = "../../data/NER/json_entity/json_files_alberto/"
json_folder = "../../data/NER/json_entity/"
json_files = [
    file_name for file_name in os.listdir(json_folder) if file_name.endswith(".json")
]
json_files


['output_4.json',
 'output_6.json',
 'output_7.json',
 'output_8.json',
 'output_3.json',
 'output_10.json',
 'output_9.json',
 'output_2.json',
 'output_1.json',
 'output_11.json',
 'output_0.json',
 'output_5.json']

Load all the JSON files into a python list

In [3]:
data = []
for fname in json_files:
    with open(json_folder + fname, "r") as f:
        data.extend(json.load(f))


In [4]:
# the unique entity classes found in the dataset
unique_classes = set()
for record in data:
    for entity in record["entities"]:
        unique_classes.add(entity["class"])
unique_classes = list(unique_classes)

unique_classes


['books',
 'none',
 'topics',
 'products',
 'None',
 'places',
 'organizations',
 'people',
 'films',
 'topic',
 'artista',
 'songs',
 'animals',
 'objects',
 'dates']

In [5]:
# it is reasonable to discard the entities with class None/none
for record in data:
    for entity in record["entities"]:
        if entity["class"] == "None" or entity["class"] == "none":
            print(entity["text"])


Sí
sí
todo
magia
ya
te
digo
Y así empieza la relación de amistad.
Ay
todo
el
rato
era
un
coñazo
Pero
como
si
fuera
así
una
cosa
de
plastilina
.
¿Por qué han desaparecido las tartas?
No
es que ya
es que ya
que comes fatal
Bueno
pero
esto
ya
es
como
muy
Esto
da
igual
sí
.
No
pero
que
esto
que
has
dicho
me
parece
súper
interesante
Y te traen los restos así con pelos.
Y eso es como lo de la comida construida, que esto ya...
Entonces
quiero
mi
teángulo
de
postre
estructurado
.
De posca también, estructurado.
No habremos llevado la construcción un poco demasiado lejos.
Por ejemplo, jamás he cocinado una quiche.
Eso es exacto, perfecto, distadura.
Qué vergüenza de que te servirá eso.
¿Qué quiere decir la palabra plata?
Y mientras viste de camuflaje, ya.
Bueno
un
momento
aquí
un
inciso
.
O quiero que comen las
Ya estamos
Como yo he dicho
yo trabo un momento muy dulce
aunque
¿Qué es lo que te iba a decir?
Pero
cuando
está
bien
una
vez
se
hace
como
que
no
pone
el
piota
automático,
pero
disfrutar

In [6]:
# also topics and topic seem to be the same category
for record in data:
    for entity in record["entities"]:
        if entity["class"] == "topics" or entity["class"] == "topic":
            print(entity)


{'class': 'topics', 'text': 'monstruo'}
{'class': 'topics', 'text': 'aniquilador de Almax'}
{'class': 'topics', 'text': 'arte'}
{'class': 'topics', 'text': 'ocupación alemana en Francia'}
{'class': 'topics', 'text': 'guerra'}
{'class': 'topics', 'text': 'pajaros exóticos'}
{'class': 'topics', 'text': 'jaulas de mimbre'}
{'class': 'topics', 'text': 'esculturas'}
{'class': 'topics', 'text': 'sofá luis 13'}
{'class': 'topics', 'text': 'mandolinas'}
{'class': 'topics', 'text': 'guitarras'}
{'class': 'topics', 'text': 'cuadros cubistas'}
{'class': 'topics', 'text': 'puedo pensar'}
{'class': 'topics', 'text': 'fascista'}
{'class': 'topics', 'text': 'estructura'}
{'class': 'topics', 'text': 'comida'}
{'class': 'topics', 'text': 'pensamiento'}
{'class': 'topics', 'text': 'columne'}
{'class': 'topics', 'text': 'columna'}
{'class': 'topics', 'text': 'pensamiento recurrente'}
{'class': 'topics', 'text': 'mañana laborable'}
{'class': 'topics', 'text': 'tontería'}
{'class': 'topics', 'text': 'destr

## Cleaning and Upload to Argilla server

In [7]:
def find_position(text, subtext):
    """"Finds the start and end position of `subtext` in `text`."""
    r = re.search(subtext.lower(), text.lower())
    if r is not None:
        return r.span()
    else:
        return None


def check_multilabel(label_limits, start, end):
    """Checks if start and end are inside other label intervals."""
    for _, start2, end2 in label_limits:
        if (start >= start2 and end <= end2) or (start <= start2 and end >= end2):
            return True
    return False


In [8]:
def preprocess_record(record):
    """Preprocess record in order to be suitable for Argilla."""
    text = record["text"].strip()
    prediction = []
    metadata = []

    # create list of tuples as (<Entity_name>, <Start_idx>, <End_idx>)
    for entity in record["entities"]:
        entity_class = entity["class"].upper()
        # clean topics class name
        if entity_class == "TOPICS":
            entity_class = "TOPIC"
        # discard entities with None class
        if entity_class == "NONE":
            continue

        entity_text = entity["text"]
        span = find_position(text, entity_text)
        # if theres no match in the text "start" and "end" are set to None

        if span is not None:
            start, end = span
            # argilla doesn't support multilabel tokens, so force one label when
            # an overlap occurs
            if check_multilabel(prediction, start, end) == True:
                continue

            prediction.append((entity_class, start, end))

        metadata.append((entity_class, entity_text))

    return text, prediction, metadata


Here we create a list of `TokenClassificationRecord`s for out dataset

In [9]:
nlp = spacy.load("es_core_news_sm")

rg_records = []
for record in data:
    text, prediction, metadata = preprocess_record(record)
    doc = nlp(text)
    tokens = [token.text for token in doc]
    try:
        rg_record = rg.TokenClassificationRecord(
            text=text,
            tokens=tokens,
            prediction=prediction,
            prediction_agent="ChatGPT",
            metadata={"ocurrences": metadata},
        )
    except:
        rg_record = rg.TokenClassificationRecord(
            text=text, tokens=tokens, metadata={"occurences": metadata}
        )
    rg_records.append(rg_record)


Initialize connection to the Argilla server.

In [37]:
rg.init(
    api_url="https://davidfm43-argilla-podcasts-ner.hf.space", api_key="team.apikey"
)


In [38]:
rg_dataset_name = "podcasts-ner-v1"
## This line is commented so you don't upload data to Argilla by accident
# rg.log(rg_records, name=rg_dataset_name)


Output()

BulkResponse(dataset='podcasts-ner-v1', processed=519, failed=0)

Now we curate the dataset from the Argilla GUI client.

# Download Cleaned dataset and upload to HF

In [60]:
hf_ds = rg.load(rg_dataset_name, query="status:Validated")
hf_ds = hf_ds.to_datasets()
hf_ds = hf_ds.select_columns(["id", "text", "annotation"])
hf_ds = hf_ds.train_test_split(train_size=0.8, seed=42)


In [61]:
## This line is commented so you don't upload dataset to the HF hub by accident
# hf_ds.push_to_hub("hackathon-somos-nlp-2023/podcasts-ner-es")


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

Deleting unused files from dataset repository:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading metadata:   0%|          | 0.00/522 [00:00<?, ?B/s]