# Tutorial proyecto de clasificación de texto mediante Hugging Face (multiclase)

## ¿Qué vamos a construir?

Bienvenido al tutorial del proyecto de clasificación de texto mediante Hugging Face. Este notebook está diseñado para que pueda ser reutilizable para otros casos.
En concreto, vamos a construir un modelo, que a partir de una frase en español, va a determinar de qué asignatura se trata, pudiendo ser una de las siguientes:
* Religión
* Lengua y literatura
* Educación física
* Artes
* Idiomas extranjeros
* Historia
* Geografía
* Física y química
* Matemáticas
* Frase no relacionada con asignaturas

In [5]:
from dotenv import load_dotenv
import os

load_dotenv()

True

In [6]:
import pprint
from pathlib import Path
import os

import pandas as pd
import numpy as np
import torch

import datasets
import evaluate

from transformers import pipeline
from transformers import TrainingArguments, Trainer
from transformers import AutoTokenizer, AutoModelForSequenceClassification


# Subimos el dataset a HF datasets
DATASET_NAME = "tonicanada/learn_hf_spanish_sentence_classification_by_school_subject"
df = pd.read_excel("./datasets/spanish_sentence_classification_by_school_subject_dataset.xlsx")
dataset = datasets.Dataset.from_pandas(df)
repo_id_dataset = DATASET_NAME
dataset.push_to_hub(repo_id_dataset)


  from .autonotebook import tqdm as notebook_tqdm
Uploading the dataset shards:   0%|                       | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 100%|████████| 1/1 [00:00<00:00, 408.56ba/s][A
Uploading the dataset shards: 100%|███████████████| 1/1 [00:00<00:00,  1.25it/s]
No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/datasets/tonicanada/learn_hf_spanish_sentence_classification_by_school_subject/commit/daaf3830f889b6ac91ae95714d3464b58c7e3d38', commit_message='Upload dataset', commit_description='', oid='daaf3830f889b6ac91ae95714d3464b58c7e3d38', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/tonicanada/learn_hf_spanish_sentence_classification_by_school_subject', endpoint='https://huggingface.co', repo_type='dataset', repo_id='tonicanada/learn_hf_spanish_sentence_classification_by_school_subject'), pr_revision=None, pr_num=None)

In [7]:
# 1. Definimos las variables para el entrenaiento del modelo y pipeline
MODEL_NAME = "distilbert/distilbert-base-multilingual-cased"
MODEL_SAVE_DIR_NAME="models/learn_hf_spanish_sentence_classification_by_school_subject"

In [8]:
# 2. Cargamos y preprocesamos el dataset desde Hugging Face Hub
print(f"[INFO] Descargando dataset desde Hugging Face Hub, nombre: {DATASET_NAME}")
dataset = datasets.load_dataset(DATASET_NAME)

[INFO] Descargando dataset desde Hugging Face Hub, nombre: tonicanada/learn_hf_spanish_sentence_classification_by_school_subject


In [9]:
# Creamos una función que permita transformar las labels a id
id2label = {idx: label for idx, label in enumerate(dataset['train'].unique('label')[::-1])}
print(id2label)

{0: 'Religión', 1: 'Frase no relacionada con asignaturas', 2: 'Lengua y literatura', 3: 'Educación física', 4: 'Artes', 5: 'Idiomas extranjeros', 6: 'Historia', 7: 'Geografía', 8: 'Física y química', 9: 'Matemáticas'}


In [10]:
label2id = {label: id for id, label in id2label.items()}
label2id

{'Religión': 0,
 'Frase no relacionada con asignaturas': 1,
 'Lengua y literatura': 2,
 'Educación física': 3,
 'Artes': 4,
 'Idiomas extranjeros': 5,
 'Historia': 6,
 'Geografía': 7,
 'Física y química': 8,
 'Matemáticas': 9}

In [11]:
# Creamos una función para mapear las labels a ID en el dataset
def map_labels_to_number(example):
    example["label"] = label2id[example["label"]]
    return example

In [12]:
dataset = dataset['train'].map(map_labels_to_number)

In [13]:
# Dividimos dataset en train/test sets
dataset = dataset.train_test_split(test_size=0.2, seed=42)

In [14]:
df = dataset['train'].to_pandas()
df.head()

Unnamed: 0,text,label
0,La Antártida es el continente más frío,7
1,El área de un círculo se calcula con la fórmul...,9
2,El desierto del Sahara se encuentra en el nort...,7
3,El ensayo es un texto que presenta las ideas d...,2
4,El voleibol se juega en una cancha dividida po...,3


In [15]:
# Importamos un tokenizer y lo mapeamos en el dataset
print(f"[INFO] Tokenizando text para entrenamiento de modelo: {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=MODEL_NAME,
                                          use_fast=True)

[INFO] Tokenizando text para entrenamiento de modelo: distilbert/distilbert-base-multilingual-cased


In [16]:
# Hacemos pruebas de conversión de texto a números con el tokenizer
tokenizer("Ciencia")
tokenizer.convert_ids_to_tokens(tokenizer("😆").input_ids)

['[CLS]', '[UNK]', '[SEP]']

In [18]:
# Creamos función para tokenize la columna "text" del dataset
def tokenize_text(examples):
    """
    Tokeniza un texto.
    """
    return tokenizer(examples['text'],
                    padding=True,
                    truncation=True)

In [19]:
tokenized_dataset = dataset.map(function=tokenize_text,
                                batched=True,
                                batch_size=1000)

Map: 100%|██████████████████████████| 200/200 [00:00<00:00, 23453.49 examples/s]
Map: 100%|████████████████████████████| 50/50 [00:00<00:00, 11630.81 examples/s]


In [20]:
df = tokenized_dataset['train'].to_pandas()
df.head()

Unnamed: 0,text,label,input_ids,attention_mask
0,La Antártida es el continente más frío,7,"[101, 10159, 40328, 46532, 11726, 10196, 10125...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, ..."
1,El área de un círculo se calcula con la fórmul...,9,"[101, 10224, 13487, 10104, 10119, 78443, 10126...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,El desierto del Sahara se encuentra en el nort...,7,"[101, 10224, 10139, 93548, 10127, 38836, 10126...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,El ensayo es un texto que presenta las ideas d...,2,"[101, 10224, 55683, 50253, 10196, 10119, 27888...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,El voleibol se juega en una cancha dividida po...,3,"[101, 10224, 12714, 72099, 10126, 56879, 10110...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [21]:
# Definimos una métrica de evaluación
accuracy_metric = evaluate.load("accuracy")
def compute_accuracy(predictions_and_labels):
    predictions, labels = predictions_and_labels

    # El modelo tendrá outputs logits de la siguiente forma ([[item_n, item_n], [item_m, item_m]]) 
    # dependiendo del número de clases que tenga el problema
    # Queramos comparar etiquetas que están en la forma ([0,0,0,1])
    if len(predictions.shape) >= 2:
        predictions = np.argmax(predictions, axis=1)

    return accuracy_metric.compute(predictions=predictions, references=labels)

In [22]:
# Seteamos el modelo
print(f"[INFO] Cargando modelo: {MODEL_NAME}")
model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path=MODEL_NAME,
    num_labels=10,
    id2label=id2label,
    label2id=label2id,
)
print(f"[INFO] Modelo cargado completamente!")

[INFO] Cargando modelo: distilbert/distilbert-base-multilingual-cased


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[INFO] Modelo cargado completamente!


In [23]:
model_save_dir = Path(MODEL_SAVE_DIR_NAME)

In [24]:
# Setup TrainingArguments (these are hyperparameters for our model)
# Hyperparameters = settings that we can set as developers
# Parameters = settings/weigths our model learns on its own
training_args = TrainingArguments(
    output_dir = model_save_dir,
    learning_rate=0.0001,
    per_device_eval_batch_size=32,
    per_device_train_batch_size=32,
    num_train_epochs=15,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=3,
    use_cpu=False,
    seed=42,
    load_best_model_at_end=True,
    logging_strategy="epoch",
    report_to="none",
    push_to_hub=False,
    hub_private_repo=False #Note: this will make our model public by default
)

In [25]:
# Creamos instancia de trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    tokenizer=tokenizer,
    compute_metrics=compute_accuracy
)


In [26]:
# Entrenamos el modelo
print(f"[INFO] Commencing model training...")
results = trainer.train()

[INFO] Commencing model training...


Epoch,Training Loss,Validation Loss,Accuracy
1,1.9642,1.515297,0.62
2,1.0957,0.955098,0.78
3,0.5173,0.616162,0.84
4,0.2341,0.409817,0.92
5,0.1249,0.335582,0.9
6,0.0702,0.294131,0.9
7,0.0488,0.272708,0.9
8,0.0341,0.267143,0.9
9,0.0256,0.265655,0.9
10,0.0205,0.263266,0.9


In [57]:
# Guardamos el modelo (to a local directory)
print(f"[INFO] Entrenamiento completado, guardando modelo en siguiente carpeta local: {model_save_dir}")
trainer.save_model(output_dir=model_save_dir)

[INFO] Model training complete, saving model to a local path: models/learn_hf_spanish_sentence_classification_by_school_subject


In [58]:
# Subimos el modelo a HF hub
print(f"[INFO] Subiendo modelo a Hugging Face Hub...")
model_upload_url = trainer.push_to_hub(
    commit_message="Uploading 'learn_hf_spanish_sentence_classification_by_school_subject'",
)
print(f"[INFO] Modelo subido con éxito, disponible en {model_upload_url}")

[INFO] Uploading model to Hugging Face Hub...



Upload 2 LFS files:   0%|                                 | 0/2 [00:00<?, ?it/s][A

model.safetensors:   0%|                  | 16.4k/541M [00:00<2:53:07, 52.1kB/s][A[A

training_args.bin: 100%|███████████████████| 5.24k/5.24k [00:00<00:00, 5.78kB/s][A[A
model.safetensors: 100%|█████████████████████| 541M/541M [00:24<00:00, 21.8MB/s]

Upload 2 LFS files: 100%|█████████████████████████| 2/2 [00:25<00:00, 12.57s/it][A


[INFO] Model upload complete, model available at https://huggingface.co/tonicanada/learn_hf_spanish_sentence_classification_by_school_subject/tree/main/


In [27]:
# Evaluate el modelo con la data test
print(f"[INFO] Realizando evaluación en test dataset...")
predictions_all = trainer.predict(tokenized_dataset['test'])
predictions_values = predictions_all.predictions
predictions_metrics = predictions_all.metrics

[INFO] Realizando evaluación en test dataset...


In [28]:
print(f"[INFO] Métricas de predicción en test data:")
pprint.pprint(predictions_metrics)

[INFO] Métricas de predicción en test data:
{'test_accuracy': 0.9,
 'test_loss': 0.254910409450531,
 'test_runtime': 0.0604,
 'test_samples_per_second': 827.991,
 'test_steps_per_second': 33.12}


In [29]:
# Probamos el modelo con ejemplos
from transformers import pipeline
learn_hf_spanish_sentence_classification_by_school_subject = pipeline(task="text-classification",
                                    model=model_save_dir,
                                    device=torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu"),
                                    top_k=1,
                                    batch_size=32)

learn_hf_spanish_sentence_classification_by_school_subject("Julio César fue gobernador de Roma")

[[{'label': 'Historia', 'score': 0.9934651255607605}]]

In [30]:
learn_hf_spanish_sentence_classification_by_school_subject("Maoma es un profeta del islam")

[[{'label': 'Religión', 'score': 0.9959214925765991}]]

In [31]:
learn_hf_spanish_sentence_classification_by_school_subject("Mañana voy a trabajar")

[[{'label': 'Frase no relacionada con asignaturas',
   'score': 0.6237140893936157}]]

In [32]:
learn_hf_spanish_sentence_classification_by_school_subject("Cuáles son las medidas de la cancha de basket?")

[[{'label': 'Educación física', 'score': 0.8877455592155457}]]

## Creamos una demo a partir de nuestro modelo

Necesitamos crear una función que entregue el output con la siguiente forma: `{"label_1": probability_1, "label_2": probability_2,...}`

In [33]:
from typing import Dict

# 1. Create a function to take a string input
def spanish_sentence_classification_by_school_subject(text: str) -> Dict[str, float]:
    # 2. Setup food not food text classifier
    spanish_sentence_classification_by_school_subject_pipeline = pipeline(task="text-classification",
                                        model=model_save_dir,
                                        batch_size=32,
                                        device="cuda" if torch.cuda.is_available() else "cpu",
                                        top_k=None) # top_k=None => return all possible labels

    # 3. Get the outputs from our pipeline
    outputs = spanish_sentence_classification_by_school_subject_pipeline(text)[0]

    # 4. Format output for Gradio
    output_dict = {}
    for item in outputs:
        output_dict[item['label']]=item['score']
    
    return output_dict

spanish_sentence_classification_by_school_subject(text="El golf es muy entretenido")

{'Educación física': 0.9700006246566772,
 'Frase no relacionada con asignaturas': 0.013027115724980831,
 'Geografía': 0.003420923836529255,
 'Física y química': 0.002429001033306122,
 'Religión': 0.0024216054007411003,
 'Historia': 0.0022585378028452396,
 'Artes': 0.0019943711813539267,
 'Matemáticas': 0.001680593704804778,
 'Lengua y literatura': 0.0016698952531442046,
 'Idiomas extranjeros': 0.0010973839089274406}

In [34]:
# Construímos una demo pequeña en Gradio para ejecutarla localmente
# 1. Import gradio
import gradio as gr

# 2. Create a gradio interface
demo = gr.Interface(
    fn=spanish_sentence_classification_by_school_subject,
    inputs="text",
    outputs=gr.Label(num_top_classes=10),
    title="Detector de asignaturas",
    description="Clasificador de texto que detecta la asignatura escolar que tiene referencia con la frase",
    examples=[["Matemáticas: 5 al cuadrado es 25"],
              ["Geografía: París es la capital de Francia"]])

# 3. Launch the interface
demo.launch()

* Running on local URL:  http://127.0.0.1:7863

To create a public link, set `share=True` in `launch()`.




### Creamos un directorio para guardar la demo

In [75]:
from pathlib import Path

# Make directory for demos
demos_dir = Path("./demos")
demos_dir.mkdir(exist_ok=True)

# Create a folder for the food_not_food_text_classifier demo
food_not_food_text_classifier_demo_dir = Path(demos_dir, "spanish_sentence_classification_by_school_subject")
food_not_food_text_classifier_demo_dir.mkdir(exist_ok=True)

In [85]:
%%writefile ./demos/spanish_sentence_classification_by_school_subject/app.py
# 1. Import the required libraries
import torch
import gradio as gr

from typing import Dict
from transformers import pipeline

# 2. Define our function to use with our model
spanish_sentence_classification_by_school_subject_pipeline = pipeline(task="text-classification",
                                    model="tonicanada/learn_hf_spanish_sentence_classification_by_school_subject",
                                    top_k=1,
                                    device="cuda" if torch.cuda.is_available() else "cpu",
                                    batch_size=32)    

def classify_text(text):
    # Usa el clasificador
    result = spanish_sentence_classification_by_school_subject_pipeline(text)
    # Extrae la etiqueta y la puntuación (score)
    label = result[0][0]['label']
    score = result[0][0]['score']
    return {label: score}  # Devuelve un diccionario con la etiqueta y la puntuación


# 3. Create a Gradio interface
description = """
Un clasificador de texto que indica a qué asignatura se refiere la frase. 

Fine-tuned from [DistilBERT](https://huggingface.co/distilbert/distilbert/distilbert-base-multilingual-cased) on a [small dataset of food and not food text](https://huggingface.co/datasets/mrdbourke/learn_hf_food_not_food_image_captions).
"""

demo = gr.Interface(
    fn = classify_text,
    inputs = "text",
    outputs=gr.Label(num_top_classes=10),
    title="📚🔍 Clasificador de asignaturas",
    description=description,
    examples=[["Matemáticas: 5 al cuadrado es 25"],
                       ["Geografía: París es la capital de Francia"]])


# 4. Launch the interface
if __name__ == "__main__":
    demo.launch()

Overwriting ./demos/spanish_sentence_classification_by_school_subject/app.py


In [78]:
%%writefile ./demos/spanish_sentence_classification_by_school_subject/README.md
---
title: Clasificador de asignaturas
emoji: 📚🔍
colorFrom: blue
colorTo: yellow
sdk: gradio
app_file: app.py
pinned: false
license: apache-2.0
---

# 📚🔍 Clasificador de asignaturas

Pequeña demo que clasifica las frases según si se refieren a asignaturas escolares (ejemplo: matemáticas, religión, etc).

DistillBERT model fine-tuned on a small synthetic dataset of 250 generated [Frases ejemplo](https://huggingface.co/datasets/tonicanada/learn_hf_spanish_sentence_classification_by_school_subject).

Writing ./demos/spanish_sentence_classification_by_school_subject/README.md


In [79]:
%%writefile ./demos/spanish_sentence_classification_by_school_subject/requirements.txt
gradio
torch
transformers

Writing ./demos/spanish_sentence_classification_by_school_subject/requirements.txt


In [86]:
from huggingface_hub import (
    create_repo,
    get_full_repo_name,
    upload_file,
    upload_folder
)

# Define the parameters we'd like to use for uploading our Space
LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD = "./demos/spanish_sentence_classification_by_school_subject/"
HF_TARGET_SPACE_NAME = "learn_hf_spanish_sentence_classification_by_school_subject_demo"
HF_REPO_TYPE = "space"
HF_SPACE_SDK = "gradio"

# Create a Space repo on Hugging Face Hub
print(f"[INFO] Creating repo on Hugging Face Hub with name: {HF_TARGET_SPACE_NAME}")
create_repo(
    repo_id=HF_TARGET_SPACE_NAME,
    repo_type = HF_REPO_TYPE,
    private=False,
    space_sdk=HF_SPACE_SDK,
    exist_ok=True
)

# Get the full repo name (e.g. {username}/{repo_name})
hf_full_repo_name = get_full_repo_name(model_id=HF_TARGET_SPACE_NAME)
print(f"[INFO] Full hugging face hub repo name: {hf_full_repo_name}")

# Uploading our demo folder
print(f"[INFO] Uploading {LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD} to repo {hf_full_repo_name}")
folder_upload_url = upload_folder(
    repo_id=hf_full_repo_name,
    folder_path=LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD,
    path_in_repo=".",
    repo_type=HF_REPO_TYPE,
    commit_message="Uploading our food not food classifier demo from a notebook!"
)

print(f"[INFO] Demo folder successfully uploaded commit url: {folder_upload_url}")

[INFO] Creating repo on Hugging Face Hub with name: learn_hf_spanish_sentence_classification_by_school_subject_demo
[INFO] Full hugging face hub repo name: tonicanada/learn_hf_spanish_sentence_classification_by_school_subject_demo
[INFO] Uploading ./demos/spanish_sentence_classification_by_school_subject/ to repo tonicanada/learn_hf_spanish_sentence_classification_by_school_subject_demo
[INFO] Demo folder successfully uploaded commit url: https://huggingface.co/spaces/tonicanada/learn_hf_spanish_sentence_classification_by_school_subject_demo/tree/main/.
