# Aplicando Transformers en Clasificación de Texto.

**Investigadores**: <br>
  Dr. Ramón Zatarain Cabada<br>
  Dra. María Lucía Barrón Estrada<br>
  M.C. Víctor Manuel Bátiz Beltrán

**Corpus**: EduSERE

**Referencias**:

- Barrón Estrada, M. L., Zatarain Cabada, R., Oramas Bustillos, R., & Graff, M. (2020). Opinion mining and emotion recognition applied to learning environments. Expert Systems with Applications, 150, 113265. https://doi.org/10.1016/j.eswa.2020.113265

- Zatarain Cabada, R., Barrón Estrada, M. L., Bátiz Beltrán, V. M. (2023). Advanced Applications of Generative AI and Natural Language Processing Models (Chapter 15). Deep Learning Approaches for Affective Computing in Text (pages 306-339). DOI: 10.4018/979-8-3693-0502-7.ch015.


### Descripción general

Usaremos el dataset EduSERE.

El corpus se divide en tres emociones orientadas al aprendizaje: frustrado, aburrido y comprometido (enganchado). El corpus tiene 3245 textos clasificados como frustrado, 3239 textos clasificados como aburrido y 5600 textos clasificados como comprometido (enganchado). Los textos del corpus están en español.

### Primeros pasos
Instalamos e importamos las bibliotecas a utilizar.

In [1]:
!pip install emoji

Collecting emoji
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.0-py3-none-any.whl (586 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/586.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m586.9/586.9 kB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.0


In [2]:
import re
#import matplotlib.pyplot as plt
import string
from nltk.corpus import stopwords
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk import SnowballStemmer
import unicodedata
from collections import Counter
from wordcloud import WordCloud
from gensim.utils import simple_preprocess
import gensim
from sklearn.model_selection import train_test_split
import spacy
import pickle
import warnings
warnings.filterwarnings('ignore')
#import seaborn as sns
#from sklearn.metrics import confusion_matrix
#import matplotlib.pyplot as plt
import tensorflow as tf
import keras
import numpy as np
import pandas as pd
import emoji
import keras
from keras import backend as K
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
print('Listo')

Listo


## 1. Cargamos el dataset

### Descargando el corpus.

La primera celda de código fue necesaria para poder usar el mode GPU, ya que sin ello marcaba error de encoding.

In [3]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [4]:
def corpus_download(path, url):
  !wget --no-check-certificate \
     {url} \
     -O {path}

In [7]:
corpus_download("EduSere.csv","https://person-app-itc.web.app/corpus/EduSere.csv")

--2024-10-22 04:22:32--  https://person-app-itc.web.app/corpus/EduSere.csv
Resolving person-app-itc.web.app (person-app-itc.web.app)... 199.36.158.100, 2620:0:890::100
Connecting to person-app-itc.web.app (person-app-itc.web.app)|199.36.158.100|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1039787 (1015K) [text/csv]
Saving to: ‘EduSere.csv’


2024-10-22 04:22:32 (7.43 MB/s) - ‘EduSere.csv’ saved [1039787/1039787]



In [8]:
data = pd.read_csv("EduSere.csv")

### Exploración de los datos

In [9]:
data.head()

Unnamed: 0,Text,Label
0,Que aburrido ser nueva en esto!! Hasta me olvi...,aburrido
1,"Si estás cansado de lo mismo, abre los ojos 👓",aburrido
2,muy poca explicación,aburrido
3,"—oye, ¿estudiaste El Resumen? —¿cuál Resumen...",aburrido
4,Asco lo que hiciste el día de hoy 🤢,aburrido


In [10]:
len(data)

12084

Cambiamos las etiquetas a representación numérica 0 = frustrado (frustrated), 1 = aburrido (bored) y 2 = comprometido (engaged).


In [11]:
data['Label'] = data['Label'].replace({'frustrado':0, 'aburrido':1, 'comprometido':2})

## 2. Limpieza de datos (Data cleaning)

In [12]:
data.head()

Unnamed: 0,Text,Label
0,Que aburrido ser nueva en esto!! Hasta me olvi...,1
1,"Si estás cansado de lo mismo, abre los ojos 👓",1
2,muy poca explicación,1
3,"—oye, ¿estudiaste El Resumen? —¿cuál Resumen...",1
4,Asco lo que hiciste el día de hoy 🤢,1


In [13]:
data['Label'].value_counts()

Unnamed: 0_level_0,count
Label,Unnamed: 1_level_1
2,5600
0,3245
1,3239


Como podemos observar tenemos un desbalance entre las clases. Tomaremos 3000 registros por cada clase para tenerlas balanceadas.

In [14]:
#Undersampling
frustrated = data[data['Label']==0]
bored = data[data['Label']==1]
engaged = data[data['Label']==2]

frustrated = frustrated.sample(n=3000, random_state=1)
bored = bored.sample(n=3000, random_state=1)
engaged = engaged.sample(n=3000, random_state=1)

data = pd.concat([frustrated, bored, engaged], axis=0)


In [15]:
data.head()

Unnamed: 0,Text,Label
10373,considero que este curso debería ser un poco m...,0
11665,"se me acaba el tiempo para terminar el examen,...",0
9635,"Pero me tengo que conformar con estar aquí, im...",0
11889,tengo un código que todavía no funciona: neces...,0
11109,mi sueño frustrado es saber cantar,0


In [16]:
print(data.dtypes)


Text     object
Label     int64
dtype: object


In [17]:
#Check if we have null fields
data.isnull().sum()

Unnamed: 0,0
Text,0
Label,0


In [None]:
#In case we have null texts.
data["Text"].fillna("Sin texto", inplace = True)

### A continuación realizaremos los siguientes pasos:

1. Separar el texto en Tokens
2. Convertir palabras a minúsculas
3. Expandir contracciones
4. Remover urls, correos, saltos de línea
5. Eliminar caracteres repetidos
6. Eliminar nuevas líneas y pestañas
7. Remover saltos de línea
8. Remover comillas simples
9. Eliminar comas " , "
10. Remover números
11. Remover Caracteres no alfanuméricos
12. Eliminar guiones entre palabras
13. Eliminar los guiones dobles y triples
14. Eliminar espacios en blanco (al principio, final y espacios dobles)
15. Eleminar stop words
16. Realizar stemming/Lematizacion  
17. Remover signos de puntuación
18. Destokenizar


In [18]:
def process_text(sentence, norm_user = True, norm_hashtag = True, separate_characters = True):
    # Convert instance to string
    sentence = str(sentence)

    # All text to lowecase
    sentence = sentence.lower()

    # Normalize users and url
    if norm_user == True:
        sentence = re.sub(r'\@\w+','@usuario', sentence)
    if norm_hashtag == True:
        sentence = re.sub(r"http\S+|www\S+|https\S+", 'url', sentence, flags=re.MULTILINE)

    # Separate special characters
    if separate_characters == True:
        sentence = re.sub(r":", " : ", sentence)
        sentence = re.sub(r",", " , ", sentence)
        sentence = re.sub(r"\.", " . ", sentence)
        sentence = re.sub(r"!", " ! ", sentence)
        sentence = re.sub(r"¡", " ¡ ", sentence)
        sentence = re.sub(r"“", " “ ", sentence)
        sentence = re.sub(r"'", " ' ", sentence)
        sentence = re.sub(r"”", " ” ", sentence)
        sentence = re.sub(r"\(", " ( ", sentence)
        sentence = re.sub(r"\)", " ) ", sentence)
        sentence = re.sub(r"\?", " ? ", sentence)
        sentence = re.sub(r"\¿", " ¿ ", sentence)

    # Substituting multiple spaces with single space
    sentence = re.sub(r'\s+', ' ', sentence, flags=re.I)
    # emojis to text
    sentence = emoji.demojize(sentence)

    return sentence

In [19]:
clean_data = data.copy()
clean_data['Text'] = clean_data['Text'].apply(process_text)

In [20]:
clean_data.head()

Unnamed: 0,Text,Label
10373,considero que este curso debería ser un poco m...,0
11665,se me acaba el tiempo para terminar el examen ...,0
9635,"pero me tengo que conformar con estar aquí , i...",0
11889,tengo un código que todavía no funciona : nece...,0
11109,mi sueño frustrado es saber cantar,0


## Eliminamos las palabras que no aportan valor (stopwords).

In [None]:
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


In [None]:
nltk.download('stopwords')
print(stopwords.words('spanish'))

In [None]:
stop_words = set(stopwords.words('spanish'))

In [None]:
def remove_stopwords(text):
  word_tokens = word_tokenize(text)
  no_stopwords = [word for word in word_tokens if not word in stop_words]
  return " ".join(no_stopwords)

In [None]:
remove_stopwords('el que tiene tienda la debe atender')

In [None]:
clean_data['Text'] = clean_data['Text'].apply(remove_stopwords)

## Lematización

In [None]:
#https://spacy.io/models/es
#We'll use Spacy for Lematization
!python -m spacy download es_core_news_sm

In [None]:
import spacy
import es_core_news_sm
nlp = es_core_news_sm.load()

In [None]:
def lematize(text):
    doc = nlp(text)
    lemms = []
    for token in doc:
        lemms.append(token.lemma_)
    return " ".join(lemms)

In [None]:
lematize('yo soy muy feliz con mi familia')

In [None]:
clean_data['Text'] = clean_data['Text'].apply(lematize)

## Eliminamos signos de puntuación y acentos (Punctuation Cleaning).



In [None]:
def cleaning_punct(text):
  token_list = gensim.utils.simple_preprocess(str(text), deacc=True)  # deacc=True remueve puntuación
  return " ".join(token_list)

In [None]:
cleaning_punct('mi méxico querido qué fantástico')

In [None]:
clean_data['Text'] = clean_data['Text'].apply(cleaning_punct)

### Codificación de las etiquetas

Como el conjunto de datos es categórico, necesitamos convertir las etiquetas de personalidad de Neutral, No y Sí a un tipo float que nuestro modelo pueda entender. Para lograr esta tarea, implementaremos el método to_categorical de Keras.

## Construcción del modelo

### Transformers

Pasos iniciales

In [23]:
!pip install transformers==4.24.0
!pip install simpletransformers==0.63.11

Collecting transformers==4.24.0
  Downloading transformers-4.24.0-py3-none-any.whl.metadata (90 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/90.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.5/90.5 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.24.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m115.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uninstall: t

In [None]:
#!pip install transformers
#!pip install simpletransformers

In [24]:
pip show simpletransformers  #Para ver la versión instalada

Name: simpletransformers
Version: 0.63.11
Summary: An easy-to-use wrapper library for the Transformers library.
Home-page: https://github.com/ThilinaRajapakse/simpletransformers/
Author: Thilina Rajapakse
Author-email: chaturangarajapakshe@gmail.com
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: datasets, numpy, pandas, regex, requests, scikit-learn, scipy, sentencepiece, seqeval, streamlit, tensorboard, tokenizers, tqdm, transformers, wandb
Required-by: 


### Cargamos los modelos preentrenados.

In [25]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs

In [26]:
import logging # Import the logging module

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

In [27]:
clean_data2 = clean_data.copy()
clean_data2.rename(columns = {'Text':'text','Label':'labels'}, inplace = True)

In [28]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(clean_data2, test_size=0.20)

print('train shape: ',train_df.shape)
print('test shape: ',test_df.shape)

train shape:  (7200, 2)
test shape:  (1800, 2)


In [30]:
# Optional model configuration
model_args = ClassificationArgs(num_train_epochs=1)

train_args ={"reprocess_input_data": True,
             "fp16":False,
             "num_train_epochs": 1,  #Usaremos una época por temas de tiempo
             "overwrite_output_dir": True}

# Create a ClassificationModel
model = ClassificationModel(
    'bert',
    'bert-base-uncased',
    num_labels=3,
    args=train_args
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

In [31]:
# Train the model
model.train_model(train_df)

  0%|          | 0/7200 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/900 [00:00<?, ?it/s]

(900, 0.7502511037223869)

In [37]:
# Cargamos las métricas
from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score

# Evaluamos el modelo
result, model_outputs, wrong_predictions = model.eval_model(
    test_df,
    f1=lambda labels, preds: f1_score(labels, preds, average='weighted'),  # Use weighted average for F1-score
    acc=accuracy_score,  # Accuracy doesn't need averaging for multi-class
    rc=lambda labels, preds: recall_score(labels, preds, average='weighted'),  # Use weighted average for recall
    pcs=lambda labels, preds: precision_score(labels, preds, average='weighted')  # Use weighted average for precision
)

  0%|          | 0/1800 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/225 [00:00<?, ?it/s]

In [39]:
print(f" Exactitud (Accuracy): {result['acc']}")
print(f" F1-Score: {result['f1']}")
print(f" Recall: {result['rc']}")
print(f" Precisión: {result['pcs']}")

 Exactitud (Accuracy): 0.7827777777777778
 F1-Score: 0.7846341277061037
 Recall: 0.7827777777777778
 Precisión: 0.7896888501578875


### Probando el modelo

In [41]:
#from sklearn.metrics import recall_score
from sklearn import metrics

In [53]:
#Recordemos nuestras clases frustrado':0, 'aburrido':1, 'comprometido':2
clases = ['Frustrado','Aburrido','Comprometido']

In [67]:
# Vamos a usar un diccionario para crear el dataset de prueba
# Frases:
# 0-Que tristeza estar en este taller, siento que pierdo mi tiempo
# 1-Que taller tan tedioso, no me motiva a nada
# 2-Este taller esta genial, los instructores explican muy bien
datos = {
    'text': ['Que tristeza estar en este taller, siento que pierdo mi tiempo'],
    'labels': [0]
}

# Crear un DataFrame a partir del diccionario
df = pd.DataFrame(datos)

In [68]:
df.head()

Unnamed: 0,text,labels
0,"Que tristeza estar en este taller, siento que ...",0


In [69]:
test = df['text'].to_numpy().tolist()
y = df['labels'].to_numpy().tolist()
print(test[0])
print(y[0])
print(len(test))
print(len(y))

Que tristeza estar en este taller, siento que pierdo mi tiempo
0
1
1


In [70]:
predictions_test = model.predict(test)

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

In [71]:
#Accediendo la clase elegida por el modelo
print(clases[predictions_test[0][0]])

Frustrado


In [72]:
#Usando el vector de probabilidades
print(clases[np.argmax(predictions_test[1])])

Frustrado


In [49]:
test_recall = metrics.recall_score(y, predictions_test[0], average='macro')
test_f1 = metrics.f1_score(y, predictions_test[0], average='macro')
test_precision = metrics.precision_score(y, predictions_test[0], average='macro')
test_accuracy = metrics.accuracy_score(y, predictions_test[0])

In [50]:
print("Metrics results:")
print(f"Accuracy: {test_accuracy}")
print(f"F1: {test_f1}")
print(f"Precision: {test_precision}")
print(f"Recall: {test_recall}")

Metrics results:
Accuracy: 1.0
F1: 1.0
Precision: 1.0
Recall: 1.0
