# Aplicando Transformers en Análisis de Sentimientos.

**Investigadores**: <br>
  Dr. Ramón Zatarain Cabada<br>
  Dra. María Lucía Barrón Estrada<br>
  M.C. Víctor Manuel Bátiz Beltrán

**Corpus**: SentiText

**Referencias**:

- Barrón Estrada, M. L., Zatarain Cabada, R., Oramas Bustillos, R., & Graff, M. (2020). Opinion mining and emotion recognition applied to learning environments. Expert Systems with Applications, 150, 113265. https://doi.org/10.1016/j.eswa.2020.113265

- Zatarain Cabada, R., Barrón Estrada, M. L., Bátiz Beltrán, V. M. (2023). Advanced Applications of Generative AI and Natural Language Processing Models (Chapter 15). Deep Learning Approaches for Affective Computing in Text (pages 306-339). DOI: 10.4018/979-8-3693-0502-7.ch015.


### Descripción general
Usaremos el dataset **SentiText**.

Este corpus cuenta con 24,556 textos de Twitter sobre opiniones relacionadas con el aprendizaje de lenguajes de programación. Se trata de un corpus muy bien balanceado en cuanto a la distribución de opiniones positivas y negativas. Los textos del corpus están en español.


### Pasos iniciales
Instalamos e importamos las bibliotecas a utilizar.

In [1]:
!pip install emoji

Collecting emoji
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.0-py3-none-any.whl (586 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/586.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m586.9/586.9 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.0


In [2]:
import re
#import matplotlib.pyplot as plt
import string
from nltk.corpus import stopwords
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk import SnowballStemmer
import unicodedata
from collections import Counter
from wordcloud import WordCloud
from gensim.utils import simple_preprocess
import gensim
from sklearn.model_selection import train_test_split
import spacy
import pickle
import warnings
warnings.filterwarnings('ignore')
#import seaborn as sns
#from sklearn.metrics import confusion_matrix
#import matplotlib.pyplot as plt
import tensorflow as tf
import keras
import numpy as np
import pandas as pd
import emoji
import keras
from keras import backend as K
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
print('Listo')

Listo


## 1. Cargando el dataset

### Descargando el corpus desde el sitio Web de PersonApp.

La primera celda de código fue necesaria para poder usar el mode GPU, ya que sin ello marcaba error de encoding.

In [3]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [4]:
def corpus_download(path, url):
  !wget --no-check-certificate \
     {url} \
     -O {path}

In [5]:
corpus_download("SentiText.csv","https://person-app-itc.web.app/corpus/SentiText.csv")

--2024-10-22 03:08:16--  https://person-app-itc.web.app/corpus/SentiText.csv
Resolving person-app-itc.web.app (person-app-itc.web.app)... 199.36.158.100, 2620:0:890::100
Connecting to person-app-itc.web.app (person-app-itc.web.app)|199.36.158.100|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2085703 (2.0M) [text/csv]
Saving to: ‘SentiText.csv’


2024-10-22 03:08:16 (134 MB/s) - ‘SentiText.csv’ saved [2085703/2085703]



In [6]:
data = pd.read_csv("SentiText.csv")

### Exploración de los datos

In [7]:
data.head()

Unnamed: 0,Text,Label
0,!!!Que dia de mierda fue el de ayer loko :(,negativo
1,!Qué #asco de esta chusma intolerante! https:/...,negativo
2,"—oye, ¿estudiaste El Resumen de mate? —¿cuál...",negativo
3,(...) no sabe lo que le espera y piensa que va...,negativo
4,...es realmente #triste llegar a viejo y sin c...,negativo


In [8]:
len(data)

24556

We change labels to numerical representation 0 = negative and 1= positive.

In [9]:
data['Label'] = data['Label'].replace({'negativo':0, 'positivo':1})

## 2. Limpieza de datos (Data cleaning)

In [10]:
data.head()

Unnamed: 0,Text,Label
0,!!!Que dia de mierda fue el de ayer loko :(,0
1,!Qué #asco de esta chusma intolerante! https:/...,0
2,"—oye, ¿estudiaste El Resumen de mate? —¿cuál...",0
3,(...) no sabe lo que le espera y piensa que va...,0
4,...es realmente #triste llegar a viejo y sin c...,0


In [11]:
print(data.dtypes)


Text     object
Label     int64
dtype: object


In [12]:
#Check if we have null fields
data.isnull().sum()

Unnamed: 0,0
Text,0
Label,0


In [None]:
#In case we have null texts.
data["Text"].fillna("Sin texto", inplace = True)

### A continuación realizaremos los siguientes pasos:

1. Separar el texto en Tokens
2. Convertir palabras a minúsculas
3. Expandir contracciones
4. Remover urls, correos, saltos de línea
5. Eliminar caracteres repetidos
6. Eliminar nuevas líneas y pestañas
7. Remover saltos de línea
8. Remover comillas simples
9. Eliminar comas " , "
10. Remover números
11. Remover Caracteres no alfanuméricos
12. Eliminar guiones entre palabras
13. Eliminar los guiones dobles y triples
14. Eliminar espacios en blanco (al principio, final y espacios dobles)
15. Eleminar stop words
16. Realizar stemming/Lematizacion  
17. Remover signos de puntuación
18. Destokenizar


In [14]:
def process_text(sentence, norm_user = True, norm_hashtag = True, separate_characters = True):
    # Convert instance to string
    sentence = str(sentence)

    # All text to lowecase
    sentence = sentence.lower()

    # Normalize users and url
    if norm_user == True:
        sentence = re.sub(r'\@\w+','@usuario', sentence)
    if norm_hashtag == True:
        sentence = re.sub(r"http\S+|www\S+|https\S+", 'url', sentence, flags=re.MULTILINE)

    # Separate special characters
    if separate_characters == True:
        sentence = re.sub(r":", " : ", sentence)
        sentence = re.sub(r",", " , ", sentence)
        sentence = re.sub(r"\.", " . ", sentence)
        sentence = re.sub(r"!", " ! ", sentence)
        sentence = re.sub(r"¡", " ¡ ", sentence)
        sentence = re.sub(r"“", " “ ", sentence)
        sentence = re.sub(r"'", " ' ", sentence)
        sentence = re.sub(r"”", " ” ", sentence)
        sentence = re.sub(r"\(", " ( ", sentence)
        sentence = re.sub(r"\)", " ) ", sentence)
        sentence = re.sub(r"\?", " ? ", sentence)
        sentence = re.sub(r"\¿", " ¿ ", sentence)

    # Substituting multiple spaces with single space
    sentence = re.sub(r'\s+', ' ', sentence, flags=re.I)
    # emojis to text
    sentence = emoji.demojize(sentence)

    return sentence

In [15]:
clean_data = data.copy()
clean_data['Text'] = clean_data['Text'].apply(process_text)

In [16]:
clean_data.head()

Unnamed: 0,Text,Label
0,! ! ! que dia de mierda fue el de ayer loko : (,0
1,! qué #asco de esta chusma intolerante ! url,0
2,"—oye , ¿ estudiaste el resumen de mate ? — ¿ ...",0
3,( . . . ) no sabe lo que le espera y piensa q...,0
4,. . . es realmente #triste llegar a viejo y s...,0


### Eliminando las palabras que no aportan valor (stopwords)

In [17]:
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [18]:
nltk.download('stopwords')
print(stopwords.words('spanish'))

['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'del', 'se', 'las', 'por', 'un', 'para', 'con', 'no', 'una', 'su', 'al', 'lo', 'como', 'más', 'pero', 'sus', 'le', 'ya', 'o', 'este', 'sí', 'porque', 'esta', 'entre', 'cuando', 'muy', 'sin', 'sobre', 'también', 'me', 'hasta', 'hay', 'donde', 'quien', 'desde', 'todo', 'nos', 'durante', 'todos', 'uno', 'les', 'ni', 'contra', 'otros', 'ese', 'eso', 'ante', 'ellos', 'e', 'esto', 'mí', 'antes', 'algunos', 'qué', 'unos', 'yo', 'otro', 'otras', 'otra', 'él', 'tanto', 'esa', 'estos', 'mucho', 'quienes', 'nada', 'muchos', 'cual', 'poco', 'ella', 'estar', 'estas', 'algunas', 'algo', 'nosotros', 'mi', 'mis', 'tú', 'te', 'ti', 'tu', 'tus', 'ellas', 'nosotras', 'vosotros', 'vosotras', 'os', 'mío', 'mía', 'míos', 'mías', 'tuyo', 'tuya', 'tuyos', 'tuyas', 'suyo', 'suya', 'suyos', 'suyas', 'nuestro', 'nuestra', 'nuestros', 'nuestras', 'vuestro', 'vuestra', 'vuestros', 'vuestras', 'esos', 'esas', 'estoy', 'estás', 'está', 'estamos', 'estáis', 'están', 'e

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [19]:
stop_words = set(stopwords.words('spanish'))

In [20]:
def remove_stopwords(text):
  word_tokens = word_tokenize(text)
  no_stopwords = [word for word in word_tokens if not word in stop_words]
  return " ".join(no_stopwords)

In [21]:
remove_stopwords('el que tiene tienda la debe atender')

'tienda debe atender'

In [22]:
clean_data['Text'] = clean_data['Text'].apply(remove_stopwords)

### Lematización

In [23]:
#https://spacy.io/models/es
#We'll use Spacy for Lematization
!python -m spacy download es_core_news_sm

Collecting es-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.7.0/es_core_news_sm-3.7.0-py3-none-any.whl (12.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: es-core-news-sm
Successfully installed es-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [24]:
import spacy
import es_core_news_sm
nlp = es_core_news_sm.load()

In [25]:
def lematize(text):
    doc = nlp(text)
    lemms = []
    for token in doc:
        lemms.append(token.lemma_)
    return " ".join(lemms)

In [26]:
lematize('yo soy muy feliz con mi familia')

'yo ser mucho feliz con mi familia'

In [27]:
clean_data['Text'] = clean_data['Text'].apply(lematize)

### Retirando elementos de puntuación y acentos (Punctuation Cleaning)



In [28]:
def cleaning_punct(text):
  token_list = gensim.utils.simple_preprocess(str(text), deacc=True)  # deacc=True remueve puntuación
  return " ".join(token_list)

In [29]:
cleaning_punct('mi méxico querido qué fantástico')

'mi mexico querido que fantastico'

In [30]:
clean_data['Text'] = clean_data['Text'].apply(cleaning_punct)

## 3. Construcción del modelo

In [None]:
#clases = ['Negativo','Positivo']

### Transformers

Pasos iniciales

In [31]:
!pip install transformers==4.24.0
!pip install simpletransformers==0.63.11

Collecting transformers==4.24.0
  Downloading transformers-4.24.0-py3-none-any.whl.metadata (90 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/90.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.5/90.5 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.24.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m56.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m65.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uninstall: to

In [32]:
#!pip install transformers
#!pip install simpletransformers



In [33]:
pip show simpletransformers #Solo para mostrar la versión instalada

Name: simpletransformers
Version: 0.63.11
Summary: An easy-to-use wrapper library for the Transformers library.
Home-page: https://github.com/ThilinaRajapakse/simpletransformers/
Author: Thilina Rajapakse
Author-email: chaturangarajapakshe@gmail.com
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: datasets, numpy, pandas, regex, requests, scikit-learn, scipy, sentencepiece, seqeval, streamlit, tensorboard, tokenizers, tqdm, transformers, wandb
Required-by: 


### Cargando los modelos preentrenados

In [34]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs

In [35]:
import logging # Import the logging module

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

In [36]:
clean_data2 = clean_data.copy()
clean_data2.rename(columns = {'Text':'text','Label':'labels'}, inplace = True)

In [37]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(clean_data2, test_size=0.20)

print('train shape: ',train_df.shape)
print('test shape: ',test_df.shape)

train shape:  (19644, 2)
test shape:  (4912, 2)


In [38]:
# Optional model configuration
model_args = ClassificationArgs(num_train_epochs=1)

train_args ={"reprocess_input_data": True,
             "fp16":False,
             "num_train_epochs": 1, # Usaremos una época por cuestiones de tiempo
             "overwrite_output_dir": True}

# Create a ClassificationModel
model = ClassificationModel(
    'bert',
    'bert-base-uncased',
    num_labels=2,
    args=train_args
)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Entrenamos el modelo

In [39]:
# Train the model
model.train_model(train_df)

  0%|          | 0/19644 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/2456 [00:00<?, ?it/s]

(2456, 0.40836173980701596)

In [40]:
from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score

In [41]:
# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(test_df,f1=f1_score, acc=accuracy_score, rc=recall_score, pcs=precision_score)

  0%|          | 0/4912 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/614 [00:00<?, ?it/s]

In [43]:
print(f" Exactitud (Accuracy): {result['acc']}")
print(f" F1-Score: {result['f1']}")
print(f" Recall: {result['rc']}")
print(f" Precisión: {result['pcs']}")

 Exactitud (Accuracy): 0.8711319218241043
 F1-Score: 0.8714198659354052
 Recall: 0.8830794565664882
 Precisión: 0.8600641539695268


### Probando el modelo

In [48]:
from sklearn.metrics import recall_score
from sklearn import metrics

In [52]:
#Recordemos nuestras clases 0 = Negativo, 1 = Positivo
clases = ['Negativo','Positivo']

In [69]:
# Vamos a usar un diccionario para crear el dataset de prueba
# Frases:
# 0-odio ir a la escuela, es horrible;
# 1-La vida es hermosa, soy muy feliz estudiando
datos = {
    'text': ['La vida es hermosa, soy muy feliz estudiando'],
    'labels': [1]
}

# Crear un DataFrame a partir del diccionario
df = pd.DataFrame(datos)

In [70]:
df.head()

Unnamed: 0,text,labels
0,"La vida es hermosa, soy muy feliz estudiando",1


In [71]:
test = df['text'].to_numpy().tolist()
y = df['labels'].to_numpy().tolist()
print(test[0])
print(y[0])
print(len(test))
print(len(y))

La vida es hermosa, soy muy feliz estudiando
1
1
1


In [72]:
predictions_test = model.predict(test)

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

In [73]:
#Accediendo la clase elegida por el modelo
print(clases[predictions_test[0][0]])

Positivo


In [74]:
#Usando el vector de probabilidades
print(clases[np.argmax(predictions_test[1])])

Positivo


In [49]:
test_recall = metrics.recall_score(y, predictions_test[0], average='macro')
test_f1 = metrics.f1_score(y, predictions_test[0], average='macro')
test_precision = metrics.precision_score(y, predictions_test[0], average='macro')
test_accuracy = metrics.accuracy_score(y, predictions_test[0])

In [50]:
print("Metrics results:")
print(f"Accuracy: {test_accuracy}")
print(f"F1: {test_f1}")
print(f"Precision: {test_precision}")
print(f"Recall: {test_recall}")

Metrics results:
Accuracy: 1.0
F1: 1.0
Precision: 1.0
Recall: 1.0
