
BERT en un problema de modelado de tópicos
==========================================

Introducción
------------

Los modelos basados en transformers nos pueden ayudar a resolver varios tipos de problemas. Desde problemas de clasificación y regresión hasta tareas más complejas como resumen de textos o generación de leguaje condicionado. Veamos como resolver el problema de clasificación de tweets sobre el que hemos estado trabajando anteriormente pero ahora utilizando el modelo BERT.

### Para ejecutar este notebook

Para ejecutar este notebook, instale las siguientes librerias:

In [1]:
!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Datasets/mascorpus/tweets_marketing.csv \
    --quiet --no-clobber --directory-prefix ./Datasets/mascorpus/

!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Utils/TextNormalizer.py \
    --quiet --no-clobber --directory-prefix ./Utils/
    
!wget https://raw.githubusercontent.com/santiagxf/M72109/master/docs/nlp/neural/bertopic.txt \
    --quiet --no-clobber

!wget https://raw.githubusercontent.com/santiagxf/M72109/master/docs/nlp/preprocessing/Normalization.txt \
    --quiet --no-clobber

!pip install -r Normalization.txt --quiet
!pip install -r bertopic.txt --quiet

[K     |████████████████████████████████| 10.4 MB 4.6 MB/s 
[K     |████████████████████████████████| 235 kB 45.2 MB/s 
[K     |████████████████████████████████| 184 kB 46.9 MB/s 
[K     |████████████████████████████████| 1.0 MB 42.3 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
en-core-web-sm 3.4.0 requires spacy<3.5.0,>=3.4.0, but you have spacy 2.3.5 which is incompatible.
confection 0.0.2 requires srsly<3.0.0,>=2.4.0, but you have srsly 1.0.5 which is incompatible.[0m
[K     |████████████████████████████████| 3.1 MB 5.3 MB/s 
[K     |████████████████████████████████| 831.4 MB 2.5 kB/s 
[K     |████████████████████████████████| 306 kB 57.5 MB/s 
[K     |████████████████████████████████| 90 kB 8.5 MB/s 
[K     |████████████████████████████████| 163 kB 57.8 MB/s 
[K     |████████████████████████████████| 3.3 MB 36.7 MB/s 
[K     |█████

In [1]:
import warnings
warnings.filterwarnings('ignore')

Cargamos el set de datos

In [2]:
import pandas as pd

tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')

In [3]:
from Utils.TextNormalizer import TweetTextNormalizer

In [4]:
normalizer = TweetTextNormalizer(lemmatize=False, stem=False, reduce_len=True, strip_handles=True, strip_stopwords=False, strip_urls=True, strip_accents=True)

In [12]:
docs = list(normalizer.transform(tweets['TEXTO']))

### Verificando el hardware disponible

In [13]:
import torch
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

print("Este notebook se está ejecutando en", device)

Este notebook se está ejecutando en cpu


In [14]:
from transformers import pipeline

embedding_model = pipeline("feature-extraction", model="fce-m72109/mascorpus-bert-classifier") 

Some weights of the model checkpoint at fce-m72109/mascorpus-bert-classifier were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [15]:
from bertopic import BERTopic

topic_model = BERTopic(language='spanish', embedding_model=embedding_model)

In [16]:
topics, probs = topic_model.fit_transform(docs) 

In [18]:
import numpy as np

np.unique(np.asarray(topics))

array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
       16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27])

In [19]:
topic_model.visualize_topics()

In [20]:
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer('fce-m72109/mascorpus-bert-classifier')
embeddings = sentence_model.encode(docs, show_progress_bar=False)

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/486k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/567 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/248k [00:00<?, ?B/s]

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/fce-m72109_mascorpus-bert-classifier were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Ejecutamos las visualizaciones con los embeddings originales

In [21]:
topic_model.visualize_documents(docs, embeddings=embeddings)

De forma alternativa, podemos reducir la dimensionalidad de los embeddings para que la ejecución sea mucho mas rápida

In [22]:
from umap import UMAP

reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)