todo:
1. probar coherencia con distintos hiperparametros (o simplemente decidirse por uno -> me gusta esta opción)
2. generar los ministerios (clusterizacion de topicos - > usando hierarchical topics es más simple, sin embargo, algunos topícos (de niveles más altos) presentan ruido y fusionan temas totalmente distintos (e.g. futbol + salud???))
3. generar dataset noticias y dataset topics, empezar a contextualizar los datos en el tiempo y por comuna. 
4. estudiar sobre knowledge graph -> me tinca mucho :p


In [None]:
# prevent huge warning messages of bertmodel 
import warnings
warnings.filterwarnings("ignore") 

from tqdm import tqdm
from tqdm.notebook import tqdm_notebook
tqdm.pandas()

import pandas as pd

import sys
sys.path.append('scripts/')

About [preprocessing]( https://github.com/MaartenGr/BERTopic/issues/40), in words of Maarten Grootendorst, author of BERTopic:


_"In general, no, you do not need to preprocess your data. Like you said, keeping the original structure of the text is especially important for transformer-based models to understand the context._

_However, there are exceptions to this. For example, if you were to have scraped documents with a lot of html tags, then it might be beneficial to remove those as they do not provide any interesting context."_

In [None]:
from preprocess import filter_by_media
from preprocess import cluster_by_month
from preprocess import find_cities

df = pd.read_csv("data/loslagos-comunas.csv")[:100]
df = cluster_by_month(filter_by_media(df))
df = df.drop_duplicates(subset='content', keep="first")
df.drop(columns=['comuna'], axis=1, inplace=True)
df['cities'] =  df.content.progress_apply(lambda x: find_cities(str(x)))
docs = df.content.tolist()

print("number of news:", len(df))
df.head(5)

### Topic modeling with [BERTopic](https://github.com/MaartenGr/BERTopic) (+[SentenceTransformer](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) +[Word Embeddings](https://github.com/dccuchile/spanish-word-embeddings))

In [None]:
from gensim.models import KeyedVectors
from bertopic.backend import WordDocEmbedder
from sentence_transformers import SentenceTransformer

ft  = KeyedVectors.load_word2vec_format("data/SBW-vectors-300-min5.bin.gz", binary=True) 
embedding_model = SentenceTransformer("all-mpnet-base-v2")
word_doc_embedder = WordDocEmbedder(embedding_model=embedding_model, word_embedding_model=ft)

In [None]:
from topic_modeling import model_definition

topic_model = model_definition(word_doc_embedder)
topic_model.get_params()

In [None]:
topics, probs = topic_model.fit_transform(docs)

clusters = topic_model.get_topic_info()
clusters

#### Evaluation: Coherence Score

There is no one way to determine whether the coherence score is good or bad. The score and its value depends on the data that it's calculated from. For instance, in one case, the score of 0.5 might be good enough but in another case not acceptable. The only rule is that we want to **maximize** the score.

Usually, the coherence score will increase with the number of topics . This increase will become smaller as the number of topics get higher. The trade-off between the number of topics and coherence score can be achieved using the so-called elbow technique. The method implies plotting coherence score as a function of number of topics. We use the elbow of the curve to select the number of topics.

The idea behind this method is that we want to choose a point after which the diminishing increase of coherence score is no longer worth the additional increase of number of topics.

In [None]:
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

# Preprocess Documents
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics})
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

# Extract vectorizer and analyzer from BERTopic
vectorizer = topic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()

# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names()
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
               for topic in range(len(set(topics))-1)]
# Evaluate
cv_coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_v')

umass_coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='u_mass')
c_npmi_coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_npmi')

cv_coherence = cv_coherence_model.get_coherence()
umass_coherence = umass_coherence_model.get_coherence()
c_npmi_coherence = c_npmi_coherence_model.get_coherence()

print(cv_coherence, umass_coherence, c_npmi_coherence)

#### Topics 

In [None]:
clusters['most_freq_tokens'] = clusters.Topic.progress_apply(lambda x: topic_model.get_topic(x))

In [None]:
clusters

In [None]:
df['topic'] = ""

# label each row with his topic
labels=[]
for item in topic_model.generate_topic_labels():
    item.partition("_")[2]
    labels.append(item)

count = 0
for doc in tqdm(docs):  
    df.at[df.index[df['content'] == doc], 'topic'] = labels[topics[count]+1]
    count+=1

In [None]:
df.head(5)

#### Hierarchical clustering

In [None]:
from scipy.cluster import hierarchy as sch

# Hierarchical topics
linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)
hierarchical_topics = topic_model.hierarchical_topics(docs, linkage_function=linkage_function)

In [None]:
pd.set_option("display.max_columns", 20, 'display.max_colwidth', 50)
hierarchical_topics.head(4)

In [None]:
fig=topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig.write_image("img/htopics2.png")
fig

<img src="https://raw.githubusercontent.com/rickiwasho/proyecto-titulo/main/img/htopics2xddddd.png">

health=hierarchical_topics[hierarchical_topics['Parent_Name'] == "casos_contagios_salud_dosis_casos activos"].Topics
health2=hierarchical_topics[hierarchical_topics['Parent_Name'] == "cáncer_pacientes_salud_enfermedad_enfermedades"].Topics
sports=hierarchical_topics[hierarchical_topics['Parent_Name'] == "partido_equipo_club_torneo_final"].Topics
russia_ucraine=hierarchical_topics[hierarchical_topics['Parent_Name'] == "ucrania_rusia_ruso_putin_guerra"].Topics
seafood=hierarchical_topics[hierarchical_topics['Parent_Name'] == "mariscos_marea roja_pesca_toxinas_extracción"].Topics
crime=hierarchical_topics[hierarchical_topics['Parent_Name'] == "tribunal_fiscal_juicio_fiscalía_víctima"].Topics
world=hierarchical_topics[hierarchical_topics['Parent_Name'] == "johnson_bolivia_gobierno_primer ministro_peso argentino"].Topics
politics=hierarchical_topics[hierarchical_topics['Parent_Name'] == "presidente_gobierno_constitucional_comisión_votos"].Topics
art=hierarchical_topics[hierarchical_topics['Parent_Name'] == "música_artista_artistas_arte_festival"].Topics

In [None]:
#topic_model.save("out/save2", save_embedding_model=True)

In [None]:
#my_model = BERTopic.load("out/save2")
#new_topics, new_probs = my_model.transform(docs)
#my_model.get_topic_info()

#### Topics over time

In [None]:
timestamps = df.date.tolist()

# Es muy costoso (creo que se debe al word embedding)
topics_over_time = topic_model.topics_over_time(docs=docs, 
                                                timestamps=timestamps, 
                                                global_tuning=False, 
                                                evolution_tuning=False, 
                                                nr_bins=20)

In [None]:
topics_over_time.head(4)

In [None]:
fig = topic_model.visualize_topics_over_time(topics_over_time, topics=art)
fig.write_image("img/dtmt2v.png")

<img src="https://raw.githubusercontent.com/rickiwasho/proyecto-titulo/main/img/dtmt2v.png">

### 5 _most important_ keywords of documents using [KeyBERT](https://github.com/MaartenGr/KeyBERT) (+[SentenceTransformer](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) +[Word Embeddings](https://github.com/dccuchile/spanish-word-embeddings))

In [None]:
from keyword_extraction import extract_ngram_keywords

df['2gram_keywords'] = extract_ngram_keywords((2,2), word_doc_embedder, docs)
df['3gram_keywords'] = extract_ngram_keywords((3,3), word_doc_embedder, docs)

df.head(5)

### Sentiment Analysis using [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased?text=Mi+nombre+es+%5BMASK%5D+y+vivo+en+Nueva+York.) + Sentiment Analysis/Emotional Analysis using [roBERTuito](https://huggingface.co/pysentimiento/robertuito-sentiment-analysis?text=Te+quiero.+Te+amo.)

In [None]:
#!pip install pysentimiento

In [None]:
# roBERTuito
from pysentimiento import create_analyzer
sentiment_analyzer = create_analyzer(task="sentiment", lang="es")
emotion_analyzer = create_analyzer(task="emotion", lang="es")

In [None]:
# BETO
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
model_name = "finiteautomata/beto-sentiment-analysis"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
nlp = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

In [None]:
df['title_sentiment_roBERTuito'] = ""
df['title_emotion_roBERTuito'] = ""
df['title_sentiment_BETO'] = ""
df['text_sentiment_BETO'] = ""

for index, row in tqdm(sub.iterrows(), desc='sub rows - sentiment', total=sub.shape[0]):
    # análisis del título de la noticia
    sub.at[index, "title_sentiment_roBERTuito"] = sentiment_analyzer.predict(row['title'])
    sub.at[index, "title_emotion_roBERTuito"] = emotion_analyzer.predict(row['title'])
    sub.at[index, 'title_sentiment_BETO'] = nlp(row['title'])
    
    # análisis del cuerpo de la noticia
    count_neutral = 0
    count_negative = 0
    count_positive = 0
    partition = row['text'].split(".")
    for text in partition:
        # Analizamos su sentimiento
        sentiment_value = nlp(text)
        if sentiment_value[0].get('label') == "NEU": count_neutral=count_neutral+1
        if sentiment_value[0].get('label') == "NEG": count_negative=count_negative+1
        if sentiment_value[0].get('label') == "POS": count_positive=count_positive+1
            
    sub.at[index, "text_sentiment_BETO"] = {"NEU": count_neutral, "NEG": count_negative, "POS": count_positive}

In [None]:
pd.set_option("display.max_columns", 100, 'display.max_colwidth', None)
sub[['title','title_sentiment_roBERTuito', 'title_emotion_roBERTuito','title_sentiment_BETO',"text_sentiment_BETO"]]