<a href="https://colab.research.google.com/github/rfahrn/Exercise_3/blob/main/ex_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install bertopic

In [4]:
from bertopic import BERTopic
import pandas as pd

## Data scarcity issue

First I tried `nrows=10`, thinking to speed up the exploratory part with just a little bit of data. But I kept getting the error `zero-size array to reduction operation maximum which has no identity` even after increasing to 100 rows. Luckily I found an issue in `MaartenGr/BERTopic` about this, which revealed that too little data will do that as there is not enough text from which to extract any meaningful topics.

I then increased to `nrows=500`, since the issue indicated that this seemed to be the minimum amount of data required. Then I had another problem: 

In [None]:
# Change to nrows=5563 for the full dataset
df = pd.read_json('https://files.ifi.uzh.ch/cl/siclemat/lehre/fs21/tm/data/all_de_topics.jsonl', lines=True, nrows=500)
docs = df.iloc[:,0].tolist()

topic_model = BERTopic(verbose=True, language='German')
topics, probs = topic_model.fit_transform(docs)

topic_model.get_topic_info()

The topics were full of stopwords! I found yet another issue in `MaartenGr/BERTopic` about this, where they said again that too little data will result in stopwords flooding the results.

## Stopword issue

For dev I didn't want to increase the data, though, since it takes so long to train, so I threw in the code I found there assigning the `vectorizer_model` to `CountVectorizer` with German stop words as an argument. I'll remove this step for the final model.

In [9]:
# If stopwords go away with the full dataset, you don't need to run this cell

from sklearn.feature_extraction.text import CountVectorizer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stops = set(stopwords.words('german'))
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=stops)

topic_model = BERTopic(verbose=True, language='German', vectorizer_model=vectorizer_model)
topics, probs = topic_model.fit_transform(docs)

topic_model.get_topic_info()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Batches:   0%|          | 0/16 [00:00<?, ?it/s]

2022-04-22 13:45:46,089 - BERTopic - Transformed documents to Embeddings
2022-04-22 13:45:50,962 - BERTopic - Reduced dimensionality with UMAP
2022-04-22 13:45:51,001 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2022-04-22 13:45:56,808 - BERTopic - Reduced number of topics from 10 to 10


Unnamed: 0,Topic,Count,Name
0,0,129,0_kommission_bundesrat_initiative_geht
1,-1,109,-1_schweiz_mehr_bundesrat_heute
2,1,90,1_bundesrat_initiative_mehr_müssen
3,2,60,2_schweiz_schweizer_mehr_eu
4,3,27,3_schweiz_personenfreizügigkeit_zuwanderung_pr...
5,4,24,4_armee_000_mehr_soldaten
6,5,19,5_landwirtschaft_feldwerbung_schweiz_mehr
7,6,16,6_franken_armee_milliarden_milliarden franken
8,7,15,7_abkommen_eu_schweiz_geht
9,8,11,8_eu_bundesrat_schweiz_europarat


That looks much better!

## Visualizing the topics

In [11]:
topic_model.visualize_topics()

This looks super clean! I'm curious what will happen with the full dataset.

In [12]:
topic_model.visualize_barchart()

In [13]:
topic_model.visualize_heatmap()

These visualizations look cool, but the presence of Topic -1 is kind of annoying and I can't seem to find a way to remove it.