# Topic modelling with Gensim

In this notebook Gensim is used to find topics from the documents, we first study coherence measures to find the number of topics in the collection, then we visualize results.

The basic idea is to study the differences between topics in different times. So we first present the methodology followed given an era, and then perform the study on all eras.

In [11]:
import sys
sys.path.append("..")

from src.dataset import Dataset

# utils
import json
import random 
import numpy as np
from collections import defaultdict
# topic modelling
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel, LdaMulticore, LdaModel

# visualization
from pprint import pprint
import matplotlib.pyplot as plt
import pyLDAvis
from pyLDAvis import gensim_models

In [12]:
import warnings
warnings.filterwarnings('ignore')

### Loading the tokenized documents

In [13]:
dataset = Dataset()
# load only the year specified
# year = None # carica tutto
year = None # carico solo quel ventennio 
sample_size = 20000 

# tokens = dataset.load_dataset(year=year, tokens=True)

tokens = random.sample(dataset.load_dataset(year=year, tokens=True), sample_size)

In [14]:
len(tokens)

20000

In [15]:
freq = defaultdict(lambda:0)
for doc in tokens:
    for w in doc:
    # for w in set(doc):        
        freq[w] += 1

In [16]:
narcotics = ['cannabis', 'cocaine', 'methamphetamine', 'drugs', 'drug', 'marijuana', 
             'ecstasy', 'lsd', 'ketamine', 'heroin', 'fentanyl', 'overdose']

weapons = ['gun', 'knife', 'weapon', 'firearm', 'rifle', 'carabine', 'shotgun', 'handgun', 
           'revolver', 'musket', 'pistol', 'derringer', 'assault', 'rifle', 'sword', 'blunt']

investigation = ['gang', 'mafia', 'serial',  'killer', 'rape', 'theft', 'recidivism', 
                 'arrest', 'robbery', 'cybercrime', 'cyber', 'crime']

interesting_set = set(narcotics + weapons + investigation)

In [17]:
def sel_criterium(w):
    return (w in interesting_set) or ((len(w) >= 3) and (10 < freq[w] < 0.5*len(tokens)))

tokens = [[w for w in doc if sel_criterium(w)] for doc in tokens]

In [18]:
sum([len(x) for x in tokens])

8552342

### Creating required structures
The topic modelling requires this three structures to work, the first one gives a mapping from an id to a word, 
second one is the tokenized collection, and the third one is a list for each document of word id, frequency.

In [19]:
id2word = corpora.Dictionary(tokens)
texts = tokens
corpus = [id2word.doc2bow(text) for text in texts]

print(f"Corpus[0]: {corpus[0][:5]}...")
print(f"id2word[0]: {id2word[0]}")
print(f"Corpus[0] readable: {[(id2word[cp[0]], cp[1]) for cp in corpus[0][:5]]}...")

Corpus[0]: [(0, 1), (1, 1)]...
id2word[0]: fitch
Corpus[0] readable: [('fitch', 1), ('presiding', 1)]...


### Fitting LDA models and computing coherence measures

Lda is unsupervised, a crucial parameter is the number of topics to find in the collection. To avoid choosing at random one could compute the coherence measure given a certain number of topics and pick the highest coherence.

To do so, different models are fitted, and the best one, according to coherence is considered.

In [20]:
def compute_coherence_perplexity_values(corpus, id2word, tokens, k=[2,3,4], verbose=True):
    lda_models = []
    statistics = []
    
    for topics in k:
        if verbose: print(f"Fitting model with {topics} topics.")
        lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=topics, 
                             random_state=100, update_every=1, chunksize=100,
                             passes=100, alpha='auto', per_word_topics=True)
        
        lda_models.append(lda_model)
        coherence_model = CoherenceModel(model=lda_model, texts=tokens, 
                                             dictionary=id2word, coherence='c_v')
        
        statistics.append({"topics" : topics, 
                          "coherence" : coherence_model.get_coherence(),
                          "log_perplexity" : lda_model.log_perplexity(corpus)})
        
    return lda_models, statistics

In [21]:
k = range(2, 11)
lda_models, statistics = compute_coherence_perplexity_values(corpus, id2word, tokens, k)

Fitting model with 2 topics.
Fitting model with 3 topics.
Fitting model with 4 topics.
Fitting model with 5 topics.
Fitting model with 6 topics.
Fitting model with 7 topics.
Fitting model with 8 topics.


KeyboardInterrupt: 

### Plotting coherence measures
The best model is the one with four topics.

In [None]:
plt.plot(k, [x["coherence"] for x in statistics])
plt.title("Lda models coherence score")
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
best, ind = max(zip(statistics, range(len(statistics))), key=lambda x:(x[0]["coherence"], x[1]))
print(f"Best model found:\n{best}")
best_model = lda_models[ind]
print("\nBest model topics:")
best_model.print_topics()

### pyLDAvis visualization

This tool offers a graphical visualization of the topics, a good visualization has big topics that are far from each others.

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(best_model, corpus, id2word)
vis

## NOT FOR NOW Putting all together

After presenting the methodology, we run different searches on each era to find the best number of topics, the result is a model for each era that achieved the bet coherence.

In [None]:
tokens_dir = "../data/processed/tokens"
models_dir = "../data/models/lda"
epochs_files = [f"{tokens_dir}/{f}" for f in sorted(os.listdir(tokens_dir))]

min_topics = 8
max_topics = 13

k = range(min_topics, max_topics+1)

best_models = {}
for epoch in epochs_files:
    # extracting year name
    name = int(epoch.split("/")[-1].split(".")[0])
    print(f"Computing best model for {name}:")
    
    # creating required structures
    texts = json.load(open(epoch, "r"))
    id2word = corpora.Dictionary(texts)
    corpus = [id2word.doc2bow(text) for text in texts]
    
    # fitting models and picking the best one
    lda_models, statistics = compute_coherence_perplexity_values(corpus, id2word, texts, k, verbose=False)
    best, ind = max(zip(statistics, range(len(statistics))), key=lambda x:x[0]["coherence"])
    print(f"\tBest one: {best}\n")
    best_models[name] = lda_models[ind]
    
    # saving model to disk
    os.makedirs(f"{models_dir}/{name}")
    lda_models[ind].save(f"{models_dir}/{name}/{name}_lda.model")