# Topic modelling

The goal of this notebook is to run topic modelling on the dataset. For performance reasons we limit ourselves to the Illinois Appellate Court containing approximately 120k documents.

https://www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/#9buildldamodelwithsklearn


In [1]:
import sys
sys.path.append("..")

import json
import numpy as np
from pprint import pprint
from collections import defaultdict

from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import LatentDirichletAllocation as lda
from lda import guidedlda as glda

import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

from src.dataset import Dataset
from src.vectorizers import TokenVectorizer
from src.lda_utils import get_word_relevance, get_words_relevance, print_topics

import warnings
warnings.filterwarnings('ignore')

## Tokens loading and preprocessing

The first step is to load the tokenized documents and to filter the tokens, removing those that are over and under used.

In [25]:
dataset = Dataset()
tokens = dataset.load_dataset(year=None, 
                              tokens=True, 
                              courts={"Illinois Appellate Court"})

In [26]:
len(tokens)

123915

### Filtering
We count the occurrences of each token and filter those that do not match the criteria.

In [27]:
freq = defaultdict(lambda:0)
for doc in tokens:
    # for w in doc:
    for w in set(doc):        
        freq[w] += 1

In [2]:
narcotics = ['cannabis', 'cocaine', 'methamphetamine', 'drugs', 'drug', 'marijuana', 
             'ecstasy', 'lsd', 'ketamine', 'heroin', 'fentanyl', 'overdose']

weapons = ['gun', 'knife', 'weapon', 'firearm', 'rifle', 'carabine', 'shotgun', 'handgun', 
           'revolver', 'musket', 'pistol', 'derringer', 'assault', 'rifle', 'sword', 'blunt']

investigation = ['gang', 'mafia', 'serial',  'killer', 'rape', 'theft', 'recidivism', 
                 'arrest', 'robbery', 'cybercrime', 'cyber', 'crime']

interesting_set = set(narcotics + weapons + investigation)

In [29]:
def sel_criterium(w):
    return (w in interesting_set) or (len(w) >= 3) and (15 < freq[w] < 0.5*len(tokens))
    
tokens = [[w for w in doc if sel_criterium(w)] for doc in tokens]

### Vectorization
The next step is to vectorize the data with a count vectorizer. The results are saved on disk to avi having to run the first cells each time, saving RAM.

In [30]:
dv = TokenVectorizer(tokens, method="count")

vectors = dv.vectors()
dv.save_vectors_vectorizer(vectors)
print(f"Vocabulary length: {len(dv.vectorizer.vocabulary_)}")

Vocabulary length: 38855


In [3]:
vectors, vectorizer = TokenVectorizer.load_vectors_vectorizer(method="count")

## Classic LDA model
The first model used is the sklearn implementation of LDA.

In [None]:
numTopics = 3
# alpha = 50/numTopics
alpha = 0.1
beta = 0.01

lda_model = lda(n_components = numTopics, 
                doc_topic_prior= alpha, 
                topic_word_prior = beta, 
                random_state=0, 
                n_jobs=-1)

lda_output = lda_model.fit_transform(vectors)

### Printing topics words
Here are the found most relevant words for each topic.

In [None]:
print_topics(lda_model, 
             vectorizer, 
             n_top_words=5, 
             only_interesting=False)

### Considering only words of interest
Here are only the words of interest in each topic.

In [None]:
print_topics(lda_model, 
             vectorizer, 
             n_top_words=5, 
             only_interesting=True, 
             interesting_set=interesting_set)

### Finding most relevant topics given a word, and a list of words
We define two functions that computes the relevance of each topic given a word and a list of words.

In [None]:
vocab = vectorizer.get_feature_names()
word2id = dict((v, idx) for idx, v in enumerate(vocab))

get_word_relevance(weapons[0], word2id, vocab, lda_model, normalize=True)

In [None]:
for l in [weapons, narcotics, investigation]:
    print(get_words_relevance(l, word2id, vocab, lda_model, normalize=True))

## Finding the optimal number of topics

Before proceeding with other models, we find the optimal number of topics with a grid search approach.

In [None]:
# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda_model.score(vectors))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(vectors))

# See model parameters
pprint(lda_model.get_params())

### Grid search

In [None]:
search_params = {
    'n_components'  : [8, 10, 12]
}

fun = lda()

model = GridSearchCV(fun, param_grid=search_params, verbose=1)
model.fit(vectors)

In [None]:
best_lda_model = model.best_estimator_
print("Best Model's Params: ", model.best_params_)
print("Best Log Likelihood Score: ", model.best_score_)
print("Model Perplexity: ", best_lda_model.perplexity(vectors))

### Visualization

In [None]:
panel = pyLDAvis.sklearn.prepare(best_lda_model, vectors, vectorizer)
panel

## Guided LDA approach
The next part of the notebook is a guided LDA approach. The idea is to set word priors before running lda, to merge interesting words in the same topic. This will lead to a division between the categories of interest.

In [None]:
vocab = vectorizer.get_feature_names()
word2id = dict((v, idx) for idx, v in enumerate(vocab))

In [None]:
seed_topic_list = [narcotics, investigation, weapons]
seed_topics = {}

for i, st in enumerate(seed_topic_list):
    for word in st:
        if word in word2id:
            seed_topics[word2id[word]] = i
        else:
            print(f"{word} not found in vocabulary")

In [None]:
g_numTopics = model.best_params_["n_components"]
g_alpha = 0.1
g_beta = 0.01
g_iter = 100

glda_model = glda.GuidedLDA(n_topics=g_numTopics, 
                            n_iter=g_iter, 
                            random_state=0, 
                            refresh=10, 
                            alpha=g_alpha, 
                            eta=g_beta)

glda_model.fit(vectors, 
               seed_topics=seed_topics, 
               seed_confidence=0.90)

In [None]:
print("Guided lda topics")
print_topics(glda_model, 
             vectorizer, 
             n_top_words=10, 
             only_interesting=False)

print("\nTopics with only interesting words")
print_topics(glda_model, 
             vectorizer, 
             n_top_words=10, 
             only_interesting=True, 
             interesting_set=interesting_set)

In [None]:
for l in [weapons, narcotics, investigation]:
    print(get_words_relevance(l, word2id, vocab, glda_model, normalize=True))

In [None]:
panel = pyLDAvis.sklearn.prepare(glda_model, vectors, vectorizer)
panel