# Topic modelling

The goal is to find three topics from the collection, the firt one regarding drugs, the second weapons, and the third investigation. 

We start with a classic model to then test the guided lda approach. We do not expect the first one to find the three topics we want, while the second should guide the topic modelling towards the required goal.

References:
https://medium.com/analytics-vidhya/how-i-tackled-a-real-world-problem-with-guidedlda-55ee803a6f0d


In [10]:
import warnings
warnings.filterwarnings('ignore')

import sys
sys.path.append("..")

import json
import numpy as np
from pprint import pprint
from collections import defaultdict

from sklearn.decomposition import LatentDirichletAllocation as lda
from lda import guidedlda as glda

# import pyLDAvis.sklearn
# pyLDAvis.enable_notebook()

from src.dataset import Dataset
from src.vectorizers import TokenVectorizer
from src.lda_utils import get_word_relevance, get_words_relevance, print_topics

In [11]:
dataset = Dataset()
# load only the year specified
# year = None # carica tutto
year = None # carico solo quel ventennio 

tokens = dataset.load_dataset(year=1960, 
                              tokens=True, 
                              courts={"Illinois Supreme Court"})

In [12]:
len(tokens)

4891

### Filter tokens

In [13]:
freq = defaultdict(lambda:0)
for doc in tokens:
    for w in doc:
    # for w in set(doc):        
        freq[w] += 1

In [14]:
narcotics = ['cannabis', 'cocaine', 'methamphetamine', 'drugs', 'drug', 'marijuana', 
             'ecstasy', 'lsd', 'ketamine', 'heroin', 'fentanyl', 'overdose']

weapons = ['gun', 'knife', 'weapon', 'firearm', 'rifle', 'carabine', 'shotgun', 'handgun', 
           'revolver', 'musket', 'pistol', 'derringer', 'assault', 'rifle', 'sword', 'blunt']

investigation = ['gang', 'mafia', 'serial',  'killer', 'rape', 'theft', 'recidivism', 
                 'arrest', 'robbery', 'cybercrime', 'cyber', 'crime']

interesting_set = set(narcotics + weapons + investigation)

In [15]:
def sel_criterium(w):
    return (w in interesting_set) or (len(w) >= 3) and (10 < freq[w] < 0.5*len(tokens))
    
tokens = [[w for w in doc if sel_criterium(w)] for doc in tokens]

### Vectorize the documents
The vectorized is a tfidf one, we use the output to fit the lda model.

In [16]:
dv = TokenVectorizer(tokens, method="count")

vectors = dv.vectors()
dv.save_vectors_vectorizer(vectors)
print(f"Vocabulary length: {len(dv.vectorizer.vocabulary_)}")

Vocabulary length: 12000


### Loading precomputed vectors, this 

In [17]:
vectors, vectorizer = TokenVectorizer.load_vectors_vectorizer(method="count")

## Classic LDA model

The number of topics is set to three, while alpha and beta have values proposed in the literature. 

Griffiths TL, Steyvers M (2004). “Finding Scientific Topics.” Proceedings of the National Academy of Sciences of the United States of America, 101, 5228–5235.

In [18]:
numTopics = 10
# alpha = 50/numTopics
alpha = 0.1
beta = 0.01

lda_model = lda(n_components = numTopics, 
                doc_topic_prior= alpha, 
                topic_word_prior = beta, 
                random_state=0, 
                n_jobs=-1)

lda_output = lda_model.fit_transform(vectors)

### Topics relevant words

The next step is to check the words for each topic, results are interesting and expected, bu twe can't see a distinction between the topics we want.

In [19]:
print_topics(lda_model, 
             vectorizer, 
             n_top_words=10, 
             only_interesting=False)


Topic: 0
676.74*safety + 662.33*driver + 652.65*teacher + 617.18*negligence + 612.1*care + 603.65*education + 567.83*truck + 561.69*standard + 547.29*operate + 534.6*church

Topic: 1
1618.18*deed + 1185.82*title + 945.12*grand + 868.01*contempt + 785.54*probate + 744.28*trustee + 656.57*testator + 541.41*decree + 507.08*executor + 503.53*document

Topic: 2
4950.84*arrest + 3712.68*crime + 2574.77*robbery + 1985.23*identification + 1956.09*guilt + 1835.01*gun + 1673.84*station + 1557.41*suppress + 1459.09*room + 1410.36*door

Topic: 3
1723.44*zone + 1534.1*lot + 1262.89*park + 1035.94*north + 1027.17*road + 1014.97*avenue + 991.85*tract + 937.69*south + 879.07*east + 852.05*propose

Topic: 4
916.26*insure + 888.98*automobile + 773.14*coverage + 737.14*vehicle + 595.17*insurer + 463.89*clause + 421.56*check + 393.01*uninsured + 360.32*book + 350.14*life

Topic: 5
1046.38*decree + 985.11*wife + 947.48*mother + 842.82*divorce + 796.36*joint + 784.78*husband + 768.71*sign + 767.75*minor + 

### Consider only words of interest
We now print the word distribution, considering only interesting words

In [20]:
print_topics(lda_model, 
             vectorizer, 
             n_top_words=10, 
             only_interesting=True, 
             interesting_set=interesting_set)


Topic: 0
55.11*drug + 4.06*assault + 3.21*crime + 0.94*arrest + 0.01*pistol + 0.01*cocaine + 0.01*derringer + 0.01*overdose + 0.01*recidivism + 0.01*weapon

Topic: 1
13.02*crime + 2.89*sword + 1.84*arrest + 0.97*theft + 0.39*assault + 0.08*weapon + 0.04*knife + 0.01*handgun + 0.01*gun + 0.01*gang

Topic: 2
4950.84*arrest + 3712.68*crime + 2574.77*robbery + 1835.01*gun + 992.87*rape + 812.89*weapon + 661.04*assault + 588.91*knife + 487.17*drug + 428.6*heroin

Topic: 3
19.17*drug + 2.49*lsd + 1.6*theft + 1.08*serial + 0.01*blunt + 0.01*assault + 0.01*gang + 0.01*firearm + 0.01*arrest + 0.01*crime

Topic: 4
17.88*assault + 11.65*theft + 1.87*crime + 1.61*shotgun + 0.4*drug + 0.29*rape + 0.01*arrest + 0.01*gang + 0.01*firearm + 0.01*blunt

Topic: 5
42.53*lsd + 22.35*drug + 3.91*blunt + 1.97*crime + 0.33*theft + 0.3*killer + 0.02*cocaine + 0.01*assault + 0.01*marijuana + 0.01*sword

Topic: 6
1303.23*crime + 758.41*arrest + 522.52*drug + 481.25*robbery + 340.68*theft + 202.71*marijuana + 18

### Finding most relevant topics given a word, and a list of words

We no test how much the interesting words get merged together in topics.

In [21]:
vocab = vectorizer.get_feature_names()
word2id = dict((v, idx) for idx, v in enumerate(vocab))

In [22]:
get_word_relevance(weapons[0], word2id, vocab, lda_model, normalize=True)

{0: 0.001,
 1: 0.001,
 2: 99.995,
 3: 0.001,
 4: 0.001,
 5: 0.001,
 6: 0.001,
 7: 0.001,
 8: 0.001,
 9: 0.001}

In [23]:
for l in [weapons, narcotics, investigation]:
    print(get_words_relevance(l, word2id, vocab, lda_model, normalize=True))

{0: 0.075, 1: 0.062, 2: 96.064, 3: 0.003, 4: 0.35, 5: 0.072, 6: 3.057, 7: 0.185, 8: 0.002, 9: 0.129}
{0: 2.375, 1: 0.003, 2: 48.801, 3: 0.935, 4: 0.02, 5: 2.795, 6: 40.508, 7: 2.325, 8: 1.712, 9: 0.526}
{0: 0.027, 1: 0.1, 2: 80.274, 3: 0.017, 4: 0.087, 5: 0.017, 6: 19.328, 7: 0.063, 8: 0.043, 9: 0.045}


The result implies that narcotics are present in the topic number 2, 6, weapons get the topic number 2, while investigation the 2 and the 6.

We would want three different topics for each interesting list.

## Finding the optimal number of topics

We now run the lda model with a variable number of topics, we compute coherence to decide the best number of topics.

The model is made for gensim, but the documentation states:

*This function also supports models from lda and sklearn (by passing topic_word_distrib, dtm and vocab)!*

In [24]:
# from tmtoolkit.topicmod.evaluate import metric_coherence_gensim

# metric_coherence_gensim(measure='c_v', 
#                         top_n=25, 
#                         topic_word_distrib=lda_model.components_, 
#                         dtm=vectors, 
#                         vocab=np.array(vectorizer.get_feature_names()), 
#                         texts=tokens)

We can see that the topics blends together even considering only the words of interest, LDA must be guided. 

## Guided LDA approach
We now guide the lda process by setting some seeds, exploiting the model defined by the GuidedLDA package.

In [25]:
vocab = vectorizer.get_feature_names()
word2id = dict((v, idx) for idx, v in enumerate(vocab))

In [26]:
seed_topic_list = [narcotics, investigation, weapons]
seed_topics = {}

for i, st in enumerate(seed_topic_list):
    for word in st:
        if word in word2id:
            seed_topics[word2id[word]] = i
        else:
            print(f"{word} not found in vocabulary")

drugs not found in vocabulary
ecstasy not found in vocabulary
ketamine not found in vocabulary
fentanyl not found in vocabulary
mafia not found in vocabulary
cybercrime not found in vocabulary
cyber not found in vocabulary
carabine not found in vocabulary
musket not found in vocabulary


In [27]:
g_numTopics = 10
g_alpha = 0.1
g_beta = 0.01
g_iter = 100

glda_model = glda.GuidedLDA(n_topics=g_numTopics, 
                            n_iter=g_iter, 
                            random_state=0, 
                            refresh=10, 
                            alpha=g_alpha, 
                            eta=g_beta)

glda_model.fit(vectors, 
               seed_topics=seed_topics, 
               seed_confidence=0.90)

INFO:lda:n_documents: 4891
INFO:lda:vocab_size: 12000
INFO:lda:n_words: 1986290
INFO:lda:n_topics: 10
INFO:lda:n_iter: 100
INFO:lda:<0> log likelihood: -21616291
INFO:lda:<10> log likelihood: -17651783
INFO:lda:<20> log likelihood: -17087954
INFO:lda:<30> log likelihood: -16932813
INFO:lda:<40> log likelihood: -16860177
INFO:lda:<50> log likelihood: -16812843
INFO:lda:<60> log likelihood: -16780606
INFO:lda:<70> log likelihood: -16758106
INFO:lda:<80> log likelihood: -16741971
INFO:lda:<90> log likelihood: -16726448
INFO:lda:<99> log likelihood: -16713549


<lda.guidedlda.GuidedLDA at 0x7f852d965310>

In [28]:
print("Guided lda topics")
print_topics(glda_model, 
             vectorizer, 
             n_top_words=10, 
             only_interesting=False)

print("\nTopics with only interesting words")
print_topics(glda_model, 
             vectorizer, 
             n_top_words=10, 
             only_interesting=True, 
             interesting_set=interesting_set)

Guided lda topics

Topic: 0
0.01*information + 0.01*drug + 0.01*agent + 0.01*informer + 0.01*client + 0.01*test + 0.01*contempt + 0.0*standard + 0.0*material + 0.0*mental

Topic: 1
0.02*arrest + 0.01*crime + 0.01*accuse + 0.01*robbery + 0.01*convict + 0.0*penitentiary + 0.0*waive + 0.0*appoint + 0.0*prosecutor + 0.0*prejudice

Topic: 2
0.01*identification + 0.01*gun + 0.01*station + 0.01*door + 0.01*drive + 0.01*crime + 0.01*room + 0.0*tavern + 0.0*observe + 0.0*guilt

Topic: 3
0.01*election + 0.01*class + 0.01*protection + 0.01*local + 0.01*administrative + 0.01*limitation + 0.01*government + 0.0*adopt + 0.0*classification + 0.0*license

Topic: 4
0.01*arbitrator + 0.01*disability + 0.01*doctor + 0.01*workman + 0.01*manifest + 0.01*week + 0.01*pain + 0.01*loss + 0.01*permanent + 0.01*leg

Topic: 5
0.02*zone + 0.01*lot + 0.01*park + 0.01*road + 0.01*north + 0.01*tract + 0.01*south + 0.01*avenue + 0.01*east + 0.01*west

Topic: 6
0.01*negligence + 0.01*vehicle + 0.01*automobile + 0.01*rec

In [29]:
for l in [weapons, narcotics, investigation]:
    print(get_words_relevance(l, word2id, vocab, glda_model, normalize=True))

{0: 0.004, 1: 7.245, 2: 92.069, 3: 0.547, 4: 0.111, 5: 0.006, 6: 0.004, 7: 0.005, 8: 0.005, 9: 0.003}
{0: 96.542, 1: 1.355, 2: 0.002, 3: 0.002, 4: 1.013, 5: 1.07, 6: 0.004, 7: 0.005, 8: 0.004, 9: 0.003}
{0: 0.666, 1: 71.659, 2: 27.348, 3: 0.001, 4: 0.062, 5: 0.002, 6: 0.259, 7: 0.001, 8: 0.001, 9: 0.001}


The overall partition is better, narcotics get the topic 0, weapons the number 2, while investigation is at 70 percent on the topic 1 and 30 on the 2.