# Topic modelling

The goal is to find three topics from the collection, the firt one regarding drugs, the second weapons, and the third investigation. 

We start with a classic model to then test the guided lda approach. We do not expect the first one to find the three topics we want, while the second should guide the topic modelling towards the required goal.

References:
https://medium.com/analytics-vidhya/how-i-tackled-a-real-world-problem-with-guidedlda-55ee803a6f0d


In [1]:
import warnings
warnings.filterwarnings('ignore')

import sys
sys.path.append("..")

import json
import numpy as np
from collections import defaultdict

from sklearn.decomposition import LatentDirichletAllocation as lda
from lda import guidedlda as glda

# import pyLDAvis.sklearn
# pyLDAvis.enable_notebook()

from src.dataset import Dataset
from src.vectorizers import TokenVectorizer

In [60]:
dataset = Dataset()
# load only the year specified
# year = None # carica tutto
year = None # carico solo quel ventennio 

tokens = dataset.load_dataset(year=year, tokens=True)

### Filtra i dati, se vuoi

In [61]:
freq = defaultdict(lambda:0)
for doc in tokens:
    for w in doc:
    # for w in set(doc):        
        freq[w] += 1

In [62]:
narcotics = ['cannabis', 'cocaine', 'methamphetamine', 'drugs', 'drug', 'marijuana', 
             'ecstasy', 'lsd', 'ketamine', 'heroin', 'fentanyl', 'overdose']

weapons = ['gun', 'knife', 'weapon', 'firearm', 'rifle', 'carabine', 'shotgun', 'handgun', 
           'revolver', 'musket', 'pistol', 'derringer', 'assault', 'rifle', 'sword', 'blunt']

investigation = ['gang', 'mafia', 'serial',  'killer', 'rape', 'theft', 'recidivism', 
                 'arrest', 'robbery', 'cybercrime', 'cyber', 'crime']

interesting_set = set(narcotics + weapons + investigation)

In [63]:
def sel_criterium(w):
    return (w in interesting_set) or (len(w) >= 3) and (10 < freq[w] < 0.5*len(tokens))

tokens = [[w for w in doc if sel_criterium(w)] for doc in tokens]

### Vectorize the documents
The vectorized is a tfidf one, we use the output to fit the lda model.

In [64]:
dv = TokenVectorizer(tokens, method="count")

vectors = dv.vectors()
dv.save_vectors_vectorizer(vectors)
print(f"Vocabulary length: {len(dv.vectorizer.vocabulary_)}")

Vocabulary length: 89943


### Loading precomputed vectors, this 

In [65]:
vectors, vectorizer = TokenVectorizer.load_vectors_vectorizer(method="count")

## Classic LDA model

The number of topics is set to three, while alpha and beta have values proposed in the literature. 

Griffiths TL, Steyvers M (2004). “Finding Scientific Topics.” Proceedings of the National Academy of Sciences of the United States of America, 101, 5228–5235.

In [66]:
numTopics = 10
# alpha = 50/numTopics
alpha = 0.1
beta = 0.01

lda_model = lda(n_components = numTopics, 
                doc_topic_prior= alpha, 
                topic_word_prior = beta, 
                random_state=0, 
                n_jobs=-1)

lda_output = lda_model.fit_transform(vectors)

### Topics relevant words

The next step is to check the words for each topic, results are interesting and expected, bu twe can't see a distinction between the topics we want.

In [67]:
n_top_words = 10

vocab = vectorizer.get_feature_names()
topic_words = {}
for topic, comp in enumerate(lda_model.components_): 
    word_idx = np.argsort(comp)[::-1][:n_top_words]
    topic_words[topic] = [vocab[i] for i in word_idx]
    
for topic, words in topic_words.items():
    print('\nTopic: %d' % topic)
    print('%s' % ', '.join(words))


Topic: 0
constitution, election, legislature, member, article, injunction, license, amendment, constitutional, vote

Topic: 1
crime, arrest, prosecutor, indictment, convict, sexual, abuse, sentencing, prosecution, juror

Topic: 2
lease, price, rate, loss, contractor, agent, rent, furnish, perform, market

Topic: 3
village, assessment, road, water, town, levy, railroad, improvement, line, commissioner

Topic: 4
track, train, railroad, injure, automobile, drive, truck, safety, declaration, side

Topic: 5
marriage, search, custody, parent, divorce, warrant, husband, father, income, marital

Topic: 6
hospital, medical, expert, test, physician, industrial, compensation, doctor, treatment, decedent

Topic: 7
door, room, gun, arrest, apartment, stop, fire, store, floor, walk

Topic: 8
trustee, execute, stock, lien, execution, equity, debt, certificate, creditor, mrs

Topic: 9
summary, dismissal, assert, limitation, allegation, civil, employer, insure, procedure, affidavit


### Consider only words of interest
We now print the word distribution, considering only interesting words

In [68]:
topic_words = {}
for topic, comp in enumerate(lda_model.components_): 
    word_idx = np.argsort(comp)[::-1]
    topic_words[topic] = [w for w in [vocab[i] for i in word_idx] if w in interesting_set][:n_top_words]
    
for topic, words in topic_words.items():
    print('\nTopic: %d' % topic)
    print('%s' % ', '.join(words))


Topic: 0
crime, arrest, drug, theft, firearm, assault, serial, cannabis, weapon, drugs

Topic: 1
crime, arrest, robbery, assault, drug, theft, gun, rape, weapon, gang

Topic: 2
serial, theft, sword, drugs, drug, arrest, blunt, killer, crime, firearm

Topic: 3
arrest, sword, fentanyl, crime, blunt, assault, gang, drug, rifle, firearm

Topic: 4
gang, arrest, assault, serial, theft, knife, killer, shotgun, drug, drugs

Topic: 5
arrest, cocaine, drug, cannabis, marijuana, heroin, methamphetamine, lsd, weapon, serial

Topic: 6
drug, arrest, overdose, blunt, marijuana, cannabis, cocaine, killer, methamphetamine, assault

Topic: 7
gun, arrest, crime, weapon, robbery, knife, assault, revolver, pistol, shotgun

Topic: 8
arrest, sword, blunt, serial, assault, drug, crime, theft, rifle, cyber

Topic: 9
drug, theft, sword, drugs, assault, blunt, crime, serial, arrest, firearm


We can see that the topics blends together even considering only the words of interest, LDA must be guided. 

## Guided LDA approach
We now guide the lda process by setting some seeds, exploiting the model defined by the GuidedLDA package.

In [69]:
word2id = dict((v, idx) for idx, v in enumerate(vocab))

In [70]:
seed_topic_list = [narcotics, investigation, weapons]
seed_topics = {}

for i, st in enumerate(seed_topic_list):
    for word in st:
        if word in word2id:
            seed_topics[word2id[word]] = i
        else:
            print(f"{word} not found in vocabulary")

cybercrime not found in vocabulary
carabine not found in vocabulary


In [71]:
glda_model = glda.GuidedLDA(n_topics=10, 
                       n_iter=250, 
                       random_state=0, 
                       refresh=10, 
                       alpha=alpha, 
                       eta=beta)

glda_model.fit(vectors, 
          seed_topics=seed_topics, 
          seed_confidence=0.90)

INFO:lda:n_documents: 183146
INFO:lda:vocab_size: 89943
INFO:lda:n_words: 79362124
INFO:lda:n_topics: 10
INFO:lda:n_iter: 250
INFO:lda:<0> log likelihood: -879179572
INFO:lda:<10> log likelihood: -746208315
INFO:lda:<20> log likelihood: -709967559
INFO:lda:<30> log likelihood: -705226937
INFO:lda:<40> log likelihood: -703361333
INFO:lda:<50> log likelihood: -702222837
INFO:lda:<60> log likelihood: -701393400
INFO:lda:<70> log likelihood: -700808215
INFO:lda:<80> log likelihood: -700356755
INFO:lda:<90> log likelihood: -699991570
INFO:lda:<100> log likelihood: -699691969
INFO:lda:<110> log likelihood: -699408386
INFO:lda:<120> log likelihood: -699129151
INFO:lda:<130> log likelihood: -698936185
INFO:lda:<140> log likelihood: -698809886
INFO:lda:<150> log likelihood: -698678869
INFO:lda:<160> log likelihood: -698547644
INFO:lda:<170> log likelihood: -698483851
INFO:lda:<180> log likelihood: -698380867
INFO:lda:<190> log likelihood: -698318069
INFO:lda:<200> log likelihood: -698239159
INF

<lda.guidedlda.GuidedLDA at 0x278521bbfa0>

In [72]:
topic_word = glda_model.topic_word_
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][::-1]
    interesting_topic_words = [w for w in topic_words if w in interesting_set][:n_top_words]
    print(f"Topic {i}:\n{' '.join(topic_words[:n_top_words * 2])}\n{' '.join(interesting_topic_words)}")

Topic 0:
medical hospital parent minor expert treatment physician mental doctor mother custody patient father health abuse disability test suffer physical week
drug assault overdose blunt arrest marijuana cocaine recidivism killer sword
Topic 1:
arrest crime search doubt gun robbery convict apartment indictment prosecutor identify juror sentencing warrant prosecution armed door felony drive guilt
arrest crime gun robbery assault drug weapon theft rape cocaine
Topic 2:
train track drive truck injure stop fall side railroad driver automobile industrial run safety passenger light strike operate danger hour
gang arrest knife assault gun rifle serial killer firearm fentanyl
Topic 3:
vacate affidavit final dismissal marriage civil serve relief merit post procedure represent limitation arbitration allegation october december statutory november august
sword arrest weapon drugs fentanyl firearm handgun gun gang heroin
Topic 4:
assessment election taxis levy constitution legislature commissioner

## LSI

In [73]:
from gensim.test.utils import common_dictionary, common_corpus
from gensim.models import LsiModel

In [74]:
reverse_vocabulary = { dv.vectorizer.vocabulary_[k]:k for k in dv.vectorizer.vocabulary_}

In [75]:
model = LsiModel(vectors.transpose(), id2word=reverse_vocabulary, num_topics=numTopics) 
topics = model.get_topics()

INFO:gensim.models.lsimodel:using serial LSI version on this node
INFO:gensim.models.lsimodel:updating model with new documents
INFO:gensim.models.lsimodel:using 100 extra samples and 2 power iterations
INFO:gensim.models.lsimodel:1st phase: constructing (89943, 110) action matrix
INFO:gensim.models.lsimodel:orthonormalizing (89943, 110) action matrix
INFO:gensim.models.lsimodel:2nd phase: running dense svd on (110, 183146) matrix
INFO:gensim.models.lsimodel:computing the final decomposition
INFO:gensim.models.lsimodel:keeping 10 factors (discarding 63.532% of energy spectrum)
INFO:gensim.models.lsimodel:processed sparse job of 183146 documents
INFO:gensim.utils:LsiModel lifecycle event {'msg': 'trained LsiModel(num_terms=89943, num_topics=10, decay=1.0, chunksize=20000) in 63.97s', 'datetime': '2021-11-25T03:35:53.454214', 'gensim': '4.1.2', 'python': '3.9.1 (tags/v3.9.1:1e5d33e, Dec  7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'cr

In [76]:
topWords = []
for topicno in range(numTopics):
    print('Topic {}'.format(topicno))
    print([(x, round(y, 2)) for x, y in model.show_topic(topicno, topn=10)], '\n')
    topWords.append([(x) for x, y in model.show_topic(topicno, topn=10)])

Topic 0
[('arrest', 0.09), ('crime', 0.08), ('standard', 0.07), ('united', 0.07), ('member', 0.07), ('test', 0.07), ('amendment', 0.07), ('factor', 0.07), ('abuse', 0.07), ('hospital', 0.06)] 

Topic 1
[('arrest', -0.23), ('crime', -0.18), ('stock', 0.14), ('search', -0.14), ('trustee', 0.13), ('prosecutor', -0.12), ('gun', -0.12), ('lease', 0.11), ('apartment', -0.11), ('robbery', -0.1)] 

Topic 2
[('railroad', 0.26), ('track', 0.19), ('medical', -0.18), ('hospital', -0.16), ('road', 0.16), ('train', 0.15), ('line', 0.14), ('north', 0.14), ('south', 0.13), ('east', 0.12)] 

Topic 3
[('stock', -0.25), ('trustee', -0.18), ('railroad', 0.17), ('share', -0.17), ('village', 0.16), ('road', 0.13), ('track', 0.13), ('mrs', -0.12), ('train', 0.11), ('marriage', -0.11)] 

Topic 4
[('election', -0.36), ('constitution', -0.23), ('vote', -0.21), ('ballot', -0.2), ('hospital', 0.16), ('medical', 0.15), ('legislature', -0.13), ('search', -0.13), ('constitutional', -0.12), ('amendment', -0.12)] 

To