# Topic modelling

The goal is to find three topics from the collection, the firt one regarding drugs, the second weapons, and the third investigation. 

We start with a classic model to then test the guided lda approach. We do not expect the first one to find the three topics we want, while the second should guide the topic modelling towards the required goal.

References:
https://medium.com/analytics-vidhya/how-i-tackled-a-real-world-problem-with-guidedlda-55ee803a6f0d


In [23]:
import warnings
warnings.filterwarnings('ignore')

import sys
sys.path.append("..")

import json
import numpy as np
from sklearn.decomposition import LatentDirichletAllocation as lda
from gensim.models import LsiModel
from gensim.test.utils import common_dictionary, common_corpus
# import guidedlda

# import pyLDAvis.sklearn
# pyLDAvis.enable_notebook()

from src.dataset import Dataset
from src.vectorizers import TokenVectorizer

In [2]:
tokens = json.load(open("../data/processed/filtered_tokens.json", "r"))

### Vectorize the documents
The vectorized is a tfidf one, we use the output to fit the lda model.

In [3]:
dv = TokenVectorizer(tokens, method="count")

vectors = dv.vectors()
# dv.save_vectors_vectorizer(vectors)
print(f"Vocabulary length: {len(dv.vectorizer.vocabulary_)}")

Vocabulary length: 60750


In [None]:
# vectors, vectorizer = TokenTfidfVectorizer.load_vectors_vectorizer()

## Classic LDA model

The number of topics is set to three, while alpha and beta have values proposed in the literature. 

Griffiths TL, Steyvers M (2004). “Finding Scientific Topics.” Proceedings of the National Academy of Sciences of the United States of America, 101, 5228–5235.

In [5]:
numTopics = 3
alpha = 50/numTopics
beta = 0.1

lda_model = lda(n_components = numTopics, 
                doc_topic_prior= alpha, 
                topic_word_prior = beta, 
                random_state=0, 
                n_jobs=-1)

lda_output = lda_model.fit_transform(vectors)

### Topics relevant words

The next step is to check the words for each topic, results are interesting and expected, bu twe can't see a distinction between the topics we want.

In [9]:
n_top_words = 30
vocab = dv.vectorizer.get_feature_names()
topic_words = {}
for topic, comp in enumerate(lda_model.components_): 
    word_idx = np.argsort(comp)[::-1][:n_top_words]
    topic_words[topic] = [vocab[i] for i in word_idx]
    
for topic, words in topic_words.items():
    print('\nTopic: %d' % topic)
    print('%s' % ', '.join(words))


Topic: 0
witness, officer, counsel, police, statement, car, sentence, tell, defense, instruction, guilty, offense, conviction, ask, criminal, judge, reasonable, arrest, verdict, man, call, argument, crime, prove, examination, victim, murder, admit, commit, believe

Topic: 1
property, appellant, contract, company, appellee, city, estate, land, bill, interest, sale, decree, bank, trust, payment, deed, chicago, suit, damage, agreement, money, title, business, owner, corporation, premise, purchase, sell, power, work

Topic: 2
petition, complaint, board, respondent, child, attorney, policy, hearing, dismiss, stat, claimant, service, public, par, insurance, provision, petitioner, rev, injury, proceeding, employee, department, commission, school, district, code, award, jurisdiction, duty, notice


### Consider only words of interest
We now print the word distribution, considering only interesting words

In [17]:
narcotics = ['cannabis', 'cocaine', 'methamphetamine', 'drugs', 'drug', 'marijuana', 
             'ecstasy', 'lsd', 'ketamine', 'heroin', 'fentanyl', 'overdose']

weapons = ['gun', 'knife', 'weapon', 'firearm', 'rifle', 'carabine', 'shotgun', 'handgun', 
           'revolver', 'musket', 'pistol', 'derringer', 'assault', 'rifle', 'sword', 'blunt']

investigation = ['gang', 'mafia', 'serial',  'killer', 'rape', 'theft', 'recidivism', 
                 'arrest', 'robbery', 'cybercrime', 'cyber', 'crime']

interesting_set = set(narcotics + weapons + investigation)

vocab = dv.vectorizer.get_feature_names()
topic_words = {}
for topic, comp in enumerate(lda_model.components_): 
    word_idx = np.argsort(comp)[::-1]
    topic_words[topic] = [w for w in [vocab[i] for i in word_idx] if w in interesting_set][:10]
    
for topic, words in topic_words.items():
    print('\nTopic: %d' % topic)
    print('%s' % ', '.join(words))


Topic: 0
arrest, crime, gun, robbery, drug, assault, weapon, theft, rape, cocaine

Topic: 1
sword, serial, drugs, blunt, arrest, drug, theft, assault, gang, crime

Topic: 2
drug, assault, theft, recidivism, crime, sword, drugs, firearm, overdose, arrest


We can see that the topics blends together even considering only the words of interest, LDA must be guided. 

## Guided LDA approach
We now guide the lda process by setting some seeds, exploiting the model defined by the GuidedLDA package.

In [20]:
tf_feature_names = dv.vectorizer.get_feature_names()
word2id = dict((v, idx) for idx, v in enumerate(tf_feature_names))

## LSI

In [21]:
reverse_vocabulary = { dv.vectorizer.vocabulary_[k]:k for k in dv.vectorizer.vocabulary_}

In [24]:
model = LsiModel(vectors.transpose(), id2word=reverse_vocabulary, num_topics=numTopics) 
topics = model.get_topics()

In [26]:
topWords = []
for topicno in range(numTopics):
    print('Topic {}'.format(topicno))
    print([(x, round(y, 2)) for x, y in model.show_topic(topicno, topn=10)], '\n')
    topWords.append([(x) for x, y in model.show_topic(topicno, topn=10)])

Topic 0
[('property', 0.16), ('counsel', 0.13), ('contract', 0.12), ('officer', 0.12), ('witness', 0.11), ('statement', 0.11), ('company', 0.11), ('attorney', 0.1), ('city', 0.1), ('interest', 0.1)] 

Topic 1
[('property', -0.26), ('police', 0.2), ('contract', -0.2), ('officer', 0.19), ('counsel', 0.19), ('company', -0.18), ('statement', 0.16), ('sentence', 0.15), ('witness', 0.14), ('murder', 0.13)] 

Topic 2
[('respondent', 0.37), ('board', 0.27), ('appellant', -0.22), ('company', -0.19), ('child', 0.17), ('appellee', -0.17), ('petition', 0.16), ('district', 0.14), ('petitioner', 0.14), ('school', 0.14)] 

