# Topic modelling

The goal is to find three topics from the collection, the firt one regarding drugs, the second weapons, and the third investigation. 

We start with a classic model to then test the guided lda approach. We do not expect the first one to find the three topics we want, while the second should guide the topic modelling towards the required goal.

References:
https://medium.com/analytics-vidhya/how-i-tackled-a-real-world-problem-with-guidedlda-55ee803a6f0d


In [1]:
import warnings
warnings.filterwarnings('ignore')

import sys
sys.path.append("..")
import numpy as np

from sklearn.decomposition import LatentDirichletAllocation as lda
# import guidedlda

# from gensim.models import LsiModel
# from gensim.test.utils import common_dictionary, common_corpus
# import pyLDAvis.sklearn
# pyLDAvis.enable_notebook()

from src.dataset import Dataset
from src.vectorizers import TokenTfidfVectorizer

In [2]:
dataset = Dataset(dataset_path="", save_path=f"../data/processed/tokenized_processed.json")
tokens = dataset.load_text_list(field_name="tokens", size=-1)

### Vectorize the documents
The vectorized is a tfidf one, we use the output to fit the lda model.

In [3]:
dv = TokenTfidfVectorizer(tokens)

vectors = dv.vectors()
dv.save_vectors_vectorizer(vectors)
print(f"Vocabulary length: {len(dv.vectorizer.vocabulary_)}")

Vocabulary length: 61721


In [2]:
vectors, vectorizer = TokenTfidfVectorizer.load_vectors_vectorizer()

## Classic LDA model

The number of topics is set to three, while alpha and beta have values proposed in the literature. 

Griffiths TL, Steyvers M (2004). “Finding Scientific Topics.” Proceedings of the National Academy of Sciences of the United States of America, 101, 5228–5235.

In [8]:
numTopics = 3
alpha = 50/numTopics
beta = 0.1

lda_model = lda(n_components = numTopics, 
                doc_topic_prior= alpha, 
                topic_word_prior = beta, 
                random_state=0, 
                n_jobs=-1)

lda_output = lda_model.fit_transform(vectors)

### Topics relevant words

The next step is to check the words for each topic, results are not really promising, as the three topics looks the same.

In [14]:
n_top_words = 10
vocab = vectorizer.get_feature_names()
topic_words = {}
for topic, comp in enumerate(lda_model.components_): 
    word_idx = np.argsort(comp)[::-1][:n_top_words]
    topic_words[topic] = [vocab[i] for i in word_idx]
    
for topic, words in topic_words.items():
    print('Topic: %d' % topic)
    print('  %s' % ', '.join(words))

Topic: 0
  appellant, property, pay
Topic: 1
  appellant, jury, property
Topic: 2
  people, ned, appellee
Topic: 3
  appellant, appellee, ned
Topic: 4
  presiding, appellant, jury
Topic: 5
  appellant, ned, jury
Topic: 6
  ned, appellee, appellant
Topic: 7
  appellee, appellant, ned
Topic: 8
  curiam, appellant, jury
Topic: 9
  people, jury, property


## Guided LDA approach
We now guide the lda process by setting some seeds, exploiting the model defined by the GuidedLDA package.

In [10]:
tf_feature_names = vectorizer.get_feature_names()
word2id = dict((v, idx) for idx, v in enumerate(tf_feature_names))
word2id

{'aaa': 0,
 'aaas': 1,
 'aabd': 2,
 'aachen': 3,
 'aad': 4,
 'aadiich': 5,
 'aahich': 6,
 'aar': 7,
 'aara': 8,
 'aaras': 9,
 'aardvark': 10,
 'aarhich': 11,
 'aaron': 12,
 'aaronson': 13,
 'aarvold': 14,
 'aavard': 15,
 'aavay': 16,
 'aba': 17,
 'aback': 18,
 'abadia': 19,
 'abaj': 20,
 'aban': 21,
 'abandon': 22,
 'abandoned': 23,
 'abandoning': 24,
 'abandonment': 25,
 'abata': 26,
 'abatable': 27,
 'abate': 28,
 'abated': 29,
 'abatement': 30,
 'abatron': 31,
 'abb': 32,
 'abbahamson': 33,
 'abbamonto': 34,
 'abbasi': 35,
 'abbate': 36,
 'abbe': 37,
 'abbell': 38,
 'abbey': 39,
 'abbie': 40,
 'abbinante': 41,
 'abbot': 42,
 'abbott': 43,
 'abboud': 44,
 'abbreviate': 45,
 'abbreviated': 46,
 'abbreviation': 47,
 'abby': 48,
 'abc': 49,
 'abcess': 50,
 'abdallah': 51,
 'abdicate': 52,
 'abdication': 53,
 'abdill': 54,
 'abdnour': 55,
 'abdoman': 56,
 'abdomen': 57,
 'abdominal': 58,
 'abduct': 59,
 'abduction': 60,
 'abductor': 61,
 'abdul': 62,
 'abdullah': 63,
 'abe': 64,
 'abeaha

## LSI

In [14]:
reverse_vocabulary = { dv.vectorizer.vocabulary_[k]:k for k in dv.vectorizer.vocabulary_}

In [15]:
model = LsiModel(vectors.transpose(), id2word=reverse_vocabulary, num_topics=numTopics) 
topics = model.get_topics()

In [16]:
topWords = []
for topicno in range(numTopics):
    print('Topic {}'.format(topicno))
    print([(x, round(y, 2)) for x, y in model.show_topic(topicno, topn=30)], '\n')
    topWords.append([(x) for x, y in model.show_topic(topicno, topn=30)])
    
print(set.intersection(*map(set,topWords)))

Topic 0
[('defendant', 0.33), ('court', 0.28), ('plaintiff', 0.21), ('illinois', 0.18), ('trial', 0.14), ('case', 0.12), ('evidence', 0.11), ('state', 0.11), ('judgment', 0.1), ('appellant', 0.1), ('say', 0.1), ('n.e.2d', 0.09), ('jury', 0.09), ('make', 0.09), ('people', 0.08), ('would', 0.08), ('motion', 0.08), ('act', 0.08), ('such', 0.08), ('error', 0.08), ('app', 0.08), ('appellee', 0.07), ('order', 0.07), ('section', 0.07), ('may', 0.07), ('time', 0.07), ('property', 0.07), ('file', 0.07), ('3d', 0.07), ('contract', 0.07)] 

Topic 1
[('mr.', 0.41), ('justice', 0.38), ('opinion', 0.38), ('presiding', 0.36), ('deliver', 0.32), ('court', 0.27), ('publish', 0.26), ('full', 0.16), ('defendant', -0.13), ('mcsurely', 0.12), ('o’connor', 0.11), ('barnes', 0.11), ('matchett', 0.1), ('gridley', 0.09), ('scanlan', 0.06), ('friend', 0.06), ('plaintiff', -0.05), ('thomson', 0.04), ('illinois', -0.04), ('taylor', 0.04), ('wilson', 0.04), ('n.e.2d', -0.04), ('trial', -0.04), ('sullivan', 0.04), 