## Topic Modeling with Gensim 

https://tedboy.github.io/nlps/gensim_tutorial/tutorial.html

https://www.tutorialspoint.com/gensim/index.htm

**Gensim** is an alternative library for
- topic modelling
- document indexing
- similarity retrieval

We can use previous libraries (textacy, spaCy, NLTK) to preprocess and tokenize the texts (although Gensim has its own functions)

After preprocessing and tokenization, we will use **Gensim** specific functions:

- from gensim.corpora import Dictionary : create a dictionary for all tokens (token, id)

- doc2bow returns a sparse representation of the word counts (number of occurences, id)



In [14]:
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt
import regex as re

In [15]:
file="un-general-debates.csv"
df=pd.read_csv(file)

In [16]:
df = df.sample(20,random_state=42)

### Step 1 : preprocessing and tokenization
We preprocess and tokenize the texts (split) and exclude stopwords

In [17]:
import textacy
import textacy.preprocessing as tprep

preproc = tprep.make_pipeline(
    tprep.remove.punctuation,
    tprep.remove.accents,
    tprep.remove.html_tags,
    tprep.replace.urls,
    tprep.normalize.unicode,
    tprep.normalize.whitespace,
    tprep.normalize.quotation_marks,
    tprep.normalize.hyphenated_words
)


In [18]:
df['text']=df['text'].apply(preproc)

In [19]:
import spacy 
import en_core_web_sm
nlp = en_core_web_sm.load()

# Import stopword from spaCy
#import spacy 

# Import stopword from spaCy
#import spacy 
#loading the english language small model of spacy

stopwords_spa = nlp.Defaults.stop_words

print(len(stopwords_spa))
include_stopwords={'\n','that','the'}
stopwords_spa|= include_stopwords
print(len(stopwords_spa))
print(stopwords_spa)
print(type(stopwords_spa))

327
327
{'becomes', '’re', 'whence', 'at', 'whereupon', 'amongst', 'herself', 'were', 'within', 'whatever', '’m', 'somewhere', 'do', 'became', 'something', 'had', 'formerly', 'make', 'bottom', 'this', 'if', 'down', 'twenty', 'may', "n't", 'anyway', 'its', 'hundred', '’ve', 'nothing', 'them', 'name', 'upon', 'himself', 'last', 'mine', 'back', 'amount', 'thus', '‘ll', "'ll", 'elsewhere', "'m", 'move', 'thereafter', 'what', 'beside', 'cannot', 'his', 'n’t', 'here', 'even', 'there', 'once', 'only', 'below', 'become', 'up', 'to', 'four', 'anyhow', 'whereby', 'along', 'via', 'due', 'him', 'call', 'of', 'noone', 'you', 'front', 'how', 'very', 'yourself', 'since', 'behind', 'when', 'some', 'put', 'our', 'into', 'more', 'have', 'anywhere', 'whenever', 'meanwhile', 'made', 'either', 'been', "'s", 'yet', 'should', 'ten', 'one', 'again', 'ever', 'was', 'their', 'seem', 'someone', 'will', '‘s', 'third', 'see', '\n', '‘re', 'does', 'than', 'else', 'still', 'two', 'most', 'though', 'am', 'used', 'alr

In [None]:
#gensim_text=[[w for w in re.findall(r'\b\w\w+\b',text.lower()) if w not in stopwords_spa ] for text in df['text']]

In [None]:
gensim_text=[[w.lemma_.lower().strip() for w in nlp(text) if w.lemma_ not in stopwords_spa ] for text in df['text']]

In [20]:
def spacy_tokenizer(text):
    tokens = [token.lemma_.lower().strip() for token in nlp(text) if token.lemma_ not in stopwords_spa] 
    return tokens

In [21]:
gensim_text = df['text'].apply(spacy_tokenizer)

In [9]:
len(gensim_text)

20

In [22]:
gensim_text.loc[df.index[0],]

['international',
 'community',
 'currently',
 'period',
 'reflection',
 'self',
 'definition',
 'great',
 'transformation',
 'humanity',
 'experience',
 'course',
 'previous',
 'decade',
 'current',
 'challenge',
 'demand',
 'great',
 'responsibility',
 'nation',
 'play',
 'active',
 'role',
 'search',
 'urgent',
 'solution',
 'problem',
 'affect',
 'new',
 'session',
 'general',
 'assembly',
 'present',
 'excellent',
 'opportunity',
 'achieve',
 'goal',
 'today',
 'dominican',
 'republic',
 'reaffirm',
 'commitment',
 'peace',
 'defence',
 'human',
 'right',
 'security',
 'sustainable',
 'development',
 'strengthening',
 'democracy',
 'pillar',
 'indisputable',
 'importance',
 'safeguard',
 'international',
 'peace',
 'stability',
 'issue',
 'reform',
 'united',
 'nations',
 'agenda',
 'long',
 'time',
 'event',
 'recent',
 'year',
 'clear',
 'task',
 'undertake',
 'matter',
 'priority',
 'reform',
 'security',
 'council',
 'particularly',
 'urgent',
 'approval',
 'resolution',
 '47'

### Step 2 : Creating a Dictionary

We construct a dictionary who defines the vocabulary of all words after tokenization.

Each token receives a unique integer ID

In [23]:
from gensim.corpora import Dictionary
dict_gensim_text=Dictionary(gensim_text) # define the vocabulary of all words that the process know

We can impose restriction on the number of occurrences

In [24]:
dict_gensim_text.filter_extremes(no_below=5,no_above=0.7)

In [None]:
print(dict_gensim_text)

In [None]:
print(dict_gensim_text.token2id)

In [None]:
dict_gensim_text.keys()

In [None]:
dict_gensim_text.get(57)

Collection frequencies: token_id with how many instances of the token are contained in the documents

In [None]:
dict_gensim_text.cfs

Document frequencies: token_id with how many documents contain this token

In [None]:
dict_gensim_text.dfs

In [None]:
print('number of documents processed',dict_gensim_text.num_docs)

In [None]:
print('number of processed words',dict_gensim_text.num_pos)

### Step 3 : Bag of word representation of a set of texts

We represent each document as a vector of features. 
    
We create a bag-of-word representation of a document with **doc2bow**. 
    
doc2bow returns a sparse representation of the word counts.
    
For each token in a document and in the dictionary we obtain a list of (ID of the token, Number of occurences in the text)

In [25]:
bow_gensim_text = [dict_gensim_text.doc2bow(text) for text in gensim_text]

In [26]:
len(bow_gensim_text)

20

Words represented by their ID number and their number of occurrences

In [27]:
print(bow_gensim_text[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 3), (6, 1), (7, 2), (8, 5), (9, 2), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 2), (16, 1), (17, 2), (18, 2), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 2), (37, 1), (38, 1), (39, 1), (40, 3), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 2), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 3), (69, 2), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 2), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 1), (83, 2), (84, 1), (85, 3), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 2), (102, 1), (103, 1), (104, 2), (105, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 2),

### Step 4 : computation of the Tfidf

Computation of tf-idf for the corpus of documents

1. We apply our BoW corpus to models.TfidfModel
2. We obtain the word ids 

In [28]:
from gensim.models import TfidfModel
tfidf_text = TfidfModel(bow_gensim_text) # We initialize the tf_idf model
tfidf_corpus = tfidf_text[bow_gensim_text] # We apply it to the 

## NMF (Non-negative Matrix Factorization) with Gensim

function Nmf from gensim.models.nmf

Main arguments:
    
- corpus : list of (int,float) or csc_matrix with the shape (n_tokens,n_documents) training corpus 
- number of topics
- id2word : dict or (int,str): mapping from words IDS to words


In [None]:
from gensim.models.nmf import Nmf
nmf_gensim_text = Nmf(tfidf_corpus,num_topics=10,id2word=dict_gensim_text,normalize='True',random_state=42)

In [None]:
nmf_gensim_text.show_topic(0,topn=10)

In [None]:
nmf_gensim_text.get_term_topics(0,normalize=True)

In [None]:
nmf_gensim_text.show_topics()

In [None]:
nmf_gensim_text.show_topics(num_topics=10,num_words=10)

### LDA with Gensim

In [29]:
from gensim.models import LdaModel
lda_gensim_text = LdaModel(corpus=bow_gensim_text,id2word = dict_gensim_text,chunksize=2000, alpha='auto',eta='auto',iterations=400,num_topics=5,passes=20,eval_every=None, random_state=42)

#### Displaying topics

In [30]:
lda_gensim_text.show_topics(5,num_words=10,formatted=True)

[(0,
  '0.009*"democracy" + 0.008*"life" + 0.007*"latin" + 0.007*"hope" + 0.007*"powers" + 0.007*"asia" + 0.007*"war" + 0.006*"view" + 0.006*"power" + 0.006*"man"'),
 (1,
  '0.011*"programme" + 0.009*"goal" + 0.008*"agenda" + 0.008*"republic" + 0.008*"millennium" + 0.007*"commitment" + 0.007*"price" + 0.007*"live" + 0.007*"strengthen" + 0.007*"cooperation"'),
 (2,
  '0.024*"nuclear" + 0.015*"weapon" + 0.012*"war" + 0.011*"military" + 0.010*"relation" + 0.009*"disarmament" + 0.008*"co" + 0.007*"foreign" + 0.007*"europe" + 0.007*"strengthen"'),
 (3,
  '0.026*"" + 0.011*"union" + 0.010*"dialogue" + 0.009*"change" + 0.009*"drug" + 0.009*"independence" + 0.008*"european" + 0.007*"conference" + 0.007*"reality" + 0.006*"opinion"'),
 (4,
  '0.024*"africa" + 0.019*"african" + 0.007*"debt" + 0.007*"continent" + 0.006*"delegation" + 0.006*"assistance" + 0.005*"agreement" + 0.005*"child" + 0.005*"cause" + 0.005*"hope"')]

For a **single** topic

return word-probability pairs for the most relevant words generated by the topics

In [31]:
lda_gensim_text.show_topic(0,topn=10)

[('democracy', np.float32(0.008826124)),
 ('life', np.float32(0.007889859)),
 ('latin', np.float32(0.0070758434)),
 ('hope', np.float32(0.006837128)),
 ('powers', np.float32(0.006731101)),
 ('asia', np.float32(0.006697009)),
 ('war', np.float32(0.0065731197)),
 ('view', np.float32(0.0059399856)),
 ('power', np.float32(0.005935396)),
 ('man', np.float32(0.0057120956))]

The LDA model gives three informations:
 1. Topics in the document
 2. What topic each word belongs to
 3. Phi value: probability of a word to lie in a particular topic

Get the topic distribution for a given document

In [None]:
lda_gensim_text.get_document_topics(bow_gensim_text[1])
#lda_gensim_text.get_document_topics(bow_gensim_text[1],per_word_topics= True)

Get the most relevant topics to a given word

return: the relevant topics represented as a pair of their ID and their assigned probability

In [None]:
lda_gensim_text.get_term_topics('war')

Get most relevant words for a single topic

In [None]:
lda_gensim_text.get_topic_terms(0,topn=5)

In [None]:
dict_gensim_text[68]

In [35]:
w = lda_gensim_text.get_topics()

array([[1.3621929e-03, 1.3013736e-03, 8.9289318e-04, ..., 6.3572232e-05,
        6.3558400e-05, 6.3626692e-05],
       [1.2574578e-03, 1.2570092e-03, 1.2576318e-03, ..., 1.2583163e-03,
        1.2590395e-03, 1.2583511e-03],
       [1.9835193e-04, 5.1059580e-04, 1.2273294e-03, ..., 6.7070767e-04,
        9.0312975e-04, 1.4611264e-03],
       [1.6781538e-03, 1.6517708e-03, 1.1534267e-04, ..., 6.8343093e-04,
        9.2661951e-04, 1.6829083e-03],
       [8.5320364e-04, 8.7590923e-04, 6.4736778e-05, ..., 8.7609427e-04,
        4.8551380e-04, 6.1101317e-05]], shape=(5, 842), dtype=float32)

In [33]:
import pyLDAvis 
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook(local=True)
vis = pyLDAvis.gensim_models.prepare(lda_gensim_text, bow_gensim_text , dict_gensim_text,sort_topics=False)

pyLDAvis.display(vis)

## Evaluation the coherence of a model

Topic coherence: a measure of the quality of topics, via a single scalar value. These measure are based on the computation of an aggregated measure of probability of joint occurrence of words in a corpus of documents. 

The computation of the metrics can be decomposed into 4 steps:

1. Segmentation: creating pairs of subsets of words (W,W') used to assess the topics coherence

2. Probability calculation: given the corpus of documents, we compute the probability of occurrence of each elements of the previous pairs

3. Confirmation measure: The aim is to quantify the relationship between W and W', to measure how W is associated with W'

4. Aggregation: of the previous confirmation measures (mean, median, geometric mean...)



- u_mass: $ -14 \leq u\_mass \leq 0$

- c_v $c\_v \in [0,1]$

- c\_uci  $ -14 \leq u\_mass \leq 14$

- cnpmi $ -1 \leq u\_mass \leq 1$

In [None]:
from gensim.models.coherencemodel import CoherenceModel

u_mass :

input : corpus of the model 

In [None]:
nmf_gensim_text_coherence =CoherenceModel(model=nmf_gensim_text,corpus=bow_gensim_text,coherence='u_mass')

nmf_gensim_text_coherence_score = nmf_gensim_text_coherence.get_coherence()

print(nmf_gensim_text_coherence_score)

In [None]:
lda_gensim_text_coherence =CoherenceModel(model=lda_gensim_text,corpus=bow_gensim_text,coherence='u_mass')

lda_gensim_text_coherence_score = lda_gensim_text_coherence.get_coherence()

print(lda_gensim_text_coherence_score)

c_v :

input = text + dictionnary 

In [None]:
nmf_gensim_text_coherence =CoherenceModel(model=nmf_gensim_text,texts=gensim_text,dictionary=dict_gensim_text,coherence='c_v')

nmf_gensim_text_coherence_score = nmf_gensim_text_coherence.get_coherence()

print(nmf_gensim_text_coherence_score)

In [None]:
lda_gensim_text_coherence =CoherenceModel(model=lda_gensim_text,texts=gensim_text,dictionary=dict_gensim_text,coherence='c_v')

lda_gensim_text_coherence_score = lda_gensim_text_coherence.get_coherence()

print(lda_gensim_text_coherence_score)

In [None]:
nmf_gensim_text_coherence =CoherenceModel(model=nmf_gensim_text,texts=gensim_text,dictionary=dict_gensim_text,coherence='c_uci')

nmf_gensim_text_coherence_score = nmf_gensim_text_coherence.get_coherence()

print(nmf_gensim_text_coherence_score)

In [None]:
lda_gensim_text_coherence =CoherenceModel(model=lda_gensim_text,texts=gensim_text,dictionary=dict_gensim_text,coherence='c_npmi')

lda_gensim_text_coherence_score = lda_gensim_text_coherence.get_coherence()

print(lda_gensim_text_coherence_score)

#### Coherence scores by topic

In [None]:
lda_gensim_text_coherence.get_coherence_per_topic()

### Finding the optimal number of topics

In [None]:
from tqdm import tqdm # in loop, takes member and iterate over them

In [None]:
from gensim.models.coherencemodel import CoherenceModel

In [None]:
from gensim.models.ldamulticore import LdaMulticore
lda_model_n=[]
for n in tqdm(range(5,20)):
    lda_model = LdaMulticore(corpus=bow_gensim_text,id2word=dict_gensim_text,chunksize=2000,eta='auto',iterations=400,
                             num_topics=n,passes=20,eval_every=None,random_state=42)
    lda_coherence = CoherenceModel(model=lda_model, texts=gensim_text,dictionary=dict_gensim_text,coherence='c_v')
    lda_model_n.append((n,lda_model,lda_coherence.get_coherence()))

In [None]:
pd.DataFrame(lda_model_n, columns=["n","model","coherence"]).set_index("n")[["coherence"]].plot()

In [None]:
### Hierarchical Dirichlet process with Gensim

Broader blocks divided divided in subtopics 

In [None]:
from gensim.models import HdpModel
hdp_gensim = HdpModel(corpus=bow_gensim_text,id2word=dict_gensim_text)

In [None]:
hdp_gensim.print_topics(num_words=10)

# Document retrieval : Perform a similarity Query

MatrixSimilarity is a Gensim Function used to compute the similarities between a query and a corpus of documents
    
The similarity measure used is cosine between two vectors.

    cosine_similarity = (A · B) / (||A|| × ||B||)

where (A.B) is the scalar product of two vectors, and ||A|| is the vector norm

Cosine_similarity $\in [-1, 1]$

Cosine_similarity = 1 for identical vectors
= 0 for orthogonal vectors
= -1  for opposed vectors

for tfidf, cosine_similarity $\in [0,1]$

In [None]:
from gensim import similarities
tfidf_index = similarities.MatrixSimilarity(tfidf_corpus)

### Preprocessing and tokenizing a query

In [None]:

query="Our country wants to promote peace and democracy through co-operation"

preprocessed_query = preproc(query)
query_text = spacy_tokenizer(preprocessed_query)

print(query_text)

### Converting a query into the vector space of Gensim

1. Convert query to BoW

2. Convert query to TF-IDF


In [None]:
query_bow = dict_gensim_text.doc2bow(query_text)

In [None]:
query_tfidf = tfidf_text[query_bow]

### Get simularity scores 

In [None]:
We compute the similarity scores with the (20) documents 

In [None]:
tfidf_index[query_tfidf]

In [None]:
tfidf_similarities = list(enumerate(tfidf_index[query_tfidf]))

# enumerate adds an index to each score

Classification of scores in decreasing order with the score x[1] of each tuple as a key

In [None]:
tfidf_similarities.sort(key=lambda x:x[1],reverse=True)

In [None]:
print("Top similar documents (ID,score):")
for doc_id,score in tfidf_similarities[:3]:
    print(f"ID: {doc_id},score : {score:.4f}")


In [None]:
top_n = 3
top_similar_docs = tfidf_similarities[:top_n]
print("Top similar documents Text:")
for doc_id,score in top_similar_docs:
    print(f"ID: {doc_id},score : {score:.4f}")
    print(df.loc[df.index[doc_id],'text'])