# Embedded Topic Model

#### Dejan Milacic, Kshirabdhi Patel

####  dejan.milacic@ryerson.ca, kshirabdhi.patel@ryerson.ca

# Introduction:

#### Problem Description:

How to determine the topics present in a set of documents and the mixture of topics discussed in a single document.

#### Context of the Problem:

Topic modelling can help to organize unlabelled text data and group together documents with similar themes.

#### Limitation About other Approaches:

Traditional topic models do not take advantage of distributed word representations which can capture semantic similarities between tokens.
LDA requires large vocabularies to be pruned severely for a good fit, potentially removing important terms and limiting the scope of the model.

#### Solution:

The Embedded Topic Model (ETM) uses word embeddings to discover the latent semantic structure of texts and accommodate large vocabularies.

# Background

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Moody [1] | lda2vec uses a combination of word and document vectors to model topics | Hacker News comments, 20ng | No quantitative performance measure on Hacker News comments
| Dieng et al. [2] | ETM models topics as points in the word embedding space | NYT, 20ng | Future: Could use a parser for more informative tokenized vocabulary


# Methodology

<div>
<img src="https://drive.google.com/uc?id=1Qy0mB14QP0VhUZJ9DZn9OaeT7reRpWrh" width="450"/>
</div>

The Embedded Topic Model can use pre-trained word embeddings (e.g. Skip-gram) or learn embeddings during training.

The model defines an $L\times V$ word embedding matrix $\rho$ where $L$ is the size of the embeddings (= 300) and $V$ is the vocabulary size.

The model also learns $K$ (= 50) topic embeddings, defined in an $L\times K$ embedding matrix $\alpha$.

ETM measures agreement between word embeddings and topic embeddings by taking the inner product of the word embedding matrix and the topic embedding ($\beta$ above).

The marginal likelihood of each document is intractable to compute, so the algorithm uses amortized variational inference.

The distributions of the topic proportions of each document depend on the document $d$ and shared variational parameters $\nu$.
Each distribution is a Gaussian whose mean and variance come from an "inference network," a neural network parametrized by $\nu$.

The topic proportions $\theta$ are then drawn from a logistic-normal distribution with the mean and variance output by the inference network.

The evidence lower bound (ELBO) is a function used to train the parameters $\alpha$, $\rho$ and $\nu$:

<div>
<img src="https://drive.google.com/uc?id=1Z1GZBFmBWKf0lvkF6EdIMk0ZfzE_54gc" width="450"/>
</div>

As a function of variational parameters $\nu$, the first term encourages them to place mass on topic proportions $\delta$ that explain the observed words.
The second term encourages them to be close to the prior $p(\delta_d)$.

As a function of model parameters $\rho$ and $\alpha$, it maximizes the expected complete log-likelihood.

ELBO is optimized using stochastic optimization, taking Monte Carlo approximations of the full gradient through reparameterization.

Performance is measured in terms of topic coherence and topic diversity.
Topic coherence is the average pointwise mutual information of two words drawn randomly from the same document. 
Topic diversity is the percentage of unique words in the top 25 words of all topics.




# Implementation

In this section, you will provide the code and its explanation. You may have to create more cells after this.

In [20]:
from __future__ import print_function
import torch.nn.functional as F 
import pickle
import scipy.io
import argparse
import torch
import numpy as np 
import os 
import math 
import random 
import sys
import matplotlib.pyplot as plt 
from torch import nn, optim
from torch.nn import functional as F
from etm import ETM
import re
from sklearn.datasets import fetch_20newsgroups
import string
import gensim
from gensim.models.coherencemodel import CoherenceModel
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print('\n')
np.random.seed(2019)
torch.manual_seed(2019)
if torch.cuda.is_available():
    torch.cuda.manual_seed(2019)





In [3]:
path = 'C:/Users/amitp/OneDrive/Ryerson-DS/DS8008/project-nlp/project_code'

In [4]:
# get the Preprocessed data data
def _fetch(path, name):
    if name == 'train':
        token_file = os.path.join(path, 'bow_tr_tokens')
        count_file = os.path.join(path, 'bow_tr_counts')
    elif name == 'valid':
        token_file = os.path.join(path, 'bow_va_tokens')
        count_file = os.path.join(path, 'bow_va_counts')
    else:
        token_file = os.path.join(path, 'bow_ts_tokens')
        count_file = os.path.join(path, 'bow_ts_counts')
    tokens = scipy.io.loadmat(token_file)['tokens'].squeeze()
    counts = scipy.io.loadmat(count_file)['counts'].squeeze()
    if name == 'test':
        token_1_file = os.path.join(path, 'bow_ts_h1_tokens')
        count_1_file = os.path.join(path, 'bow_ts_h1_counts')
        token_2_file = os.path.join(path, 'bow_ts_h2_tokens')
        count_2_file = os.path.join(path, 'bow_ts_h2_counts')
        tokens_1 = scipy.io.loadmat(token_1_file)['tokens'].squeeze()
        counts_1 = scipy.io.loadmat(count_1_file)['counts'].squeeze()
        tokens_2 = scipy.io.loadmat(token_2_file)['tokens'].squeeze()
        counts_2 = scipy.io.loadmat(count_2_file)['counts'].squeeze()
        return {'tokens': tokens, 'counts': counts, 
                    'tokens_1': tokens_1, 'counts_1': counts_1, 
                        'tokens_2': tokens_2, 'counts_2': counts_2}
    return {'tokens': tokens, 'counts': counts}

In [5]:
train = _fetch(path, 'train')
test = _fetch(path, 'test')
valid = _fetch(path, 'valid')
# 1. training data
train_tokens = train['tokens']
train_counts = train['counts']
num_docs_train = len(train_tokens)

# 2. dev set
valid_tokens = valid['tokens']
valid_counts = valid['counts']
num_docs_valid = len(valid_tokens)

# 3. test data
test_tokens = test['tokens']
test_counts = test['counts']
num_docs_test = len(test_tokens)

test_1_tokens = test['tokens_1']
test_1_counts = test['counts_1']
num_docs_test_1 = len(test_1_tokens)

test_2_tokens = test['tokens_2']
test_2_counts = test['counts_2']
num_docs_test_2 = len(test_2_tokens)

In [43]:
# get the data(vocabulary v)
with open(os.path.join(path, 'vocab.pkl'), 'rb') as f:
    vocab = pickle.load(f)

print(vocab[0:10])
vocab_size = len(vocab)
vocab_size

['embedded', 'erzurum', 'hypocritical', 'acc', 'vgalogo', 'smithsonian', 'friendly', 'yk', 'hollow', 'twisto']


18627

Now we have 2 choice 
    1. To use pre trained embedding for words 
    2. create the embedding for words.

- If we choose to use the pre trained word2vec skip-gram model for word embeddings the the below block of code will be executed . every word will be a vector of 300 dimensions. 
- embedding.txt is the embedding file generated by the word_embadding.ipynb file. we will be using this file to fit vector embedding for vocabulary.

In [7]:
emb_path = path+"/embeddings.txt"
emb_size = 300

# decide embedding
train_embeddings = True
mode = 'Train'  # mode = eval if train_embeddings = False else mode = 'Train'
# if using the word2vect pretrained embadding then:
# if using the pre trained embadding 
embeddings = None
if not train_embeddings:
    
    vectors = {}
    with open(emb_path, encoding="ISO-8859-1") as f:
        for l in f:
            line = l.split()
            word = line[0]
            if word in vocab:
                vect = np.array(line[1:]).astype(np.float)
                vectors[word] = vect
    embeddings = np.zeros((vocab_size, emb_size))
    words_found = 0
    for i, word in enumerate(vocab):
        try: 
            embeddings[i] = vectors[word]
            words_found += 1
        except KeyError:
            embeddings[i] = np.random.normal(scale=0.6, size=(emb_size, ))
    embeddings = torch.from_numpy(embeddings).to(device)
    embeddings_dim = embeddings.size()

    print('finished...................................\n')
    print(embeddings.size())
    print(embeddings[0].size()) # 300 features for every word in the vocabulary

Select the activation function which will be used for training the neural network models.

In [8]:
# choose activation function for the model
def get_activation( act):
    if act == 'tanh':
        act = nn.Tanh()
    elif act == 'relu':
        act = nn.ReLU()
    elif act == 'softplus':
        act = nn.Softplus()
    elif act == 'rrelu':
        act = nn.RReLU()
    elif act == 'leakyrelu':
        act = nn.LeakyReLU()
    elif act == 'elu':
        act = nn.ELU()
    elif act == 'selu':
        act = nn.SELU()
    elif act == 'glu':
        act = nn.GLU()
    else:
        print('Defaulting to tanh activations...')
        act = nn.Tanh()
    return act 

- All the hyper parameters used by ETM model. 
- For detail visit ETM.py file.
- this include functions for computing:
    - Topic embedding
    - Word embadding
    - The variational distribution network
    - function to optimize network parameters like theta_d,mean,standard dev
    - generation of Gaussian distribution for topic whose mean and variance come from an “inference network,”

In [10]:
# ETM Implimentation
# Decide all the hyperparameters
num_topics = 50
t_hidden_size = 1200 # neuron in hidden layer
rho_size = 300 # wordembedding dimension size. a matrix whose columns contain the embedding representations of the vocabulary
theta_act = get_activation( "tanh")
enc_drop = 0.5  # dropout rate on encoder
model = ETM(num_topics, vocab_size, t_hidden_size, rho_size).to(device)

In [11]:
# choose optimizer to use for all the  neural networks
wdecay = 1.2e-6
lr = 0.005 # learning rate

def opt(optimizer):
    if optimizer == 'adam':
        optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=wdecay)
    elif optimizer == 'adagrad':
        optimizer = optim.Adagrad(model.parameters(), lr=lr, weight_decay=wdecay)
    elif optimizer == 'adadelta':
        optimizer = optim.Adadelta(model.parameters(), lr=lr, weight_decay=wdecay)
    elif optimizer == 'rmsprop':
        optimizer = optim.RMSprop(model.parameters(), lr=lr, weight_decay=wdecay)
    elif optimizer == 'asgd':
        optimizer = optim.ASGD(model.parameters(), lr=lr, t0=0, lambd=0., weight_decay=wdecay)
    else:
        print('Defaulting to vanilla SGD')
        optimizer = optim.SGD(model.parameters(), lr=lr)
    return optimizer
optimizer = opt('adam')

In [12]:
# to split the formated data to bath of some hundreds tensor
def get_batch(tokens, counts, ind, vocab_size, device, emsize=300):
    """fetch input data by batch."""
    batch_size = len(ind)
    data_batch = np.zeros((batch_size, vocab_size))
    
    for i, doc_id in enumerate(ind):
        doc = tokens[doc_id]
        count = counts[doc_id]
        L = count.shape[1]
        if len(doc) == 1: 
            doc = [doc.squeeze()]
            count = [count.squeeze()]
        else:
            doc = doc.squeeze()
            count = count.squeeze()
        if doc_id != -1:
            for j, word in enumerate(doc):
                data_batch[i, word] = count[j]
    data_batch = torch.from_numpy(data_batch).float().to(device)
    return data_batch

In [23]:
#training hyperparameters

batch_size = 1000
bow_norm = True  # Decide whether to use normalized representation of bag of words or as it is
clip = 0.0       # gradient clipping


- if we are not using the pretrained embedding the the model weights will be used as embeddings vector.
- this function learns the weights of embedding matrix

In [24]:
def train(epoch):
    model.train() 
    acc_loss = 0
    acc_kl_theta_loss = 0
    cnt = 0
    indices = torch.randperm(num_docs_train)  # random number generator
    indices = torch.split(indices, batch_size) # split into chunk of size batch_size
    for idx, ind in enumerate(indices):
        optimizer.zero_grad()
        model.zero_grad()
        data_batch = get_batch(train_tokens, train_counts, ind, vocab_size, device) # batch of hundreds
        sums = data_batch.sum(1).unsqueeze(1) # sum for normalisation
        if bow_norm:
            normalized_data_batch = data_batch / sums
        else:
            normalized_data_batch = data_batch
        recon_loss, kld_theta = model(data_batch, normalized_data_batch)
        total_loss = recon_loss + kld_theta
        total_loss.backward()
        

        if clip > 0:
            torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()

        acc_loss += torch.sum(recon_loss).item()
        acc_kl_theta_loss += torch.sum(kld_theta).item()
        cnt += 1
    
    cur_loss = round(acc_loss / cnt, 2) 
    cur_kl_theta = round(acc_kl_theta_loss / cnt, 2) 
    cur_real_loss = round(cur_loss + cur_kl_theta, 2)
    print('*'*100)
    print('Epoch----->{} .. LR: {} .. KL_theta: {} .. Rec_loss: {} .. NELBO: {}'.format(
            epoch, optimizer.param_groups[0]['lr'], cur_kl_theta, cur_loss, cur_real_loss))
    print('*'*100)

- This function extract the nearby words surrounded by the topic.

In [25]:
# for word visualisation in embedding space

def nearest_neighbors(word, embeddings, vocab):
    vectors = embeddings.data.cpu().numpy() 
    index = vocab.index(word)
    #print('vectors: ', vectors.shape)
    query = vectors[index]
    #print('query: ', query.shape)
    ranks = vectors.dot(query).squeeze()
    denom = query.T.dot(query).squeeze()
    denom = denom * np.sum(vectors**2, 1)
    denom = np.sqrt(denom)
    ranks = ranks / denom
    mostSimilar = []
    [mostSimilar.append(idx) for idx in ranks.argsort()[::-1]]
    nearest_neighbors = mostSimilar[:20]
    nearest_neighbors = [vocab[comp] for comp in nearest_neighbors]
    return nearest_neighbors

In [26]:
num_words = 20   #number of words for topic viz

In [27]:
def visualize(m, show_emb=True):
    m.eval()

    queries = ['andrew', 'computer', 'sports', 'religion', 'man', 'love', 
                'intelligence', 'money', 'politics', 'health', 'people', 'family']

    ## visualize topics using monte carlo
    with torch.no_grad():
        print('**.'*30)
        print('Visualize topics...\n')
        topics_words = []
        gammas = m.get_beta()
        for k in range(num_topics):
            gamma = gammas[k]
            top_words = list(gamma.cpu().numpy().argsort()[-num_words+1:][::-1])
            topic_words = [vocab[a] for a in top_words]
            topics_words.append(' '.join(topic_words))
            print('Topic {}: {}'.format(k, topic_words))

        if show_emb:
            ## visualize word embeddings by using V to get nearest neighbors
            print('\n','**.'*30)
            print('\nVisualize word embeddings by using output embedding matrix\n')
            try:
                embeddings = m.rho.weight  # Vocab_size x E
            except:
                embeddings = m.rho         # Vocab_size x E
            neighbors = []
            for word in queries:
                print('word: {} .. neighbors:\n {}\n'.format(
                    word, nearest_neighbors(word, embeddings, vocab)))
            print('**.'*30)

- Here 3 function are written:
    1. get_document_frequency - count of number of documents in which particular word exist.
    2. get_topic_coherence - average pointwise mutual information of two words drawn randomly from the same document
    3. get_topic_diversity - topic diversity is the percentage of unique words in the top 25 words of all topics

In [28]:
def get_document_frequency(data, wi, wj=None):
    if wj is None:
        D_wi = 0
        for l in range(len(data)):
            doc = data[l].squeeze(0)
            if len(doc) == 1: 
                continue
            else:
                doc = doc.squeeze()
            if wi in doc:
                D_wi += 1
        return D_wi
    D_wj = 0
    D_wi_wj = 0
    for l in range(len(data)):
        doc = data[l].squeeze(0)
        if len(doc) == 1: 
            doc = [doc.squeeze()]
        else:
            doc = doc.squeeze()
        if wj in doc:
            D_wj += 1
            if wi in doc:
                D_wi_wj += 1
    return D_wj, D_wi_wj 

def get_topic_coherence(beta, data, vocab):
    D = len(data) ## number of docs...data is list of documents
    TC = []
    num_topics = len(beta)
    for k in range(num_topics):
        top_10 = list(beta[k].argsort()[-11:][::-1])
        TC_k = 0
        counter = 0
        for i, word in enumerate(top_10):
            # get D(w_i)
            D_wi = get_document_frequency(data, word)
            j = i + 1
            tmp = 0
            while j < len(top_10) and j > i:
                # get D(w_j) and D(w_i, w_j)
                D_wj, D_wi_wj = get_document_frequency(data, word, top_10[j])
                # get f(w_i, w_j)
                if D_wi_wj == 0:
                    f_wi_wj = -1
                else:
                    f_wi_wj = -1 + ( np.log(D_wi) + np.log(D_wj)  - 2.0 * np.log(D) ) / ( np.log(D_wi_wj) - np.log(D) )
                # update tmp: 
                tmp += f_wi_wj
                j += 1
                counter += 1
            # update TC_k
            TC_k += tmp 
        TC.append(TC_k)
    TC = np.mean(TC) / counter
    print('Topic coherence is: {}'.format(TC))

def get_topic_diversity(beta, topk):
    num_topics = beta.shape[0]
    list_w = np.zeros((num_topics, topk))
    for k in range(num_topics):
        idx = beta[k,:].argsort()[-topk:][::-1]
        list_w[k,:] = idx
    n_unique = len(np.unique(list_w))
    TD = n_unique / (topk * num_topics)
    print('Topic diveristy is: {}'.format(TD))

In [29]:
eval_batch_size = 1000 #input batch size for evaluation

-  Evaluate the model perfomance:

- Topic cohorence 
- Topic diversity
- perplexity

In [30]:

def evaluate(m, source, tc=False, td=False):
    """Compute perplexity on document completion.
    """
    m.eval()
    with torch.no_grad():
        if source == 'val':
            indices = torch.split(torch.tensor(range(num_docs_valid)), eval_batch_size)
            tokens = valid_tokens
            counts = valid_counts
        else: 
            indices = torch.split(torch.tensor(range(num_docs_test)), eval_batch_size)
            tokens = test_tokens
            counts = test_counts

        ## get \beta here
        beta = m.get_beta()

        ### do dc and tc here
        acc_loss = 0
        cnt = 0
        indices_1 = torch.split(torch.tensor(range(num_docs_test_1)), eval_batch_size)
        for idx, ind in enumerate(indices_1):
            ## get theta from first half of docs
            data_batch_1 = get_batch(test_1_tokens, test_1_counts, ind, vocab_size, device)
            sums_1 = data_batch_1.sum(1).unsqueeze(1)
            if bow_norm:
                normalized_data_batch_1 = data_batch_1 / sums_1
            else:
                normalized_data_batch_1 = data_batch_1
            theta, _ = m.get_theta(normalized_data_batch_1)

            ## get prediction loss using second half
            data_batch_2 = get_batch(test_2_tokens, test_2_counts, ind, vocab_size, device)
            sums_2 = data_batch_2.sum(1).unsqueeze(1)
            res = torch.mm(theta, beta)
            preds = torch.log(res)
            recon_loss = -(preds * data_batch_2).sum(1)
            
            loss = recon_loss / sums_2.squeeze()
            loss = loss.mean().item()
            acc_loss += loss
            cnt += 1
        cur_loss = acc_loss / cnt
        ppl_dc = round(math.exp(cur_loss), 1)
        print('*'*100)
        print('{} Doc Completion PPL: {}'.format(source.upper(), ppl_dc))
        print('*'*100)
        if tc or td:
            beta = beta.data.cpu().numpy()
            if tc:
                print('Computing topic coherence...')
                get_topic_coherence(beta, train_tokens, vocab)
            if td:
                print('Computing topic diversity...')
                get_topic_diversity(beta, 25)
        return ppl_dc



- Trainning model and test

In [31]:
# number of epochs :
epochs =21
nonmono = 10 #number of bad hits allowed
anneal_lr =0 # whether to anneal the learning rate or not
visualize_every =10 #when to visualize results

In [32]:
## train model on data 
if mode == 'Train':
    best_epoch = 0
    best_val_ppl = 1e9
    all_val_ppls = []
    print('\n')
    print('Visualizing model quality before training...')
    visualize(model)
    print('\n')
    for epoch in range(1, epochs):
        train(epoch)
        val_ppl = evaluate(model, 'val')
        if val_ppl < best_val_ppl:
            best_epoch = epoch
            best_val_ppl = val_ppl
        else:
            ## check whether to anneal lr
            lr = optimizer.param_groups[0]['lr']
            if anneal_lr and (len(all_val_ppls) > nonmono and val_ppl > min(all_val_ppls[:-nonmono]) and lr > 1e-5):
                optimizer.param_groups[0]['lr'] /= lr_factor
        if epoch % visualize_every == 0:
            visualize(model)
        all_val_ppls.append(val_ppl)

    model = model.to(device)
    val_ppl = evaluate(model, 'val')

    model.eval()
    with torch.no_grad():
        ## get document completion perplexities
        test_ppl = evaluate(model, 'test', tc=True, td=True)

        ## get most used topics
        indices = torch.tensor(range(num_docs_train))
        indices = torch.split(indices, batch_size)
        thetaAvg = torch.zeros(1, num_topics).to(device)
        thetaWeightedAvg = torch.zeros(1, num_topics).to(device)
        cnt = 0
        for idx, ind in enumerate(indices):
            data_batch = get_batch(train_tokens, train_counts, ind, vocab_size, device)
            sums = data_batch.sum(1).unsqueeze(1)
            cnt += sums.sum(0).squeeze().cpu().numpy()
            if bow_norm:
                normalized_data_batch = data_batch / sums
            else:
                normalized_data_batch = data_batch
            theta, _ = model.get_theta(normalized_data_batch)
            thetaAvg += theta.sum(0).unsqueeze(0) / num_docs_train
            weighed_theta = sums * theta
            thetaWeightedAvg += weighed_theta.sum(0).unsqueeze(0)
            if idx % 100 == 0 and idx > 0:
                print('batch: {}/{}'.format(idx, len(indices)))
        thetaWeightedAvg = thetaWeightedAvg.squeeze().cpu().numpy() / cnt
        print('\nThe 10 most used topics are {}'.format(thetaWeightedAvg.argsort()[::-1][:10]))

        ## show topics
        beta = model.get_beta()
        topic_indices = list(np.random.choice(num_topics, 10)) # 10 random topics
        print('\n')
        for k in range(num_topics):#topic_indices:
            gamma = beta[k]
            top_words = list(gamma.cpu().numpy().argsort()[-num_words+1:][::-1])
            topic_words = [vocab[a] for a in top_words]
            print('Topic {}: {}'.format(k, topic_words))

        if train_embeddings:
            ## show etm embeddings 
            try:
                rho_etm = model.rho.weight.cpu()
            except:
                rho_etm = model.rho.cpu()
            queries = ['andrew', 'woman', 'computer', 'sports', 'religion', 'man', 'love', 
                            'intelligence', 'money', 'politics', 'health', 'people', 'family']
            print('\n')
            print('ETM embeddings...')
            for word in queries:
                print('word: {} .. etm neighbors: {}'.format(word, nearest_neighbors(word, rho_etm, vocab)))
            print('\n')



Visualizing model quality before training...
**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.
Visualize topics...

Topic 0: ['backups', 'october', 'interpreter', 'devoid', 'occupied', 'wait', 'sticker', 'escrow', 'fingers', 'eaten', 'convincing', 'tau', 'bona', 'peice', 'hears', 'petroleum', 'exodus', 'sir', 'shores']
Topic 1: ['bream', 'adelaide', 'consultant', 'administrators', 'dichotomy', 'ul', 'individual', 'murray', 'pittsburgh', 'legends', 'pasadena', 'plugging', 'outlined', 'wsnc', 'sinned', 'assurances', 'dragged', 'mob', 'utility']
Topic 2: ['distinguish', 'fled', 'gnuplot', 'ku', 'restaurant', 'waits', 'lakers', 'galileo', 'problematic', 'placement', 'unreadable', 'wireframe', 'vus', 'laird', 'constraints', 'alaska', 'someones', 'stewart', 'negotiated']
Topic 3: ['breeding', 'pif', 'sherri', 'disobedience', 'edwards', 'evasive', 'corvette', 'willis', 'cdac', 'residence', 'turkish', 'behold', 'dug', 'broadly', 'penetrate', 'retreat'

Epoch----->5 .. LR: 0.005 .. KL_theta: 0.27 .. Rec_loss: 1039.08 .. NELBO: 1039.35
****************************************************************************************************
****************************************************************************************************
VAL Doc Completion PPL: 6892.9
****************************************************************************************************
****************************************************************************************************
Epoch----->6 .. LR: 0.005 .. KL_theta: 0.3 .. Rec_loss: 1040.76 .. NELBO: 1041.06
****************************************************************************************************
****************************************************************************************************
VAL Doc Completion PPL: 6853.4
****************************************************************************************************
******************************************************************

****************************************************************************************************
Epoch----->13 .. LR: 0.005 .. KL_theta: 1.57 .. Rec_loss: 1022.73 .. NELBO: 1024.3
****************************************************************************************************
****************************************************************************************************
VAL Doc Completion PPL: 6721.7
****************************************************************************************************
****************************************************************************************************
Epoch----->14 .. LR: 0.005 .. KL_theta: 1.79 .. Rec_loss: 1032.11 .. NELBO: 1033.9
****************************************************************************************************
****************************************************************************************************
VAL Doc Completion PPL: 6628.3
*****************************************************************


word: family .. neighbors:
 ['family', 'ways', 'present', 'specific', 'cases', 'policy', 'rights', 'groups', 'exists', 'involved', 'law', 'purpose', 'common', 'word', 'building', 'page', 'simply', 'events', 'act', 'areas']

**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.**.
****************************************************************************************************
VAL Doc Completion PPL: 5937.8
****************************************************************************************************
****************************************************************************************************
TEST Doc Completion PPL: 5937.8
****************************************************************************************************
Computing topic coherence...
Topic coherence is: 0.11004299629945907
Computing topic diversity...
Topic diveristy is: 0.1096

The 10 most used topics are [35 17 40 25 28 11 34  6 15 20]


Topic 0: ['writes', 'article'

In [None]:
if mode == 'eval':
    best_epoch = 0
    best_val_ppl = 1e9
    all_val_ppls = []
    print('\n')
    print('Visualizing model quality before training...')
    visualize(model)
    print('\n')
    for epoch in range(1, epochs):
        train(epoch)
        val_ppl = evaluate(model, 'val')
        if val_ppl < best_val_ppl:
            best_epoch = epoch
            best_val_ppl = val_ppl
        else:
            ## check whether to anneal lr
            lr = optimizer.param_groups[0]['lr']
            if anneal_lr and (len(all_val_ppls) > nonmono and val_ppl > min(all_val_ppls[:-nonmono]) and lr > 1e-5):
                optimizer.param_groups[0]['lr'] /= lr_factor
        if epoch % visualize_every == 0:
            visualize(model)
        all_val_ppls.append(val_ppl)

    model = model.to(device)
    val_ppl = evaluate(model, 'val')

## LDA Model

In [31]:
data = fetch_20newsgroups(subset='train')
data = data.data


with open('C:/Users/amitp/OneDrive/Ryerson-DS/DS8008/project-nlp/ETM-master/ETM-master/scripts/stops.txt', 'r') as f:
    stops = f.read().split('\n')
stops[0:5]


data_clean = [re.findall(r'''[\w']+|[.,!?;-~{}`´_<=>:/@*()&'$%#"]''',data[doc]) for doc in range(len(data))]

def contains_punctuation(w):
    return any(char in string.punctuation for char in w)

def contains_numeric(w):
    return any(char.isdigit() for char in w)

data_clean1 = [[w.lower() for w in data_clean[doc] if not contains_punctuation(w)] for doc in range(len(data_clean))]
data_clean1 = [[w for w in data_clean1[doc] if not contains_numeric(w)] for doc in range(len(data_clean1))]
data_clean = [" ".join(data_clean1[doc]) for doc in range(len(data_clean1))]


In [32]:
from gensim import corpora
dictionary = corpora.Dictionary(data_clean1)
corpus = [dictionary.doc2bow(text) for text in data_clean1]


In [10]:
# fit to the model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=50, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=1000,
                                           passes=5,
                                           alpha='auto',
                                           per_word_topics=True)

In [33]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_clean1, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -10.55420551940949

Coherence Score:  0.4678729093056614


In [34]:
# LDA from sklearn
from sklearn.decomposition import LatentDirichletAllocation
cvectorizer = CountVectorizer(min_df=0.3, max_df=1.0, stop_words=None)
cvz = cvectorizer.fit_transform(data_clean)

k = 50
lda = LatentDirichletAllocation(n_components=k,
                                learning_method='online',
                                learning_decay=0.85,
                                learning_offset=10.,
                                evaluate_every=10,
                                verbose=1,
                                random_state=5).fit(cvz)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10, perplexity: 1866.3155


# Conclusion and Future Direction

-  Here we implemented Topic Modelling in Different environmental settings:
    1. Implemented Topic modelling with word2Vec pre trained embeddings  (perplexity :6064.30)
    2. Implemented Topic modelling with training of word embedding and Topic embeddings (perplexity :5937.8)
    3. The traditional Topic Model with LDA (3170.5589)
- In 20news group data, The choice of topic word were more convincing with pretrained embedded topic models then LDA and ETM embeddings of words and documents.
   1. All of the three models are highly sensitive to random words.
   2. We should very  carefully while selecting token because these models tend to form a topic of random words itself.
**Future work:**
- We should come up with a model which can select words or predict the misspelled word correctly or may be try some POS tagging Techniques to eliminate irrelevant words.



# References:



[2]:  Adji B. Dieng, Francisco J. R. Ruiz and David M. Blei, Topic Modeling in Embedded Spaces, ArXiv, 2019