<a href="https://colab.research.google.com/github/th00masml/Natural-Language-Processing/blob/master/Tutorial_(v1_8_1)_Training%2C_Saving%2C_Loading_and_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Tutorial (v1.8.1): Training, Saving, Loading and Testing

(last updated 12-01-2021)

In this tutorial, we are going to use contextualized topic modeling to get topics out of a collections made of Wikipedia Abstracts.

## Topic Models 

Topic models allow you to discover latent topics in your documents in a completely unsuperivsed way. Just use your documents and get topics out.

## Contextualized Topic Models

![](https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/logo.png)

What are Contextualized Topic Models? **CTMs** are a family of topic models that combine the expressive power of BERT embeddings with the usupervised capabilities of topic models to get topics out of documents.

## Python Package

You can find our package [here](https://github.com/MilaNLProc/contextualized-topic-models).

![https://travis-ci.com/MilaNLProc/contextualized-topic-models](https://travis-ci.com/MilaNLProc/contextualized-topic-models.svg) ![https://pypi.python.org/pypi/contextualized_topic_models](https://img.shields.io/pypi/v/contextualized_topic_models.svg) ![https://pepy.tech/badge/contextualized-topic-models](https://pepy.tech/badge/contextualized-topic-models)




# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# Data

We are going to download some abstracts from Wikipedia and use them to run our topic modeling pipeline. 

In [1]:
%%capture
!wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt
!wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_prep.txt

# Installing Contextualized Topic Models

Now, we install the contextualized topic model library

In [2]:
%%capture
!pip install contextualized-topic-models==1.8.1
!pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

# Restart the Notebook

For the changes to take effect, we now need to restart the notebook.

From the Menu:

+ Runtime → Restart Runtime

## Importing what we need

In [1]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file, TopicModelDataPreparation
from contextualized_topic_models.datasets.dataset import CTMDataset
from contextualized_topic_models.evaluation.measures import CoherenceNPMI, InvertedRBO
from gensim.corpora.dictionary import Dictionary
from gensim.models import ldamodel 
import os
import numpy as np
import pickle

Let's read our data files and store the documents as lists of strings

In [2]:
with open("dbpedia_sample_abstract_20k_prep.txt", 'r') as fr_prep:
  text_training_preprocessed = [line.strip() for line in fr_prep.readlines()]

with open("dbpedia_sample_abstract_20k_unprep.txt", 'r') as fr_unprep:
  text_training_not_preprocessed = [line.strip() for line in fr_unprep.readlines()]

NOTE: Make sure that the lenghts of the two lists of documents are the same and the index of a not preprocessed document corresponds to the index of the same preprocessed document. 

In [3]:
assert len(text_training_preprocessed) == len(text_training_not_preprocessed)

print(text_training_not_preprocessed[0])
print(text_training_preprocessed[0])

The Mid-Peninsula Highway is a proposed freeway across the Niagara Peninsula in the Canadian province of Ontario. Although plans for a highway connecting Hamilton to Fort Erie south of the Niagara Escarpment have surfaced for decades,it was not until The Niagara Frontier International Gateway Study was published by the Ministry
mid peninsula highway proposed across peninsula canadian province ontario although highway connecting hamilton fort south international study published ministry


## Let's split the documents in training and testing

In [4]:
training_bow_documents = text_training_preprocessed[0:15000]
training_contextual_document = text_training_not_preprocessed[0:15000]

testing_bow_documents = text_training_preprocessed[15000:]
testing_contextual_documents = text_training_not_preprocessed[15000:]

## Creating the Training Dataset
Let's pass our files with preprocess data to our TopicModelDataPreparation object. This object takes care of creating the bag of words for you and of obtaining the contextualized BERT representations of documents. This operation allows us to create our training dataset.


In [5]:
tp = TopicModelDataPreparation("bert-base-nli-mean-tokens")

training_dataset = tp.create_training_set(training_contextual_document, training_bow_documents)

100%|██████████| 405M/405M [00:11<00:00, 36.0MB/s]


Batches:   0%|          | 0/75 [00:00<?, ?it/s]


Why do we use the **preprocessed text** here? We need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help.

And what about the **unpreprocessed text**? We provide unpreprocessed text as the input for BERT (or the contextualized model of your choice) to let the model output more accurate document representations.

Let's check the vocabulary

In [8]:
tp.vocab[:30]

['abbreviated',
 'academic',
 'academy',
 'access',
 'according',
 'achieved',
 'acquired',
 'acre',
 'acres',
 'across',
 'act',
 'acting',
 'action',
 'active',
 'activist',
 'activities',
 'activity',
 'actor',
 'actress',
 'acts',
 'ad',
 'added',
 'addition',
 'additional',
 'adelaide',
 'adjacent',
 'administered',
 'administration',
 'administrative',
 'adult']

## Training our Combined Contextualized Topic Model

Finally, we can fit our new topic model. We will ask the model to find 50 topics in our collection (`n_component` parameter of the CombinedTM object). 

(Increase the number of epochs if you want to get better results)

In [9]:
ctm = CombinedTM(input_size=len(tp.vocab), bert_input_size=768, num_epochs=100, n_components=50)
ctm.fit(training_dataset)  

Epoch: [100/100]	 Seen Samples: [1500000/1500000]	Train Loss: 134.9792868815104	Time: 0:00:06.362935: : 100it [10:29,  6.30s/it]


### Saving the Model

In [10]:
ctm.save(models_dir="./")



### Loading the Model

In [11]:
del ctm

In [13]:
ctm = CombinedTM(input_size=len(tp.vocab), bert_input_size=768, num_epochs=100, n_components=50)

ctm.load("contextualized_topic_model_nc_50_tpm_0.0_tpv_0.98_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99/",
                                                                                                      epoch=26)



FileNotFoundError: ignored

# Topics

After training, now it is the time to look at our topics: we can use the 

```
get_topic_lists
```

function to get the topics. It also accept a parameter that allows you to select how many words you want to see for each topic.

If you look at the topics, you will see that they all make sense and are representative of a collection of documents that comes from Wikipedia (general knowledge).

In [None]:
ctm.get_topic_lists(5)

## Using the TestSet

Now we are going to use the testset: we want to predict the topic for unseen documents.

In [15]:
testing_dataset = tp.create_test_set(testing_contextual_documents, testing_bow_documents) # create dataset for the testset
predictions = ctm.get_doc_topic_distribution(testing_dataset, n_samples=10)

Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Sampling: [10/10]: : 10it [00:18,  1.84s/it]


In [17]:
print(testing_contextual_documents[15])

topic_index = np.argmax(predictions[15])
ctm.get_topic_lists(5)[topic_index]

Dhale (Arabic: الضالع‎‎ Aḍ Ḍāliʿ) province is one of the governorates of Yemen that have been created after the announcement of Yemeni unification. The population of the province accounted for (2.4%) of the total population of the Republic, and allocated administratively into (9) districts. Dali city is the centre of


TypeError: ignored