## Model inference

In this notebook, we will load a previously trained model, explore the learned topics, and predict topics for a paper on arXiv.

In [1]:
# imports
from utils import scrape_arxiv_abstract
from model import TopicModel
from dataset import ArXivDataset
from gensim.models import LdaModel
from pprint import pprint

### Build topic model

To build a `TopicModel` object, we need to pass in as arguments the dataset used to create the model (to process new instances) and the model itself (to predict topics for the new instances).

In [2]:
# create topic model
model_path = "./models/lda_n12_p5_r929_c50.2"
dataset_path = "./data/dataset.obj"
model = TopicModel(model_path, dataset_path)

### Investigate topics

Next, let us explore the different topics learned by the model so that we can assign understandable topic names to each cluster.

In [3]:
# print topics
pprint(model.topics)

[(0,
  '0.046*"model" + 0.027*"task" + 0.025*"domain" + 0.022*"representation" + '
  '0.021*"text" + 0.020*"knowledge" + 0.020*"language" + 0.017*"information" + '
  '0.016*"semantic" + 0.013*"query"'),
 (1,
  '0.076*"model" + 0.025*"distribution" + 0.020*"method" + 0.019*"datum" + '
  '0.013*"inference" + 0.013*"test" + 0.013*"bayesian" + 0.012*"variable" + '
  '0.011*"approach" + 0.010*"parameter"'),
 (2,
  '0.015*"theory" + 0.013*"distribution" + 0.012*"property" + 0.011*"measure" '
  '+ 0.011*"information" + 0.010*"rule" + 0.010*"class" + 0.010*"probability" '
  '+ 0.010*"function" + 0.009*"problem"'),
 (3,
  '0.065*"image" + 0.025*"method" + 0.018*"object" + 0.017*"segmentation" + '
  '0.015*"detection" + 0.015*"video" + 0.011*"dataset" + 0.010*"feature" + '
  '0.009*"approach" + 0.008*"multi"'),
 (4,
  '0.042*"user" + 0.028*"group" + 0.025*"item" + 0.021*"mechanism" + '
  '0.018*"preference" + 0.018*"design" + 0.017*"social" + 0.017*"product" + '
  '0.014*"market" + 0.012*"optima

We can see that there are some clusters that seem to refer to specific topics in machine learning. One of them is topic 7, which seems to direcly relate to sequential and time-series data. Another example is topic 10, which seems to be related to reinforcement learning.

To make it easier to refer to these topic clusters, we will assign (tentative) names to each of them. Note that these names are subject to interpretation and are only assigned to help "summarize" each cluster.

In [4]:
# set topic names
topic_names = [
    "Natural language processing",
    "Probability + Inference",
    "ML-related terms?",
    "Computer vision",
    "Recommendation",
    "Algorithms + Optimization",
    "Deep learning",
    "Sequences + Time series",
    "ML-related terms?",
    "Estimation + Linear algebra?",
    "Reinforcement learning",
    "Paper-related terms?"
]

model.set_topic_names(topic_names)
pprint(model.topics)

[('Natural language processing',
  '0.046*"model" + 0.027*"task" + 0.025*"domain" + 0.022*"representation" + '
  '0.021*"text" + 0.020*"knowledge" + 0.020*"language" + 0.017*"information" + '
  '0.016*"semantic" + 0.013*"query"'),
 ('Probability + Inference',
  '0.076*"model" + 0.025*"distribution" + 0.020*"method" + 0.019*"datum" + '
  '0.013*"inference" + 0.013*"test" + 0.013*"bayesian" + 0.012*"variable" + '
  '0.011*"approach" + 0.010*"parameter"'),
 ('ML-related terms?',
  '0.015*"theory" + 0.013*"distribution" + 0.012*"property" + 0.011*"measure" '
  '+ 0.011*"information" + 0.010*"rule" + 0.010*"class" + 0.010*"probability" '
  '+ 0.010*"function" + 0.009*"problem"'),
 ('Computer vision',
  '0.065*"image" + 0.025*"method" + 0.018*"object" + 0.017*"segmentation" + '
  '0.015*"detection" + 0.015*"video" + 0.011*"dataset" + 0.010*"feature" + '
  '0.009*"approach" + 0.008*"multi"'),
 ('Recommendation',
  '0.042*"user" + 0.028*"group" + 0.025*"item" + 0.021*"mechanism" + '
  '0.018*"

### Predict topics for a paper

Let us see how our model predicts a paper taken directly from arXiv. Using the `scrape_arxiv_abstract()` function, we can extract the title and the abstract of any paper on arXiv given its URL. Once scraped, this title and abstract can be passed into our topic model's `predict()` method.

To illustrate, let us scrape the title and abstract from the seminal paper ["Attention Is All You Need" (2017)](https://arxiv.org/abs/1706.03762) and see what topics the model detects.

In [5]:
# paper: "Attention Is All You Need" (Vaswani et al, 2017)
paper_url = "https://arxiv.org/abs/1706.03762"
text = scrape_arxiv_abstract(paper_url)
print(text)

Attention Is All You Need

  The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks in an encoder-decoder configuration. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer, based
solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to be
superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014
English-to-German translation task, improving over the existing best results,
including ensembles by over 2 BLEU. On the WMT 2014 English-to-French
translation task, our model establishes a new single-model state-of-the-art
BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction
of the training costs of the best models from the literature

In [6]:
# get predictions
model.predict(text)[:3]

[('Deep learning', 0.7827023),
 ('Natural language processing', 0.18202062),
 ('ML-related terms?', 0.022977384)]