**Jupyter Notebook to generate the BART-TL topic labels predictions.**

In [20]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import pandas as pd

1. Read the topics ground truth and define the BERT-TL model and number of topic words to generate the predictions

In [42]:
# Read the topics (names and descriptions)
topics = pd.read_json("topics_20news.json")

# Extract the topics' names and define a results dataframe to save similarities
results = pd.DataFrame({'topics':topics['name']})
# Show topics' names
print(results)

                   topics
0            sport hockey
1        religion atheism
2           science space
3        science medicine
4           politics_misc
5   computer mac hardware
6        politics mideast
7   computer ibm hardware
8                for sale
9     science electronics
10  computer windows misc
11       motor motorcycle
12         sport baseball
13     religion christian
14          politics guns
15      computer graphics
16            motor autos
17          religion misc
18     computer windows x
19          science crypt


In [43]:
# Inicialize the bart-tl model
mname = "cristian-popa/bart-tl-all"
# mname = "cristian-popa/bart-tl-ng"

tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)

In [None]:
# Define the number of words of the description to predict the topic
num = 10

2. For each topic, generate the topic label prediction given by the BART-TL model selected and save the results in a JSON file

In [44]:
# Inicialize the labels array to save the results
labels = []
# For each topic
for i in range(0,len(topics)):
    # Split the description by words
    description = topics['description'].values[i]
    words = description.split(",")

    # Generate the question to ask
    input = ""
    for w in words[0:num]:
        input = input+w+" "
    # Show the question to ask
    print(input)

    # Tokenize the question
    enc = tokenizer(input, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
    # Generate the topic label given the tokenized question
    outputs = model.generate(
        input_ids=enc.input_ids,
        attention_mask=enc.attention_mask,
        max_length=15,
        min_length=1,
        do_sample=False,
        num_beams=25,
        length_penalty=1.0,
        repetition_penalty=1.5
    )

    # Decode the topic label
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Save the topic label
    labels.append(decoded)
    # Show the topic label
    print(decoded)
# Save labels results in dataframe
results = results.assign(Answer=labels)


game team play hockey player win goal season fan playoff 
on the other hand
god religion atheist moral claim point objective good belief argument 
belief
space nasa launch system orbit earth mission satellite shuttle moon 
mars orbiter
patient disease medical doctor study food health problem effect work 
on the other hand
government president state law work give man american drug stephanopoulo 
the president
mac apple problem drive system work monitor computer card disk 
operating system
armenian israel turkish jew arab israeli muslim state kill government 
arabia
drive card system problem work controller disk scsus ide run 
usb
offer sale sell include drive price shipping condition system card 
offer
work circuit ground power wire good line find battery copy 
voltage regulator
window file run driver problem program work card system version 
application programming interface
bike ride dod motorcycle dog good bmw work rider road 
on the other hand
game team win hit player run baseball g

In [46]:
# Save results as json file
results.to_json("Bart-TL/topics_results_10_bart-tl-all.json")