# Summarizing Topic Models with Transformers

This kernel uses preprocessed data from [my earlier kernel](https://www.kaggle.com/donkeys/my-little-preprocessing). First, explore a bit of topic model parameters space, use the parameters to build matching topic models using [Gensim LDA](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html), finds the most representative documents for each topic, and summarizes those documents using [HuggingFace Transformers](https://github.com/huggingface/transformers). The idea was to look at possibility of summarizing topic models based on large sets of text, and whether reasonable topic models can be found, ...

### Version History
- v11 update preprocessing with April 17th set, fixed filepath indices, fix other minor issues 
- v10 updated preprocessing
- v8 updated preprocessing
- v4-5 update to new preprocessing, pmc docs
- v3 clean up tokens from a few excess words id'd in v2, summarize top 3 / topic as one set for topic
- v2 first public version. find topic count using coherence values, describe top 20 words/tokens per topic, and top 3 documents per topic, summarize the top 3 / topic using transformers


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import logging

import os


TQDM for progress bars in notebooks:

In [None]:
from tqdm.auto import tqdm
tqdm.pandas()

In [None]:
!ls /kaggle/input/

In [None]:
!ls /kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/pdf_json | head -n 10

In [None]:
#!head /kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset/pdf_json/00623bf2715e25d3acacb3f210d6888ed840e3cb.json -n 200

# Read in the Data

I am using a preprocessed dataset generated by my [other notebook](https://www.kaggle.com/donkeys/preprocess-input-docs-from-apr-17-upload-dataset), and uploaded as a [dataset](). This avoids some [memory issues](https://www.kaggle.com/general/142462#803723) that seem occut on Kaggle when using notebook outputs directly as inputs. This dataset has directories for "paragraphs" and "whole" documents. The first one hosts all docs split into paragraphs according to the original inputs. The second one has combined each document into one whole text file per document.

In [None]:
!ls /kaggle/input/covid-nlp-preprocess/output

In [None]:
!ls /kaggle/input/covid-nlp-preprocess/output/paragraphs

The above shows two .txt files, and four directories under both paragraphs and whole dirs. The directories match those in the Kaggle input dataset for documents. Just the contents have been preprocessed to remove stopwords, lemmatize, clean up a bit. The .txt files contain a set of unrecognized words and their closest identified matching identified words. So one could update the preprocessor if there is a frequent typo, or similar, in the dataset documents.

Anyway, for this kernel the important bits are in the four directories / folders. The preprocessed documents.

Again, as in preprocessing, I use a simple data structure to hold the different forms of the text in each doc:

In [None]:
class COVDoc:
    def __init__(self):
        self.filepath_proc = None
        self.filepath_orig = None
        #self.basepath_orig = None
        #self.text_proc = None
        self.text_orig = None
        self.abstract = None
        self.tokenized_proc = None
        self.doc_type = None
    
    #this function allows me to lazy-load the original text to save memory
    def load_orig(self):
        with open(self.filepath_orig) as f:
            d = json.load(f)
            body = ""
            for idx, paragraph in enumerate(d["body_text"]):
                body += f"{paragraph['text']}\n"
            self.text_orig = body

    def load_abstract(self):
        with open(self.filepath_orig) as f:
            d = json.load(f)
            if "abstract" in d:
                abstract_list = d["abstract"]
                if len(abstract_list) > 0:
                    self.abstract = d["abstract"][0]["text"]


Function to load different datasets into memory, matching the preprocessed texts to their original files:

In [None]:
import glob, os, json

def load_docs(base_path, base_path_orig, doc_type):
    loaded_docs = []
    file_paths_proc = glob.glob(base_path)
    file_names_proc = [os.path.basename(path) for path in file_paths_proc]
    file_names_orig = [os.path.splitext(filename)[0]+".json" for filename in file_names_proc]
    #file_paths_orig = [os.path.join(base_path_orig, filename) for filename in file_names_orig]
    for idx, filepath_proc in enumerate(file_paths_proc):
        doc = COVDoc()
        doc.doc_type = doc_type
        #doc.basepath_orig = base_path_orig
        doc.filepath_proc = filepath_proc
        filename = file_names_orig[idx]
        if filename.startswith("PMC"):
            filepath = os.path.join(base_path_orig, "pmc_json", filename)
        else:
            filepath = os.path.join(base_path_orig, "pdf_json", filename)
        doc.filepath_orig = filepath
        with open(filepath_proc) as f:
            d = f.read()
            doc.tokenized_proc = d.strip().split(" ")
            if len(doc.tokenized_proc) < 2:
                print("skipping doc due to no content:"+filepath_proc)
                continue
            if "PMC2114261" in filename:
                print(doc.filepath_proc)
                print(doc.filepath_orig)
            doc.tokenized_proc = [token for token in doc.tokenized_proc if (token != "et" and token != "al" and token != "fig") ]
        loaded_docs.append(doc)
    return loaded_docs

## The Four Datasets

In [None]:
!ls /kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv


In [None]:
!ls /kaggle/input/covid-nlp-preprocess/output/whole

Load all four datasets in preprocessed form, and capture reference to original, non-processed file:

In [None]:
med_docs = load_docs("/kaggle/input/covid-nlp-preprocess/output/whole/biorxiv_medrxiv/*.txt", "/kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv", "medx")
len(med_docs)

In [None]:
comuse_docs = load_docs("/kaggle/input/covid-nlp-preprocess/output/whole/comm_use_subset/*.txt", "/kaggle/input/CORD-19-research-challenge/comm_use_subset/comm_use_subset", "comm_user")
len(comuse_docs)

In [None]:
noncom_docs = load_docs("/kaggle/input/covid-nlp-preprocess/output/whole/noncomm_use_subset/*.txt", "/kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset", "noncomm")
len(noncom_docs)

In [None]:
custom_docs = load_docs("/kaggle/input/covid-nlp-preprocess/output/whole/custom_license/*.txt", "/kaggle/input/CORD-19-research-challenge/custom_license/custom_license", "custom")
len(custom_docs)

In [None]:
#!cat /kaggle/input/CORD-19-research-challenge/noncomm_use_subset/noncomm_use_subset/pmc_json/PMC5632742.xml.json
#!cat /kaggle/input/CORD-19-research-challenge/custom_license/custom_license/pdf_json/94f6c2e70e777539702580b3afc0c2d45a4d57b0.json

In [None]:
#!ls kaggle/input/CORD-19-research-challenge/biorxiv_medrxiv/biorxiv_medrxiv/pdf_orig

# Gensim Processing and LDA Topic Modelling

Collect all four datasets into one, and convert the documents into Gensim consumable format:

In [None]:
#https://www.machinelearningplus.com/nlp/gensim-tutorial/
from gensim.models import LdaModel, LdaMulticore
from gensim import corpora

all_docs = med_docs
all_docs.extend(comuse_docs)
all_docs.extend(noncom_docs)
all_docs.extend(custom_docs)

doc_tokens = [doc.tokenized_proc for doc in all_docs]

#id to word mapping for gensim
id2word = corpora.Dictionary(doc_tokens)

In [None]:
corpus = [id2word.doc2bow(text) for text in doc_tokens] 

We are short on memory again, so clear everything when can:

In [None]:
del doc_tokens

## Gensim (Hyper)Parameter Search

Uncomment the cell below to enable Gensim logging to console. This shows at what point the topics start to converge, and how much they converge. So I found with 2 passes and about 250+ iterations they seemed to converge quite well (around 90%). So I went with that.. Just disabled this for public kernel because it produces a lot of spammy text.

In [None]:
#https://stackoverflow.com/questions/7016056/python-logging-not-outputting-anything
#for handler in logging.root.handlers[:]:
#    logging.root.removeHandler(handler)
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)


If you like to play with a small, single instance model, try uncommenting below and play with the parameters.

In [None]:
#test_lda = LdaMulticore(corpus,num_topics=2, id2word=id2word, iterations=500, passes=2) 
#sentence = 'i like red wine with steak'
#sentence2 = [word for word in sentence.lower().split()] 
#test_lda[id2word.doc2bow(sentence2)]

### Most Coherent Topics

Gensim has a notion of [Topic Coherence](https://rare-technologies.com/what-is-topic-coherence/). The higher the coherence value, the better the topics should be. So I tried some different values to pick the best one (according to the coherence measure):

In [None]:
from gensim.models import CoherenceModel
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3): 
    """
    Compute c_v coherence for various number of topics
    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics
    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respect """
    coherence_values = []
    model_list = []
    for num_topics in tqdm(range(start, limit, step)):
        model = LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=id2word, iterations=600, passes=2) 
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, corpus=corpus, dictionary=dictionary, coherence='u_mass')
#        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values

This experiments with topic models of different sizes to see what kind of coherence it gives. The model with highest coherence is then used later. This loop here for now just runs a few different values, although I tried more earlier:

In [None]:
topic_count_start = 3
topic_count_step = 1
topic_count_limit = 10
# Can take a long time to run.
model_list, coherence_values = \
  compute_coherence_values(dictionary=id2word, corpus=corpus, texts=None, limit=topic_count_limit, start=topic_count_start, step=topic_count_step)
#  compute_coherence_values(dictionary=id2word, corpus=corpus, texts=doc_tokens, limit=topic_count_limit, start=topic_count_start, step=topic_count_step)


Print and visualize the coherence results for the different model configurations:

In [None]:
coherence_values

In [None]:
import matplotlib.pyplot as plt 
%matplotlib inline
# Show graph
x = range(topic_count_start, topic_count_limit, topic_count_step)
plt.plot(x, coherence_values) 
plt.xlabel("Num Topics") 
plt.ylabel("Coherence score") 
plt.legend(("coherence_values"), loc='best') 
plt.show()

Find the model with highest coherence:

In [None]:
topic_idx = np.argmax(coherence_values)
print(coherence_values[topic_idx]) 
test_lda = model_list[topic_idx]


## Final Topic Models

How many topics did we end up with choosing?

In [None]:
test_lda.num_topics

Which model was it in the list of tried models?

In [None]:
topic_idx

Visualize the topics in terms of their top words (words giving highest importance in the specific topic):

In [None]:
n_topics = test_lda.num_topics
col_names = []
for x in range(n_topics):
    topic_name = f"Topic{x+1}"
    col_names.append((topic_name, "Word"))
    col_names.append((topic_name, "Weight"))
     

In [None]:
tw_df = pd.DataFrame()

In [None]:
data = []
for x in range(n_topics):
    top_words = test_lda.show_topic(x, 20)
    words = []
    weights = []
    for word_weight in top_words:
        words.append(word_weight[0])
        weights.append(word_weight[1])
    data.append(words)
    data.append(weights)
    tw_df[f"Word{x+1}"] = words
    tw_df[f"Weight{x+1}"] = weights   

In [None]:
tw_df.columns = pd.MultiIndex.from_tuples(col_names)

In [None]:
tw_df

The details of the topics change a bit over runs due to random init state and similar factors. So cannot comment on exact detail as it might change on notebook run. But on a general level the topics seem to describe patient studies, patients in general, viruses, trials. Depending on how many we take. It seems that up to about 5 topics the set can be seen as providing quite coherent topics that are quite identifiable with some "concept".

# Find Top Documents/Articles for Topics

LDA Topic Models represent assignments of words in documents to different topics. Find the documents that are assigned most into each topic. Assume that those documents best represent that topic:

In [None]:
import heapq 

top_docs = {} 
#first create placeholder lists for top 3 docs in each topic 
for t in range(0, n_topics):
    doc_list = [(-1,-1),(-1,-1),(-1,-1)] 
    heapq.heapify(doc_list)
    top_docs[t] = doc_list
#count variable in following is practically doc_id since the index is from 0 with increments of 1
count = 0
for doc in tqdm(corpus):
    topics = test_lda[doc] 
    for topic_prob in topics:
        topic_n = topic_prob[0]
        topic_p = topic_prob[1]
        top_list = top_docs[topic_n]
        #count is document id, heapq sorts by first item in tuple
        heapq.heappushpop(top_list, (topic_p, count))
        #above pushes new item, pops lowest item. so pop itself if lowest..
    count += 1

In [None]:
#print(top_docs)

In [None]:
top_sorted = {}
for topic_id in top_docs:
    heap = top_docs[topic_id]
    sorted_topics = [heapq.heappop(heap) for _ in range(len(heap))] 
    print(str(topic_id)+": "+str(sorted_topics)) 
    top_sorted[topic_id] = sorted_topics


In [None]:
topic_names = [f"Topic{x+1}" for x in range(n_topics)]
word_names = [f"Word{x+1}" for x in range(3)]

top_doc_weights = []
top_doc_paths = []

pd.options.display.max_colwidth = 100

for topic_id in top_sorted:
    top_docs = top_sorted[topic_id]
    doc_ids = [doc_tuple[1] for doc_tuple in top_sorted[topic_id]] 
    doc_weights = [doc_tuple[0] for doc_tuple in top_sorted[topic_id]]
    topic_docs = [all_docs[doc_id] for doc_id in doc_ids]
    for x in range(3):
        doc = topic_docs[x]
        doc_path = f"{doc.doc_type}/{os.path.basename(doc.filepath_orig)}"
        weight = doc_weights[x]
        top_doc_weights.append(weight)
        top_doc_paths.append(doc_path)
top_doc_weights.reverse()
top_doc_paths.reverse()


In [None]:
df = pd.DataFrame()
df["weight"] = top_doc_weights
df["path"] = top_doc_paths
df.index = pd.MultiIndex.from_product([topic_names,
                                     ['Doc1', 'Doc2', 'Doc3']],
                                    names=['',''])
df

# Summarize with Transformers

In [None]:
for topic_id in top_sorted:
    print()
    print()
    print(f"----------- TOPIC {topic_id}: -----------")
    top_docs = top_sorted[topic_id]
    doc_ids = [doc_tuple[1] for doc_tuple in top_sorted[topic_id]] 
    doc_weights = [doc_tuple[0] for doc_tuple in top_sorted[topic_id]]
    topic_docs = [all_docs[doc_id] for doc_id in doc_ids]
    for x in range(3):
        print(f"----------- TOPIC {topic_id} / doc {x+1}: -----------")
        doc = topic_docs[x]
        doc.load_abstract()
        doc.load_orig()
        if doc.abstract != None:
            print(doc.abstract)
        else:
            print(doc.text_orig[:400])


Install the transformer libs:

In [None]:
!pip install --upgrade transformers

In [None]:
from transformers import pipeline

Thanks for the great libs HuggingFacers! Straight out of the HuggingFace examples:

In [None]:
summarizer = pipeline('summarization')
#summarizer(TEXT_TO_SUMMARIZE)

The following summarizes the 3 top documents per topic. These are the same ones I printed the beginning for earlier above. So perhaps compare the above short snippets (potential abstracts) to what summary the transformer provides:

In [None]:
for topic_id in top_sorted:
    print()
    print()
    print(f"----------- TOPIC {topic_id}: -----------")
    top_docs = top_sorted[topic_id]
    doc_ids = [doc_tuple[1] for doc_tuple in top_sorted[topic_id]] 
    doc_weights = [doc_tuple[0] for doc_tuple in top_sorted[topic_id]]
    topic_docs = [all_docs[doc_id] for doc_id in doc_ids]
    for x in range(3):
        print(f"----------- TOPIC {topic_id} / doc {x+1}: -----------")
        doc = topic_docs[x]
        doc.load_orig()
        weight = doc_weights[x]
        print(f"topic %:{weight}, document:{doc.filepath_orig}")
        #and this is the magic line doing the summary
        print(summarizer(doc.text_orig, min_length=200, max_length=400))

And as another viewpoint, try to summarize all top 3 documents in one for each topic:

In [None]:
for topic_id in top_sorted:
    print()
    print()
    print(f"----------- TOPIC {topic_id}: -----------")
    top_docs = top_sorted[topic_id]
    doc_ids = [doc_tuple[1] for doc_tuple in top_sorted[topic_id]] 
    doc_weights = [doc_tuple[0] for doc_tuple in top_sorted[topic_id]]
    topic_docs = [all_docs[doc_id] for doc_id in doc_ids]
    combined_text = ""
    for x in range(3):
        print(f"----------- TOPIC {topic_id} / doc {x+1}: -----------")
        doc = topic_docs[x]
        doc.load_orig()
        weight = doc_weights[x]
        print(f"topic %:{weight}, document:{doc.filepath_orig}")
        combined_text += " "+doc.text_orig
        #and this is the magic line doing the summary
    print(summarizer(combined_text, min_length=500, max_length=1000))

### Final Thoughts

Well, I thought this was interesting.

Sometimes it looks like there are some issues with how the transformer summarizes longer texts. Most of the time it seems to pick a specific "topic" (no pun intended..:) and just write a piece on that. Great looking for automated generation though.

And that's all folks. Not sure how useful it is for actual COVID research but it was an interesting start (for me) on the COVID docs..