# 3a. Topic Modelling
In this notebook I implement a ML model for topic modelling. The abstracts cover a wide range of technical and policy-relevant topics. The goal of topic modelling is to classify the abstract into different topics.

In [None]:
import json
import pickle

In [None]:
### There was some issue with __init__() got an unexpected keyword argument 'cachedir' when importing top2vec. Using an older version of joblib (1.1.0) avoids this problem,
### but I should undo this asap top2vec works on the latest versions of joblib and hdbscan
### in this order: pip install top2vec, pip install upgrade joblib, import top2vec
!pip install top2vec
!pip install --upgrade joblib==1.1.0
from top2vec import Top2Vec

## 1. Importing Input

In [45]:
###unpickling
directory = "/content/drive/MyDrive/Colab Notebooks/ESPI_Codes/IAC_Analysis/1.HTML_Parsing/"
with open(directory+"IAC_raw_data.pickle", "rb") as handle:
  dict_of_info = pickle.load(handle)

## 2. Preparing Data

These are the abstracts:

In [None]:
abstracts_raw = {}
for key, value in dict_of_info.items():
  abstracts_raw.update({key:[]})
  for year, paper_ids in value.items():
    abstracts_raw[key].append(paper_ids["abstract"].lower())

This dictionary contains as keys the top2vec id and as value the paper_id:

In [46]:
top2vec_paper_id_dict = {}
for key, value in dict_of_info.items():
  for count, paper_id in enumerate(value):
    top2vec_paper_id_dict.update({paper_id: str(key)+"_"+str(count)})

Top2vec requires one input: the documents as a list of strings (here: docs).
In addition to this, I will provide a custom list of ids for the docs. Each id will have at the beginning the year, so that we can later distinguish the docs by year and see how the topics change over the 5 year period.

In [None]:
docs = []
ids = []
for key, value in abstracts_raw.items():
  for count, abstract in enumerate(value):
    docs.append(abstract)
    ids.append(str(key)+"_"+str(count))

## 3. Training the Model

In [None]:
### training the model
#model1 = Top2Vec(documents = docs, document_ids = ids, ngram_vocab = True, speed= "deep-learn", verbose=True)
#model1.save("IAC_top2vec_model1")

In [None]:
### loading the model
directory = "/content/drive/MyDrive/Colab Notebooks/ESPI_Codes/IAC_Analysis/3.Topic_Modeling/"
model1 = Top2Vec.load(directory+"3a.IAC_top2vec_model1")

### loading the topic names from a dict I created
topic_names_model1 = json.load(open(directory+"3ai.topic_names.json"))

#4. Functions for interpreting the models

This function prints the basic information about a model: number of topics, for each topic the topic number and the size (i.e., number of docs):

In [None]:
def basic_info(model):
  topic_sizes, topic_nums = model.get_topic_sizes()
  print(f"This model has {model.get_num_topics()} topics")
  print(3*"_________")
  for topic_size, topic_num in zip(topic_sizes, topic_nums):
    print(f"Topic #{topic_num} is size: {topic_size}")

This function prints all the topics and associated top-words that define the topic.

In [None]:
def topic_words(model, num_of_topics):
  all = model.get_num_topics()
  topic_words, word_scores, topic_nums = model.get_topics(num_of_topics)
  for words, scores, nums in zip(topic_words, word_scores, topic_nums):
    print(nums)
    print(f"Words:{words}")

This function prints a word cloud for the specified topic number:

In [None]:
def word_clouds(model, list_of_topics):
  for element in list_of_topics:
    model1.generate_topic_wordcloud(element)

This function prints the document associated with a topic:

In [None]:
def docs_by_topic(model, topic_num, num_docs):
  documents, document_scores, document_ids = model.search_documents_by_topic(topic_num= topic_num, num_docs=num_docs)
  for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()

This function prints similar words of a list of keywords:

In [None]:
def keyword_search(model, keywords):
  words, word_scores = model.similar_words(keywords=keywords, keywords_neg=[], num_words=20)
  for word, score in zip(words, word_scores):
    print(f"{word} {score}")

This function shows how many docs are associated with a topic over the 5 years:

In [None]:
def topics_per_year(model):
  result = {}
  for topic_number in model.get_topics()[2]:
    result.update({topic_number: {'Words': [],
                                  'Total': [],
                                  '2018': 0,
                                  '2019': 0,
                                  '2020': 0,
                                  '2021': 0,
                                  '2022': 0}})
    
    topic_size = model.get_topic_sizes()[0][topic_number] #number of docs in a topic
    list_of_words = model.get_topics()[0][topic_number] #words that make up a topic
    result[topic_number].update({"Words": list_of_words,
                                 "Total": topic_size})

    list_of_doc_ids = model.search_documents_by_topic(topic_num = topic_number, num_docs = topic_size)[2]
    for doc_id in list_of_doc_ids:
      result[topic_number][doc_id[:4]] +=1
      
  return result

## 5. Exporting

This is a simplified dict with top2vec ids as key and a nested dict with topic number and name:

In [None]:
output = {}
for topic in model1.get_topic_sizes()[1]:
  topic_size = model1.get_topic_sizes()[0][topic]
  list_of_docs = model1.search_documents_by_topic(topic_num = topic, num_docs = topic_size)[2]
  for doc in list_of_docs:
    output.update({doc: {"topic number": topic, "topic name": topic_names_model1[str(topic)]}})

In [48]:
### pickling
with open("3a.doc_topic_name_number.pickle", "wb") as f:
  pickle.dump(output, f, protocol = pickle.HIGHEST_PROTOCOL)

#the top2vec_id and paper_id dictionary:
with open("3a.top2vec_paper_id_dict.pickle", "wb") as f:
  pickle.dump(top2vec_paper_id_dict, f, protocol = pickle.HIGHEST_PROTOCOL)

# 6. Conclusion
Now, all abstracts are categorised into one of ca. 100 distinct topics. The topics do not have names yet - they are represented by an index. In 3b. I'll give the topics a more meaningful name. In step 4, I'll consolidate the topics and the new features from step 3 (pre-processing), namely organisation type, to create one datastructure that contains all the relevant information.