About this Notebook:
For this notebook we used topic model: Latent Dirichlet Allocation (LDA) which considers each document as a collection of (keyword) topics that rearrange the topics distribution within the documents and keywords into different defined number of topics for keywords distribution.

Steps:
1. Preprocess document metadata text by using Natural Language Processing (NLP). 
    * Filtered research articles by surveillance and diagnostics keywords, and covid-19 tagging. 
    * Filtered missing abstract. 
    * Validated word count and unique word count. 
    * Included bigram and trigram terms. 
    * Tokenize, lemmatize, customized stop word removal
2. Applied LDA, topic model on abstract.
    * Create data dictionary (unique id for each word [word_id, word_frequency]) with Genism.
    * Convert into bag-of-words corpus and term document frequency (tf-idf)
    * Validated passes, iterations, random state, and alpha
    * Finding optimal number of topics for LDA with highest coherence value and low perplexity value
    * Finding dominant topics in each sentence
    * Finding most representative document for each topic
    * Finding topic distribution across documents
3. Calculate similarity between pages by nearest neighbors, sourced by the distance metric. 
4. Visualizing topic with pyLDAvis


Pros:
1.	Used research article meta-data 
2.	Shows latent relationships between articles 
3.	LDA is an effective visualization tool for topic modeling.
4.	Easy to visually understand group of clusters. 
5.	Has a history of producing reliable results in multiple different domains.
6.	Transferable tool for new applications. 
7.	Guided LDA nudge LDA topic to a semi-supervised model

Cons:
1.	The current topic model focused on the abstract, not full text. Abstract summarize the research article. While, full text contains the article complete context.
2.	Topic model’s accuracy depends on known number of topics before applying to the model. In this notebook we compared and choose the number of topics with the highest coherence value. However, it is still random.
3.	The model coherence value is poor; needs more iteration and pass test in the future. 
4.	LDA distribution can't capture correlations among topics. 
5.	LDA is an unsupervised learning thus topics are hard to identify

Topic modeling is an unsupervised machine learning statistical model to discover hidden semantic text structures by assuming each word in the document are related. It then tries to learn topic representations of papers in a corpus. Documents are probability distribution over latent topics. Topics are probability distributions over words. After determining the topics in the documents, it use the model to generate topic and word distributions over a corpus. The output is then used to identify similar documents within the corpus. 

Topics is a collection of dominant keywords defined by five main factors:
1. Quality of text preprocessing
2. Variety of topics
3. Topic modeling algorithm choice
4. Number of topics
5. Tuning algorithm parameters.




In [None]:
# !pip install scispacy
!pip install guidedlda
# !pip install langdetect

In [None]:
import covid19_tools as cv19 # library generous released to public by Andy White (https://www.kaggle.com/ajrwhite/covid19-tools)
import pandas as pd
import re
from IPython.core.display import display, HTML
import html
import numpy as np
import json
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import glob

from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# import scispacy
import spacy
# import en_core_sci_lg

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

from scipy.spatial.distance import jensenshannon

import joblib

from IPython.display import HTML, display

from ipywidgets import interact, Layout, HBox, VBox, Box
import ipywidgets as widgets
from IPython.display import clear_output

from tqdm import tqdm
from os.path import isfile

import seaborn as sb
import matplotlib.pyplot as plt
plt.style.use("dark_background")

# **Loading metadata**

In [None]:
METADATA_FILE = '../input/CORD-19-research-challenge/metadata.csv'

# Load metadata
meta = cv19.load_metadata(METADATA_FILE)
# print(meta.shape)
# Add tags
meta, covid19_counts = cv19.add_tag_covid19(meta)

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

In [None]:
meta.info()

# **Filter Research Articles by Social and Ethical Terms (Keyword search)**

In [None]:
SOC_ETHIC_TERMS = ['exposure',
                   'immediate',
                   'policy recommendations',                  
                   'mitigation',                   
                   'denominators',                   
                   'testing',                   
                   'sharing information',                   
                   'demographics',
                   'asymptomatic disease',
                   'serosurveys',
                   'convalescent samples',
                   'early detection',
                   'screening',
                   'neutralizing antibodies',
                   'ELISAs',
                   'increase capacity',
                   'existing diagnostic',
                   'diagnostic platforms',
                   'existing surveillance',
                   'surveillance platforms',
                   'recruitment',
                   'support',
                   'coordination',
                   'local expertise',
                   'capacity',
                   'public',
                   'private',
                   'commercial',
                   'non-profit',
                   'academic',
                   'legal',
                   'ethical',
                   'communications',
                   'operational issues',
                   'national guidance',
                   'national guidelines',
                   'universities',
                   'communications',
                   'public health officials',
                   'public',
                   'point-of-care test',
                   'rapid influenza test',
                   'rapid bed-side tests',
                   'tradeoffs',
                   'surveillance experiments',
                   'PCR',
                   'special entity',
                   'longitudinal samples',
                   'ad hoc local interventions',
                   'separation of assay development',
                   'migrate assays',
                   'evolution of the virus',
                   'genetic drift',
                   'mutations',
                   'latency issues',
                   'viral load',
                   'detect the pathogen',
                   'biological sampling',
                   'environmental sampling',
                   'host response markers',
                   'cytokines',
                   'detect early disease',
                   'predict severe disease progression',
                   'best clinical practice',
                   'efficacy of therapeutic interventions',
                   'screening and testing',
                   'policies and protocols',
                   'supplies',
                   'mass testing',
                   'swabs',
                   'reagents',
                   'technology roadmap',
                   'barriers to developing',
                   'market forces',
                   'future coalition',
                   'accelerator models',
                   'Coalition for Epidemic Preparedness Innovations',
                   'streamlined regulatory environment',
                   'CRISPR',
                   'holistic approaches',
                   'genomics',
                   'large scale',
                   'rapid sequencing',
                   'bioinformatics',
                   'genome',
                   'unknown pathogens',
                    'naturally-occurring pathogens',
                   'One Health',
                   'future spillover',
                   'hosts',
                   'ongoing exposure',
                   'transmission hosts',
                   'heavily trafficked',
                   'farmed wildlife',
                   'domestic food',
                   'companion species',
                   'environmental',
                   'demographic',
                   'occupational risk factors',
                   'transmiss', 
                   'transmitted',
                    'incubation',
                    'environmental stability',
                    'airborne',
                    'via contact',
                    'human to human',
                    'through droplets',
                    'through secretions',
                    r'\broute',
                    'exportation'
                  ]


In [None]:
meta, soc_ethic_counts = cv19.count_and_tag(meta,
                                               SOC_ETHIC_TERMS,
                                               'soc_ethic')

In [None]:
print('Loading full text for tag_disease_covid19')
# pulling ~1000 research articles
full_text_repr = cv19.load_full_text(meta[meta.tag_disease_covid19 &
                                          meta.tag_soc_ethic],
                                     '../input/CORD-19-research-challenge')

#pulling ~5000 research articles (picked due to broader search term, which include SARS)
# metadata_filter = meta[meta.tag_soc_ethic == True] 
# full_text_repr = cv19.load_full_text(metadata_filter,
#                                      '../input/CORD-19-research-challenge')

In [None]:
full_text_repr[0]

In [None]:
meta.shape

In [None]:
meta.head()

In [None]:
# meta_rel = meta[meta.tag_disease_covid19 & meta.tag_soc_ethic]
# include only soc and ethic terms
meta_rel = meta[meta.tag_soc_ethic]
meta_rel.shape

In [None]:
# (~(meta_rel['abstract'].isna()))

In [None]:
meta_rel['abstract'].isna().sum()

# **Remove missing abstracts**

In [None]:
meta_rel = meta_rel[(~(meta_rel['abstract'].isna()))]

In [None]:
meta_rel.shape

In [None]:
# meta_rel['tag_soc_ethic']== False

In [None]:
metadata_filter = meta[meta.tag_soc_ethic == True] 

In [None]:
#remove non related articles
meta_rel_drop = meta_rel.drop(meta_rel[meta_rel['tag_soc_ethic'] == False].index, inplace=True)

In [None]:
meta_rel.shape

# **Added Abstract word count**

In [None]:
meta_rel['abstract_word_count'] = meta_rel['abstract'].apply(lambda x: len(x.strip().split()))  # word count in abstract
meta_rel['abstract_unique_words']=meta_rel['abstract'].apply(lambda x:len(set(str(x).split())))  # number of unique words in body
meta_rel.head()

# **Adding abstract and fulltext word count**

In [None]:
# meta_rel['abstract_word_count'] = meta_rel['abstract'].apply(lambda x: len(x.strip().split()))  # word count in abstract
# meta_rel['body_word_count'] = meta_rel['body_text'].apply(lambda x: len(x.strip().split()))  # word count in body
# meta_rel['body_unique_words']= meta_rel['body_text'].apply(lambda x:len(set(str(x).split())))  # number of unique words in body
# meta_rel.head()

# **(removed, will include in the future) Include N-gram SOC Terms **

In [None]:
# two_terms = ['shelter in place','bed shortage','public health','public interest','human rights','digital rights','face mask','fake news','civil society',
# 'medical treatment','community containment','mental health','suicide hotline','gig worker','medical worker','vulnerable population',
# 'vulnerable community','social distancing','contact tracing','stay at home']

In [None]:
# def replace_space(x):
#     x.replace(" ", "_")
#     print (x)

In [None]:
# replace_space(two_terms)

In [None]:
# from nltk.tree import *

# # Tree manipulation

# # Extract phrases from a parsed (chunked) tree
# # Phrase = tag for the string phrase (sub-tree) to extract
# # Returns: List of deep copies;  Recursive
# def ExtractPhrases( myTree, phrase):
#     myPhrases = []
#     if (myTree.node == phrase):
#         myPhrases.append( myTree.copy(True) )
#     for child in myTree:
#         if (type(child) is Tree):
#             list_of_phrases = ExtractPhrases(child, phrase)
#             if (len(list_of_phrases) > 0):
#                 myPhrases.extend(list_of_phrases)
#     return myPhrases

# test = Tree.parse('(S (NP I) (VP (V enjoyed) (NP my cookies)))')
# print ("Input tree: ", test)

# print ("\nNoun phrases:")
# list_of_noun_phrases = ExtractPhrases(test, 'NP')
# for phrase in list_of_noun_phrases:
#     print (" ", phrase)

'''# **Bigram and Trigram models**
Multiple words occuring together'''

In [None]:
# def sent_to_words(sentences):
#     for sentence in sentences:
#         yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

# abstract_gram = list(sent_to_words(meta_rel['abstract']))

# print(abstract_gram[:1])

In [None]:
# # Build the bigram and trigram models
# bigram = gensim.models.Phrases(abstract_gram, min_count=5, threshold=100) # higher threshold fewer phrases.
# trigram = gensim.models.Phrases(bigram[abstract_gram], threshold=100)  

# # Faster way to get a sentence clubbed as a trigram/bigram
# bigram_mod = gensim.models.phrases.Phraser(bigram)
# trigram_mod = gensim.models.phrases.Phraser(trigram)

# # bigram and trigram example
# # print(bigram_mod[abstract_gram[0]])
# # print(trigram_mod[bigram_mod[abstract_gram[0]]])

In [None]:
# abstract_gram

In [None]:
# print(bigram_mod[abstract_gram[1]])
# # print(trigram_mod[bigram_mod[abstract_gram[0]]])

# **Text Cleaning: Tokenization**

In [None]:
# include multiple word in the tokenization
import spacy
spacy.load('en')
from spacy.lang.en import English
parser = English()
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

# **Find the meanings of words, synonyms, antonyms, and more (Lemmatization)**

In [None]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

# **Stopword Removal**

In [None]:
nltk.download('stopwords')
# en_stop = set(nltk.corpus.stopwords.words('english'))

# **Adding new stopwords**

In [None]:
en_stop = nltk.corpus.stopwords.words('english')
#add new stopwords here
en_stop.extend(['abstract', 'doi', 'preprint', 'copyright','https', 'et', 'al','figure','fig', 'fig.', 
                'al.','PMC', 'CZI','peer', 'reviewed', 'org','author','rights', 'reserved', 'permission', 
                'used', 'using', 'biorxiv', 'medrxiv', 'license','Elsevier','www'])

en_stop = set(en_stop)

# **Define a function to prepare the text for topic modelling**

In [None]:
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

# **Add (abstract) token into list**

In [None]:
meta_rel['tokens'] = meta_rel.apply(lambda x: prepare_text_for_lda(x['abstract']),axis=1)
text_data = list(meta_rel['tokens'])

# **Abstract word count**
Abstract average word is 222.
Abstract unique average word is 140.

In [None]:
import seaborn as sns
sns.distplot(meta_rel['abstract_word_count'])
meta_rel['abstract_word_count'].describe()

In [None]:
sns.distplot(meta_rel['abstract_unique_words'])
meta_rel['abstract_unique_words'].describe()

In [None]:
# matrix/ nested list
# text_data

In [None]:
'''Topic model included non SOE term'''
# meta_rel['tag_soc_ethic']== False

In [None]:
len(text_data[1])

#  **Latent Dirichlet Allocation (LDA)**
1. Create data dictionary (Gensim create a unique id for each word [word_id,word_frequency])
2. Convert into bag-of-words corpus and Term Document Frequency (tf-idf)
4. Save dictionary and corpus

In [None]:
from gensim import corpora
#data dictionary
dictionary = corpora.Dictionary(text_data)
#corpus and tf-idf
corpus = [dictionary.doc2bow(text) for text in text_data]


import pickle
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

In [None]:
# #readable format of corpus (tf-idf)
# [[(dictionary[id], freq) for id, freq in cp] for cp in corpus[:1]]

# **10 LDA topics**

In [None]:
'''
passes (int, optional) – Number of passes through the corpus during training.
iterations (int, optional) – Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.
Optimize Topic: 8
'''
#10 topics
import gensim
NUM_TOPICS = 8
# ldamodel = gensim.models.ldamodel.LdaModel(corpus, 
#                                            num_topics = NUM_TOPICS, 
#                                            id2word=dictionary, 
#                                            update_every=1,
#                                            passes=25, 
#                                            random_state=7,
#                                            alpha='auto') # TO-Do: Find optimal # topics, iterations, and passes

# ldamodel = gensim.models.ldamodel.LdaModel(corpus, 
#                                            num_topics = NUM_TOPICS, 
#                                            id2word=dictionary, 
#                                            update_every=1,
#                                            passes=100, 
#                                            iterations =100,
#                                            random_state=7,
#                                            alpha='auto') # TO-Do: Find optimal # topics, iterations, and passes

ldamodel = gensim.models.ldamodel.LdaModel(corpus, 
                                           num_topics = NUM_TOPICS, 
                                           id2word=dictionary, 
                                           update_every=1,
                                           passes=25, 
                                           iterations =50,
                                           random_state=7,
                                           alpha='auto') # TO-Do: Find optimal # topics, iterations, and passes


ldamodel.save('model10.gensim')

topics = ldamodel.print_topics(num_words=10)
for topic in topics:
    print(topic)

# **Finding optimal number of topics for LDA (highest coherence)**
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics


In [None]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamodel.LdaModel(corpus, 
                                           num_topics = NUM_TOPICS, 
                                           id2word=dictionary)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=corpus, texts=text_data, start=2, limit=40, step=6)

In [None]:
# Show graph
'''    
Pick model (num topics) with the highest coherence score before flattening out.
'''
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

In [None]:
# Select the model and print the topics
optimal_model = model_list[2]
model_topics = optimal_model.show_topics(formatted=False)
print(optimal_model.print_topics(num_words=10))

# **Dominant topics in each sentence**

In [None]:
def format_topics_sentences(ldamodel, corpus, texts):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=ldamodel, corpus=corpus, texts=text_data)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic.head(10)

# **Most representative document for each topic**

In [None]:
# Group top 5 sentences under each topic
sent_topics_sorteddf_mallet = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

# Show
sent_topics_sorteddf_mallet

# **Topic distribution across documents**

In [None]:
# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

# Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)

# Topic Number and Keywords
topic_num_keywords = sent_topics_sorteddf_mallet[['Topic_Num', 'Keywords']]

# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

# Change Column names
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']

# Show
df_dominant_topics

In [None]:
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('model10.gensim')

# lda10 = gensim.models.ldamodel.LdaModel.load('model10.gensim')
# lda_display10 = pyLDAvis.gensim.prepare(lda10, corpus, dictionary, sort_topics=False)
# pyLDAvis.display(lda_display10)

In [None]:
'''Match research paper to one of the 7 LDA topics'''
# #print topic cluster of research articles.
# for i in ldamodel.get_document_topics(corpus)[:]:
#     li = []
#     for j in i:
#         li.append(j[1])
#         bz=li.index(max(li))
#     print(i[bz][0])

# **Compute Model Perplexity and Coherence Score**

In [None]:
#Model perplexity and topic coherence measure topic accuracy. 
# Compute Perplexity # a measure of how good the model is. A low score is good.
print('\nPerplexity: ', ldamodel.log_perplexity(corpus))  

# Compute Coherence Score (Want a hight value)
coherence_model_lda = CoherenceModel(model=ldamodel, texts=text_data, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

# **pyLDAvis visualize information from a fit to a corpus of text data, LDA topic model **
* Saliency: a measure of topic term strength (salient keywords form the selected topics)
* Relevance: Average word probability weight to a topic and the word given the topic normalized by the probability of the topic.
* Bubbles: Represents a topic. The larger the bubbles the stronger the importance of the topic relative to the data.

Things to look for: 
* Each bubble represent a topic. The size of the bubble represent the prevalency of the topic.
* A good topic will have big, non-overlapping bubvles scattered throughout the chart instead of clusered in one quadrant. 
* Each topic is filled with salient keywords. 

In [None]:
# import pyLDAvis.gensim
#visual graph
# lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
# # lda matching by words vs matching by documents. Document may contains multiple topics and words. 
# pyLDAvis.display(lda_display)

# **Visualizing topics**

In [None]:
import pyLDAvis.gensim
lda10 = gensim.models.ldamodel.LdaModel.load('model10.gensim')
#visual graph
lda_display10 = pyLDAvis.gensim.prepare(lda10, corpus, dictionary, sort_topics=False)
# lda matching by words vs matching by documents. Document may contains multiple topics and words. 
pyLDAvis.display(lda_display10)

# **(need to come back in the future removed covid-19 tag to increase dataset) Same process with full-texts**

In [None]:
# for item in full_text_repr[0]['body_text']:
#     print(item['text'])

In [None]:
# text_data_full = []
# for i, record in enumerate(full_text_repr):
#     record_text = "".join([item['text'] for item in record['body_text']])
#     tokens = prepare_text_for_lda(record_text)
#     text_data_full.append(tokens)

In [None]:
# text_data_full

In [None]:
# len(text_data_full[3])

In [None]:
# from gensim import corpora
# dictionary_full = corpora.Dictionary(prepare_text_for_lda)
# corpus_full = [dictionary.doc2bow(text) for text in text_data_full]
# import pickle
# pickle.dump(corpus_full, open('corpus_fulltexts.pkl', 'wb'))
# dictionary_full.save('dictionary_fulltexts.gensim')

In [None]:
# # 5 topics
# import gensim
# NUM_TOPICS = 5
# ldamodel_full = gensim.models.ldamodel.LdaModel(corpus_full, num_topics = NUM_TOPICS, id2word=dictionary_full, passes=15)
# ldamodel_full.save('model10_fulltexts.gensim')
# topics = ldamodel.print_topics(num_words=25)
# for topic in topics:
#     print(topic)

In [None]:
# #10 topics
# import gensim
# NUM_TOPICS = 10
# ldamodel_full = gensim.models.ldamodel.LdaModel(corpus_full, num_topics = NUM_TOPICS, id2word=dictionary_full, passes=15)
# ldamodel_full.save('model10_fulltexts.gensim')
# topics = ldamodel_full.print_topics(num_words=10)
# for topic in topics:
#     print(topic)

In [None]:
# dictionary_full = gensim.corpora.Dictionary.load('dictionary_fulltexts.gensim')
# corpus_full = pickle.load(open('corpus_fulltexts.pkl', 'rb'))
# lda_full = gensim.models.ldamodel.LdaModel.load('model10_fulltexts.gensim')
# import pyLDAvis.gensim

# **Visualize full text**

In [None]:
# #visual
# lda_display_full = pyLDAvis.gensim.prepare(lda_full, corpus_full, dictionary_full, sort_topics=False)

In [None]:
# #visual
# pyLDAvis.display(lda_display_full)

# ** (removed) Model Evaulation with loglikehoods**
Loglikehoods attribute on a fitted model can monitor convergence. The attribute is bound to a list which records the sequence of log likelihoods associated with the model at different iterations. 

Documentation:https://lda.readthedocs.io/en/latest/getting_started.html

# **(removed, will include in the future) Guided LDA **
GuidededLDA is used to separate topics which had smaller representation in teh corpus and guide the classification of documents. 

X: call on internal vector (represents the corpus)
vocab: calls on the important high representation of words in the corpus
word2id: calls internal library transforming word to id. 

Seed topic list: "force" topic model words output into defined topic (categories)

https://github.com/vi3k6i5/guidedlda 
https://www.freecodecamp.org/news/how-we-changed-unsupervised-lda-to-semi-supervised-guidedlda-e36a95f3a164/ 
https://medium.com/analytics-vidhya/how-i-tackled-a-real-world-problem-with-guidedlda-55ee803a6f0d**

In [None]:
import guidedlda 

In [None]:
# seed_topic_list =[
# ['disinformation','misinformation','news','tweet','media','censorship','war','viral','anti-asian','fake'],
# ['police','law','enforcement','liberty','self-determination','force','politics','restriction','freedom','detention','lockdown'],
# ['well-being','isolation','psychological','mental','health','vulnerable','elderly','wellness','trauma','suicide','hotline'],
# ['privacy','surveillance','digital','human','rights','declaration','censorship','self-determination','democracy','discrimination','civil','society'],
# ['economics','economy','gig','low-income','worker','curve','recession','business','jobs','loss'],
# ['healthcare','nurse','doctor','front-line','seniors','caregiver','medical'],
# ['policy','shelter-in-place','GDPR','distancing','contain','containment','suppress','suppression','quarantine','closure','replication','reprecussion','capacity','lockdown']    
# ]     

# seed_topic_list =[     
# ["severe", "symptom", "clinical", "disease", "study", "result", "case", "cov-2", "coronavirus", "covid-19", "cov-2"],
# ["study", "viral", "control", "method", "intervention"],
# ["intervention", "social", "method" ],
# ["china", "wuhan", "country", "hubei", "province", "health", "cov-2", "coronavirus", "covid-19", "cov-2"],
# ["Patient", "patient", "transmission", "epidemic", "measure", "respiratory"],
# ["emergency", "Patient", "patient", "cov-2", "coronavirus", "covid-19", "cov-2", "public", "medical", "outbreak", "case", "number", "estimate", "infection", "health", "disease", "virus", " treatment"],
# ["group", "china", "wuhan", "country", "hubei", "quarantine", "using", "disease"],
# ]

In [None]:
# '''Make our own dataset and word2id'''

# # print(X.shape)
# # print(corpus[100])
# word2id = {}
# vocab = []
# index = 0
# for tx in text_data:
#     for word in tx:
#         if word not in word2id:
#             vocab.append(word)
#             word2id[word] = index
#             index += 1

# print(len(word2id))

# ## transfer corpus to word_ids sentences
# corpus_with_id = []
# max_len = max([len(x) for x in corpus])
# for line in corpus:
#     doc = []
#     for word, fre in line:
#         doc.append(word)
#     doc += [0 for _ in range(max_len - len(doc))]
#     corpus_with_id.append(doc)

# import numpy
# corpus_with_id = numpy.array(corpus_with_id)
# print(corpus_with_id.shape)

# # seed_topics = {}
# # for t_id, st in enumerate(seed_topic_list):
# #     for word in st:
# #         print(word)

In [None]:
# '''Check seed topics for seed_topic_list'''
# seed_topics = {}
# for t_id, st in enumerate(seed_topic_list):
#     for word in st:
#         if word in word2id:
#             seed_topics[word2id[word]] = t_id


In [None]:
# '''model training'''
# model = guidedlda.GuidedLDA(n_topics=7, n_iter=100, random_state=7, refresh=20)
# model.fit(corpus_with_id, seed_topics=seed_topics, seed_confidence=0.15)

In [None]:
# '''Get guidedLDA output'''
# n_top_words = 10
# topic_word = model.topic_word_
# for i, topic_dist in enumerate(topic_word):
#     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
#     print('Topic {}: {}'.format(i, ' '.join(topic_words)))

In [None]:
# '''Retreive the document-topic distributions'''
# doc_topic = model.transform(corpus_with_id)
# for i in range(7):
#     print("top topic: {} Document: {}".format(doc_topic[i].argmax(),
#                                                   ', '.join(np.array(vocab)[list(reversed(corpus_with_id[i,:].argsort()))[0:5]])))


In [None]:
# with open('guidedlda_model.pickle', 'wb') as file_handle:
#      pickle.dump(model, file_handle)
# # load the model for prediction
# # with open('guidedlda_model.pickle', 'rb') as file_handle:
# #      model = pickle.load(file_handle)

In [None]:
# X = guidedlda.datasets.load_data(guidedlda.datasets.NYT) # need to update to main list
# vocab = guidedlda.datasets.load_vocab(guidedlda.datasets.NYT)
# word2id = dict((v, idx) for idx, v in enumerate(vocab))

In [None]:
# model = guidedlda.GuidedLDA(n_topics=7, n_iter=100, random_state=7, refresh=20)
# seed_topics = {}
# for t_id, st in enumerate(seed_topic_list):
#     for word in st:
#         seed_topics[word2id[word]] = t_id

In [None]:
# model.fit(X, seed_topics=seed_topics, seed_confidence=0.15)

In [None]:
# n_top_words = 10
# topic_word = model.topic_word_
# for i, topic_dist in enumerate(topic_word):
#     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
#     print('Topic {}: {}'.format(i, ' '.join(topic_words)))

In [None]:
# doc_topic = model.transform(X)
# for i in range(9):
#     print("top topic: {} Document: {}".format(doc_topic[i].argmax(),
#                                                   ', '.join(np.array(vocab)[list(reversed(X[i,:].argsort()))[0:5]])))


In [None]:
# model.purge_extra_matrices()

In [None]:
# from six.moves import cPickle as pickle
# with open('guidedlda_model.pickle', 'wb') as file_handle:
#      pickle.dump(model, file_handle)
# # load the model for prediction
# with open('guidedlda_model.pickle', 'rb') as file_handle:
#      model = pickle.load(file_handle)
# doc_topic = model.transform(X)