<h1> Dynamic Topic Modelling Using MNF (see Greene) </h1>

In [12]:
#!python -m pip install -U prettytable

In [25]:
import pandas as pd
import numpy
import pickle
import time
import joblib

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer

In [14]:
# Modify the pre-processing function
def preprocess(df, stopwords):
    # Append all the final tokens
    text = []
    docs = df['final_tokens']
    
    for abstract in docs:
        text.append(' '.join(abstract))
        
    # Create the term-document matrix
    tfidf_vectorizer = TfidfVectorizer(max_df=1.0, min_df=0, lowercase=False, stop_words=stop_wds)
    tf_idf = tfidf_vectorizer.fit_transform(text)
    
    # store the vocabulary map
    v = tfidf_vectorizer.vocabulary_
    terms = []
    for i in range(len(v)):
        terms.append("")
    for term in v.keys():
        terms[ v[term] ] = term
        
    return (tf_idf,terms)

In [15]:
# Load the dataset. Apply the DTM on the coronavirus corpus (1000 abstracts)
f = open('/project/biocomplexity/sdad/projects_data/ncses/prd/Tech-Report/case_studies/coronavirus_corpus.pkl', 'rb')
df = pickle.load(f)
f.close()
df.head()

Unnamed: 0,index,original index,PROJECT_ID,ABSTRACT,FY,PROJECT_TERMS,PROJECT_TITLE,DEPARTMENT,AGENCY,IC_CENTER,...,ORGANIZATION_COUNTRY,BUDGET_START_DATE,BUDGET_END_DATE,CFDA_CODE,FY.y,FY_TOTAL_COST,FY_TOTAL_COST_SUB_PROJECTS,ORG_COUNT,PI_COUNT,final_tokens
5122,5122,5608,95622,Host - virus interactions are greatly affected...,2008,Affect; Antiviral Agents; Area; Biochemical; D...,THE ROLE OF A HOST ION PUMP IN RNA VIRUS RECOM...,NSF,NSF,,...,UNITED STATES,,,47.074,2008,612000.0,,1,1,"[host, virus, interaction, greatly, affect, ex..."
13938,13938,15726,152221,AbstractThe American Society for Virology (ASV...,2008,abstracting; Acquired Immunodeficiency Syndrom...,"INTERNATIONAL CONGRESS OF VIROLOGY, ISTANBUL, ...",HHS,NIH,NIAID,...,UNITED STATES,2/1/2008,1/31/2009,93.855,2008,30000.0,,1,1,"[american, society, virology_asv, wish, provid..."
14418,14418,17118,153617,DESCRIPTION (provided by applicant): Severe a...,2008,Abbreviations; Address; Alleles; Alphavirus; A...,SARS REVERSE GENETICS,HHS,NIH,NIAID,...,UNITED STATES,2/1/2008,1/31/2010,93.855,2008,271608.0,,1,1,"[severe, acute, respiratory, syndrome, life_th..."
14514,14514,17384,153888,DESCRIPTION (provided by applicant): Human cor...,2008,Acute; Age-Years; Antigens; Archives; base; Bi...,EPIDEMIOLOGY AND CLINICAL FEATURES OF HUMAN CO...,HHS,NIH,NIAID,...,UNITED STATES,2/1/2008,1/31/2010,93.855,2008,75292.0,,1,1,"[human, coronaviruse_hcov, initially, identify..."
14565,14565,17526,154033,"DESCRIPTION: Plus-stranded RNA viruses, which ...",2008,Affect; Agriculture; Animals; Antiviral Agents...,ROLES OF HOST RNA BINDING PROTEINS IN VIRUS RE...,HHS,NIH,NIAID,...,UNITED STATES,2/1/2008,1/31/2010,93.855,2008,200124.0,,1,1,"[strand, rna, virus, pose, significant, risk, ..."


<h2> Pre-processing </h2>

In [16]:
# Transform dataset as in Greene. Create a data sample for each year
year_list = df['FY'].unique()

for year in year_list:
    df_subset = df[df['FY']==year]
    
    # save the pickle file
    pickle.dump(df_subset, open('/project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/data/Coronavirus_'+str(year)+'.pkl','wb'))

In [17]:
# Create the term-document matrix tfidf for each pkl file
stop_wds = ['research', 'study', 'project']  # use will be eliminated by max_df

for year in year_list:
    # Load the sample for a given year
    f = open('/project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/data/Coronavirus_'+str(year)+'.pkl', 'rb')
    df = pickle.load(f)
    f.close()
            
    # Pre-processing the pkl file
    (tf_idf,terms) = preprocess(df, stop_wds)
    
    # Save the term-document matrix
    joblib.dump((tf_idf,terms,df), '/project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/Term_docs_'+str(year)+'.pkl' )

<h2> Window Topic Modeling </h2>

In [26]:
# Create a windows-topics by applying NMF on each tfidf pkl file
!python find-window-topics.py /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/*.pkl -k 30,30 -o /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out

Using random seed 1000
- Processing time window matrix for 'Term_docs_2008' from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/Term_docs_2008.pkl ...
Note: NumExpr detected 40 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
Read 103x3818 document-term matrix
Generating models in range [30,30] ...
Applying window topic modeling to matrix for k=30 topics ...
Generated 103x30 factor W and 30x3818 factor H
Writing results to /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2008_windowtopics_k30.pkl
Saving NMF results to /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2008_windowtopics_k30.pkl
- Processing time window matrix for 'Term_docs_2009' from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/Term_docs_2009.pkl ...
Read 111x3914 document-term matrix
Generating models in ra

In [27]:
# Display topics 
!python display-topics.py /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2008_windowtopics_k30.pkl 

Note: NumExpr detected 40 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
- Loaded model with 30 topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2008_windowtopics_k30.pkl
Top 10 terms for 30 topics:
+------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
| Rank | Term_docs_2008_01 | Term_docs_2008_02 | Term_docs_2008_03 | Term_docs_2008_04 | Term_docs_2008_05 | Term_docs_2008_06 | Term_docs_2008_07 | Term_docs_2008_08 |
+------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
|    1 | host              | influenza         | sars_cov          | inhibitor         | vaccine           | ebola             | 1918              | ace2              |
|    2 | r

<h2> Dynamic Topic Modeling </h2>

In [28]:
# Solve for DTM
!python find-dynamic-topics.py /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2008_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2009_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2010_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2011_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2012_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2013_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2014_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2015_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2016_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2017_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2018_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2019_windowtopics_k30.pkl -k 30,30 -o /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out

Using random seed 1000
- Processing individual time window topic models ...
Note: NumExpr detected 40 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
Loaded 30 time window topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2008_windowtopics_k30.pkl
Loaded 30 time window topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2009_windowtopics_k30.pkl
Loaded 30 time window topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2010_windowtopics_k30.pkl
Loaded 30 time window topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2011_windowtopics_k30.pkl
Loaded 30 time window topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2012_windowtopics_k30.pkl
Loaded 30 time wind

In [29]:
#Display the result
!python display-topics.py /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/dynamictopics_k30.pkl

- Loaded model with 30 topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/dynamictopics_k30.pkl
Top 10 terms for 30 topics:
+------+-------------+-----------+-------------+-------------+----------------+-------------+-------------+------------+
| Rank | D01         | D02       | D03         | D04         | D05            | D06         | D07         | D08        |
+------+-------------+-----------+-------------+-------------+----------------+-------------+-------------+------------+
|    1 | rna         | influenza | mouse       | disease     | fusion         | fidelity    | core        | ace2       |
|    2 | replication | virus     | aged        | lung        | peptide        | compound    | stock       | sar_cov    |
|    3 | host        | pandemic  | response    | patient     | protein        | cov         | virus       | epithelial |
|    4 | viral       | 1918      | cell        | infectious  | membrane       | vivo        | recombinan

In [11]:
# Track the dynamic
!python track-dynamic-topics.py /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/dynamictopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2008_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2009_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2010_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2011_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2012_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2013_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2014_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2015_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2016_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2017_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2018_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2019_windowtopics_k30.pkl

Loaded model with 30 dynamic topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/dynamictopics_k30.pkl
Note: NumExpr detected 40 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
Loaded model with 30 window topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2008_windowtopics_k30.pkl
Loaded model with 30 window topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2009_windowtopics_k30.pkl
Loaded model with 30 window topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2010_windowtopics_k30.pkl
Loaded model with 30 window topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2011_windowtopics_k30.pkl
Loaded model with 30 window topics from /project/biocomplexity/sdad/projects_d

In [30]:
# Track the dynamic
!python track-dynamic-topics.py /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/dynamictopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2008_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2009_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2010_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2011_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2012_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2013_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2014_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2015_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2016_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2017_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2018_windowtopics_k30.pkl /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2019_windowtopics_k30.pkl

Loaded model with 30 dynamic topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/dynamictopics_k30.pkl
Note: NumExpr detected 40 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
Loaded model with 30 window topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2008_windowtopics_k30.pkl
Loaded model with 30 window topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2009_windowtopics_k30.pkl
Loaded model with 30 window topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2010_windowtopics_k30.pkl
Loaded model with 30 window topics from /project/biocomplexity/sdad/projects_data/ncses/prd/Dynamic_Topics_Modelling/NMF/out/Term_docs_2011_windowtopics_k30.pkl
Loaded model with 30 window topics from /project/biocomplexity/sdad/projects_d