Process:

1. Download metadata
2. Download text documents
3. Perform cleaning
4. Apply LDA model

In [1]:
import os
import json

import pandas as pd

## Import the DocsManager notebook

Let's import the DocsManager helper class that manages the loading and filtering of documents from the API.

In [2]:
%%capture

# LdaMallet
# Dictionary (gensim)
# build_docs
# transform_dt
# get_tw
# get_top_words
# %run ./LDAModule.ipynb

# DocsManager
# build_docs
%run ../../DocsManager.ipynb
## Jupyter.notebook.save_checkpoint()

# get_corpus_path
# get_txt_clean_path
%run ../../path_manager.ipynb

In [3]:
get_corpus_path('IMF')

'/R/NLP/CORPUS/IMF'

Let's create a DocsManager instance.

- `metadata_filename`: path to the metadata file generated after scraping the API
- `cleaned_files_dir`: path to the directory where the cleaned files are stored
- `model_output_dir`: path to where model related files will be saved


In [4]:
CORPUS_ID = 'IMF'
CORPUS_PART = 'ALL'
NUM_TOPICS = 50
# MALLET_BINARY_PATH = "../Mallet/bin/mallet"
MODELS_PATH = get_models_path('LSA')
NUM_WORKERS = 22

MODEL_ID = f"{CORPUS_PART}_{NUM_TOPICS}"
MODEL_FOLDER = os.path.join(MODELS_PATH, f'{CORPUS_ID}-{MODEL_ID}')

In [5]:
MODEL_DATA_FOLDER = os.path.join(MODEL_FOLDER, 'data')

if not os.path.isdir(MODEL_DATA_FOLDER):
    os.makedirs(MODEL_DATA_FOLDER)

In [6]:
%%time
docs = build_docs(
    metadata_filename=os.path.join(get_corpus_path(CORPUS_ID), f'{CORPUS_ID.lower()}_metadata_complete.csv'),
    cleaned_files_dir=get_txt_clean_path(CORPUS_ID),
    model_output_dir=MODEL_FOLDER  # Use flat directory as discussed...
)

CPU times: user 188 ms, sys: 39 ms, total: 227 ms
Wall time: 232 ms


Given a `CORPUS_PART`, let's extract the filtered documents. Please check the `DocsManager.ipynb` notebook for additional filter options.

In [None]:
docs.set_min_token_count(100)

In [7]:
%%time
docs_filtered, meta = docs.filter_doclist(CORPUS_PART, save=True, return_meta=True, pool_workers=22)

CPU times: user 2.46 s, sys: 2.79 s, total: 5.25 s
Wall time: 1min 32s


In [8]:
meta.head(2)

Unnamed: 0,id,title,author,digital_identifier,language_detected,year,pages
0,imf_000169111a90a98ac66bb6d07e4682d6be8b6ac3,Indonesia : Selected Issues,International Monetary Fund,9781475510775/1934-7685,en,2012,
1,imf_0004fdded18df40f3680c5a2f343f773d8f18d6f,United States : Staff Report for the 2000 Arti...,International Monetary Fund,9781451839524/1934-7685,en,2000,


In [9]:
docs_filtered.head(2)

Unnamed: 0,id,filename,text
0,imf_000169111a90a98ac66bb6d07e4682d6be8b6ac3,/R/NLP/CORPUS/IMF/TXT_CLEAN/imf_000169111a90a9...,select issue country report august internation...
1,imf_0004fdded18df40f3680c5a2f343f773d8f18d6f,/R/NLP/CORPUS/IMF/TXT_CLEAN/imf_0004fdded18df4...,international monetary fund staff country repo...


# LSA model

### Generate gensim dictionary

In [10]:
%%time
g_dict = Dictionary(docs_filtered.text.str.split())
g_dict.id2token = {id: token for token, id in g_dict.token2id.items()}

CPU times: user 1min 41s, sys: 4.69 s, total: 1min 46s
Wall time: 3min 29s


### Train LDA model using Gensim's Mallet wrapper

In [11]:
corpus = [g_dict.doc2bow(text.split()) for text in docs_filtered.text]

In [13]:
MODEL_DATA_FOLDER = os.path.join(MODELS_PATH, f'{CORPUS_ID}-{MODEL_ID}', 'data')

if not os.path.isdir(MODEL_DATA_FOLDER):
    os.makedirs(MODEL_DATA_FOLDER)

In [14]:
MODEL_DATA_FOLDER

'/R/NLP/MODELS/LDA/IMF-ALL_50/data'

# WARNING! Mallet files will be stored in the user home directory.

Ideally, this should be in the /tmp directory but the allocated space is not enough

In [15]:
import logging

logging.basicConfig(filename=f'{CORPUS_ID.lower()}-{MODEL_ID}.log', format='%(asctime)s - %(levelname)s - %(message)s', level=logging.INFO)


In [17]:
%%time

model = LdaMallet(
    MALLET_BINARY_PATH, corpus=corpus, num_topics=NUM_TOPICS, prefix='./{CORPUS_ID}-{MODEL_ID}_', 
    id2word=g_dict.id2token, workers=NUM_WORKERS
)

./corpus.txt
./state.mallet.gz
CPU times: user 19min 24s, sys: 2.99 s, total: 19min 27s
Wall time: 1h 35min 6s


In [19]:
model.fdoctopics(), model.num_topics

('~/IMF-ALL_50_doctopics.txt', 50)

### Load doc topics

In [19]:
dt = pd.read_csv(
    model.fdoctopics(), delimiter='\t', header=None,
    names=[i for i in range(model.num_topics)], index_col=None,
    usecols=[i + 2 for i in range(model.num_topics)],
)

dt.index = docs_filtered['id']
dt = dt.divide(dt.min(axis=1), axis=0).astype(int) - 1

In [20]:
dt.head(2)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
imf_000169111a90a98ac66bb6d07e4682d6be8b6ac3,0,240,1,3,158,67,1,25,4,1,...,1,691,0,3,427,3,162,7,12,3
imf_0004fdded18df40f3680c5a2f343f773d8f18d6f,191,46,14,16,215,178,15,146,43,128,...,267,290,17,4,681,80,25,4147,17,111


In [19]:
dt.head(2)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20140580,2,0,0,33,3,4,1,0,1,3,...,20,2,2,0,2,181,0,0,0,0
25715555,0,131,0,0,1136,0,46,10,25,702,...,222,0,0,861,3,3,156,0,3,43


### Generate dfr data

In [21]:
ddt = transform_dt(dt.as_matrix().T)

In [22]:
ttw = get_tw(model)

### Store data

In [23]:
with open(os.path.join(MODEL_DATA_FOLDER, 'tw.json'), 'w') as fl:
    json.dump(ttw, fl)

In [24]:
with open(os.path.join(MODEL_DATA_FOLDER, 'dt.json'), 'w') as fl:
    json.dump(ddt, fl)

In [25]:
info_json = {
    "title": "Topics in <em>WB Documents and Reports API<\/em>",
    "meta_info": "This site is the working demo for <a href=\"/\">dfr-browser</a>, a browsing interface for topic models of journal articles or other text.",
    "VIS": {
        "condition": {
            "type": "time",
            "spec": {
                "unit": "year",
                "n": 1
            }
        },
        "bib_sort": {
            "major": "year",
            "minor": "alpha"
        },
        "model_view": {
            "plot": {
                "words": 6,
                "size_range": [6, 14]
            } 
        }
    }
}

with open(os.path.join(MODEL_DATA_FOLDER, 'info.json'), 'w') as fl:
    json.dump(info_json, fl)

# Generation of key LDA files

### doc_topics

In [26]:
dt.to_csv(
    os.path.join(MODEL_DATA_FOLDER, f'doc_topics_{MODEL_ID}.csv'), 
    header=False,  # Change to True if topic id should be present as the header
    index=False  # Change to True if the uid should be present as the index
)

### topic_words

In [27]:
word_topics = pd.DataFrame(model.word_topics, columns=range(model.word_topics.shape[1]), index=range(1, model.word_topics.shape[0] + 1))
word_topics = word_topics.rename(columns=model.id2word)

In [28]:
word_topics.head()

Unnamed: 0,ability,able,absence,absolute,absorb,accelerate,accelerator,accept,access,accommodate,...,stapler,nonfreezing,gruel,armpit,convector,dingo,famish,outshining,monomer,telesale
1,709.0,0.0,203.0,0.0,0.0,0.0,0.0,1.0,4325.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,843.0,0.0,116.0,0.0,217.0,0.0,0.0,0.0,1103.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3217.0,0.0,607.0,188.0,222.0,0.0,0.0,1213.0,2239.0,57.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,188.0,0.0,0.0,98.0,0.0,850.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,510.0,0.0,207.0,461.0,0.0,0.0,0.0,0.0,0.0,113.0,...,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
word_topics.astype(int).to_csv(
    os.path.join(MODEL_DATA_FOLDER, f'topic_words_{MODEL_ID}.csv'), 
    header=False,  # Change to True if actual word should be present as the header
    index=False  # Change to True if the topic id should be present as the index
)

### top_words

In [30]:
top_words = get_top_words(word_topics, topic=None, topn=50)

In [31]:
top_words.head(2)

Unnamed: 0,topic,word,weight
0,1,debt,339300
1,1,country,30594


In [32]:
top_words.to_csv(
    os.path.join(MODEL_DATA_FOLDER, f'top_words_{MODEL_ID}.csv'), 
    index=False  # Change to True if the topic id should be present as the index
)

In [33]:
%%time
model.save(os.path.join(MODEL_DATA_FOLDER, f'{CORPUS_ID}_lda_model_{MODEL_ID}.lda'))

/R/NLP/MODELS/LDA/IMF-ALL_50/data/IMF_lda_model_ALL_50.lda
CPU times: user 50 ms, sys: 33 ms, total: 83 ms
Wall time: 453 ms


In [34]:
# ls -lh saved_lda_model.lda

# Find closest document by Euclidean distance

Use functions defined in `LDAModule.ipynb`: `close_docs`

In [84]:
# # We generate a function that will find and list the N documents closest to a selected one
# close_docs <- function(docid, numclose) {
#   indx <- which(s$uid == docid)
#   mxcol = 24 + as.numeric(model)
#   x1 <- s[indx, 25:mxcol]
#   neighbors <- s[, 25:mxcol]
#   dist <- pdist(neighbors, x1)
#   similar <- cbind(s, dist@dist)
#   similar <- similar[order(dist@dist),]
#   head(similar[, c(1,5,6,8,9,11,15)], numclose) # The first in the list is the document itself
# }

# close_docs(10575832, 21)
# close_docs(27761347, 21)

In [87]:
doc_ids = close_docs(docs, doc_id=20140580, num_docs=10, report=True, dt=dt)

uid: 20140580 
title: SABER in Action: An Overview - -ln-             Strengthening Education Systems to Achieve Learning for All 
url: http://documents.worldbank.org/curated/en/866881468323335358/SABER-in-Action-An-Overview-Strengthening-Education-Systems-to-Achieve-Learning-for-All 
pdf_url: http://documents.worldbank.org/curated/en/866881468323335358/pdf/80059-REVISED-SABER-in-Action-An-Overview.pdf

uid: 29839318 
title: Statement by Mr. Johan Van -ln-             Overtveldt at the 97th meeting of the Development Committee -ln-             held on April 21, 2018 
url: http://documents.worldbank.org/curated/en/805751524690768770/Statement-by-Mr-Johan-Van-Overtveldt-at-the-97th-meeting-of-the-Development-Committee-held-on-April-21-2018 
pdf_url: http://documents.worldbank.org/curated/en/805751524690768770/pdf/DCS2018-0031-Belgium-04212018.pdf

uid: 29839299 
title: Statement by Rt. Hon. Penny -ln-             Mordaunt at the 97th meeting of the Development Committee -ln-             

# Scratch

In [178]:
ddt['p'][0:2]

[0, 3533]

In [174]:
ddt['i'][10000]

5959

In [175]:
ddt['x'][10000]

232

In [161]:
dt.as_matrix().T

array([[ 0,  0, 18, ...,  0,  0,  3],
       [ 0,  0, 10, ...,  0,  0,  2],
       [ 2,  0,  6, ...,  2,  4,  0],
       ...,
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  6, 30, ...,  0,  0,  0],
       [ 2,  6,  6, ...,  6,  4,  2]], dtype=int64)

In [179]:
import requests
dfr_dt = 'http://microdatahub.com/topicsmodeling/dfr/topic_browser/model.php?type=dt&model=data50_SAR'
dfr_dt = requests.get(dfr_dt)

In [234]:
dfr_dt = ddt  # dfr_dt.json()

In [235]:
# http://microdatahub.com/topicsmodeling/dfr/topic_browser/browser.php?model=data50_SAR&type=SAR&topic_count=50#/doc/9622


In [248]:
did = 8130
ps_9622 = [p for p in range(0, dfr_dt['p'][-1]) if dfr_dt['i'][p] == did]

for t in range(50):
    p0 = dfr_dt['p'][t]
    p1 = dfr_dt['p'][t + 1]

    pt_9622 = [p for p in range(p0, p1) if dfr_dt['i'][p] == did]
    try:
        raw = dfr_dt['x'][pt_9622[0]]
        w =  raw / sum(dfr_dt['x'][p] for p in ps_9622)
        print(t, raw, w)
    except:
        continue

0 3 0.0008880994671403197
1 2 0.0005920663114268798
3 2 0.0005920663114268798
6 5 0.0014801657785671995
10 2 0.0005920663114268798
11 5 0.0014801657785671995
12 900 0.2664298401420959
16 18 0.0053285968028419185
20 240 0.07104795737122557
21 5 0.0014801657785671995
23 2 0.0005920663114268798
27 10 0.002960331557134399
28 3 0.0008880994671403197
29 289 0.08555358200118414
34 2 0.0005920663114268798
37 8 0.002368265245707519
40 61 0.018058022498519833
41 2 0.0005920663114268798
42 16 0.004736530491415038
45 609 0.1802841918294849
46 2 0.0005920663114268798
47 3 0.0008880994671403197
48 50 0.014801657785671996
49 64 0.018946121965660152


In [249]:
dt.T.sum()

uid
20140580      1080
25715555     21191
25715556     16486
25715559     22112
25715564     22201
26063310     28187
26063311     86580
26527953     75725
26527971     76887
27100106    137258
27164047      4331
27279530     19752
27556198     18284
27556933    196106
27563729      4124
27666842     74554
27678201     30339
27873493      4271
27998406      8498
28022007     57133
28024406     21622
28078923     82664
28097873     80300
28097875     79557
28135063     53546
28138284     15789
28170915     33635
28397832     12005
28397833     34754
28645758    101015
             ...  
29934340     17483
29934571      1166
29934572     10604
29934576      3477
29934577      1058
29934810     23777
29934811      6585
29935000      1666
29935012     19639
29935017     10680
29935018     19224
29935028      3498
29935030     10039
29935031     29335
29935035     14365
29935179      4293
29935213      1246
29935218      4780
29935220      5114
29935221      1958
29935314       488
29935339

In [231]:
[p for p in range(0, dfr_dt['p'][-1]) if dfr_dt['i'][p] == 10490]

[32432, 58929, 92858, 134291]

In [227]:
wbdocs.doclist[wbdocs.doclist.uid == 27164047].tokens

10    3262.0
Name: tokens, dtype: float64

In [251]:
wbdocs.doclist[wbdocs.doclist.uid == 29935714].tokens

10499    1963.0
Name: tokens, dtype: float64

In [238]:
WBdocs_filtered.shape

(7896, 3)

In [239]:
dfr_dt['p'][-1]

389113

In [246]:
dt.shape

(8131, 100)