## Topical Modelling Banking Inquiry Hearings

I picked the hearings of the Joint Committee of Inquiry into the Banking Crisis because it's a relatively small dataset and because I've previously done some work to add keywords to it, drawn from a list provided by the committee staff. It will be interesting to compare the topic clusters with these keywords.

The banking inquiry hearings themselves are derivations from the original daily XML of committee hearings. Whereas the daily XML document contains all the hearings held on an individual date in one document, the hearings XML is divided into one document per hearing (and also including the attendance details though that's not relevant for this exercise). This doesn't make any difference to the workflow of the task, because I'm interested in the speaker uris, which are numbered according to the daily document if I ever want to cross-reference them. I'm only noting it now in case I run into any unexpected issues when I expand this approach to the broader set of debates.


### Step 1: Retrieve XML from eXist-db and transfer to MongoDB collection

I'm transferring the data from a native XML database, eXist-db, to MongoDB because the latter has a more useful python interface, natively stores json and binary objects (eXist-db does the same but I haven't figured that out yet), and is more memory efficient (or at least I have a better idea how to use it in a memory efficient manner).

I'm retrieving speech elements from the hearing XML. The specific fields I want are: 

* The hearing date in xs:date format (as a string), 
* the speaker eId;
* the speech URI, in the form of the hearing URI appended with the parent debateSection eId followed by the speech eId;
* the text of the speech itself, as an array of strings of text for each paragraph; and
* the MongoDB ``_id`` for the speech document, which is the speech URI with the substring _akn/ie/debateRecord_ replaced with _speech_

The data is retrieved in json format.

In [1]:
from spacy.tokens.doc import Doc
from spacy.en import English

In [77]:
print()
db.drop_collection("banking")
%run exist-to-mongo

CPU times: user 9 µs, sys: 2 µs, total: 11 µs
Wall time: 11 µs

eXist-db http request status code: 200
Number of documents retrieved: 60067

---------------------------------------

Example speech document

{'_id': '/speech/joint-committee-of-inquiry-into-the-banking-crisis/2014-12-17/eng@hearing_1/dbsect_2/spk_2',
 'date': '2014-12-17',
 'spkr': 'person/ie/oireachtas/committee/inquiry-into-the-banking-crisis/witness/peter.nyberg',
 'text': ['Paragraph', 'Another paragraph'],
 'uri': '/akn/ie/debateRecord/joint-committee-of-inquiry-into-the-banking-crisis/2014-12-17/eng@hearing_1/dbsect_2/spk_2'}

---------------------------------------

MongoDB insert many success: True
No. of documents in MongoDB collection: 60067


### Parse Speeches with Spacy

I'm loading the full spacy English parser, including dependency parser and entity tagger, although for the topic modelling I only require the tokenizer and POS parser. I'm trying this approach first on the basis that I'll be working with sentence dependency trees later so might as well get the whole lot out of the way first. This may not be worthwhile when I come to the full project if I only examine dependency structures for a subset of debates.

In [81]:
%time nlp = spacy.en.English()

CPU times: user 1min 32s, sys: 3.02 s, total: 1min 35s
Wall time: 2min 26s


Each text array in the MongoDb collection is merged into one long string and processed with spacy to produce an output array of lemmas with stop words and non-alphabet words removed

In [90]:
def tokenize_doc(text):
    doc = nlp(text)
    toks = [w.lemma_ for w in doc if w.is_alpha and not w.is_stop]
    return doc, toks

Adding processed text to MongoDB as I go so I can come back to it later. This was a computationally expensive exercise, using up 96%-100% of CPU but only 36% of memory, about the amount used by spacy on its own. I only timed the tokenize_doc function (mostly in the 1-5 ms range) but didn't time the entire process - it took about 10 minutes.

In [None]:
speeches = db.banking.find()
for speech in speeches:
    text = "".join(speech["text"])
    #print("String length of speech: {}".format(len(text)))
    %time doc, toks = tokenize_doc(text)
    #print("Number of tokens in tokenized doc: {}".format(len(doc)))
    #print("Number of tokens after processing: {}".format(len(toks)))
    up = db.banking.update_one({"_id":speech["_id"]}, {"$set":{"byte_string":doc.to_bytes(), "tokens":toks}})
    #print(up.acknowledged)
    #print("\n---------------\n")

Now that I have the byte_strings, it's very speedy to do things like identify entities (even if they're not always accurate)

In [188]:
taster = db.banking.find({"date":"2014-12-17"})
ent_list = []
for t in taster:
    doc = Doc(nlp.vocab).from_bytes(t['byte_string'])
    ent_list.extend([e.orth_.rstrip(".").replace("the ", "").replace("'s", "") for e in doc.ents if e.label_=="ORG" or e.label_ =="GPE"])
print(set(ent_list))

{'Financial Regulator', 'Committee', 'Finnish Ministry of Finance', 'Washington', 'OECD', 'Cyprus', 'DSG', 'United States', 'Irish Government', 'America', 'Houses of Oireachtas on 17 October', 'ECB', 'National Asset Management Agency', 'Finland', 'Germany', 'Anglo Irish Bank', 'EU', 'Financial Regulator "', 'NAMA', '€26', 'RTE', 'Central Bank', 'Credit Institutions (Financial Support', 'Ministry', 'Australia', 'Scandinavia', 'Department of Finance', 'United State', 'EU Commission', 'Joint Committee of Inquiry', 'Parliament', 'NTMA', 'ESB', 'IMF', 'Act', 'USA', 'Government', 'Garda', 'Canada', 'The Department of Finance', 'Iceland', 'Commission', 'Houses of Oireachtas', 'Banking Crisis', 'PwC', 'The National Treasury Management Agency', 'Defamation Act 2009', 'Ireland', 'State', 'Banking Sector', 'UK', 'European Union', 'Bank of Finland', 'Department', 'Nyberg', 'Anglo', 'Irish Central Bank', 'Spain', 'Irish Nationwide Building Society', 'European Central Bank', 'US', 'Office of Financi

In [172]:
d = Doc(nlp.vocab).from_bytes(db.banking.find_one()['byte_string'])
print(d)
[w.lemma_ for w in d]

It has not.  It is the same.


['it', 'have', 'not', '.', ' ', 'it', 'be', 'the', 'same', '.']

### Topic Modelling with Gensim

For this part of the evaluation, I'm following as closely as possible the [Gensim tutorial](https://radimrehurek.com/gensim/tutorial.html). The banking inquiry corpus might well fit in memory but I'm trying the streaming approach because I will need to scale up. The topic clusters are visualised in this [notebook](LDA_viz.ipynb#topic=&lambda=0.41&term=) or in the [browser](banking_lda.html#topic=&lambda=0.6&term=)

In [1]:
from gensim import corpora, models, similarities


In [183]:
#Ignoring warnings because of this issue https://github.com/bmabey/pyLDAvis/issues/8
import warnings
warnings.filterwarnings("ignore")

In [143]:
class BankingCorpus(object):
    def __iter__(self):
        for speech in db.banking.find({}, {"tokens":True}):
            yield dictionary.doc2bow(speech['tokens'])

In [149]:
speeches = db.banking.find({}, {"tokens":True})
# collect statistics about all tokens
%time dictionary = corpora.Dictionary(speech['tokens'] for speech in speeches)
#filter out words that only appear once
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]
%time dictionary.filter_tokens(once_ids)
dictionary.compactify() # remove gaps in id sequence after words that were removed
dictionary.save('banking.dict') # store the dictionary, for future reference
print(dictionary)


CPU times: user 3.19 s, sys: 59.9 ms, total: 3.25 s
Wall time: 3.36 s
CPU times: user 17 ms, sys: 0 ns, total: 17 ms
Wall time: 16.9 ms
Dictionary(9886 unique tokens: ['', 'composition', 'secretly', 'instantly', 'mobilisation']...)


In [151]:
corpus = BankingCorpus()
%time banking_corpus = [v for v in corpus]
corpora.MmCorpus.serialize('banking_corpus.mm', banking_corpus)

CPU times: user 3.05 s, sys: 27.9 ms, total: 3.08 s
Wall time: 3.18 s


In [2]:
banking_corpus = corpora.MmCorpus('banking_corpus.mm')
%time tfidf = models.TfidfModel(banking_corpus) # step 1 -- initialize a model
%time corpus_tfidf = tfidf[banking_corpus]
corpora.MmCorpus.serialize('banking_tfidf_corpus.mm', corpus_tfidf)

CPU times: user 3.4 s, sys: 0 ns, total: 3.4 s
Wall time: 3.22 s
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 18.6 µs


In [184]:
%time lda_model = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=25)

CPU times: user 50.3 s, sys: 51.8 ms, total: 50.4 s
Wall time: 50.3 s


In [185]:
lda_model.print_topics(4)

[(11,
  '0.028*proposal + 0.027*repay + 0.025*chairman + 0.023*sharing + 0.022*alternative + 0.020*right + 0.020*sure + 0.018*asset + 0.017*fully + 0.015*procedure'),
 (8,
  '0.020*burden + 0.019*share + 0.018*relation + 0.017*cardiff + 0.016*reduce + 0.015*minister + 0.015*noonan + 0.013*cabinet + 0.013*kevin + 0.013*meeting'),
 (22,
  '0.011*house + 0.011*know + 0.011*price + 0.009*sell + 0.009*recall + 0.008*statutory + 0.008*think + 0.008*limit + 0.007*prepare + 0.007*lot'),
 (24,
  '0.017*reduction + 0.013*realise + 0.013*hazard + 0.013*exception + 0.012*walk + 0.012*report + 0.012*importance + 0.012*head + 0.012*successful + 0.012*instance')]

In [186]:
lda_model.save("banking_lda.lda")