# ELM Proof of Concept #2 : Topic Modelling

* October 5th, 2018
* Ryan Kazmerik, Data Scientist
* Enterprise Data Science, Encana


## Hypothesis
Topic modelling may reveal word relationships in our news articles, which can be used to extract themes or topics per article and for the entire corpus of news articles.

We will test this hypothesis using two different popular topic models and our corpus of ~10,000 news articles from News-API

### Research
**1. LDA (Latent Dirichlet Allocation)**
* generative probability model
* discovers topics through probability distributions
* produces probability weights per word
<br/>[LDA and the ABC news headlines](https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df)<br/><br/>

**2. NMF (Non-negative Matrix Factorization)**
* determanistic probability model
* linear algebreic determination
* produces probability weights per word
<br/>[NMF and the New York Times](https://towardsdatascience.com/topic-modeling-for-the-new-york-times-news-dataset-1f643e15caac)

## Experiments
**We need a training set of articles for this experiment.**

Let's load in ~1200 news headlines from Elastic and focus on articles that mention *oil* or *gas*:

In [1]:
from elasticsearch import Elasticsearch
es = Elasticsearch()

docs = es.search(
    index='articles', 
    doc_type='article',
    q='description:"oil" OR description:"gas"',
    filter_path=['hits.hits'],
    _source_include='description,title',
    sort='_id',
    size=100000
)

articles = []

for i,d in enumerate(docs['hits']['hits']):
    desc = (d["_source"]["description"])
    if(desc):
        articles.append(desc)
    i+=1;
    
print("No. training articles:",len(articles))
print()
print("Sample description:",articles[1])

No. training articles: 8284

Sample description: Shares were mixed in Asia on Monday, with Chinese benchmarks leading decliners. The air strikes on Syria appeared to be having scant impact on trading, and oil prices fell back. Eyes were on Chinese GDP data due on Tuesday. KEEPING SCORE: Japan's Nikkei 225 i…


**The sentences in the articles must be parsed to remove stop words, and split into individual words (tokenization). Then the words need to be encoded as floating point values to be used as input for our algorithms (vectorization).**

Let's create some feature vectors for our dataset:

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text 

n_features = 1000

my_additional_stop_words = ['ap','said','says','time','monday','november']
stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)

count_vectorizer = CountVectorizer(max_df=0.95, min_df=2, analyzer='word',
                         max_features=n_features, stop_words=stop_words, token_pattern = r'\b[a-zA-Z]{3,}\b')

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=n_features, 
                        stop_words=stop_words)

fs1_train = count_vectorizer.fit_transform(articles)
fs2_train = tfidf_vectorizer.fit_transform(articles)

print("Count Feature set shape:",fs1_train.shape)
print("TF-IDF Feature set shape:",fs2_train.shape)

Count Feature set shape: (8284, 1000)
TF-IDF Feature set shape: (8284, 1000)


**Next we will fit our topic models for both algorithms (LDA and NMF). We also need to specify the number of topics we are looking for. We can fine-tune this later but let's start by looking for 6 topics.**

Let's generate our topic models:

In [3]:
from sklearn.decomposition import LatentDirichletAllocation, NMF
from time import time

n_components = 5

print ('Fitting the LDA model...')     
t0 = time()

LDA = LatentDirichletAllocation(n_components=n_components, max_iter=5,
            learning_method='online', learning_offset=50., random_state=0)
LDA = LDA.fit(fs1_train)

print ('   done in:', (time()-t0))
print ()

print ('Fitting the NMF model...')
t0 = time()

NMF = NMF(n_components=n_components, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd')
NMF = NMF.fit(fs2_train)

print ('   done in:', (time()-t0))

Fitting the LDA model...
   done in: 7.1941375732421875

Fitting the NMF model...
   done in: 0.25736021995544434


## Results
**The most widely used method or evaluating a topic model is extrinsic evaluation, which means we manually evaluate the related words that the topic model generated to see if they make sense.**

Let's have a look at the output of our two topic models:

In [4]:
n_terms = 10
terms1 = count_vectorizer.get_feature_names()
terms2 = tfidf_vectorizer.get_feature_names()

print("LDA - Topic model:")

for topic_idx, topic in enumerate(LDA.components_):
    message = " Topic %d: " % topic_idx
    message += " ".join([terms1[i]
                    for i in topic.argsort()[:-n_terms - 1:-1]])
    print(message)
print()


print("NMF - Topic model:")

for topic_idx, topic in enumerate(NMF.components_):
    message = " Topic %d: " % topic_idx
    message += " ".join([terms2[i]
                    for i in topic.argsort()[:-n_terms - 1:-1]])
    print(message)
print()

LDA - Topic model:
 Topic 0: oil iran sanctions trump president washington iranian exports donald states
 Topic 1: oil energy prices companies stocks stock higher canada index shares
 Topic 2: gas energy oil natural reuters company pipeline new billion liquefied
 Topic 3: oil prices crude opec supply global output production year demand
 Topic 4: oil saudi energy arabia minister state reuters world company agency

NMF - Topic model:
 Topic 0: prices oil crude sanctions supply global iran markets rose demand
 Topic 1: gas energy natural company reuters new oil pipeline billion liquefied
 Topic 2: stock index canada main shares higher energy prices lower stocks
 Topic 3: saudi arabia opec russia oil output minister production aramco al
 Topic 4: trump president donald iran administration sanctions nuclear oil deal washington



## Observations

### 1. Topics are representative of current stories



**The topics generated by NMF seem to be more cohesive and make more sense than the LDA topics. We can see mention of events we know to be happening such as: Iranian sanctions, Hurricane Michael and Russia/Saudi discussing increasing production.**

Let's have a closer look at the NMF topic model and some sample articles for each topic:

In [5]:
import numpy as np

def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("Topic %d" % (topic_idx))
        print("Keywords:", " ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print("Top document:", documents[doc_index])
            
        print()

In [6]:
nmf_W = NMF.transform(fs2_train)
nmf_H = NMF.components_

no_top_words = 10
no_top_documents = 1
display_topics(nmf_H, nmf_W, terms2, articles, no_top_words, no_top_documents)

Topic 0
Keywords: prices oil crude sanctions supply global iran markets rose demand
Top document: Brent crude oil prices rose back above $80 a barrel on Monday as markets were expected to tighten once U.S. sanctions against Iran's crude exports are implemented next month.

Topic 1
Keywords: gas energy natural company reuters new oil pipeline billion liquefied
Top document: Natural gas coalition launches

Topic 2
Keywords: stock index canada main shares higher energy prices lower stocks
Top document: Canada's main stock index opened higher on Monday as a rise in oil prices lifted shares of energy companies.

Topic 3
Keywords: saudi arabia opec russia oil output minister production aramco al
Top document: World’s most profitable company recruits American oil executive Lynn Laverty Elsenhans in milestone move for Saudi Arabia Saudi Aramco, the world’s most profitable company, has appointed a woman to its board, in a milestone move for Saudi Arabia, where only o…

Topic 4
Keywords: trump p

### 2. Number of topics is an important variable

**Too few topics results in really broad general topics.**

In [8]:
import pyLDAvis
import pyLDAvis.sklearn
import warnings; warnings.simplefilter('ignore')
warnings.simplefilter('ignore')

n_components = 3

LDA = LatentDirichletAllocation(n_components=n_components, max_iter=5,
            learning_method='online', learning_offset=50., random_state=0)
LDA = LDA.fit(fs1_train)
    
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(LDA, fs1_train, count_vectorizer)

In [9]:
n_components = 20

LDA = LatentDirichletAllocation(n_components=n_components, max_iter=5,
            learning_method='online', learning_offset=50., random_state=0)
LDA = LDA.fit(fs1_train)
    
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(LDA, fs1_train, count_vectorizer)

## Conclusion

*Hypothesis: Topic modelling may reveal word relationships in our news articles, which can be used to extract themes or topics per article and for the entire corpus of news articles.*

* Non-negative matrix factorization does a better job of clustering related words given a relatively small dataset (~1200 headlines), where as LDA struggled to provide cohesive results at this scale.
<br/>

* Each article could be classified by our topic model as we process new articles through the pipeline efficiently. This would provide a high-level (macro) category, but more granular tagging would require a different approach.
<br/>

* Difficult to determine the exact accuracy of the topic model, as the primary means of evaluation is extrinsic (human evaluation).

## Further Improvements

1. Changing the initial query to Elastic had a significant impact on the topics produced. We could experiment with this query to dial in how broad or specific the topics are.
<br/>

2. The LDA model did not perform well with ~1200 articles but may perform better with a larger dataset - as generative models typically need lots of data.
<br/>

3. We may consider writing an additional component that labels the topics programatically. Example: query another source with the top 10 keywords and receive a high level topic label.
<br/>

4. Could add many more terms to the stop-words list to prevent topics from containing generic words such as 'tuesday'.
<br/>

5. Need to experiment with bi-grams and tri-gram vectorizers so that keyword phrases like 'british columbia' do not appear as seperate keywords 'british' and 'columbia'. 

6. Should implement stemming, lemmatization, PoS tagging.