# Setup

If you haven't already please install mallet. As that will be required to run the following notebook. There is also a back up alterative lda model we will allow usage of (LDA Multicore). This can be used by passing a different argument.

### Pre Processing

In [2]:
# import nltk
# nltk.download('stopwords')
# nltk.download('wordnet')

We have filtered all news headlines from Reuters from 2014 to 2016 by checking for the occurence of the word 'oil'. We then aggregate file corpus into a line corpus, provided as a CSV file ('oil_news.csv'). We have functions in our Python script that will read this in and perform the following pre-processing steps:

- Remove Stop Words (NLTK) 
- Lower Case (Gensim) 
- Remove Punctuation (Gensim) 
- Remove Numbers (Gensim) 
- Stemmed/Lemmatized (NLTK) 

After this we use the bag of words approach and build up a corpus of pre-processed documents. This is all done by invoking the get_text_data() method.

In [6]:
from CTADaily import get_text_data
text_df, dictionary, corpus = get_text_data('WTI-NEWS.csv', 'Headline')

In [7]:
text_df.head(2)

Unnamed: 0,Date,Headline
0,2014-01-02,"[brazil, oleo, gas, say, honor, oil, field, debt]"
1,2014-01-02,"[cabot, oil, gas, corpor, declar, dividend]"


### Building The Model

In [4]:
from CTADaily import build_lda_model
lda_model = build_lda_model('Mallet', corpus, dictionary, num_topics=20)

If you were unable to install Mallet, there is an alternative model type you can use which is the LDA Multicore method from gensim. This can be simply passed into build_lda_model as follows:

In [5]:
#lda_model = build_lda_model('Multicore', corpus, id2word, num_topics=20)

#### Optimizing Model by Coherence

In [79]:
from gensim.models.coherencemodel import CoherenceModel
from CTADaily import get_tokens 

tokens = get_tokens(text_df, 'Headline')

cm = CoherenceModel(model=lda_model, corpus=corpus, texts=oil_news_df['Headline'].values.tolist(), coherence='c_v')
coherence = cm.get_coherence()
print(coherence)

0.36782637006277696


We see quite a low coherence value of 0.367 for 20 topics. Let's try to optimise this.

In [1]:
scores = []
models = []

for i in range(5, 60, 5):
    print('Topic Model with {} #Topics'.format(i))
    model = build_lda_model('Mallet', corpus, id2word, num_topics=i)
    models.append(model)
    cm = CoherenceModel(model=lda_model, corpus=corpus, texts=oil_news_df['Headline'].values.tolist(), coherence='c_v')
    coherence = cm.get_coherence()
    scores.append(coherence)
    print('Coherence: {}'.format(coherence))

Topic Model with 5 #Topics


NameError: name 'build_lda_model' is not defined

We will now be able to get the topic distribution for each document. We want to use these probabilities as features for our regression, combined with the date the headline was published by Reuters. We will need to unpack the tuples, to remove the topic number and only give us the probability. We will do this for all columns before merging date.

In [37]:
topic_dist_df = pandas.DataFrame(lda_model[corpus])

for column in topic_dist_df.columns:
    topic_dist_df[column] = topic_dist_df[column].apply(lambda x : x[1])

In [38]:
topic_dist_df.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.04386,0.045809,0.04386,0.051657,0.04386,0.04386,0.04386,0.078947,0.04386,0.061404,0.049708,0.051657,0.045809,0.057505,0.049708,0.059454,0.049708,0.04386,0.047758,0.04386
1,0.044643,0.046627,0.046627,0.108135,0.048611,0.044643,0.044643,0.044643,0.044643,0.044643,0.054563,0.044643,0.046627,0.050595,0.044643,0.046627,0.058532,0.046627,0.044643,0.044643


In [39]:
dates = pandas.read_csv('oil_news.csv')['Date']
topic_dist_df = pandas.concat([dates, topic_dist_df], axis=1, sort=False)

In [40]:
topic_dist_df.head(2)

Unnamed: 0,Date,0,1,2,3,4,5,6,7,8,...,10,11,12,13,14,15,16,17,18,19
0,2014-01-02,0.04386,0.045809,0.04386,0.051657,0.04386,0.04386,0.04386,0.078947,0.04386,...,0.049708,0.051657,0.045809,0.057505,0.049708,0.059454,0.049708,0.04386,0.047758,0.04386
1,2014-01-02,0.044643,0.046627,0.046627,0.108135,0.048611,0.044643,0.044643,0.044643,0.044643,...,0.054563,0.044643,0.046627,0.050595,0.044643,0.046627,0.058532,0.046627,0.044643,0.044643


Now we will merge the WTI Crude Oil Prices from our exported historical data. We have more dates in that dataset than the range of the dates from our headlines. Hence, we will perform an inner join between them on the matching column 'Date'. 

In [47]:
wti_prices = pandas.read_csv('DCOILWTICO.csv')
topic_prices = pandas.merge(topic_dist_df, wti_prices, on='Date')

In [48]:
topic_prices.head(2)

Unnamed: 0,Date,0,1,2,3,4,5,6,7,8,...,11,12,13,14,15,16,17,18,19,DCOILWTICO
0,2014-01-02,0.04386,0.045809,0.04386,0.051657,0.04386,0.04386,0.04386,0.078947,0.04386,...,0.051657,0.045809,0.057505,0.049708,0.059454,0.049708,0.04386,0.047758,0.04386,95.14
1,2014-01-02,0.044643,0.046627,0.046627,0.108135,0.048611,0.044643,0.044643,0.044643,0.044643,...,0.044643,0.046627,0.050595,0.044643,0.046627,0.058532,0.046627,0.044643,0.044643,95.14


In [54]:
columns = ['Date'] + ['Topic {}'.format(i) for i in range(20)] + ['Price']
topic_prices.columns = columns

Let's save this to a file for easier analysis.

In [55]:
topic_prices.to_csv('WTI-LDA-TOPIC.csv', index=False)

### Regression Model