# ELM Proof of Concept #1 : Sentiment Analysis

* September 20th, 2018
* Ryan Kazmerik, Data Scientist
* Enterprise Data Science, Encana

## Hypothesis
There is a correlation between the sentiment of a news article and the probabilistic outcome of the related event. That is, we can use the polarity of an article headline as an indication of the emergence or divergence of an event or idea.

We will test this hypothesis using 3 popular sentiment analysis libraries, specializing in business texts, social texts and financial texts.

### Research
**1. NLTK Vader**
* The most used python sentiment analysis library
* Valence-based approach, rates word as positive/negative including intensity
    - ex. 'Tragedy' = negative, -3.4
    - ex. 'Revitalized' = positive, +2.7
* Crowd sourced word ratings from [Amazon Mechanical Turk](https://www.mturk.com/)
* https://www.nltk.org/_modules/nltk/sentiment/vader.html
<br/><br/>

**2. Harvard IV-4**
* Developing using grants from the USA National Science Foundation
* Polarity based approach, rates words within a 6-class categorization
    - Positive, Negative
    - Strong, Weak
    - Active, Passive
* Hand-curated by linguistics researchers
* http://www.wjh.harvard.edu/~inquirer/
<br/><br/>

**3. Loughran McDonald**
* Developed at the University of Notre Dame
* Polarity based approach, rates words within a 6-class categorization
    - Positive, Negative
    - Uncertainty
    - Litigious
    - Modal
    - Constraining
* "A growing literature finds relations between stock price reactions and the sentiment of information releases" 
* https://sraf.nd.edu/textual-analysis/resources/#Master%20Dictionary

## Experiments

Training dataset - news article headlines from 40 popular news sources.

In [439]:
import warnings
warnings.filterwarnings('ignore')
from newsapi import NewsApiClient
import json

newsapi = NewsApiClient(api_key='0149c3b5da6a4b00b0dbee00c4578e25')

outlets = {
    'business':'bloomberg,business-insider,fortune,msnbc,the-wall-street-journal',
    'financial':'cnbc,financial-post,financial-times,reuters,the-economist',
    'google':'google-news,google-news-au,google-news-ca,google-news-uk',
    'world':'al-jazeera-english,bbc-news,the-guardian-uk',
    'paper':'the-globe-and-mail,the-new-york-times,the-washington-post',
    'opinion':'breitbart-news,the-huffington-post,independent,national-review,vice-news',
    'network':'abc-news,cbs-news,cnn,fox-news,nbc-news,usa-today',
    'other':'associated-press,metro,newsweek,politico,the-washington-times',
    'tech':'engadget,hacker-news,mashable,techcrunch,the-verge,wired'
}

print('Article sample: ')

data = newsapi.get_everything(
    q='oil price sanctions',
    sources='reuters',
    language='en',
    sort_by='relevancy',
    page=1,
    page_size=1)

print(json.dumps(data['articles'], indent=2))

Article sample: 
[
  {
    "source": {
      "id": "reuters",
      "name": "Reuters"
    },
    "author": "Amanda Cooper",
    "title": "Now you see it, now you don't: oil surplus vanishes ahead of Iran deadline",
    "description": "An overhang of homeless crude in the Atlantic Basin has halved in recent weeks, suggesting oil traders are bracing for a further supply loss from Iran due to U.S. sanctions and a new rally in prices.",
    "url": "https://www.reuters.com/article/us-oil-markets-analysis/now-you-see-it-now-you-dont-oil-surplus-vanishes-ahead-of-iran-deadline-idUSKCN1LF1O2",
    "urlToImage": "https://s4.reutersmedia.net/resources/r/?m=02&d=20180830&t=2&i=1298966743&w=1200&r=LYNXNPEE7T0XR",
    "publishedAt": "2018-08-30T13:50:34Z",
    "content": "LONDON (Reuters) - An overhang of homeless crude in the Atlantic Basin has halved in recent weeks, suggesting oil traders are bracing for a further supply loss from Iran due to U.S. sanctions and a new rally in prices. Iran\u2019s

### Experiment 1: Sample Headline

Let's load up the all 3 sentiment dictionaries, and perform sentiment analysis on some sample text: <br/>

In [445]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
import pysentiment as PS

NLTK = SIA()
HARV = PS.HIV4()
LM = PS.LM()

#text = "destruction"
#text = "oil prices rose"
#text = "oil prices fell"
#text = "increased"
#text = "increased to threatening"
text = "The energy market was slow to recover this weekend, but analyst speculate the long-term outcome is not dreadful"

nltk = NLTK.polarity_scores(text)

keys = HARV.tokenize(text)
harv_polarity = HARV.get_score(keys)
lm_polarity = LM.get_score(keys)

print("NLTK", nltk)
print("HARV", harv_polarity)
print("L&M", lm_polarity)

NLTK {'neg': 0.0, 'neu': 0.774, 'pos': 0.226, 'compound': 0.566}
HARV {'Positive': 0, 'Negative': 1, 'Polarity': -0.9999990000010001, 'Subjectivity': 0.09999999000000101}
L&M {'Positive': 0, 'Negative': 1, 'Polarity': -0.9999990000010001, 'Subjectivity': 0.09999999000000101}


### Experiment 2 : Iranian Sanctions

We've already inserted ~8,000 articles in ElasticSearch, so let's pull out all the articles that mention the words **Iran** and **Oil** then run the NLTK analyzer on all the articles containing the articles about Iranian sanctions:

In [446]:
from elasticsearch import Elasticsearch
es = Elasticsearch()

docs = es.search(
    index='news', 
    doc_type='article',
    q='description:"Iran" AND oil',
    default_operator='AND',
    filter_path=['hits.hits'],
    _source_include='description,title',
    sort='_id',
    size=1000
)
t1_results = []

for i,d in enumerate(docs['hits']['hits']):
    desc = (d["_source"]["description"])
    title = (d["_source"]["title"])
    
    nltk_polarity = NLTK.polarity_scores(desc)['compound']

    if(nltk_polarity > 0.2):
        nltk_sentiment = 'Positive'
    elif (nltk_polarity < -0.2):
        nltk_sentiment =  'Negative'
    else:
        nltk_sentiment = 'Neutral'
    
    t1_results.insert(i, nltk_sentiment)
    
print("NTLK SENTIMENT CLASSIFIER")
print("Topic: Iranian sanctions")
print("- Articles classified:",len(t1_results))
print("- Positive sentiment:",t1_results.count("Positive"))
print("- Neutral sentiment:",t1_results.count("Neutral"))
print("- Negative sentiment:",t1_results.count("Negative"))


NTLK SENTIMENT CLASSIFIER
Topic: Iranian sanctions
- Articles classified: 135
- Positive sentiment: 37
- Neutral sentiment: 27
- Negative sentiment: 71


### Experiment 3: Chinese Tariffs

Let's load in another topic and perform sentiment analysis on articles related to the new U.S. **tariffs** on **China**

In [447]:
docs = es.search(
    index='news', 
    doc_type='article',
    q='description:"China" AND tariffs',
    default_operator='AND',
    filter_path=['hits.hits'],
    _source_include='description,title',
    sort='_id',
    size=1000
)

t2_results = []

for i,d in enumerate(docs['hits']['hits']):
    desc = (d["_source"]["description"])
    title = (d["_source"]["title"])
    
    nltk_polarity = NLTK.polarity_scores(desc)['compound']

    if(nltk_polarity > 0.2):
        nltk_sentiment = 'Positive'
    elif (nltk_polarity < -0.2):
        nltk_sentiment =  'Negative'
    else:
        nltk_sentiment = 'Neutral'
    
    t2_results.insert(i, nltk_sentiment)
    
print("NTLK SENTIMENT CLASSIFIER")
print("Topic: U.S. tariffs on China")
print("- Articles classified:",len(t2_results))
print("- Positive sentiment:",t2_results.count("Positive"))
print("- Neutral sentiment:",t2_results.count("Neutral"))
print("- Negative sentiment:",t2_results.count("Negative"))

NTLK SENTIMENT CLASSIFIER
Topic: U.S. tariffs on China
- Articles classified: 49
- Positive sentiment: 14
- Neutral sentiment: 8
- Negative sentiment: 27


### Experiment 4: Fracking
Let's load in another topic and perform sentiment analysis on articles related to **Fracking**

In [448]:
docs = es.search(
    index='news', 
    doc_type='article',
    q='"Fracking"',
    default_operator='AND',
    filter_path=['hits.hits'],
    _source_include='description,title',
    sort='_id',
    size=1000
)

t3_results = []

for i,d in enumerate(docs['hits']['hits']):
    desc = (d["_source"]["description"])
    title = (d["_source"]["title"])
    
    nltk_polarity = NLTK.polarity_scores(desc)['compound']

    if(nltk_polarity > 0.2):
        nltk_sentiment = 'Positive'
    elif (nltk_polarity < -0.2):
        nltk_sentiment =  'Negative'
    else:
        nltk_sentiment = 'Neutral'
    
    t3_results.insert(i, nltk_sentiment)
    
print("NTLK SENTIMENT CLASSIFIER")
print("Topic: Fracking")
print("- Articles classified:",len(t3_results))
print("- Positive sentiment:",t3_results.count("Positive"))
print("- Neutral sentiment:",t3_results.count("Neutral"))
print("- Negative sentiment:",t3_results.count("Negative"))

NTLK SENTIMENT CLASSIFIER
Topic: Fracking
- Articles classified: 25
- Positive sentiment: 5
- Neutral sentiment: 7
- Negative sentiment: 13


## Results


In [449]:
import plotly
plotly.tools.set_credentials_file(username='rkazmerik', api_key='t7I510x59j1dyIous4oz')

import plotly.plotly as plt
import plotly.graph_objs as go

trace0 = go.Scatter(
    x=[5,1,3],
    y=[len(t1_results), len(t2_results), len(t3_results)],
    text=['Iranian Sanctions', 'Chinese Tariffs', 'Fracking'],
    mode='markers',
    marker=dict(
        color=['rgb(255,0,0)', 'rgb(255,150,0)','rgb(200,0,0)'],
        opacity=[1,1,1],
        size=[len(t1_results), len(t2_results), len(t3_results)],
    )
)

layout = go.Layout(
    title='Price Influencing Factors w/ Sentiment',
    xaxis=dict(
        title='Price Factor (1=low, 5=high)',
        gridcolor='rgb(240,240,240)',
        gridwidth=2,
    ),
    yaxis=dict(
        title='No. Articles',
        gridcolor='rgb(240,240,240)',
        gridwidth=2,
    )
)

fig = go.Figure(data=[trace0], layout=layout)
py.iplot(fig, filename='bubblechart-color')

Notes:
* size of the bubbles indicates the popularity of the topic (the volume) 
<br/><br/>
* colour of bubbles indicates the sentiment, green = positive, yellow = neutral, red = negative 
<br/><br/>
* price factor would be pre-assigned for each topic 
<br/><br/>
* could be interesting to track the movement of these bubbles over time

## Observations

### 1. Negative sentiment on fracking (52% negative to 20% positive)

Sample of the first 5 articles classified as negative:
<pre>
* Bethany McLean's new book takes a skeptical look at whether the business case for fracking is on firm ground.

* Check for environmental impact meant to address consumer distrust of hydraulic fracturing

* Lancashire, a hotbed of anti-fracking protests, among the authorities with exposure

* The controversial gas and oil drilling method threatens to exacerbate a looming crisis over water.

* Lancastrians protested against it, council rejected it, the health impacts are shocking, this government doesn’t care.
</pre>
Consistent with how these news sources would likely report on a subject like 'Fracking'
<br/><br/>

### 2. Majority of stories on fracking were U.K. (56% U.K. sources to 44%)

* Number of U.S. to U.K news sources is 10 to 1.
<br/><br/>
* An example of why it could be beneficial to monitor international news sources
<br/><br/>

### 3. Oil price movement reporting: Reuters

Sample of 3 article **titles** from Reuters:
<pre>
* Oil higher as U.S. sanctions on Iran raise supply concerns

* Oil gains 1 percent on signs OPEC not prepared to boost output

* Oil dips as Sino-U.S. trade tensions deepen, new tariffs due
</pre>

* Title format could be an exploitable reporting pattern.
<br/><br/>
* Reuters reports 18% of our News-API content, out of 40 sources.
<br/><br/>

### 4. Sentiment useful at macro not micro
* Example of an article classified as positive
<pre>
Iran and four ex-Soviet nations, including Russia, agreed in principle on Sunday how to divide up the potentially huge oil and gas resources
 > Polarity = 0.769, Sentiment = Positive
</pre>

* Positive keywords: 'agreed', 'potentially huge'
* Negative keywords: None

Classified as positive, does not indiciate emergence of Iranian Sanctions. However, these errors are balanced by looking at the **aggregate**, in data science we call often these edge-case data points **outliers**.

## Conclusion

*Hypothesis: There is a correlation between the sentiment of a news article and the probabilistic outcome of the related event. That is, we can use the polarity of an article headline as an indication of the emergence or divergence of an event or idea.*

* There are factors including volume, time and sentiment that contribute to understanding the outcome of an event.
<br/><br/>
* This intel is low-resolution. That is, it provides the outline of an influencing factor, but does not colour in the detail.
<br/><br/>
* Further research could be conducted on medium to high resolution intelligence to develop a deeper understanding of the current energy landscape

## Future Improvements:
1. There are some articles that just list 'Missing' where the description should be, we should pre-filter those if we're going to use the description field for sentiment analysis.
2. We could consider layering a custom dictionary on-top of NLTK Vader to provide a more detailed vocabulary. 
3. We should spot-check 20 or so articles and their sentiment to identify any problematic instances.
4. We may even consider writting a custom sentiment analysis classifier using NLTK or SpaCy + Keras.