# ELM - Dataset Exploration
* November 26, 2018
* Ryan Kazmerik, Strategic EIM

**The purpose of this notebook is to explore the small and large News-API datasets we have available for our project. Both datasets contain news articles related to the energy industry from 40 different news sources over the past year.**

## 1. Size (Number of Articles)

The small dataset was collected using our free News-API account, which only returns articles up to 30 days old. The large dataset was collected using our paid News-API subscription, which returns articles up to 12 months old.

Therefore using the same base query, we are able to nearly quadruple our dataset size using the paid subscription.

In [85]:
from elasticsearch import Elasticsearch
es_sml = Elasticsearch(['http://elastic:elastic@es.encana.com:9200'])
es_lrg = Elasticsearch()

docs_1 = es_sml.search(index='articles')
docs_2 = es_lrg.search(index='articles')

print()
print("SMALL DATASET:",docs_1['hits']['total'])
print("LARGE DATASET:",docs_2['hits']['total'])


SMALL DATASET: 19584
LARGE DATASET: 67689


> **News API query: Item must mention the word 'energy'** <br/><br/>
> 4 months = ~20,000 articles <br/>
> 12 months = ~80,000 articles

## 2. History (Published Date)

We started collecting articles for the small dataset in August 2018, and due to the 30-day historical limit, we could only gather articles for August, September, October and November.

The large dataset contains articles from January 2018 to present day.

In [86]:
import json

body = {
  "size":0,
    "aggs" : {
        "quarters" : {
            "range" : {
                "field" : "publishedAt",
                "ranges" : [
                    { "key":"2018 Q1", "from" : "2018-01-01", "to" : "2018-03-31" },
                    { "key":"2018 Q2", "from" : "2018-04-01", "to" : "2018-06-30" },
                    { "key":"2018 Q3", "from" : "2018-07-01", "to" : "2018-09-30" },
                    { "key":"2018 Q4", "from" : "2018-10-01", "to" : "2018-12-31" }
                ],
                 "format": "yyyy-MM-dd"
            }
        }
    }
}

docs_1 = es_sml.search(index='articles', body=body, filter_path='aggregations')
docs_2 = es_lrg.search(index='articles', body=body, filter_path='aggregations')

print()
print("SMALL DATASET:")
for bucket in docs_1['aggregations']['quarters']['buckets']:
    print(" >",bucket['key'],":",bucket['doc_count'])
print()
print("LARGE DATASET:")
for bucket in docs_2['aggregations']['quarters']['buckets']:
    print(" >",bucket['key'],":",bucket['doc_count'])



SMALL DATASET:
 > 2018 Q1 : 0
 > 2018 Q2 : 0
 > 2018 Q3 : 8700
 > 2018 Q4 : 10772

LARGE DATASET:
 > 2018 Q1 : 18140
 > 2018 Q2 : 18441
 > 2018 Q3 : 17667
 > 2018 Q4 : 13182


## 3. Content

Both the small and large datasets use the same base query, and pull articles from the same news sources. We've grouped these sources into categories to make organizing the articles a bit easier.

News-API crawls 138 major news sources, but we are only querying ~40 of these sources due to concern that the content from some sources may not be relevant for our use-case (ex. NFL News)

In [38]:
body = {
  "size":0,
    "aggs" : {
        "sources" : {
            "terms" : { "field" : "source.name.keyword" }
        }
    }
}

docs_1 = es_sml.search(index='articles', body=body, filter_path='aggregations')

print()
print("TOP 10 SOURCES:")
for bucket in docs_1['aggregations']['sources']['buckets']:
    print(" >",bucket['key'],":",bucket['doc_count'])


TOP 10 SOURCES:
 > Reuters : 4020
 > Independent : 2284
 > Associated Press : 1708
 > The Guardian (AU) : 1480
 > USA Today : 788
 > Financial Times : 704
 > ABC News : 703
 > Vice News : 633
 > The Washington Post : 613
 > The Wall Street Journal : 599


One major content difference between the small and large datasets is that the **full article content** is only available in the large dataset. Therefore we can run some of our PoC's against the full text of the article instead of just the headline.

In [69]:
body = {
  "query": {
    "terms": {
      "_id": [ "https://www.wsj.com/articles/trump-asks-saudi-arabia-to-boost-oil-production-1530360926"] 
    }
  }
}

docs_1 = es_lrg.search(index='articles', body=body, filter_path='hits')

title = docs_1['hits']['hits'][0]['_source']['title']
content = docs_1['hits']['hits'][0]['_source']['content']

print("LARGE DATASET:", end="\n\n")
print("TITLE:", title, end="\n\n")
print("CONTENT:", content)

LARGE DATASET:

TITLE: Trump Asks Saudi Arabia to Increase Oil Production

CONTENT: U.S. President Donald Trump on Saturday said he asked Saudi Arabia to significantly boost its oil production to bring down crude prices. “I am asking that Saudi Arabia increase oil production, maybe up to 2,000,000 barrels,” Mr. Trump said in a tweet, citing a conversation with Saudi King Salman bin Abdulaziz. “Prices to high! He has agreed!” the tweet said, citing “turmoil &amp; disfunction” in Iran and Venezuela. It wasn’t clear whether Mr. Trump was saying the king agreed that prices were too high or that the kingdom would increase oil output. In an official statement posted on the state-run Saudi Press Agency, the kingdom said King Salman spoke to Mr. Trump, but gave no mention of the 2 million barrels of extra production the American leader tweeted about. “During the call, the two leaders stressed the need to make efforts to maintain the stability of oil markets and the growth of the global economy

## 4. Impacts

**As we can see below, our PoC#3 (Named Entity Recognition) now returns many more entities for the large dataset as it has more text per article to analyze:**

In [64]:
body = {
  "query": {
    "terms": {
      "_id": [ "https://www.wsj.com/articles/trump-asks-saudi-arabia-to-boost-oil-production-1530360926"] 
    }
  }
}

docs_1 = es_lrg.search(index='articles', body=body, filter_path='hits')
ents = docs_1['hits']['hits'][0]['_source']['entities']

print("LARGE DATASET:")
print("SINGLE ARTICLE ENTITIES =",len(ents))
print(json.dumps(ents, indent=4))

LARGE DATASET:
SINGLE ARTICLE ENTITIES = 76
[
    "U.S.(GPE)",
    "Donald Trump(PERSON)",
    "Saturday(DATE)",
    "Saudi Arabia(GPE)",
    "Saudi Arabia(GPE)",
    "up to 2,000,000 barrels(QUANTITY)",
    "Trump(PERSON)",
    "Saudi(NORP)",
    "Salman bin Abdulaziz(PERSON)",
    "Iran(GPE)",
    "Venezuela(GPE)",
    "Trump(PERSON)",
    "Saudi Press Agency(ORG)",
    "Salman(PERSON)",
    "Trump(PERSON)",
    "the 2 million barrels(QUANTITY)",
    "American(NORP)",
    "two(CARDINAL)",
    "Saudi Arabia(GPE)",
    "Trump(PERSON)",
    "Saudi Arabia(GPE)",
    "90 days(DATE)",
    "Saudi(NORP)",
    "Aramco(GPE)",
    "Saudi Arabian Oil Co.(ORG)",
    "Saudi Arabia(GPE)",
    "last week(DATE)",
    "the Organization of the Petroleum Exporting Countries(ORG)",
    "OPEC(ORG)",
    "Russia(GPE)",
    "Friday(DATE)",
    "multiyear(DATE)",
    "August(DATE)",
    "1%(PERCENT)",
    "74.15(MONEY)",
    "the New York Mercantile Exchange(ORG)",
    "November 2014(DATE)",
    "Brent(PERSO

**The training curve from PoC#4 (Text Classification) indicated that more data could improve the accuracy of our price movement classifier.**

Let's re-run that experiment with the large dataset and see if the results exceeed our benchmark of 72%. We will use the command below to start a new annotation session, this time with ~450 articles:

> <pre>pgy textcat.teach el_articles_price_lrg en_core_web_md models/textcat_v2/el_docs_price_lrg.jsonl --label PRICE_UP,PRICE_DOWN --patterns models/textcat/patterns.jsonl</pre>

This time we were able to generate **282 annotations** to help train our model. We can now run the batch-train command to see how our model performs:

><pre>pgy textcat.batch-train el_articles_price_lrg --output models/textcat_v2 --eval-split 0.2</pre>

<pre>
Loaded blank model
Using 20% of examples (56) for evaluation
Using 100% of remaining examples (226) for training
Dropout: 0.2  Batch size: 10  Iterations: 10

RUN          LOSS       F-SCORE    ACCURACY
01         17.993     0.605      0.553
02         16.251     0.778      0.789
03         12.758     0.700      0.684
04         12.807     0.714      0.684
05         12.773     0.718      0.711
06         14.068     0.773      0.737
07         13.206     0.780      0.763
08         12.853     0.800      0.789
09         11.296     0.829      0.816
10         10.781     0.850      0.842

MODEL      USER       COUNT
accept     accept     17
accept     reject     4
reject     reject     15
reject     accept     2

Correct    32
Incorrect  6

Baseline   0.50
Precision  0.81
Recall     0.89
F-score    0.85
Accuracy   0.84

</pre>

## 5. Further Improvements

### 1. Are we looking at the right sources? 

In [88]:
news = {'business':'bloomberg, business-insider, fortune, msnbc, the-wall-street-journal',
    'financial':'cnbc, financial-post, financial-times, the-economist',
    'google':'google-news, google-news-au, google-news-ca, google-news-uk',
    'world':'al-jazeera-english, bbc-news, the-guardian-uk',
    'paper':'the-globe-and-mail, the-new-york-times, the-washington-post',
    'market':'reuters',
    'opinion':'breitbart-news, the-huffington-post, independent, national-review',
    'network':'abc-news, cbs-news,cnn, fox-news, nbc-news, usa-today',
    'other':'associated-press, metro, newsweek, politico, the-washington-times',
    'tech':'engadget, hacker-news, mashable, techcrunch, the-verge, wired'}

print()
for category,sources in news.items():
    print(category.upper())
    print(">",sources, end="\n\n")


BUSINESS
> bloomberg, business-insider, fortune, msnbc, the-wall-street-journal

FINANCIAL
> cnbc, financial-post, financial-times, the-economist

GOOGLE
> google-news, google-news-au, google-news-ca, google-news-uk

WORLD
> al-jazeera-english, bbc-news, the-guardian-uk

PAPER
> the-globe-and-mail, the-new-york-times, the-washington-post

MARKET
> reuters

OPINION
> breitbart-news, the-huffington-post, independent, national-review

NETWORK
> abc-news, cbs-news,cnn, fox-news, nbc-news, usa-today

OTHER
> associated-press, metro, newsweek, politico, the-washington-times

TECH
> engadget, hacker-news, mashable, techcrunch, the-verge, wired



### 2. Should we modify our base News-API query?

**Currently the query asks for all news items that mention the word 'energy' anywhere in the article title, description or content.**

Most of the articles are relevant to our interest, but we do get some unrelated, odd results:

> “Surge tents” were set up outside the emergency room of a hospital in Pennsylvania to manage the overflow caused largely by this year’s flu season.

>Fall is right around the corner and it’s time for one of my favorite events of the year, Mickey’s Not-So-Scary Halloween Party. The Magic Kingdom is transformed with glowing jack o lanterns, family-friendly frights and grinning ghosts.

>Ask the men he coaches what makes Pittsburgh Steelers offensive coordinator Randy Fichtner tick, and you get a variation of the same response.

Some initial ideas:
* specify a large list of more specific keywords.
<br/><br/>

* filter out results that cause noise.
<br/><br/>

* don't provide a filter at all, could be handy for future use cases.