# ELM Proof of Concept #4 - Text Classification

* November 12th, 2018
* Ryan Kazmerik, Strategic EIM

## Hypothesis
We can teach a text classifier to identify articles that mention the increase or decrease in the price of oil - and further we can use the classification results of all articles to assign a price factor to our news article topics.


### Research
**1. Supervised Learning**
* Human labeled examples used to train the model
* Hand engineered features (ex. sentence length)
* White box modelling

**2. Unsupervised Learning**
* Very large datasets
* Computer engineered features
* Black box modelling

**3. Semi-supervised Learning**
* Computer starts with a generic language model
* Human annotations used to guide the training process
* Gray box modelling

## Experiment

**To start we need to find a dataset that will help us train our classifier - after a couple hours of exploring our articles we observed the following patterns:**

1. Reuters most-often uses a standardized pattern when reporting news about oil price movement:

> *"Oil falls 3 percent as equity markets drop, inventories climb."*

> *"Oil prices down 20 percent in a month as fundamentals weaken."*

> *"Oil higher as U.S. Iran sanctions raise supply concerns."*

> *"Oil prices fall more than 1 percent on rising supply, trade war".*

> *"Oil prices rise as Gulf platforms shut ahead of hurricane".*

**Let's exploit this observed pattern in the data to train our classifier.**

First we need to create a dataset that contains these price movement articles:

In [1]:
import json_lines

with open('../models/textcat/el_docs_price.jsonl', 'rb') as f:  
   for item in json_lines.reader(f):
       print(item['text'])

Oil drops 2.5 percent as equity markets fall, inventories climb
Oil drops 2.5 percent as equity markets fall, inventories climb
Oil extends losses as markets fall, inventories climb
Oil falls 2.5 percent as equity markets drop, inventories climb
Oil falls 3 percent as equity markets drop, inventories climb
Oil prices hold ground, but set for 4 percent weekly fall
Oil prices rise, but still set for weekly fall amid equities rout
Oil prices climb as U.S. energy firms cut rigs, Iran sanctions loom
RPT-UPDATE 4-Oil jumps 2 pct as market tightens, more gains seen
U.S. gasoline prices at seasonal four-year high ahead of midterm elections
Oil prices fall amid supplied market, Iran sanction exemptions - Reuters
Oil prices down 20 percent in a month as fundamentals weaken
Brent crude at highest since October 2014, Iran sanctions drive buying
Brent crude oil dips on rising OPEC output; looming sanctions on Iran prevent bigger fall
UPDATE 6-Brent crude touches 2014 high ahead of Iran sanctions
Br

**We will be using a CNN (convolutional neural network) to learn the patterns in our headlines, but to start we can provide it with some seed examples of words we are particularly interested in.**

Here are the commonly observed words for headlines that mention price increase (PRICE_UP) and decrease (PRICE_DOWN):

In [10]:
with open('../models/textcat/patterns.jsonl', 'rb') as f:  
   for item in json_lines.reader(f):
       print(item)

{'label': 'PRICE_UP', 'pattern': [{'lemma': 'rise'}]}
{'label': 'PRICE_UP', 'pattern': [{'lemma': 'gain'}]}
{'label': 'PRICE_UP', 'pattern': [{'lemma': 'up'}]}
{'label': 'PRICE_UP', 'pattern': [{'lemma': 'jump'}]}
{'label': 'PRICE_UP', 'pattern': [{'lemma': 'surge'}]}
{'label': 'PRICE_UP', 'pattern': [{'lemma': 'edge'}]}
{'label': 'PRICE_UP', 'pattern': [{'lemma': 'claw back'}]}
{'label': 'PRICE_UP', 'pattern': [{'lemma': 'inch up'}]}
{'label': 'PRICE_DOWN', 'pattern': [{'lemma': 'slip'}]}
{'label': 'PRICE_DOWN', 'pattern': [{'lemma': 'retreat'}]}
{'label': 'PRICE_DOWN', 'pattern': [{'lemma': 'drop'}]}
{'label': 'PRICE_DOWN', 'pattern': [{'lemma': 'fall'}]}
{'label': 'PRICE_DOWN', 'pattern': [{'lemma': 'decline'}]}
{'label': 'PRICE_DOWN', 'pattern': [{'lemma': 'dip'}]}
{'label': 'PRICE_DOWN', 'pattern': [{'lemma': 'down'}]}
{'label': 'PRICE_DOWN', 'pattern': [{'lemma': 'losses'}]}


**Now we can start training our model using Prodigy. This semi-supervised framework will present us with headlines it believes are either classified as PRICE_UP or PRICE_DOWN and ask us to either accept or reject the prediction. We call this an:  *annotation***

Let's start the Prodigy web server and observe the annotation process:

> <pre>pgy textcat.teach el_articles_price en_core_web_md models/textcat/el_docs_price.jsonl --label PRICE_UP,PRICE_DOWN --patterns models/textcat/patterns.jsonl</pre>

Below is a sample of the annotation process, which runs at http://localhost:8080

![alt text](../notebooks/img/prodigy-sample.png "Logo Title Text 1")


<br/><br/>
**Once we have provided enough annotations, we can batch train our model and examine the precision, accuracy and f-score:**

><pre>pgy textcat.batch-train el_articles_price --output models/textcat --eval-split 0.2</pre>

## Results
<pre>
Loaded blank model
Using 20% of examples (24) for evaluation
Using 100% of remaining examples (96) for training
Dropout: 0.2  Batch size: 10  Iterations: 10

RUN        LOSS       F-SCORE    ACCURACY
01         11.339     0.640      0.500
02         7.651      0.286      0.444
03         7.107      0.556      0.556
04         5.799      0.556      0.556
05         6.299      0.375      0.444
06         5.041      0.500      0.556
07         5.649      0.737      0.722
08         6.255      0.737      0.722
09         5.041      0.737      0.722
10         5.596      0.737      0.722

MODEL      USER       COUNT
accept     accept     7
accept     reject     1
reject     reject     6
reject     accept     4

Correct    13
Incorrect  5

Baseline   0.61
Precision  0.87
Recall     0.64
F-score    0.74
Accuracy   0.72
</pre>

## Observations

### 1. 72% accuracy for 138 headlines + 120 annotations - about 10 minutes to train.

This performance is somewhat suprising given the limited size of the dataset, and the small amount of annotations. Increasing the number of headlines used for initial training and providing more annotations could improve the model further.

Using the command above, we can generate a training curve to get an idea of how the model is performing with different amounts of data. This outputs the accuracy score, as well as the increase in accuracy and is an indicator if more data could improve the accuracy.

> <pre>pgy textcat.train-curve el_articles_price --n-samples 4 --eval-split 0.2</pre>

<pre>
Starting with blank model
Dropout: 0.2  Batch size: 10  Iterations: 5  Samples: 4

%          ACCURACY
25%        0.50       +0.50
50%        0.50       +0.00
75%        0.61       +0.11
100%       0.67       +0.06</pre>

### 2. Low margin between PRICE_UP and PRICE_DOWN classes.

There is a very small margin between the two classes. Which does make a certain amount of sense as the headlines have similar features - but is not ideal as it makes the classifier more susceptible to mis-classify.

In [1]:
import spacy

nlp = spacy.load('../models/textcat')
print()

doc = nlp('Oil gains by 2 percent after two Mexico platforms were evacuated')
print(doc) 
print(doc.cats, end="\n\n") 

doc = nlp('Oil falls 3% as Sino-U.S. trade tensions deepen, new tariffs due')
print(doc)
print(doc.cats)


Oil gains by 2 percent after two Mexico platforms were evacuated
{'PRICE_DOWN': 0.12061911821365356, 'PRICE_UP': 0.16919022798538208}

Oil falls 3% as Sino-U.S. trade tensions deepen, new tariffs due
{'PRICE_DOWN': 0.5449771285057068, 'PRICE_UP': 0.5133009552955627}


### 3. Evaluation is based on training set - not best practice.

The same headlines that were used to train the dataset were also used to evaluate it. 

This is not a best practice because the potential to over-fit the model is high. Over-fitting is when the model has learned the specific dataset well, but will not perform well on new headlines it has not seen yet. That is, it learns the data, not the patterns.

## Conclusion

* Semi-supervised learning could provide a viable text classifier, but we would need a bigger dataset to further improve and evaluate the model. A target of ~1000 training articles and ~500 annotations would be a reasonable target that is supported by the Prodigy documentation.

<br/>

* In order to apply this method to our entire dataset of articles, we would need to design a generalization (because every article would not be classified)

> ex. classify 1000 articles and create a composite score for each topic. Then generalize that each article within that topic receives the same classification.

<br/>

* We could use this approach to build a similar classifier for supply increase / decrease.

<br/>

* This approach is only effective at a macro level, as there needs to be a relatively high number of articles about a topic to produce a reliable aggregate classification score.

## Further Improvements
1. We should increase the historical size of the dataset to provide more than 138 articles for training and evaluation.
2. We could expand the list of keywords in the patterns list for each label.
3. We could use NER to extract any mentioned numerics (ex. percentages) to assign a magnitude.
4. We could experiment with a different approach to semi-supervised learning:

> ex. Build a classifier to identify articles that mention price movement (would need to use different pattern keywords) and then process those articles with NER to extract numerics and a pattern matcher to assign price increase or decrease.

5. We could experiment with other approaches to building a classifier (ex. historical performance)