# <center>RNN/LSTM-based Sentiment Analysis</center>

<center>MIDS W266 Natural Language Processing Final Project</center>

<center>Alan Wang, yjawang@gmail.com</center>

<center>December 18, 2016</center>


## <center>Abstract</center>

Facing various challenges on financial news article sentiment analysis, this project proposed and explored several RNN/LSTM based models. Since the main dataset, New York Times Annotated Corpus (LDC2008T19) lasks polarity annotation, a dated stock trend has been tested as an indirect indicator of the running sentiment of the involved articles/sentences.  With 130K sentences with a SP500 mentioning and tagged with corresponding stock trends, an RNN/LSTM model with each word predicting the trend.  Then the trend is mixed with vader lexicon in the second model.  Finally, an approach of applying categorized LMs that share the common vocabulary is proposed as an alternative method to perform sentiment or other classification analysis.

## <center>Goals</center>

My interested in financial new article sentiment analysis was actually originally for automatically reading, analyzing, archiving financial articles and cross-referencing it to the actual market performance to hold Financial Analysts responsible of their market predictions.  As W266 studying and scheduling panning out, the focus of this project has been practically evolved into achieving sentiment analysis exploration goal with respect to the following logistical oriented objects:

1.	Staying with [NYT](https://catalog.ldc.upenn.edu/ldc2008t19).
2.	Staying with RNN/LSTM, (i.e. Assignment 1, Part 2).
3.	Starting with SP500 trend (stock price delta) as sentiment indicator

## <center>Dataset Cleansing and Preprocessing</center>

### Highlights of the New York Times Annotated Corpus

- 01/01/1987 to 06/18/2007 (over 20 years)

- 1.8M overall articles in xml format by yy/mm/dd

- 1.5M articles manually tagged w/ organizations, etc.

- 275K of it algorithmically-tagged and hand verified

The document of this corpus LDC2008T19 is attached [here](reference/new_york_times_annotated_corpus.pdf)

### Data Cleansing and Preprocessing

![Fig 1](./data processing.png) 
This figure gives a high-level account of the data cleansing and preprocessing:

- With the existing annotation of 'org' (organization) in NYT corpus which lists all organization mentioned in the articles, and with SP 500 company names, the 1.8M articles are extracted into 400K articles then into 130K sentences with a mentioning of one of the SP500 companies.

- With the daily stock price of the SP500 companies over these 20 years (kibot.com), the polarity of the sentiment of each sentence is defined to be 'pos' or 'neg' if the weekly average price of the company mentioned is higher or lower respectively than the week before, with respect to the publication date.  (Please note: it is possible that a sentence mentioned two companies, in which case, the same sentence will enter the system twice, each time with a potentially different polarity.


- At one point the polarity was measured by weekly return (a ratio between 0.5 to -0.5) which can be broken down into 15 or 7 levels.  This approach was given up for short of time.

## <center>Background & Challenges</center>

There has been many approaches to deal with the financial article sentiment analysis, which is noticeably different from the general opinion driven projects like movie and yelp reviews, or twitters.  One major problem starts with the lexicon variance.  Below is a contrast of how our collection, with ran through nltk.vader analyzer, end up with over 95% neutral predictions (above 70% probability) by vader.


![nltk.vader to NYT articles](./vader.png)

#### Note:  Report convention

I tried three approaches and, though all implementations are RNN/LSTM (Assignment 1/Part 2) based, I tried to separate them into different code bases. This is not only because they differ in various details, but also due to the dynamic implementation focus.

However, since all applied models share the same codebase, I will make high-level description here in the main report and provide links to individual ipython notebooks for comments on various implementation detail, including test results.  Also, I'd add 'future work' note within each model so the Conculsion is 'purely' conclusion. :)  

## <center>Model 1 - Word level Sentiments Predicting RNN/LSTM</center>

The first model is simply reapplying the LM RNN/LSTM to have each work from a sentence to predict, instead of the next word, the associated polarity, ('pos' or 'neg').  

![layers](RNNLM - layers.png)

Aside from the fact that sentimental analysis might not intuitively  be the best fit for this model, the adaptation from LM to predict sentiment looks simple on the plot.  However, a few detail needs to be noted:

- The biggest structural delta is the output layer size, which 'shrinks' from V (= 10000) to Z (=3 for 'pos', 'neg' and 'neu').
- The LM does not explicitly provide the target (which is always the next word in the seq). However, for sentiment prediction, we need to do preprocessing and align the y.  
- The nltk should have more tools to tag the source more elegantly, but I managed to make the system work according to the plan.
- After the training, I had some concerning on scoring the sampling. I may want to come up with a different loss (on averaging or voting out) on all words in the same sentence in the future.
- I realized that for this model, it may not be sufficient to just have train-test partition.  I introduced a dev set and I need to keep it until the training to measure the system 'objectively.
- Here is the counter for sentiment labeling of these 130,051 articles {'0': 333, 'n': 62551, 'p': 67167}
- However, I ended up with a 'accuracy' of 49%.  Almost a random flip of coin.

### [Model 1 ipython notebook](./models/rnn nyt sp500/engine.ipynb)


## <center>Model 2 - Stock Trend Mixed with Vader Lexicon</center>

The only difference between Model 1 and 2 is just the way 'sentiments' are labeled.  In the light of trying anything that may make a difference toward sentiments, I did the following labeling changes:

- I redefined the ‘vader’ polarity, and consider a sentence is ‘p’ as long as the prob(vader-pos) > pro(vader-neg).  It is ‘n’ as long as the as long as the prob(vader-pos) < pro(vader-neg).  Otherwise it is ‘0’, which means 'neu'. 
- I then consider a ‘p’ for a sentence only if it is a ‘p’ from the stock trend and a ‘p’ from the vader.  An ‘n’ if it is both ‘n’ from the trend and vader.  Otherwise, it is ‘0’
- After the redefinitions, I end up with a counter {'0': 83584, 'n': 14869, 'p': 31598} which looked more 'realistic' and '0' now dominates the count.
- However, all in all, ~50% is the accuracy.  Again, a coin-flipping.
![layers](RNNLM - layers-1.png)

### [Model 2 ipython notebook](./models/rnn nyt sp500 plus/engine.ipynb)


## <center>Model 3 - Categories LMs for NLP Classification</center>

Perhaps it is because I have been trying to apply RNN/LSTM (originally for LM) to sentiment or classification for a long time.  I started to ponder whether LM can help in anyway.  Then it occurred to me that perhaps I can try to set up two 'sub-languages'.  One for 'pos', the other for 'neg'.  Then when a new sentence comes to the system, I can use the scoring difference to predict which group the article should belong to.  As a matter of fact, it seems that this can be applied to classification in general, at least at a conceptual level.  After all, given a general sentence, a well trained LM can provide an assessment on how well the sentence 'fits' in that language.  This should apply when two sub-language are trained independently.  

One thing to notice is that this scheme could work only if the vocabulary of the _union_ of the whole languages is used even when the sublanguages are trained independently so that the post-training scoring comparison scheme can work on the same base.

![LM for Sentiment Analysis](lm.png)

The following ipyton depicts a general purpose classifier in which the corpus is read in as categories (one file per category, the filename is the category name).  The nltk package provides a nice Corpus reader which automatically handles the categories.  However, before we start the training a particular category ('pos' or 'neg') as a 'sublanguage, we need to separate the data into three sets again, train, test, and dev so that we can have an objective post training scores.  

Because of the time constraints, I used only 10K sentences each from the positve and negative group. The result for this model is listed at the end as

tp = 757, fp = 564, tn = 670, fn = 664, nopt = 0, nont = 0

precision = 57.31, recall = 53.27, accuracy = 53.75, true_neg_rate = 54.29, f-measure = 55.22

Which is not that impressive but already relatively better than the full 130K model-1 or model-2. 

### [Model 3 general LM for sentiment](./models/lm for senti/engine.ipynb)

Alternatively, we could have add two more categories, 'pos-dev' and 'neg-dev' so that we can just train 'pos' and 'neg' as two different sublanguage (without 'pos-dev' or 'neg-dev's participation), then use them for post training scoring.

With the 'encouragement', I ran the same model toward the nltk.movie_reviews corpus and get precision = 83.86, recall = 85.21, accuracy = 84.65, true_neg_rate = 84.11, f-measure = 84.53.

### [Model 3 LM for nltk.movie_reviews](./models/lm for senti/engine for nlk movie.ipynb)

As one word of warning, at one earlier point I forgot about saving the dev groups and reused test to do the scoring and got a precision over 90%, which is of course wrong.  But I listed here for your amusement...:) 

### [Model 3 LM for nltk.movie_reviews without Dev](./models/lm for movie/engine.ipynb)


## <center>Conclusions</center>

- Single word level RNN/LSTM may not be the best tool to handle sentiment analysis or other classification tasks because of the relatively long and scattered backprop rhythms.  By forcing each work to predict the class can help, but it may put too much weight on the last word, which is the main predictor.

- Using stock trend as an indicator for sentiment may be a long shot mainly because the association in real life is not obvious.  Especially in the context of multiple-mentionings.  With the LM model, it seems that we are learning something (54%), but we are not sure what.

- On the other hand, I am very interested on exploring more on the categorized multiple LMs as a classifier model.  Though I agree and understand the effectiveness for classification learning is questionable, I like the implication of the variance between the one sub-LM versus another sub-LM versus the union-LM.

- For lack of another timely channel, I would like to conclude this project by offering my gratefulness to both instructors for your willingness and enthusiasm to help and your profound knowledge of the subject. Thank you so much!!!



## <center>Reference</center>
<ol>
<li>
Alvim, L., Vilela, P., Motta, P., Milidiú, R. L. Sentiment of Financial News: A Natural Language Processing Approach. http://webscience.org.br/wiki/images/f/f6/Sentiment_nlp.pdf
</li><li>Bapat, P., (2014) A Comprehensive Review of Sentiment Analysis of Stocks. International Journal of Computer Applications (0975 8887) Volume 106 - No. 18, November 2014.
</li><li>Daume, H. (2009) Frustratingly Easy Domain Adaptation.
</li><li>Eduard Dragut, E., Wang, H.,  Meng, W. (2012),  Polarity Consistency Checking for  Sentiment Dictionaries, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 997–1005, Jeju, Republic of
 Korea, 8-14 July 2012. http://www.aclweb.org/anthology/P12-1105
</li><li>Feldman, R., Rosenfeld, B., Bar-Haim, R.,  Fresko, M., (2011) The Stock Sonar — Sentiment Analysis of Stocks Based on a Hybrid Approach, Proceedings of  Twenty-Third Innovative Applications of Artificial Intelligence Conference. August, 2011. https://documentviewer.herokuapp.com/
</li><li>Généreux, M., Poibeau, T., Koppel, M. (2008) Sentiment analysis using automatically labeled financial news items. LREC 2008 Workshop on Sentiment Analysis:
 
</li><li>Emotion, Metaphor, Ontology and Terminology, Jun 2008, Marakech, Morocco.2008. https://hal-univ-paris13.archives-ouvertes.fr/hal-00346996/document
</li><li>Gerow, A., Keane, M. T. (2012) Mining the Web for the “Voice of the Herd” to Track Stock Market Bubbles. Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI '11), Barcelona, Spain, 16-22 July, 2011
<li>Høysæter, L. S. Njølstad, P. S. (2014) Sentiment Analysis for Financial Applications Combining Machine Learning, Computational Linguistics, and Statistical Methods for Predicting Stock Price Behavior, http://www.diva-portal.org/smash/get/diva2:751724/FULLTEXT01.pdf
</li><li>Kalyani, J., Bharathi H. N., Jyothi R., (2016) Stock Trend Prediction Using News Sentiment Analysis. July, 2016, https://documentviewer.herokuapp.com/
https://drive.google.com/drive/folders/0BwnY241_IXUaS2dGMjJtRVdlVU0
</li><li>Medhat, W. Hassan, A., Korashy, H. (2014) Sentiment analysis algorithms and applications: A survey, Ain Shams Engineering Journal, 2014; 5, 1093-1113.
</li><li>Nagar, A., Hahsler, M., New Sentiment Analysis Using R to Predict Stock Market Trends,
 http://www.rinfinance.com/agenda/2012/talk/Nagar+Hahsler.pdf
</li><li>Sandhaus, E., (2008) The New York Times Annotated Corpus Overview, LDC2008T19. DVD. Philadelphia: Linguistic Data Consortium, 2008.
</li><li>Schumaker, R., Zhang, Y., Huang, C. N. (2009) Sentiment Analysis of Financial News Articles. https://pdfs.semanticscholar.org/ee5a/d89739ec1b857a47b12f359917116c72d46e.pdf
</li><li>Schumaker, R. P. (2010) An Analysis of Verbs in Financial News Articles and their Impact on Stock Price. https://drive.google.com/drive/folders/0BwnY241_IXUaS2dGMjJtRVdlVU0
</li><li>Sygkounas, E. Rizzo, G., , Troncy, R. Sentiment Polarity Detection From Amazon Reviews: An Experimental Study. https://drive.google.com/drive/folders/0BwnY241_IXUaS2dGMjJtRVdlVU0
</li><li>Wang, Z. Tong, V. J. C.  Lexicon Knowledge Extraction with Sentiment Polarity Computation.https://drive.google.com/drive/folders/0BwnY241_IXUaS2dGMjJtRVdlVU0
</li><li>Zhai, J., Cohen, N., Atreva, A. (2016) Sentiment analysis of news articles for financial signal Prediction. A CS224N Final Project at Stanford University. https://drive.google.com/drive/folders/0BwnY241_IXUaS2dGMjJtRVdlVU0
</li><li>Kamel, M. H. Abdulmajeed, A. Ismal, S. (2013) Risk Factors of Falls Among Elderly
Living in Urban Suez. Pan African Medicine Journal 2013, 14:26.
</li><li>Kozol, G. (2013) The Long-Term Care Conundrum, Journal of Financial Service Professionals, January, 2013.
</li><li>Manini, T (2013) Mobility Decline in Old Age: A Time to Intervene. Exercise Sport Science Review. 2013 Jan; 41(1): 2
</li>
</ol>

[senti](models/lm%20for%20senti/engine.ipynb)
