<img src="http://hilpisch.com/images/tr_logo_long.png" width="40%" align="left" style="vertical-align: top; padding-top: 23px;">
<img src="http://hilpisch.com/tpq_logo_long.png" width="36%" align="right" style="vertical-align: top;">

# Eikon Data API

**Natural Language Processing for News**

Dr. Yves J. Hilpisch | The Python Quants GmbH

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:training@tpq.io">training@tpq.io</a>

<img src="http://hilpisch.com/images/tr_eikon_02.png" width=350px align=left>

## The Agenda

This tutorial covers **natural language processing (NLP)** based on news from the Eikon Data API:

* Retrieving News Headlines and Story Texts
* Extracting Raw Text from HTML
* Tokenizing a Raw Text
* Collecting Raw Texts and Tokenizing Them
* Building a Vocabulary for a Raw Text Collection

## Imports and Versions

The following imports several **packages** as used in the following.

In [None]:
import nltk, bs4  # NLP toolkit & BeautyfulSoup
import eikon as ek  # the Eikon Python wrapper package
from bs4 import BeautifulSoup  # HTML parsing
from nltk import word_tokenize  # tokenizing
import configparser as cp

The following **Python and package versions** are used.

In [None]:
import sys
print(sys.version)

In [None]:
ek.__version__

In [None]:
nltk.__version__

In [None]:
bs4.__version__

## Connecting to Eikon Data API

This code sets the `app_id` to connect to the **Eikon Data API Proxy** which needs to be running locally. It requires the previously created text file `eikon.cfg` to be in the current working directory.

In [None]:
cfg = cp.ConfigParser()
cfg.read('eikon.cfg')  # adjust for different file location

In [None]:
ek.set_app_id(cfg['eikon']['app_id'])

## Reading News Headlines

The function `ek.get_news_headlines()` allows you to search for and retrieve **news headlines**, including `storyId` values needed to retrieve the full news text.

A `query` string might contain `RICs` and other words to be searched for.

In [None]:
news = ek.get_news_headlines('R:TSLA.O PRODUCTION',
                         date_from='2018-02-15',
                         date_to='2018-03-16',
                         count=5
                        )

In [None]:
news

## Retrieving Full Text

The function `ek.get_news_story()` retrieves the full text of a **news story** given the `storyId` value.

The `storyId` values are stored in the respective column of the `news` `DataFrame` object as created above.

In [None]:
news['storyId']

To read a news story, **pick out one story via its storyId and display the full text** provided as HTML code.

In [None]:
storyId = news['storyId'][1]

In [None]:
from IPython.display import HTML

In [None]:
HTML(ek.get_news_story(storyId))

## Extracting Raw Text

For the purposes of parsing the story text, **raw text** is better suited than HTML. To this end, the `bs4` package is helpful in transforming the HTML content to text.

In [None]:
html = ek.get_news_story(storyId)

In [None]:
raw = BeautifulSoup(html, 'html5lib').get_text()

In [None]:
ind = raw.find('Tesla')

In [None]:
print(raw[ind:ind + 500])

## Tokens for a Text

Using `nltk` **tokenization**, i.e., the splitting up of a raw text into unique elements, is easily accomplished. The `punkt` package for `nltk` is needed.

In [None]:
nltk.download('punkt')  # downloads package if required

In [None]:
tokens = word_tokenize(raw)  # derives tokens for the raw text

In [None]:
tokens[20:40]

On the basis of tokens, **contexts** for different tokens (words) are easily selected.

In [None]:
text = nltk.Text(tokens)

In [None]:
text.concordance('production')

## Collecting Raw Texts

The analyses that follow are based on **all news stories** as seen above. To this end, the raw texts are collected in a `list` object.

In [None]:
stories = []
for storyId in news['storyId']:
    html = ek.get_news_story(storyId)
    stories.append(BeautifulSoup(html, 'html5lib').get_text())

In [None]:
for story in stories:
    print(story[120:200])

## Tokens for Raw Texts

The same approach is now applied to all the story texts as collected above.

In [None]:
collection = ''.join(stories)  # combines all texts

In [None]:
tokens = word_tokenize(collection)  # derives tokens from the collection

In [None]:
tokens[40:60]

Based on the new set of tokens, the collection of raw texts can now be searched and contexts can be looked up.

In [None]:
text = nltk.Text(tokens)

In [None]:
text.concordance('production')

In [None]:
text.concordance('increase')

In [None]:
text.concordance('automation')

In [None]:
text.concordance('vehicles')

## Building a Vocabulary

Based on the tokens, a **vocabulary** can be created.

In [None]:
words = sorted([w.lower() for w in tokens])

In [None]:
ind = words.index('a')  # first occurance of 'a'
ind

In [None]:
words[ind: ind+15]

The following code **deletes duplicates** and **sorts** the remaining `list` object alphabetically.

In [None]:
words = sorted(list(set(words[ind:])))

In [None]:
words[:20]

## Conclusions

This tutorial covers the following **natural language processing (NLP)** tasks based on the Eikon Data API and respective Python packages:

* Retrieving News Headlines and Story Texts
* Extracting Raw Text from HTML
* Tokenizing a Raw Text
* Collecting Raw Texts and Tokenizing Them
* Building a Vocabulary for a Raw Text Collection

## Eikon Data API Developer Resources

* [Overview](https://developers.thomsonreuters.com/eikon-data-apis) 
* [Quick Start ](https://developers.thomsonreuters.com/eikon-data-apis/quick-start)
* [Documentation](https://developers.thomsonreuters.com/eikon-data-apis/docs)
* [Downloads](https://developers.thomsonreuters.com/eikon-data-apis/downloads)
* [Tutorials](https://developers.thomsonreuters.com/eikon-data-apis/learning)
* [Q&A Forums](https://developers.thomsonreuters.com/eikon-data-apis/qa) 

Data Item Browser Application: Type `DIB` into Eikon Search Bar.

* [Article on Chains](https://developers.thomsonreuters.com/article/simple-chain-objects-ema-part-1)

<img src="http://hilpisch.com/images/tr_logo_long.png" width="40%" align="left" style="vertical-align: top; padding-top: 23px;">
<img src="http://hilpisch.com/tpq_logo_long.png" width="36%" align="right" style="vertical-align: top;">