<img src="http://eikon.tpq.io/refinitiv_logo.png" width="28%" align="left" style="vertical-align: top; padding-top: 23px;">
<img src="http://hilpisch.com/tpq_logo_long.png" width="36%" align="right" style="vertical-align: top;">

# Eikon Data API

**Natural Language Processing for News**

Dr. Yves J. Hilpisch | The Python Quants GmbH

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:training@tpq.io">training@tpq.io</a>

<img src="http://hilpisch.com/images/tr_eikon_02.png" width=350px align=left>

## The Agenda

This tutorial covers **natural language processing (NLP)** based on news from the Eikon Data API:

* Retrieving News Headlines and Story Texts
* Extracting Raw Text from HTML
* Tokenizing a Raw Text
* Collecting Raw Texts and Tokenizing Them
* Building a Vocabulary for a Raw Text Collection

## Imports and Versions

The following imports several **packages** as used in the following.

In [1]:
import nltk, bs4  # NLP toolkit & BeautyfulSoup
import eikon as ek  # the Eikon Python wrapper package
from bs4 import BeautifulSoup  # HTML parsing
from nltk import word_tokenize  # tokenizing
import configparser as cp

The following **Python and package versions** are used.

In [2]:
import sys
print(sys.version)

3.6.4 |Anaconda, Inc.| (default, Dec 21 2017, 15:39:08) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


In [3]:
ek.__version__

'0.1.12'

In [4]:
nltk.__version__

'3.2.5'

In [5]:
bs4.__version__

'4.6.0'

## Connecting to Eikon Data API

This code sets the `app_id` to connect to the **Eikon Data API Proxy** which needs to be running locally. It requires the previously created text file `eikon.cfg` to be in the current working directory.

In [6]:
cfg = cp.ConfigParser()
cfg.read('eikon.cfg')  # adjust for different file location

['eikon.cfg']

In [7]:
ek.set_app_key(cfg['eikon']['app_id']) #set_app_id function being deprecated

## Reading News Headlines

The function `ek.get_news_headlines()` allows you to search for and retrieve **news headlines**, including `storyId` values needed to retrieve the full news text.

A `query` string might contain `RICs` and other words to be searched for.

In [8]:
news = ek.get_news_headlines('R:TSLA.O PRODUCTION',
                         date_from='2018-02-15',
                         date_to='2018-03-16',
                         count=5
                        )

In [9]:
news

Unnamed: 0,versionCreated,text,storyId,sourceCode
2018-03-14 21:49:23.000,2018-03-14 21:49:23.000,"Tesla says Model S, Model X production efficie...",urn:newsml:reuters.com:20180314:nL3N1QW5TX:2,NS:RTRS
2018-03-12 17:01:58.016,2018-03-12 17:01:58.016,Tesla (TSLA) Shares Gain Amid Production Delay...,urn:newsml:reuters.com:20180312:nNRA5phjec:1,NS:ZACKSC
2018-03-12 01:28:10.000,2018-03-12 01:28:10.000,Tesla paused Model 3 production for planned up...,urn:newsml:reuters.com:20180312:nL1N1QU00F:2,NS:RTRS
2018-03-11 23:39:28.000,2018-03-11 23:42:48.000,MEDIA LINK-Tesla temporarily suspended Model 3...,urn:newsml:reuters.com:20180311:nL1N1QT0LN:2,NS:RTRS
2018-02-26 06:05:12.873,2018-02-26 06:05:12.873,(LEAD) Tesla-Model S P100D (LEAD) Tesla launch...,urn:newsml:reuters.com:20180226:nNRA5ma8ay:1,NS:YONNWS


## Retrieving Full Text

The function `ek.get_news_story()` retrieves the full text of a **news story** given the `storyId` value.

The `storyId` values are stored in the respective column of the `news` `DataFrame` object as created above.

In [10]:
news['storyId']

2018-03-14 21:49:23.000    urn:newsml:reuters.com:20180314:nL3N1QW5TX:2
2018-03-12 17:01:58.016    urn:newsml:reuters.com:20180312:nNRA5phjec:1
2018-03-12 01:28:10.000    urn:newsml:reuters.com:20180312:nL1N1QU00F:2
2018-03-11 23:39:28.000    urn:newsml:reuters.com:20180311:nL1N1QT0LN:2
2018-02-26 06:05:12.873    urn:newsml:reuters.com:20180226:nNRA5ma8ay:1
Name: storyId, dtype: object

To read a news story, **pick out one story via its storyId and display the full text** provided as HTML code.

In [11]:
storyId = news['storyId'][1]

In [12]:
from IPython.display import HTML

In [13]:
HTML(ek.get_news_story(storyId))

## Extracting Raw Text

For the purposes of parsing the story text, **raw text** is better suited than HTML. To this end, the `bs4` package is helpful in transforming the HTML content to text.

In [14]:
html = ek.get_news_story(storyId)

In [15]:
raw = BeautifulSoup(html, 'html5lib').get_text()

In [16]:
ind = raw.find('Tesla')

In [17]:
print(raw[ind:ind + 500])

Tesla TSLA popped more than 3.5% in early morning trading Monday, despite cautious reports that the electric car manufacturer was forced to pause production of its low-priced Model 3 in February.Tesla temporarily suspended production of the Model 3 from Feb. 20 to 24 to adjust equipment, improve automation, and increase production rates, according to CNBC  . But the company says this activity is normal."Our Model 3 production plan includes periods of planned downtime in both Fremont and Gigafact


## Tokens for a Text

Using `nltk` **tokenization**, i.e., the splitting up of a raw text into unique elements, is easily accomplished. The `punkt` package for `nltk` is needed.

In [18]:
nltk.download('punkt')  # downloads package if required

[nltk_data] Downloading package punkt to /Users/yves/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [19]:
tokens = word_tokenize(raw)  # derives tokens for the raw text

In [20]:
tokens[20:40]

['reports',
 'that',
 'the',
 'electric',
 'car',
 'manufacturer',
 'was',
 'forced',
 'to',
 'pause',
 'production',
 'of',
 'its',
 'low-priced',
 'Model',
 '3',
 'in',
 'February.Tesla',
 'temporarily',
 'suspended']

On the basis of tokens, **contexts** for different tokens (words) are easily selected.

In [21]:
text = nltk.Text(tokens)

In [22]:
text.concordance('production')

Displaying 8 of 8 matches:
 manufacturer was forced to pause production of its low-priced Model 3 in Febr
ruary.Tesla temporarily suspended production of the Model 3 from Feb. 20 to 24
improve automation , and increase production rates , according to CNBC . But t
tivity is normal . `` Our Model 3 production plan includes periods of planned 
 bottlenecks in order to increase production rates , '' a Tesla spokesperson t
s goal of reaching weekly Model 3 production rates of 2,500 by the end of Q1 a
 has struggled to ramp up Model 3 production , leading to some concern about s
ting buyer.Tesla 's disappointing production numbers have caused its stock to 


## Collecting Raw Texts

The analyses that follow are based on **all news stories** as seen above. To this end, the raw texts are collected in a `list` object.

In [23]:
stories = []
for storyId in news['storyId']:
    html = ek.get_news_story(storyId)
    stories.append(BeautifulSoup(html, 'html5lib').get_text())

In [24]:
for story in stories:
    print(story[120:200])

 decreased considerably, following the latest report of quality problems that co
e electric car manufacturer was forced to pause production of its low-priced Mod
lanned work to adjust equipment in order to improve automation and increase prod
loomberg reported on Sunday.Source link: https://www.bloomberg.com/news/articles
mation in 3rd para)SEOUL, Feb. 26 (Yonhap) -- U.S. carmaker Tesla Motors Inc. on


## Tokens for Raw Texts

The same approach is now applied to all the story texts as collected above.

In [25]:
collection = ''.join(stories)  # combines all texts

In [26]:
tokens = word_tokenize(collection)  # derives tokens from the collection

In [27]:
tokens[40:60]

['carmaker',
 'from',
 'hitting',
 'its',
 'production',
 'targets.The',
 'electric',
 'car',
 'maker',
 'told',
 'Reuters',
 'on',
 'Wednesday',
 'production',
 'of',
 '100,000',
 'Model',
 'S',
 'and',
 'Model']

Based on the new set of tokens, the collection of raw texts can now be searched and contexts can be looked up.

In [28]:
text = nltk.Text(tokens)

In [29]:
text.concordance('production')

Displaying 21 of 21 matches:
ent the carmaker from hitting its production targets.The electric car maker to
r maker told Reuters on Wednesday production of 100,000 Model S and Model X ve
arts leading to costly rework and production delays , citing several current a
 manufacturer was forced to pause production of its low-priced Model 3 in Febr
ruary.Tesla temporarily suspended production of the Model 3 from Feb. 20 to 24
improve automation , and increase production rates , according to CNBC . But t
tivity is normal . `` Our Model 3 production plan includes periods of planned 
 bottlenecks in order to increase production rates , '' a Tesla spokesperson t
s goal of reaching weekly Model 3 production rates of 2,500 by the end of Q1 a
 has struggled to ramp up Model 3 production , leading to some concern about s
ting buyer.Tesla 's disappointing production numbers have caused its stock to 
 Inc TSLA.O temporarily suspended production of its Model 3 electric car from 
o improve automation an

In [30]:
text.concordance('increase')

Displaying 4 of 4 matches:
uipment , improve automation , and increase production rates , according to CN
ly address bottlenecks in order to increase production rates , '' a Tesla spok
in order to improve automation and increase production rates.Tesla said the pl
ly address bottlenecks in order to increase production rates , '' a Tesla spok


In [31]:
text.concordance('automation')

Displaying 5 of 5 matches:
 24 to adjust equipment , improve automation , and increase production rates ,
These periods are used to improve automation and systematically address bottle
ust equipment in order to improve automation and increase production rates.Tes
These periods are used to improve automation and systematically address bottle
nned downtime was used to improve automation and reduce bottlenecks , Bloomber


In [32]:
text.concordance('vehicles')

Displaying 5 of 5 matches:
ion of 100,000 Model S and Model X vehicles is now possible in a two-shift cyc
sla said it delivered 29,967 total vehicles , with 1,542 of those being Model 
livered 28,425 Model S and Model X vehicles and 1,542 Model 3 vehicles , total
Model X vehicles and 1,542 Model 3 vehicles , totaling 29,967 deliveries . ( R
 growing appetite for all-electric vehicles ( EVs ) here.Using Ludicrous mode 


## Building a Vocabulary

Based on the tokens, a **vocabulary** can be created.

In [33]:
words = sorted([w.lower() for w in tokens])

In [34]:
ind = words.index('a')  # first occurance of 'a'
ind

333

In [35]:
words[ind: ind+15]

['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']

The following code **deletes duplicates** and **sorts** the remaining `list` object alphabetically.

In [36]:
words = sorted(list(set(words[ind:])))

In [37]:
words[:20]

['a',
 'about',
 'absorb',
 'accelerating',
 'acceleration',
 'access',
 'according',
 'accordingly',
 'accuracy',
 'accurately',
 'achieve',
 'actions',
 'activity',
 'add',
 'address',
 'adequacy',
 'adjust',
 'adjusted',
 'adults',
 'advice']

## Conclusions

This tutorial covers the following **natural language processing (NLP)** tasks based on the Eikon Data API and respective Python packages:

* Retrieving News Headlines and Story Texts
* Extracting Raw Text from HTML
* Tokenizing a Raw Text
* Collecting Raw Texts and Tokenizing Them
* Building a Vocabulary for a Raw Text Collection

## Eikon Data API Developer Resources

* [Overview](https://developers.thomsonreuters.com/eikon-data-apis) 
* [Quick Start ](https://developers.thomsonreuters.com/eikon-data-apis/quick-start)
* [Documentation](https://developers.thomsonreuters.com/eikon-data-apis/docs)
* [Downloads](https://developers.thomsonreuters.com/eikon-data-apis/downloads)
* [Tutorials](https://developers.thomsonreuters.com/eikon-data-apis/learning)
* [Q&A Forums](https://developers.thomsonreuters.com/eikon-data-apis/qa) 

Data Item Browser Application: Type `DIB` into Eikon Search Bar.

* [Article on Chains](https://developers.thomsonreuters.com/article/simple-chain-objects-ema-part-1)

<img src="http://eikon.tpq.io/refinitiv_logo.png" width="28%" align="left" style="vertical-align: top; padding-top: 23px;">
<img src="http://hilpisch.com/tpq_logo_long.png" width="36%" align="right" style="vertical-align: top;">