<img src="http://hilpisch.com/tpq_logo.png" width="36%" align="right" style="vertical-align: top;">

# Dow Jones DNA NLP Case Study

_Based on news articles related to Elon Musk, Twitter & Tesla._

**Data Retrieval**

Dr Yves J Hilpisch | Michael Schwed

The Python Quants GmbH

## The Imports

In [1]:
import os
import sys
sys.path.append('../../modules/')

In [2]:
import json
import nltk
import pickle
import tpqdna
import warnings
warnings.simplefilter('ignore')

## Snapshot Creation

### Authentication

In [3]:
# expects the DNA API key to be stored in plain text as a Python pickle file
api_key = pickle.load(open('../dna_api_key.pkl', 'rb'))  
headers = {
    'user-key': api_key,
    'content-type': 'application/json',
    'cache-control': 'no-cache'
}  

### Specification

In [4]:
where = '(body like "%Musk.%" OR body like "%Musk,%" '
where += 'OR body like "%Musk %" '
where += 'OR body like "%Tesla.%" OR body like "%Tesla,%" '
where += 'OR body like "%Tesla %") '
where += 'AND (body like "%tweet.%" OR body like "%tweet,%" '
where += 'OR body like "%tweet %" '
where += 'OR body like "%Twitter.%" OR body like "%Twitter,%" '
where += 'OR body like "%Twitter %" ) ' 
where += 'AND language_code="en" '
where += 'AND publication_date >= "2018-07-23 00:00:00" '
where += 'AND publication_date <= "2018-10-29 23:59:59" '  

In [5]:
includes = {} 
excludes = {}
limit = 100

In [6]:
query = {'query': 
           {'where': where,
            'includes': includes,
            'exludes': excludes,
            'limit': limit
         }}

In [7]:
query = json.dumps(query)

In [8]:
# %time qurl = tpqdna.create_snapshot(query, headers)  

In [9]:
# %time fl = tpqdna.run_snapshot(qurl, headers)  

## Data Paths

In [10]:
project = 'musk_{}'.format(limit)  

In [11]:
base_path = os.path.abspath('../../')

In [12]:
data_path = os.path.join(base_path, 'data_musk')  
if not os.path.isdir(data_path):
    os.mkdir(data_path)

In [13]:
meta_path = os.path.join(data_path, 'meta')  
if not os.path.isdir(meta_path):
    os.mkdir(meta_path)
fn = os.path.join(meta_path, 'file_list_{}.pkl'.format(project))

In [14]:
# with open(fn, 'wb') as f:
#    pickle.dump(fl, f)  

In [15]:
with open(fn, 'rb') as f:
    fl = pickle.load(f)

## Data Retrieval

In [16]:
snapshot_path = os.path.join(data_path, 'snapshot')  
if not os.path.isdir(snapshot_path):
    os.mkdir(snapshot_path)

In [17]:
# %time tpqdna.download_snapshots(fl, snapshot_path, headers)  

In [18]:
%time data = tpqdna.avro2dataframe(snapshot_path)  

CPU times: user 143 ms, sys: 7.56 ms, total: 151 ms
Wall time: 155 ms


In [19]:
data.info()  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 35 columns):
action                       100 non-null object
an                           100 non-null object
art                          100 non-null object
body                         100 non-null object
byline                       100 non-null object
company_codes                100 non-null object
company_codes_about          100 non-null object
company_codes_association    100 non-null object
company_codes_lineage        100 non-null object
company_codes_occur          100 non-null object
company_codes_relevance      100 non-null object
copyright                    100 non-null object
credit                       100 non-null object
currency_codes               100 non-null object
dateline                     5 non-null object
document_type                100 non-null object
industry_codes               100 non-null object
ingestion_datetime           100 non-null int64
language_code    

In [20]:
data[['source_name', 'title', 'word_count']].head(8)

Unnamed: 0,source_name,title,word_count
0,Barron's,Why Musk Is Wrong About Shorts,1031
1,San Francisco Chronicle: Web Edition,Tesla owners fume about delays for service,1016
2,ArabianBusiness.com,Saudi fund PIF said to mull investment in Tesl...,240
3,Digit,SpaceX to announce the name of its first touri...,417
4,U-Wire,We’ve All Been Pronouncing Chrissy Teigen’s La...,292
5,Dow Jones Institutional News,"Public Bravado, Private Doubts: How Elon Musk'...",1786
6,Benzinga.com,Tesla Zaps Go-Private Plans; Wall Street Reacts,581
7,The Canadian Press,Tesla stock drops closer to pre-Musk tweet level,308


In [21]:
fn = os.path.join(snapshot_path, 'snapshot_{}.h5'.format(project))  

In [22]:
data.to_hdf(fn, 'data', complevel=5, complib='blosc')  

<img src="http://hilpisch.com/tpq_logo.png" width="36%" align="right" style="vertical-align: top;">