<img src="http://hilpisch.com/tpq_logo.png" width="36%" align="right" style="vertical-align: top;">

# Dow Jones DNA NLP Case Study

_Based on news articles related to Elon Musk, Twitter & Tesla._

**Information Extraction**

Dr Yves J Hilpisch | Michael Schwed

The Python Quants GmbH

## The Imports

In [1]:
import os
import sys
sys.path.append('../../modules')

In [2]:
import nltk
import pandas as pd
import soiepy.main as ie
import nlp_functions as nlp

## Snapshot Data

In [3]:
project = 'musk_100'

In [4]:
abs_path = os.path.abspath('../../')

In [5]:
data_path = os.path.join(abs_path, 'data_musk')

In [6]:
snapshot_path = os.path.join(data_path, 'snapshot')

In [7]:
fn = os.path.join(snapshot_path, 'snapshot_{}.h5'.format(project))

In [8]:
raw = pd.read_hdf(fn, 'data')

## Preprocessing

In [9]:
%time raw['body'] = raw['body'].apply(nlp.clean_up_text) 

CPU times: user 246 ms, sys: 7.59 ms, total: 253 ms
Wall time: 252 ms


In [10]:
data = raw['body'].values.tolist()  

In [11]:
%%time
s = [nltk.sent_tokenize(a) for a in data]  
s = [_ for sl in s for _ in sl]  

CPU times: user 200 ms, sys: 0 ns, total: 200 ms
Wall time: 199 ms


In [12]:
s[:2]

['after six years of reflection, he returned to the subject.',
 'the last several years have taught me that they are indeed reasonably maligned, musk wrote in an oct. 4 tweet.']

In [13]:
token_path = os.path.join(data_path, 'tokens')  
if not os.path.isdir(token_path):
    os.mkdir(token_path)

In [14]:
fn = os.path.join(token_path, 'tokens_{}_{}.txt')  

In [15]:
steps = 250
for c, i in enumerate(range(0, len(s), steps)):
    with open(fn.format(project, c), 'w') as f:
        f.writelines([_ + '\n' for _ in s[i:i + steps - 1]])  

## Relations

In [16]:
results_path = os.path.join(data_path, 'results')  
if not os.path.isdir(results_path):
    os.mkdir(results_path)
fnr = os.path.join(results_path, 'relations_{}.h5'.format(project))  

In [17]:
fl = sorted(os.listdir(token_path))
d = pd.DataFrame()
fno = len(fl)

In [18]:
%%time
try:
    d = pd.read_hdf(fnr, 'raw')  
except:
    for i, fn in enumerate(fl):
        filename = os.path.join(token_path, fn)
        msg = 'Processing file {} of {} \r'
        print(msg.format(i + 1, fno), end='')
        r = ie.stanford_ie(filename, verbose=False)  
        dt = pd.DataFrame(r)
        if len(d) == 0:
            d = dt
        else:
            d = pd.concat((d, dt), ignore_index=True)

CPU times: user 18.7 ms, sys: 292 µs, total: 19 ms
Wall time: 17.2 ms


In [19]:
d = d.iloc[:, :3]

In [20]:
d.columns = ['Node1', 'Relation', 'Node2']

In [21]:
len(d)

11385

## Post Processing

In [22]:
data = d.copy()

### Basic Processing

In [23]:
data = data.applymap(lambda s: s.strip())  

In [24]:
data = data[data.applymap(lambda s: not s in nlp.stop_words)].dropna()  

In [25]:
data = data[data.applymap(lambda s: not s.startswith('http'))].dropna()  

In [26]:
data = data.applymap(lambda s: nlp.nltk_lemma(s))  

In [27]:
len(data)

7935

### Removing Duplicates

In [28]:
def join_columns(row):
    return ' '.join([row['Node1'], row['Relation'], row['Node2']])  

In [29]:
vec = nlp.TfidfVectorizer(stop_words='english')

In [30]:
data['Join'] = data.apply(lambda row: join_columns(row), axis=1)  

In [31]:
mat = vec.fit_transform(data['Join'].values.tolist())

In [32]:
%time sim = (mat * mat.T).A  

CPU times: user 113 ms, sys: 374 ms, total: 487 ms
Wall time: 485 ms


In [33]:
data['Keep'] = True

In [34]:
%%time
for i, ind_i in enumerate(data.index):
    for j, ind_j in enumerate(data.index):
        if j > i:
            simsc = sim[i, j]
            if simsc > 0.5:
                data.loc[ind_j, 'Keep'] = False  

CPU times: user 1min 2s, sys: 103 ms, total: 1min 2s
Wall time: 1min 2s


In [35]:
data = data.iloc[:, :3][data['Keep'] == True]  

In [36]:
len(data)

1887

In [37]:
data.head()

Unnamed: 0,Node1,Relation,Node2
11,musk,wrote in,oct. 4 tweet
13,entrepreneur,hate,sellers
21,option,bet on,stock decline
24,ways,bet on,fall
25,tesla short interest,approach,40 million shares


## Storing Results

In [38]:
d.to_hdf(fnr, 'raw', complevel=5, complib='blosc')  

In [39]:
data.to_hdf(fnr, 'data', complevel=5, complib='blosc')  

<img src="http://hilpisch.com/tpq_logo.png" width="36%" align="right" style="vertical-align: top;">