<img src="http://hilpisch.com/tpq_logo.png" width="36%" align="right" style="vertical-align: top;">

# Natural Language Processing

**Open Information Extraction**

_Illustrated based on a simple example and the texts from three Apple press releases._

Dr Yves J Hilpisch | Michael Schwed

The Python Quants GmbH

## Simple Example

In [1]:
import os
import nltk
import requests
import pandas as pd

In [2]:
import sys
sys.path.append('../../modules/')
import soiepy.main as ie  
import ng_functions as ng  
import nlp_functions as nlp

In [3]:
t = '''
Peter studies data science.
Peter knows Java.
Peter prefers Python.
Peter works as a data scientist.
Peter applies machine learning.
A data scientist uses Python.
Python revolutionized data science.
Python is preferred for NLP.
Python is used for machine learning.
'''

In [4]:
s = nltk.sent_tokenize(t)  

In [5]:
s[:3]  

['\nPeter studies data science.', 'Peter knows Java.', 'Peter prefers Python.']

In [6]:
s = [nlp.clean_up_text(_) for _ in s]  
s = [' '.join(nlp.tokenize(_)) + '.' for _ in s]  

In [7]:
s[:3]  

['peter study data science.', 'peter know java.', 'peter prefer python.']

In [8]:
abs_path = os.path.abspath('../../')

In [9]:
data_path = os.path.join(abs_path, 'data')
tokens_path = os.path.join(data_path, 'tokens')
if not os.path.isdir(tokens_path):
    os.mkdir(tokens_path)

In [10]:
fn = os.path.join(tokens_path, 'tokens_example.txt')  

In [11]:
with open(fn, 'w') as f:
    f.writelines([_ + '\n' for _ in s])  

In [12]:
r = ie.stanford_ie(fn, verbose=True)  

Executing command = cd /root/notebook/dnanlp/modules/soiepy/;cd stanford-openie; java -mx4g -cp "stanford-openie.jar:stanford-openie-models.jar:lib/*" edu.stanford.nlp.naturalli.OpenIE /root/notebook/dnanlp/data/tokens/tokens_example.txt  -format ollie > /tmp/openie/out.txt


In [13]:
r[:3]  

[['peter', ' know', ' java'],
 ['peter', ' prefer', ' python'],
 ['peter', ' works', ' data scientist']]

In [14]:
d = pd.DataFrame(r, columns=['Node1', 'Relation', 'Node2'])  

In [15]:
d = d.applymap(lambda _: _.strip())  

In [16]:
d.iloc[:3]

Unnamed: 0,Node1,Relation,Node2
0,peter,know,java
1,peter,prefer,python
2,peter,works,data scientist


In [17]:
g = ng.create_graph(d)  

In [18]:
G = ng.plot_graph(g, central_gravity=0.01)  

In [19]:
G.show('ng_example.html')  

## Apple Press Releases

In [20]:
import requests

In [21]:
sources = [
    'https://nr.apple.com/dE0b1T5G3u',  # iPad Pro
    'https://nr.apple.com/dE4c7T6g1K',  # MacBook Air
    'https://nr.apple.com/dE4q4r8A2A',  # Mac Mini
]

In [22]:
html = [requests.get(url).text for url in sources]

In [23]:
sents = [nltk.sent_tokenize(h) for h in html]

In [24]:
s = []
for sent in sents:
    s.extend(sent)

In [25]:
len(s)

200

In [26]:
s = [nlp.clean_up_text(se) for se in s]

In [27]:
s = [' '.join(nlp.tokenize(se)) + '.' for se in s]

In [28]:
s = [se for se in s if len(se) > 5]

In [29]:
fn = os.path.join(tokens_path, 'tokens_apple.txt')
with open(fn, 'w') as f:
    f.writelines([_ + '\n' for _ in s])

In [30]:
%time r = ie.stanford_ie(fn, verbose=False)

CPU times: user 5.85 ms, sys: 16.3 ms, total: 22.1 ms
Wall time: 21.2 s


In [31]:
r[:3]

[['apple mac',
  ' ipad',
  ' iphone watch music support shopping bag newsroom archive press release'],
 ['apple mac',
  ' ipad',
  ' watch music support shopping bag newsroom archive press release'],
 ['today', ' introduce', ' design performance']]

In [32]:
d = pd.DataFrame(r, columns=['Node1', 'Relation', 'Node2'])

In [33]:
d = d.applymap(lambda x: x.strip())

In [34]:
d.iloc[:10]

Unnamed: 0,Node1,Relation,Node2
0,apple mac,ipad,iphone watch music support shopping bag newsro...
1,apple mac,ipad,watch music support shopping bag newsroom arch...
2,today,introduce,design performance
3,today,introduce,ipad design performance
4,today,introduce,design next-generation performance
5,today,introduce,all-screen design performance
6,today,introduce,ipad design next-generation performance
7,today,introduce,ipad all-screen design next-generation perform...
8,today,introduce,ipad all-screen design performance
9,today,introduce,all-screen design next-generation performance


In [35]:
d = d[d.applymap(lambda x: len(x) < 25)].dropna()

In [36]:
d.iloc[:5]

Unnamed: 0,Node1,Relation,Node2
2,today,introduce,design performance
3,today,introduce,ipad design performance
19,workflows .2 apps design,take,advantage display
45,workflows .2 apps design,take,advantage large display
47,photoshop ipad,coming,2019 push user computer


In [37]:
g = ng.create_graph(d)

In [38]:
G = ng.plot_graph(g, with_edge_label=False,
                  font_color='grey', central_gravity=0.01)

In [39]:
G.show('ng_apple.html')

<img src="http://hilpisch.com/tpq_logo.png" width="36%" align="right" style="vertical-align: top;">