## Research Project 3
---
```text
- Source: Reuters
- Goal: Build headline aggregator (e.g. Google News)
- Techniques: Word Embeddings, Cosine Similarity
- Tools: requests, lxml, Tensorflow Hub
- Lines of code: ~70```

### Request pages
---

In [1]:
# Let's go on reuters.com and pick a URL
url = 'https://www.reuters.com/finance/stocks/company-news/AAPL.O?date=05102018'

In [2]:
# Get that page
import requests
res = requests.get(url)
len(res.content)

65945

In [3]:
res.content[:1000] 

b'<!--[if !IE]> This has NOT been served from cache <![endif]-->\n<!--[if !IE]> Request served from apache server: prodie--i-088578c9787a71c24 <![endif]-->\n<!--[if !IE]> token: 41ac1106-02cf-4619-b75b-462416ce63bb <![endif]-->\n<!--[if !IE]> App Server /prodie--i-088578c9787a71c24/ <![endif]-->\n\n<!doctype html><html lang="en"><head>\n<title>Apple Inc (AAPL.O)  News| Reuters.com</title>\n    <meta http-equiv="X-UA-Compatible" content="IE=edge"><meta charset="utf-8"><meta http-equiv="x-dns-prefetch-control" content="on"><link rel="dns-prefetch" href="//s1.reutersmedia.net"/><link rel="dns-prefetch" href="//s2.reutersmedia.net"/><link rel="dns-prefetch" href="//s3.reutersmedia.net"/><link rel="dns-prefetch" href="//s4.reutersmedia.net"/><link rel="dns-prefetch" href="//static.reuters.com"/><link rel="dns-prefetch" href="//www.googletagservices.com"/><link rel="dns-prefetch" href="//www.googletagmanager.com"/><link rel="dns-prefetch" href="//www.google-analytics.com"/><link rel="dns-pre

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1;
            background-color: #FCF3CF;
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4">
    <a href="../deep_dives/urls.ipynb" style="text-decoration: none"> 
    <h3 style="font-family: monospace">Deep-dive</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">APIs and URLs parameters</p></a></font>
</div>

### Parse pages
---

In [4]:
# Parse HTML
from lxml import html
tree = html.fromstring(res.content)
tree.getchildren()

[<Element head at 0x10e265f48>, <Element body at 0x10e265ef8>]

In [5]:
# What children do we need?
children = tree.xpath('//div[@class="feature"]')

In [6]:
list(children[0].itertext())

['US STOCKS-Wall St rallies and Apple approaches $1 trillion value',
 '\n\t',
 '* Indexes up: Dow 0.80 pct, S&P 500 0.94 pct, Nasdaq 0.89\npct\n(Updates to close)',
 '\n\t']

In [7]:
# Get headlines
headlines = [list(child.itertext())[0] for child in children]
headlines

['US STOCKS-Wall St rallies and Apple approaches $1 trillion value',
 'UPDATE 1-Goldman Sachs, Apple to launch joint credit card - WSJ',
 'Goldman Sachs, Apple to launch joint credit card - WSJ',
 'BRIEF-Apple, Goldman Sachs Team Up On New Credit Card - WSJ',
 'Apple scraps $1 billion Irish data center over planning delays',
 'Apple drops plans for data centre in Ireland due to planning delays - RTE']

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <h3 style="font-family: monospace">Exercise 1.1</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Write a function <span style="font-family:monospace;">get_ticker_headlines</span> that, given a date and a ticker, returns the headlines on <span style="font-family:monospace;">reuters.com</span>.</p></font>
</div>

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <h3 style="font-family: monospace">Exercise 1.2</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Write a function <span style="font-family:monospace;">get_all_headlines</span> that, given a start date, a number of days, and a list of tickers, returns the headlines on reuters.com. Have this function call <span style="font-family:monospace;">get_ticker_headlines</span> for each ticker and date.</p></font>
</div>

### Vectorize pages
---

In [8]:
# If we want to cluster these headlines, we have 2 options:
# 1) heuristics
# 2) machine learning

# Which one to choose?
# simple heuristics > machine learning > complex heuristics

# How would the heuristics look like? We would have to:
# 1) remove initial and final capital words
# 2) lowercase sentence
# 3) remove punctuation
# 4) generate a set of possible sentences using synonims
# 5) count the common words and pick the pairs with highest values

# ... vs ...

# 1) use word embeddings

In [9]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)

In [10]:
import tensorflow_hub as hub
EMBED = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/1")
session = tf.Session()
session.run([tf.global_variables_initializer(), 
             tf.tables_initializer()])

[None, None]

In [11]:
embeddings = EMBED(headlines)

In [12]:
embeddings

<tf.Tensor 'module_apply_default/Encoder_en/hidden_layers/l2_normalize:0' shape=(?, 512) dtype=float32>

In [13]:
transformed = session.run(embeddings) 

In [14]:
print('Sentence = "%s"\nEncoding = %s' % (headlines[0], transformed[0][:3]))

Sentence = "US STOCKS-Wall St rallies and Apple approaches $1 trillion value"
Encoding = [ 0.05164596  0.044956   -0.02946323]


### Cluster pages
---

In [15]:
from sklearn.metrics.pairwise import cosine_similarity
sims = cosine_similarity(transformed, transformed)
sims

array([[0.9999998 , 0.6304252 , 0.71410525, 0.6446738 , 0.6522328 ,
        0.44674984],
       [0.6304252 , 1.        , 0.92891455, 0.7896416 , 0.6916854 ,
        0.5669955 ],
       [0.71410525, 0.92891455, 1.        , 0.85746527, 0.66935897,
        0.4865934 ],
       [0.6446738 , 0.7896416 , 0.85746527, 0.99999994, 0.55105186,
        0.3648151 ],
       [0.6522328 , 0.6916854 , 0.66935897, 0.55105186, 0.99999976,
        0.8390436 ],
       [0.44674984, 0.5669955 , 0.4865934 , 0.3648151 , 0.8390436 ,
        1.        ]], dtype=float32)

In [16]:
print(sims[0][0]); print(sims[1][1]); print(sims[2][2])

0.9999998
1.0
1.0


<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <h3 style="font-family: monospace">Exercise 1.4</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Write a function <span style="font-family:monospace;">get_similarities</span> that, given a list of headlines, returns their cosine similarity matrix. Assert that all diagonal elements
are equal to 1</p></font>
</div>

In [17]:
pairs = []
for row in range(sims.shape[0]):
    for column in range(row + 1, sims.shape[1]):
        pair = (row, column, sims[row][column])
        pairs.append(pair)

In [18]:
pairs = sorted(pairs, key=lambda x: x[2], reverse=True)

In [19]:
for pair in pairs:
    print('\n%.2f\n%s\n%s' % (pair[2],
                              headlines[pair[0]], 
                              headlines[pair[1]]))


0.93
UPDATE 1-Goldman Sachs, Apple to launch joint credit card - WSJ
Goldman Sachs, Apple to launch joint credit card - WSJ

0.86
Goldman Sachs, Apple to launch joint credit card - WSJ
BRIEF-Apple, Goldman Sachs Team Up On New Credit Card - WSJ

0.84
Apple scraps $1 billion Irish data center over planning delays
Apple drops plans for data centre in Ireland due to planning delays - RTE

0.79
UPDATE 1-Goldman Sachs, Apple to launch joint credit card - WSJ
BRIEF-Apple, Goldman Sachs Team Up On New Credit Card - WSJ

0.71
US STOCKS-Wall St rallies and Apple approaches $1 trillion value
Goldman Sachs, Apple to launch joint credit card - WSJ

0.69
UPDATE 1-Goldman Sachs, Apple to launch joint credit card - WSJ
Apple scraps $1 billion Irish data center over planning delays

0.67
Goldman Sachs, Apple to launch joint credit card - WSJ
Apple scraps $1 billion Irish data center over planning delays

0.65
US STOCKS-Wall St rallies and Apple approaches $1 trillion value
Apple scraps $1 billion Iri

### Text classification

In [42]:
import nltk
nltk.download('reuters')
from nltk.corpus import reuters 
documents = reuters.fileids()
documents[0]

[nltk_data] Downloading package reuters to /Users/marco/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


'test/14826'

In [55]:
len(documents)

10788

In [119]:
train_docs = list(filter(lambda doc: doc.startswith("train"), documents))
len(train_docs)

7769

In [118]:
test_docs = list(filter(lambda doc: doc.startswith("test"), documents))
len(test_docs)

3019

In [88]:
reuters.raw(train_docs[0])

'BAHIA COCOA REVIEW\n  Showers continued throughout the week in\n  the Bahia cocoa zone, alleviating the drought since early\n  January and improving prospects for the coming temporao,\n  although normal humidity levels have not been restored,\n  Comissaria Smith said in its weekly review.\n      The dry period means the temporao will be late this year.\n      Arrivals for the week ended February 22 were 155,221 bags\n  of 60 kilos making a cumulative total for the season of 5.93\n  mln against 5.81 at the same stage last year. Again it seems\n  that cocoa delivered earlier on consignment was included in the\n  arrivals figures.\n      Comissaria Smith said there is still some doubt as to how\n  much old crop cocoa is still available as harvesting has\n  practically come to an end. With total Bahia crop estimates\n  around 6.4 mln bags and sales standing at almost 6.2 mln there\n  are a few hundred thousand bags still in the hands of farmers,\n  middlemen, exporters and processors.\n  

In [87]:
reuters.categories(train_docs[0])

['cocoa']

In [278]:
categories = np.array(reuters.categories())
', '.join(categories)

'acq, alum, barley, bop, carcass, castor-oil, cocoa, coconut, coconut-oil, coffee, copper, copra-cake, corn, cotton, cotton-oil, cpi, cpu, crude, dfl, dlr, dmk, earn, fuel, gas, gnp, gold, grain, groundnut, groundnut-oil, heat, hog, housing, income, instal-debt, interest, ipi, iron-steel, jet, jobs, l-cattle, lead, lei, lin-oil, livestock, lumber, meal-feed, money-fx, money-supply, naphtha, nat-gas, nickel, nkr, nzdlr, oat, oilseed, orange, palladium, palm-oil, palmkernel, pet-chem, platinum, potato, propane, rand, rape-oil, rapeseed, reserves, retail, rice, rubber, rye, ship, silver, sorghum, soy-meal, soy-oil, soybean, strategic-metal, sugar, sun-meal, sun-oil, sunseed, tea, tin, trade, veg-oil, wheat, wpi, yen, zinc'

In [223]:
embeddings = EMBED([' '.join(reuters.raw(i).split()) for i in train_docs])
x_train = session.run(embeddings) 
embeddings = EMBED([' '.join(reuters.raw(i).split()) for i in test_docs])
x_test = session.run(embeddings) 

In [312]:
from collections import Counter

y_train, all_cats = [], []
for doc in train_docs:
    label = [0 for _ in range(len(categories))]
    for cat in reuters.categories(doc):
        label[np.where(categories == cat)[0][0]] = 1
        all_cats.append(cat)
    y_train.append(label)

In [91]:
sum(y_train[0])

1

In [313]:
y_test, all_cats = [], []
for doc in test_docs:
    label = [0 for _ in range(len(categories))]
    for cat in reuters.categories(doc):
        label[np.where(categories == cat)[0][0]] = 1
        all_cats.append(cat)
    y_test.append(label)

In [314]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
from keras.layers import Dense
from keras.models import Sequential

model = Sequential()
model.add(Dense(128, activation='relu', input_shape=transformed[0].shape))
model.add(Dense(64, activation='relu', input_shape=transformed[0].shape))
model.add(Dense(16, activation='relu', input_shape=transformed[0].shape))
model.add(Dense(len(reuters.categories()), activation='sigmoid'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

epochs = 20
batch_size = 32

history = model.fit(x_train, np.array(y_train), nb_epoch=epochs, 
                    batch_size=batch_size, verbose=1, validation_split=0.1)

score = model.evaluate(x_test, np.array(y_test), batch_size=batch_size, 
                       verbose=1)

print('\nTest accuracy:', score[1])

Train on 6992 samples, validate on 777 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

Test accuracy: 0.7946339847236803


In [303]:
probas = model.predict_proba(x_test)

In [307]:
index = 0
values = np.sort(probas[index])[::-1][:3]
cats = categories[np.argsort(probas[index])[::-1][:3]]
print('\n' + ' '.join(reuters.raw(test_docs[index]).split())[:1000])
print('\nPredictions: %s' % list(zip(cats, values)))


ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT Mounting trade friction between the U.S. And Japan has raised fears among many of Asia's exporting nations that the row could inflict far-reaching economic damage, businessmen and officials said. They told Reuter correspondents in Asian capitals a U.S. Move against Japan might boost protectionist sentiment in the U.S. And lead to curbs on American imports of their products. But some exporters said that while the conflict would hurt them in the long-run, in the short-term Tokyo's loss might be their gain. The U.S. Has said it will impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17, in retaliation for Japan's alleged failure to stick to a pact not to sell semiconductors on world markets at below cost. Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes. "We wouldn't b