## Research Project 1
```text
- Source: Reuters
- Goal: Build headline aggregator (e.g. Google News)
- Techniques: Word Embeddings, Cosine Similarity
- Tools: Tensorflow Hub
- Lines of code: ~70```

### Request pages

In [115]:
# Let's go on reuters.com and pick a URL
url = 'https://www.reuters.com/finance/stocks/company-news/AAPL.O?date=05102018'

In [166]:
# Get that page
import requests
res = requests.get(url)
len(res.content)

66395

In [167]:
res.content[:1000] 

b'<!--[if !IE]> This has been served from cache <![endif]-->\n<!--[if !IE]> Request served from apache server: produs--i-0874a8ac4d6c75691 <![endif]-->\n<!--[if !IE]> Cached on Tue, 22 May 2018 00:27:18 GMT and will expire on Tue, 22 May 2018 00:37:17 GMT <![endif]-->\n<!--[if !IE]> token: e9695e1d-1533-4873-a2db-e6559a504258 <![endif]-->\n<!--[if !IE]> App Server /produs--i-08256f73e534880fa/ <![endif]-->\n\n<!doctype html><html lang="en"><head>\n<title>Apple Inc (AAPL.O)  News| Reuters.com</title>\n    <meta http-equiv="X-UA-Compatible" content="IE=edge"><meta charset="utf-8"><meta http-equiv="x-dns-prefetch-control" content="on"><link rel="dns-prefetch" href="//s1.reutersmedia.net"/><link rel="dns-prefetch" href="//s2.reutersmedia.net"/><link rel="dns-prefetch" href="//s3.reutersmedia.net"/><link rel="dns-prefetch" href="//s4.reutersmedia.net"/><link rel="dns-prefetch" href="//static.reuters.com"/><link rel="dns-prefetch" href="//www.googletagservices.com"/><link rel="dns-prefetch" 

### Parse pages

In [117]:
# Parse HTML
from lxml import html
tree = html.fromstring(res.content)
tree.getchildren()

[<Element head at 0x14ced1a98>, <Element body at 0x14ced1b88>]

In [118]:
# What children do we need?
children = tree.xpath('//div[@class="feature"]')

In [119]:
list(children[0].itertext())

['US STOCKS-Wall St rallies and Apple approaches $1 trillion value',
 '\n\t',
 '* Indexes up: Dow 0.80 pct, S&P 500 0.94 pct, Nasdaq 0.89\npct\n(Updates to close)',
 '\n\t']

In [120]:
# Get headlines
headlines = [list(child.itertext())[0] for child in children]
headlines

['US STOCKS-Wall St rallies and Apple approaches $1 trillion value',
 'UPDATE 1-Goldman Sachs, Apple to launch joint credit card - WSJ',
 'Goldman Sachs, Apple to launch joint credit card - WSJ',
 'BRIEF-Apple, Goldman Sachs Team Up On New Credit Card - WSJ',
 'Apple scraps $1 billion Irish data center over planning delays',
 'Apple drops plans for data centre in Ireland due to planning delays - RTE']

In [None]:
# Exercise 1
# Write a `get_ticker_headlines` function that, given a date and a ticker,
# returns the headlines on reuters.com

# Exercise 2
# Write a `get_all_headlines` function that, given a start date, a number of
# days, and a list of tickers, returns the headlines on reuters.com. Have this
# function call `get_ticker_headlines` for each ticker and date.

### Vectorize pages

In [121]:
# If we want to cluster these headlines, we have 2 options:
# 1) heuristics
# 2) machine learning

# Which one to choose?
# simple heuristics > machine learning > complex heuristics

# How would the heuristics look like? We would have to:
# 1) remove initial and final capital words
# 2) lowercase sentence
# 3) remove punctuation
# 4) generate a set of possible sentences using synonims
# 5) count the common words and pick the pairs with highest values

# ... vs ...

# 1) use word embeddings

In [122]:
EMBED = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/1")

In [123]:
embeddings = EMBED(headlines)

In [124]:
embeddings

<tf.Tensor 'module_4_apply_default/Encoder_en/hidden_layers/l2_normalize:0' shape=(?, 512) dtype=float32>

In [125]:
with tf.Session() as session:
    session.run([tf.global_variables_initializer(), 
                 tf.tables_initializer()])
    transformed = session.run(embeddings) 

In [128]:
print('Sentence = "%s"\nEncoding = %s' % (headlines[0], transformed[0][:3]))

Sentence = "US STOCKS-Wall St rallies and Apple approaches $1 trillion value"
Encoding = [ 0.05164596  0.044956   -0.02946323]


### Cluster pages

In [130]:
from sklearn.metrics.pairwise import cosine_similarity
sims = cosine_similarity(transformed, transformed)
sims

array([[0.9999998 , 0.6304252 , 0.71410525, 0.6446738 , 0.6522328 ,
        0.44674984],
       [0.6304252 , 1.        , 0.92891455, 0.7896416 , 0.6916854 ,
        0.5669955 ],
       [0.71410525, 0.92891455, 1.        , 0.85746527, 0.66935897,
        0.4865934 ],
       [0.6446738 , 0.7896416 , 0.85746527, 0.99999994, 0.55105186,
        0.3648151 ],
       [0.6522328 , 0.6916854 , 0.66935897, 0.55105186, 0.99999976,
        0.8390436 ],
       [0.44674984, 0.5669955 , 0.4865934 , 0.3648151 , 0.8390436 ,
        1.        ]], dtype=float32)

In [136]:
print(sims[0][0]); print(sims[1][1]); print(sims[2][2])

0.9999998
1.0
1.0


In [168]:
# Exercise 3
# Write a `get_similarities` function that, given a list of headlines,
# returns their cosine similarity matrix. Assert that all diagonal elements
# are equal to 1.

In [158]:
pairs = []
for row in range(sims.shape[0]):
    for column in range(row + 1, sims.shape[1]):
        pair = (row, column, sims[row][column])
        pairs.append(pair)

In [160]:
pairs = sorted(pairs, key=lambda x: x[2], reverse=True)

In [161]:
for pair in pairs:
    print('\n%.2f\n%s\n%s' % (pair[2],
                              headlines[pair[0]], 
                              headlines[pair[1]]))


0.93
UPDATE 1-Goldman Sachs, Apple to launch joint credit card - WSJ
Goldman Sachs, Apple to launch joint credit card - WSJ

0.86
Goldman Sachs, Apple to launch joint credit card - WSJ
BRIEF-Apple, Goldman Sachs Team Up On New Credit Card - WSJ

0.84
Apple scraps $1 billion Irish data center over planning delays
Apple drops plans for data centre in Ireland due to planning delays - RTE

0.79
UPDATE 1-Goldman Sachs, Apple to launch joint credit card - WSJ
BRIEF-Apple, Goldman Sachs Team Up On New Credit Card - WSJ

0.71
US STOCKS-Wall St rallies and Apple approaches $1 trillion value
Goldman Sachs, Apple to launch joint credit card - WSJ

0.69
UPDATE 1-Goldman Sachs, Apple to launch joint credit card - WSJ
Apple scraps $1 billion Irish data center over planning delays

0.67
Goldman Sachs, Apple to launch joint credit card - WSJ
Apple scraps $1 billion Irish data center over planning delays

0.65
US STOCKS-Wall St rallies and Apple approaches $1 trillion value
Apple scraps $1 billion Iri