# Description:
In this notebook we request 100 news articles from the NewsAPI (maximum allowed) and we use their truncated content to build a corpus, then we preprocess the corpus using the built CorpusPreprocess scikit-learn-like transformer. We train a gensim doc2vec model on the preprocessed corpus and we assess the model by checking document rankings for each document (the document should be the most similar with itself) and by comparing random documents' content with their similar documents' content. Finally, we request new documents from NewsAPI, apply preprocessing, infer their vectors and assess their quality by getting their most similar documents.

# TODO:
- add date, price, weekday, ... token to CorpusPreprocess
- webscrape full content from urls provided by api

In [1]:
from src.features.embedding_eval import compare_documents
from src.data.text_preprocessing import CorpusPreprocess
import os
from dotenv import load_dotenv, find_dotenv
from datetime import datetime, timedelta
import random
import collections
from newsapi import NewsApiClient
from string import punctuation
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from gensim import models
# import numpy as np
# from scipy.spatial.distance import pdist, squareform

In [2]:
# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()

# if running from container
if dotenv_path == '':
    dotenv_path = "/run/secrets/dotenv-file"  # hard-coded: path to secret passed through docker-compose

# load up the entries as environment variables
load_dotenv(dotenv_path)

NEWSAPIKEY = os.environ.get("NEWSAPIKEY")

In [3]:
# Init
newsapi = NewsApiClient(api_key=NEWSAPIKEY)

# Get news articles
articles = newsapi.get_top_headlines(language='en',
                                     category='sports',  # 'business','entertainment','general','health','science','sports','technology'
                                      # domains='bbc.co.uk',
                                      # from_param=datetime.today() - timedelta(30),
                                      # to=datetime.today(),
                                      page_size=100,
                                      country='us')

corpus = list(set([c['content'] for c in articles['articles'] if c['content']]))

print("Example of article content:\n\n{}".format(corpus[0]))

Example of article content:

The 49ers two violations of the offseason resulted in fines but no cancellations of the teams workouts, sources told NBC Sports Bay Area.
The 49ers engaged in two activities during their voluntary o… [+2823 chars]


In [4]:
# Train/ test split
test_idx = random.sample(range(len(corpus)), int(len(corpus) * 0.1))
test_corpus = [corpus[i] for i in test_idx]
train_corpus = list(set(corpus).difference(set(test_corpus)))

In [5]:
# Preprocessing - removing stopwords, lowercasing, strip accents, strip punctuation, stemming, max_df and min_df thresholds
prep = CorpusPreprocess(stop_words=stopwords.words('english'), lowercase=True, strip_accents=True,
                        strip_punctuation=punctuation, stemmer=PorterStemmer(), max_df=0.2, min_df=2)
processed_train_corpus = prep.fit_transform(train_corpus)
processed_test_corpus = prep.transform(test_corpus)

print("Example of preprocessed article content:\n\n{}".format(processed_train_corpus[0]))

Example of preprocessed article content:

['49er', 'two', 'violat', 'offseason', 'result', 'fine', 'cancel', 'team', 'workout', 'sourc', 'told', 'nbc', 'sport', 'bay', 'area', '49er', 'engag', 'two', 'activ', 'dure', 'voluntari', '2823', 'char']


In [6]:
# TaggedDocument format (input to doc2vec)
tagged_corpus = [models.doc2vec.TaggedDocument(text, [i]) for i, text in enumerate(processed_train_corpus)]

# Doc2Vec model
model = models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=200)
model.build_vocab(tagged_corpus)
model.train(tagged_corpus, total_examples=model.corpus_count, epochs=model.epochs)

In [7]:
model.wv.vocab.keys()  # this accesses the words in the vocabulary

dict_keys(['49er', 'two', 'offseason', 'team', 'sport', 'bay', 'dure', 'char', 'dechambeau', 'three', 'everi', 'ha', 'one', 'point', 'shot', 'friday', '2020', 'uefa', 'european', 'championship', 'semifin', '21', 'win', 'saturday', 'way', 'offer', 'chanc', 'opportun', 'court', 'avail', 'full', 'first', 'time', 'wa', 'posit', 'go', 'thi', 'next', 'month', 'right', 'detroit', '14', 'nativ', 'keep', 'trophi', 'second', 'attempt', 'sun', 'mark', 'score', 'half', 'ufc', 'poirier', 'last', 'star', 'philadelphia', '2021', 'nfl', 'season', 'train', 'camp', 'quickli', 'philli', 'make', 'year', 'even', 'red', 'sox', 'look', 'straight', 'tonight', 'behind', 'third', 'ball', 'final', 'settl', 'chicago', 'welcom', 'comment', 'stori', 'sign', 'minor', 'leagu', 'contract', 'week', 'announc', 'pitch', 'mizzou', 'footbal', 'talk', 'lake', 'st', 'loui', 'thursday', 'atlanta', 'north', 'citi', 'five', 'rule', 'test', 'hi', 'mlb', 'start', 'busch', 'monday', 'robert', 'milwauke', 'buck', 'nba', 'champion',

In [8]:
# Assessing Doc2Vec model
ranks = []
for doc_id in range(len(tagged_corpus)):
    inferred_vector = model.infer_vector(tagged_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

# Optimally we want as much documents to be the most similar with themselves (i.e. rank 0)
print(collections.OrderedDict(sorted(collections.Counter(ranks).items())))

OrderedDict([(0, 43), (1, 6), (2, 1)])


### Observation:
Above we can the distribution of self-document similarity rank (i.e. ~ 53 documents have itself as the most similar document - rank 0, ~ 2 documents have itself as the second most similar document - rank 0, ...)

In [9]:
# Get cosine similarity between random test doc and train docs
base_doc_id = random.choice(range(len(test_corpus)))
inferred_unknown_vector = model.infer_vector(processed_test_corpus[base_doc_id])
sims = model.docvecs.most_similar([inferred_unknown_vector], topn=model.docvecs.count)
compare_out = compare_documents(base_doc_id, test_corpus[base_doc_id], sims, train_corpus)

TARGET (3): «Reggie Bush wants his Heisman Trophy back. 
That was made very clear in a statement posted to his social media account Thursday.
"Over the last few months, on multiple occasions, my team and I have… [+2861 chars]»

SIMILAR/DISSIMILAR DOCS:
MOST (28, 0.9008356332778931): «John Dodson has made his first comments on Friday’s serious car accident that nearly took his life as well as those of his wife and three children.
Dodson, a 17-fight UFC veteran and Season 14 winne… [+2357 chars]»

SECOND-MOST (13, 0.8890371322631836): «The Cubs signed Tony Cingrani to a minor league contract earlier this week, per an announcement from the Lexington Legends of the Atlantic League. The southpaw had been pitching for the independent c… [+987 chars]»

MEDIAN (43, 0.8203971982002258): «The Tampa Bay Lightning are one win away from sweeping the Stanley Cup Final, but if they clinch the series in Game 4 on Monday, families of players and staff will not be on the ice to celebrate.
Th… [+3085

In [10]:
# Get new news articles
new_articles = newsapi.get_everything(language='en',
                                      domains='bbc.co.uk',
                                      from_param=datetime.today() - timedelta(30),
                                      to=datetime.today() - timedelta(20),
                                      page_size=10)

new_corpus = list(set([c['content'] for c in new_articles['articles'] if c['content']]))

# Apply preprocessing
new_processed_corpus = prep.transform(new_corpus)

# Similarity query
doc_id = random.randint(0, len(test_corpus) - 1)
unkwnown_doc = new_processed_corpus[doc_id]
inferred_unknown_vector = model.infer_vector(unkwnown_doc)
sims = model.docvecs.most_similar([inferred_unknown_vector], topn=model.docvecs.count)
compare_out = compare_documents(doc_id, new_corpus[doc_id], sims, train_corpus)

TARGET (2): «Christian Eriksen remains in hospital in a stable condition
The decision to resume Denmark's Euro 2020 opener against Finland following Christian Eriksen's cardiac arrest was the "least bad one", sa… [+3752 chars]»

SIMILAR/DISSIMILAR DOCS:
MOST (16, 0.9594348073005676): «Italy left-back Leonardo Spinazzola has been ruled out of Euro 2020 after tests on Saturday confirmed he had suffered a torn Achilles tendon.
The 28-year-old has been one of the standout performers … [+2400 chars]»

SECOND-MOST (2, 0.9555896520614624): «Denmark are into the 2020 UEFA European Championship semifinals after a 2-1 win over the Czech Republic in Baku, Azerbaijan, on Saturday.
Thomas Delaney and Kasper Dolberg goals were enough for the … [+1112 chars]»

MEDIAN (20, 0.8771800994873047): «It’s been almost 24 hours since we found out the Dallas Cowboys would be appearing on HBO’s “Hard Knocks” series in 2021. Since then there have been a lot of misconceptions and even some misinformati… [+3470