# Description:
In this notebook we use the Vec4IR framework to test some Semantic Retrieval settings. We implement the following document representations: tfidf, average of word2vec embeddings and doc2vec.

# TODO:
- Use pre-trained word-embeddings in Setting 2: https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html

In [1]:
from utility import CorpusPreprocess
import os
import pandas as pd
from nltk.corpus import stopwords
from string import punctuation
from gensim.models import Word2Vec, Doc2Vec
from sklearn.model_selection import train_test_split
from vec4ir.doc2vec import Doc2VecInference
from vec4ir.core import Retrieval
from vec4ir.base import Tfidf, Matching
from vec4ir.word2vec import WordCentroidDistance

In [5]:
data_path = os.path.join("..", "data", "raw", "bbc")

# Reading files into memory
all_files = [os.path.join(dp, f) for dp, dn, fn in os.walk(os.path.expanduser(data_path)) for f in fn][1:]
corpus = []
for file in all_files:
    with open(file, 'r') as f:
        corpus.append(f.read())

# Saving topics from each article
topics = [path.split("\\")[3] for path in all_files]

# df = pd.read_csv(os.path.join(data_path, "sts-train.csv"))

In [6]:
# Train/ test split
train_corpus, test_corpus, train_topics, test_topics = train_test_split(corpus, topics, test_size=0.1, random_state=0)

# Preprocessing
prep = CorpusPreprocess(stop_words=stopwords.words('english'), lowercase=True, strip_accents=True,
                        strip_punctuation=punctuation, stemmer=True, max_df=0.5, min_df=3)
processed_train_corpus = prep.fit_transform(train_corpus, tokenize=False)
processed_test_corpus = prep.transform(test_corpus, tokenize=False)

In [18]:
# Setting 1 - Default Matching | tfidf model | No query expansion
match_op = Matching()
tfidf = Tfidf()
retrieval = Retrieval(retrieval_model=tfidf, matching=match_op)
retrieval.fit(processed_train_corpus)

# Querying using the fitted Retrieval model
query = "American elections republicans"
idx = retrieval.query(prep.transform([query], tokenize=False)[0], k=3)  # return top 3 documents
results = [train_corpus[i] for i in idx.tolist()]
print("Most similar document to query: \"{}\"\n\n{}".format(query, results[0]))

Most similar document to query: "American elections republicans"

February poll claim 'speculation'

Reports that Tony Blair is planning a snap general election for February 2005 have been described as "idle speculation" by Downing Street.

A spokesman said he had "no idea" where the reports in the Sunday Times and Sunday Telegraph had come from. The papers suggest ministers believe the government could benefit from a "Baghdad bounce" following successful Iraq elections in January. A British general election was last held in February in 1974. In that election, Edward Heath lost and failed to build a coalition with the Liberals. Harold Wilson took over and increased his majority later in the year in a second election

The latest speculation suggests the prime minister favours a February poll in order to exploit his current opinion poll lead over Conservative leader Michael Howard. But that strategy could prompt criticism he was seeking to "cut and run" after less then four years of a pa

In [19]:
# Setting 2 - Default Matching | WordCentroid model | No query expansion
match_op = Matching()
model = Word2Vec(processed_train_corpus, min_count=1)
wcd = WordCentroidDistance(model.wv)
retrieval = Retrieval(retrieval_model=wcd, matching=match_op)
retrieval.fit(processed_train_corpus)

# Querying using the fitted Retrieval model
query = "American elections republicans"
idx = retrieval.query(prep.transform([query], tokenize=False)[0], k=3)  # return top 3 documents
results = [train_corpus[i] for i in idx.tolist()]
print("Most similar document to query: \"{}\"\n\n{}".format(query, results[0]))

Embedding shape: (37, 100)
Most similar document to query: "American elections republicans"

Moya clinches Cup for Spain

Spain won the Davis Cup for the second time in their history when Carlos Moya beat the USA's Andy Roddick in the fourth rubber in Seville.

Moya won 6-2 7-6 (7-1) 7-6 (7-5) to give the hosts an unassailable 3-1 lead with only one singles rubber remaining. Roddick battled hard and had chances in the second set, but Moya's clay-court expertise proved the difference. Mardy Fish beat Tommy Robredo 7-6 (8-6) 6-2 in the final dead rubber to cut Spain's winning margin to 3-2. Spain's only other Davis Cup title came in 2000, when they beat Australia in Barcelona. This time they chose to play the final in Seville and the city's Olympic Stadium was revamped to allow for a record crowd for a competitve tennis event of 27,000 spectators. And the home fans gave vociferous support to their players, with 18-year-old Nadal and Moya winning both Friday's singles rubbers. American tw

In [21]:
# Setting 3 - Default Matching | Doc2vec model | No query expansion
match_op = Matching()
model = Doc2Vec(vector_size=40, min_count=2, epochs=200)
model = Doc2Vec.load(os.path.join("..", "models", "doc2vec_model"))  # loading pre-trained embeddings
doc2vec = Doc2VecInference(model=model, analyzer=lambda x: x.split())
retrieval = Retrieval(retrieval_model=doc2vec, matching=match_op)
retrieval.fit(processed_train_corpus)

# Querying using the fitted Retrieval model
query = "American elections republicans"
idx = retrieval.query(prep.transform([query], tokenize=False)[0], k=3)  # return top 3 documents
results = [train_corpus[i] for i in idx.tolist()]
print("Most similar document to query: \"{}\"\n\n{}".format(query, results[0]))

Most similar document to query: "American elections republicans"

Redford's vision of Sundance

Despite sporting a corduroy cap pulled low over his face plus a pair of dark glasses, Robert Redford cuts an unmistakable figure through the star-struck crowds at Sundance.

It's a rare downtown appearance for the man who started the annual festival in Park City, Utah back in the 1980s. Now in its twenty-first year, Sundance continues to grow. Some 45,000 people are estimated to have descended on this small ski town with nothing but movies on the mind. It's an opportunity to meet and make deals. Redford wanted Sundance to be a platform for independent film-makers, but the commercial success of many showcased films have led to criticism that the festival is becoming too mainstream.

Smaller festivals like Slamdance and XDance, which take place during the same week in Park City, are competing for Sundance's limelight. But Redford is not worried. "The more the merrier," he says. "The point was 