# 2 - Create vectors from parsed articles and index them

In this notebook I read the Json produced before, merge text cells and create a vector representation of each article's full text.

I could have used Transformers* library, but since I'm running this demo into my notebook, I adopted Spacy for simplicity. Spacy uses the average vector representation of each token ("words"). 

\*Since articles full text are bigger than traditional BERT models inputs (i.e. 512 tokens), we must use [Longformer](https://huggingface.co/docs/transformers/model_doc/longformer) to not lose to much content after truncation.

## 2.1 - Create Vectors

In [1]:
import os
import spacy
import json
import random
import numpy as np
from tqdm import tqdm

In [2]:
# Read parsed articles from json
with open('../articles.json', 'r') as f:
    parsed_articles = json.load(f)

Here I concatenate all sections text from parsed articles to create a "full text" for each document.

In [3]:
full_texts = []
for doc in parsed_articles:
    text = ' '.join([doc['title'] + ' ' + doc['abstract'] + ' ' + section['text'] for section in doc['sections']])
    full_texts.append(text)


In [4]:
# print a sample full_text
print(random.choice(full_texts)[:1000])

End-to-End Language Diarization for Bilingual Code-Switching Speech We propose two end-to-end neural configurations for language diarization on bilingual code-switching speech. The first, a BLSTM-E2E architecture, includes a set of stacked bidirectional LSTMs to compute embeddings and incorporates the deep clustering loss to enforce grouping of languages belonging to the same class. The second, an XSA-E2E architecture, is based on an x-vector model followed by a self-attention encoder. The former encodes frame-level features into segmentlevel embeddings while the latter considers all those embeddings to generate a sequence of segment-level language labels. We evaluated the proposed methods on the dataset obtained from the shared task B in WSTCSMC 2020 and our handcrafted simulated data from the SEAME dataset. Experimental results show that our proposed XSA-E2E architecture achieved a relative improvement of 12.1% in equal error rate and a 7.4% relative improvement on accuracy compared 

In [5]:
assert len(full_texts) == len(parsed_articles)

In [6]:
# You must install the model into virtual env.
nlp = spacy.load("en_core_web_sm")

In [7]:
# Generate Vector representation for random article
doc = nlp(random.choice(full_texts))
doc.vector.shape

(96,)

As shown above, Spacy produces a 96 positions vector.

## 2.2 Load and save index

For creating an index, I'll use Facebook's [Faiss library](https://github.com/facebookresearch/faiss). It is really fast to load and search for vector similarity.

In [8]:
import faiss

In [9]:
vectors_from_docs = []
for full_text in tqdm(full_texts):
    doc = nlp(full_text)
    vectors_from_docs.append(doc.vector)

100%|██████████████████████████████████████████████████████████████| 992/992 [14:25<00:00,  1.15it/s]


In [15]:
arr_docs = np.array(vectors_from_docs)
index = faiss.IndexFlatL2(arr_docs.shape[1])
index.add(arr_docs)

## 2.3 Test the index

Now I'll test the index by searching an abstract from a random file (this vector representation). I hope it returns the choosen document first :-)

In [23]:
n_near_docs = 10

idx = random.choice(range(len(full_texts)))
print('Querying document id %d : %s' % (idx, parsed_articles[idx]['title']), '\n\n')
v = nlp(parsed_articles[idx]['abstract']).vector
query = np.array([v])

scores, results = index.search(query, n_near_docs) 
for i, result in enumerate(results[0]):
    print(i+1, result, parsed_articles[result]['title'], ' - vector distance: ', scores[0][i])

Querying document id 967 : Factorization-Aware Training of Transformers for Natural Language Understanding On the Edge 


1 967 Factorization-Aware Training of Transformers for Natural Language Understanding On the Edge  - vector distance:  0.11023985
2 812 SynthASR: Unlocking Synthetic Data for Speech Recognition  - vector distance:  0.21954344
3 697 Low Resource ASR: The surprising effectiveness of High Resource Transliteration  - vector distance:  0.26151678
4 425 Noisy student-teacher training for robust keyword spotting  - vector distance:  0.27137
5 136 Enhancing Semantic Understanding with Self-supervised Methods for Abstractive Dialogue Summarization  - vector distance:  0.2738651
6 965 Simulating reading mistakes for child speech Transformer-based phone recognition  - vector distance:  0.27791983
7 649 Multimodal Speech Summarization through Semantic Concept Learning  - vector distance:  0.27955014
8 567 Adjunct-Emeritus Distillation for Semi-Supervised Language Model Adaptati

In [19]:
# Saving the index for further usage
faiss.write_index(index, '../faiss_index.out')

In [20]:
#index = faiss.read_index('../faiss_index.out')