In [1]:
import pandas as pd
import transformers
from embedding_model import Model

In [2]:
# Load pretrained Model
model = Model()
model.make_pretrained('stsb-distilbert-base')

Model Compiled: stsb-distilbert-base


In [3]:
# Load the sample Dataset
from sample_dataset import SampleDataset
arxiv_data = SampleDataset()
arxiv_data.load()

# Dataset is a sample of arXiv paper extracts and a 
# corresponding numeric label used internally 
# Taking a look 
arxiv_data[45:47]

(['MIT at SemEval-2017 Task 10: Relation Extraction with Convolutional\n  Neural Networks. Over 50 million scholarly articles have been published: they constitute a\nunique repository of knowledge. In particular, one may infer from them\nrelations between scientific concepts, such as synonyms and hyponyms.\nArtificial neural networks have been recently explored for relation extraction.\nIn this work, we continue this line of work and present a system based on a\nconvolutional neural network to extract relations. Our model ranked first in\nthe SemEval-2017 task 10 (ScienceIE) for relation extraction in scientific\narticles (subtask C).',
  'Transfer Learning for Named-Entity Recognition with Neural Networks. Recent approaches based on artificial neural networks (ANNs) have shown\npromising results for named-entity recognition (NER). In order to achieve high\nperformances, ANNs need to be trained on a large labeled dataset. However,\nlabels might be difficult to obtain for the dataset on

In [43]:
# Turn documents into vectors of dim (sample_len, 768)
# Subselecting a few samples, 
i = 0
sample_len = 1000

data, data_labels = arxiv_data[i:sample_len]

data_vec = model.encode_sentences(data)
print("Data to be added to the index: ", len(data_labels), data_vec.shape)

Data to be added to the index:  300 (300, 768)


In [44]:
# Make Index from docs
import os
import shutil
import re
import hnswlib
import numpy as np
from utils import *

# Parameters to initiate the HNSW Index
HNSW_PARAMS = {
    "save_file": 'hnsw_index.bin',
    "M": 200,
    "ef_construction": 200,
    "num_threads": MAX_SEARCH_THREADS,
    "label_mapping": arxiv_data.label_mapping 
}

from search_index import Index

# Initiate the Search Index wrapper class 
hnsw_index = Index(HNSW_PARAMS)


In [45]:
print(hnsw_index)


        HNSW Index Params: 
            SAVE_DIR: model/v1/hnsw_index/,
            SAVE_FILE: hnsw_index.bin,
            CURR_IDX_SIZE: 10,
            M: 200,
            ef_construction: 200,
            item_batch_size: 10,
            num_threads: -1,
            index_loaded: False
        


In [46]:
# Create a new index with the given params and save it to given file 
hnsw_index.define_index(idx_size=10)
hnsw_index.init_search()

First time?
Saving Fresh New index: model/v1/hnsw_index/hnsw_index.bin
Save index size: 10
Loading saved index


In [48]:
# Add the data to the index
hnsw_index.update_index(data_vec, data_labels)


            Loading Previously saved index:
                Loading File:     hnsw_index.bin
                New max_elements: 600
        
Adding n=10 batches of items from the data
Saving new index of size 300
Save index size: 300


In [53]:
MAX_NEAREST_NBRS = 15
query_text = "representation learning"
matches = hnsw_index.search(
            model.encode_sentences(
                query_text
                ),
            max_nearest_nbrs=MAX_NEAREST_NBRS
            )

In [54]:
hnsw_index.get_current_count()

300

In [55]:
len(matches['matches'])

15

In [56]:
print(f"Matches to Query: {query_text}")
matches['matches']

Matches to Query: representation learning


[{146: ('Learning unbiased features. A key element in transfer learning is representation learning; if\nrepresentations can be developed that expose the relevant factors underlying\nthe data, then new tasks and domains can be learned readily based on mappings\nof these salient factors. We propose that an important aim for these\nrepresentations are to be unbiased. Different forms of representation learning\ncan be derived from alternative definitions of unwanted bias, e.g., bias to\nparticular tasks, domains, or irrelevant underlying data dimensions. One very\nuseful approach to estimating the amount of bias in a representation comes from\nmaximum mean discrepancy (MMD) [5], a measure of distance between probability\ndistributions. We are not the first to suggest that MMD can be a useful\ncriterion in developing representations that apply across multiple domains or\ntasks [1]. However, in this paper we describe a number of novel applications of\nthis criterion that we have devised, all