# Sentence embeddings
This notebook is meant to be run in Google Colab as it requires a lot of memory.

Code in this notebook shows how to prepare data for indexing in a vector search engine.

It contains the following steps:

* Initialization of pre-trained text vectorization models (with SentenceTransformer)
* Converting text data into vectors and saving it.

In [None]:
# We use SentenceTransformer pre-trained models to convert our text into vectors.
!pip install sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import json
import pandas as pd
from tqdm.notebook import tqdm

## Download and create a pre-trained sentence encoder

Full list of available models could be found here https://www.sbert.net/docs/pretrained_models.html

In [None]:
model = SentenceTransformer('KBLab/sentence-bert-swedish-cased', device="cuda")

## Import json files

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df_e1 = pd.read_json('/content/drive/MyDrive/Colab Notebooks/EDAN70/e1.json', orient='records')
df_e2 = pd.read_json('/content/drive/MyDrive/Colab Notebooks/EDAN70/e2.json', orient='records')
print(len(df_e1))
print(len(df_e2))
df_e1[-5:]

#### Remove cross references from data
- Cross references contain no valuable information and only point to other articles.


In [None]:
df_e1 = df_e1[df_e1.cross_ref_key == ""]
df_e2 = df_e2[df_e2.cross_ref_key == ""]
print(len(df_e1))
print(len(df_e2))
df_e1[-5:]

## Encode all entries
We do encoding in batches, as this reduces overhead costs and significantly speeds up the process

In [None]:
vectors_e1 = model.encode([
    row.text
    for row in df_e1.itertuples()
], show_progress_bar=True)

vectors_e2 = model.encode([
    row.text
    for row in df_e2.itertuples()
], show_progress_bar=True)

print(f"vectors_e1.shape: {vectors_e1.shape}")
print(f"vectors_e2.shape: {vectors_e2.shape}")

## Save and download vectors

In [None]:
# You can download this saved vectors and continue with rest part of the tutorial.
np.save('vectors_e1.npy', vectors_e1, allow_pickle=False)
np.save('vectors_e2.npy', vectors_e2, allow_pickle=False)

In [None]:
from google.colab import files
files.download('vectors_e1.npy')
files.download('vectors_e2.npy')

## Optional part - make a test query

Let's just make sure, that our vectors are correctly converted and make sense.

For this we manually search for a closest vectors of a random sample.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Take a random description as a query
sample_query = df_e1.iloc[20000].text
print(sample_query)

In [None]:
query_vector = model.encode(sample_query)  # Convert query description into a vector.

In [None]:
scores = cosine_similarity([query_vector], vectors_e2)[0]  # Look for the most similar vectors, manually score all vectors
top_scores_ids = np.argsort(scores)[-5:][::-1]  # Select top-5 with vectors the largest scores

In [None]:
# Check if result similar to the query
for top_id in top_scores_ids:
  print(df_e2.iloc[top_id].text)
  print("-----")