# Sentence embeddings

Code in this notebook shows how to prepare data for indexing in a vector search engine.

It contains the following steps:

* Initialization of pre-trained text vectorization models (with SentenceTransformer)
* Converting text data into vectors and saving it.

In [None]:
# We use SentenceTransformer pre-trained models to convert our text into vectors.
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transform

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import json
import pandas as pd
from tqdm.notebook import tqdm

## Download and create a pre-trained sentence encoder

Full list of available models could be found here https://www.sbert.net/docs/pretrained_models.html

In [None]:
model = SentenceTransformer('KBLab/sentence-bert-swedish-cased', device="cuda")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/118 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/12.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/710 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/399k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Import json files

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df_e1 = pd.read_json('/content/drive/MyDrive/Colab Notebooks/EDAN70/e1.json', orient='records')
df_e2 = pd.read_json('/content/drive/MyDrive/Colab Notebooks/EDAN70/e2.json', orient='records')
df_e1[-5:]

             headword             entryid  \
84461  Övind Finnsson  e1_84461_ar_403_10   
84462         Öxabäck  e1_84462_ar_403_11   
84463       Öxnebjerg  e1_84463_ar_403_12   
84464       Öxnevalla  e1_84464_ar_403_13   
84465       Öynhausen  e1_84465_ar_403_14   

                                                    text  classifier_type  \
84461    <b>Övind Finnsson.</b> Se Eyvind Skáldaspillir.                0   
84462  <b>Öxabäck,</b> socken i Elfsborgs län, Marks ...                0   
84463  <b>Öxnebjerg,</b> backe på Fyen, 10 km. ö. om ...                0   
84464  <b>Öxnevalla,</b> socken i Elfsborgs län, Mark...                0   
84465                   <b>Öynhausen.</b> Se Oeynhausen.                0   

       class  qid second_edition_key fourth_edition_key      cross_ref_key  \
84461      0    0                                        e1_20469_ad_463_2   
84462      0    0                                                            
84463      0    0               

## Encode all entries
We do encoding in batches, as this reduces overhead costs and significantly speeds up the process

In [None]:
vectors_e1 = model.encode([
    row.text
    for row in df_e1.itertuples()
], show_progress_bar=True)

vectors_e2 = model.encode([
    row.text
    for row in df_e2.itertuples()
], show_progress_bar=True)

print(f"vectors_e1.shape: {vectors_e1.shape}")
print(f"vectors_e2.shape: {vectors_e2.shape}")

Batches:   0%|          | 0/2640 [00:00<?, ?it/s]

Batches:   0%|          | 0/4761 [00:00<?, ?it/s]

vectors_e1.shape: (84466, 768)
vectors_e2.shape: (152348, 768)


## Save and download vectors

In [None]:
# You can download this saved vectors and continue with rest part of the tutorial.
np.save('vectors_e1.npy', vectors_e1, allow_pickle=False)
np.save('vectors_e2.npy', vectors_e2, allow_pickle=False)

In [None]:
# from google.colab import files
# files.download('vectors_e1.npy')
# files.download('vectors_e2.npy')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Optional part - make a test query

Let's just make sure, that our vectors are correctly converted and make sense.

For this we manually search for a closest vectors of a random sample.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Take a random description as a query
sample_query = df_e1.iloc[20000].text
print(sample_query)

In [None]:
query_vector = model.encode(sample_query)  # Convert query description into a vector.

In [None]:
scores = cosine_similarity([query_vector], vectors_e2)[0]  # Look for the most similar vectors, manually score all vectors
top_scores_ids = np.argsort(scores)[-5:][::-1]  # Select top-5 with vectors the largest scores

In [None]:
# Check if result similar to the query
for top_id in top_scores_ids:
  print(df_e2.iloc[top_id].text)
  print("-----")