# Embedder

Embed chunked content with a high-recall tuned language model. For now, we're using the `HuggingFaceInstructEmbeddings`.

**Note**
- Run this notebook with an appropriate compute. 
- The code below installs the necessary requirements. 

In [2]:
!pip uninstall -y sentence-transformers==2.2.2
!pip install sentence-transformers==2.2.2 --no-cache-dir

Found existing installation: sentence-transformers 2.2.2
Uninstalling sentence-transformers-2.2.2:
  Successfully uninstalled sentence-transformers-2.2.2
Collecting sentence-transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 5.0 MB/s eta 0:00:01
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l- \ | / - done
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125925 sha256=246e5ba547c51e56c50ba60d24952cd61351b26a9766ef404afce6a3389fac0f
  Stored in directory: /tmp/pip-ephem-wheel-cache-oga0wbnh/wheels/5e/6f/8c/d88aec621f3f542d26fac0342bef5e693335d125f4e54aeffe
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2


In [12]:
!pip uninstall -y sentence-transformers
!pip install -U sentence-transformers --no-cache-dir

Found existing installation: sentence-transformers 2.2.2
Uninstalling sentence-transformers-2.2.2:
  Successfully uninstalled sentence-transformers-2.2.2
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 5.0 MB/s eta 0:00:011
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l- \ | / - done
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125925 sha256=c7d65dc041da6311299175aab42d7917563c37d75f80f01ee560d63e493bc3ae
  Stored in directory: /tmp/pip-ephem-wheel-cache-3myjvoym/wheels/5e/6f/8c/d88aec621f3f542d26fac0342bef5e693335d125f4e54aeffe
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2


In [4]:
!pip uninstall -y InstructorEmbedding==1.0.1
!pip install InstructorEmbedding==1.0.1 --no-cache-dir


Found existing installation: InstructorEmbedding 1.0.1
Uninstalling InstructorEmbedding-1.0.1:
  Successfully uninstalled InstructorEmbedding-1.0.1
Collecting InstructorEmbedding==1.0.1
  Downloading InstructorEmbedding-1.0.1-py2.py3-none-any.whl (19 kB)
Installing collected packages: InstructorEmbedding
Successfully installed InstructorEmbedding-1.0.1


In [None]:
!pip install neo4j

In [None]:
!pip uninstall -y azure-identity azure-keyvault-secrets azure-keyvault
!pip install azure-identity azure-keyvault-secrets azure-keyvault

In [4]:
!pip uninstall -y langchain
!pip install langchain

Collecting langchain
  Using cached langchain-0.0.337-py3-none-any.whl (2.0 MB)
Installing collected packages: langchain
Successfully installed langchain-0.0.337


# Get Credentials

In [None]:
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient

key_vault_name = "kv-bsauwmno"
kv_uri = f"https://{key_vault_name}.vault.azure.net/"

credential = DefaultAzureCredential()
client = SecretClient(vault_url=kv_uri, credential=credential)

# Now you can use neo4j_url, neo4j_port, and neo4j_password in your application
neo4j_url = client.get_secret("NEO4JURL").value
neo4j_user = client.get_secret("NEO4JUSER").value
neo4j_password = client.get_secret("NEO4JPASSWORD").value

In [5]:
import langchain
langchain.verbose = True
langchain.debug = True

In [11]:
import pandas as pd

# Read from pikle file
chunks_raw_df = pd.read_pickle('./data/chunks_raw.pkl')

chunks_raw_df.head()

Unnamed: 0,content,banner_title,banner_divisions,intro,toc,url,scrape_date,chunk_s500_o60,chunk_order
0,\n## Waarom willen de kinderartsen dat je baby...,24 uur observatie van de pasgeborene,[{'division_url': 'https://www.azstlucas.be/sp...,Elke pasgeborene wordt - bij overnachting - bi...,[{'link_url': '#waarom-willen-we-dat-je-baby-m...,https://www.azstlucas.be/onderzoek-en-behandel...,12/11/2023 15:26:50,## Waarom willen de kinderartsen dat je baby m...,0
0,\n## Waarom willen de kinderartsen dat je baby...,24 uur observatie van de pasgeborene,[{'division_url': 'https://www.azstlucas.be/sp...,Elke pasgeborene wordt - bij overnachting - bi...,[{'link_url': '#waarom-willen-we-dat-je-baby-m...,https://www.azstlucas.be/onderzoek-en-behandel...,12/11/2023 15:26:50,## Mogelijke afwijkingen\n\n\n### Aangeboren h...,1
0,\n## Waarom willen de kinderartsen dat je baby...,24 uur observatie van de pasgeborene,[{'division_url': 'https://www.azstlucas.be/sp...,Elke pasgeborene wordt - bij overnachting - bi...,[{'link_url': '#waarom-willen-we-dat-je-baby-m...,https://www.azstlucas.be/onderzoek-en-behandel...,12/11/2023 15:26:50,### Infecties\n\n\nVerschillende infecties wor...,2
0,\n## Waarom willen de kinderartsen dat je baby...,24 uur observatie van de pasgeborene,[{'division_url': 'https://www.azstlucas.be/sp...,Elke pasgeborene wordt - bij overnachting - bi...,[{'link_url': '#waarom-willen-we-dat-je-baby-m...,https://www.azstlucas.be/onderzoek-en-behandel...,12/11/2023 15:26:50,### Aangeboren darmafwijkingen\n\n\nHet is pas...,3
0,\n## Waarom willen de kinderartsen dat je baby...,24 uur observatie van de pasgeborene,[{'division_url': 'https://www.azstlucas.be/sp...,Elke pasgeborene wordt - bij overnachting - bi...,[{'link_url': '#waarom-willen-we-dat-je-baby-m...,https://www.azstlucas.be/onderzoek-en-behandel...,12/11/2023 15:26:50,De kinderarts onderzoekt je baby in normale om...,4


## Load Embedding Model 

Hugginfacehub model

In [6]:
import torch
for i in range(torch.cuda.device_count()):
   print(torch.cuda.get_device_properties(i).name)

Tesla V100-PCIE-16GB


# Models to try out
- try out instructor with different query for doc storing and tetrieval 
- https://huggingface.co/jegormeister/robbert-v2-dutch-base-mqa-finetuned
- https://huggingface.co/intfloat/multilingual-e5-base
- intfloat/multilingual-e5-large
- timpal0l/mdeberta-v3-base-squad2

In [25]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    'jegorkitskerkin/robbert-v2-dutch-base-mqa-finetuned', 
    cache_folder='./models/robbert-v2-dutch-base-mqa-finetuned',
    device='cuda'
)

#Our sentences we like to encode
sentences = list(chunks_raw_df['content'])

robbert_mqa_embeddings: List[List[float]] = model.encode(
    sentences, 
    show_progress_bar=True
)
print(robbert_mqa_embeddings)

Batches:   0%|          | 0/139 [00:00<?, ?it/s]

[[-0.33865708 -0.53614783 -0.50242066 ... -0.23327178  1.3994951
  -0.31295717]
 [-0.33865708 -0.53614783 -0.50242066 ... -0.23327178  1.3994951
  -0.31295717]
 [-0.33865708 -0.53614783 -0.50242066 ... -0.23327178  1.3994951
  -0.31295717]
 ...
 [-0.1205714   0.53177935  0.14113781 ... -0.0951068   1.1831994
  -1.0692893 ]
 [-0.1205714   0.53177935  0.14113781 ... -0.0951068   1.1831994
  -1.0692893 ]
 [-0.1205714   0.53177935  0.14113781 ... -0.0951068   1.1831994
  -1.0692893 ]]


In [26]:
len(robbert_mqa_embeddings)

4444

In [29]:
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings

model_kwargs = {
    'device': 'cuda'
}
encode_kwargs = {
    # 'normalize_embeddings': True,
    'show_progress_bar': True
}
embeddings = HuggingFaceInstructEmbeddings(
    model_name="hkunlp/instructor-xl", 
    cache_folder='./models/model_cache_xl',
    embed_instruction="Represent the Medical paragraph for retrieval: ",
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

load INSTRUCTOR_Transformer


In [18]:
torch.cuda.is_available()

True

In [17]:
from typing import List 

l = list(chunks_raw_df['content'])
embeddings: List[List[float]] = embeddings.embed_documents(l)

KeyboardInterrupt: 

In [None]:
import pandas as pd
import numpy as np

def embed_str(s: str) -> np.ndarray:
    return np.float64(embeddings.embed_query(s))

def embed_df(df: pd.DataFrame, field:str) -> pd.DataFrame:
    df[f'{field}_embedding'] = (
        df[field].apply(embed_str)
    )
    return df

# Read from pikle file
chunks_processed_df = (
    chunks_raw_df.copy(deep=True)
    # Filter (testing)
    # .pipe(lambda df: df.head(100))
    .pipe(embed_df, field="chunk_s500_o60")
)

chunks_processed_df.to_pickle(path='./data/chunks_processed_full.pkl')

chunks_processed_df.head(15)

In [None]:
# Save DF to Blob Storage for later retrieval. 
chunks_processed_df.to_pickle(path='./data/chunks_processed.pkl')