# Embeddings

## sentence-transformers/all-MiniLM-L6-v2 [(model homepage)](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

- This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
- By default, input text longer than 256 word pieces is truncated. 

## sentence-transformers/all-mpnet-base-v2 [(model homepage)](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)

- This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
- By default, input text longer than 384 word pieces is truncated. 


In [12]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

print("Max Sequence Length:", model.max_seq_length)


Max Sequence Length: 256


In [12]:
model_output[0].size()
encoded_input

{'input_ids': tensor([[ 101, 2023, 2003, 2019, 2742, 6251,  102],
        [ 101, 2169, 6251, 2003, 4991,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0]])}

In [13]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings:")
print(sentence_embeddings)


AttributeError: 'BertModel' object has no attribute 'max_seq_length'

In [14]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
print("Max Sequence Length:", model.max_seq_length)
embeddings = model.encode(sentences, convert_to_tensor=True)
embeddings.size()

Max Sequence Length: 256


torch.Size([2, 384])

NameError: name 'embeddings' is not defined

In [2]:
"""
This examples demonstrates the setup for Question-Answer-Retrieval.

You can input a query or a question. The script then uses semantic search
to find relevant passages in Simple English Wikipedia (as it is smaller and fits better in RAM).

As model, we use: nq-distilbert-base-v1

It was trained on the Natural Questions dataset, a dataset with real questions from Google Search
together with annotated data from Wikipedia providing the answer. For the passages, we encode the
Wikipedia article tile together with the individual text passages.

Google Colab Example: https://colab.research.google.com/drive/11GunvCqJuebfeTlgbJWkIMT0xJH6PWF1?usp=sharing
"""
import json
from sentence_transformers import SentenceTransformer, util
import time
import gzip
import os
import torch

# We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
model_name = 'nq-distilbert-base-v1'
bi_encoder = SentenceTransformer(model_name)
top_k = 5  # Number of passages we want to retrieve with the bi-encoder

# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

wikipedia_filepath = 'data/simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())
        for paragraph in data['paragraphs']:
            # We encode the passages as [title, text]
            passages.append([data['title'], paragraph])

# If you like, you can also limit the number of passages you want to use
print("Passages:", len(passages))

# To speed things up, pre-computed embeddings are downloaded.
# The provided file encoded the passages with the model 'nq-distilbert-base-v1'
if model_name == 'nq-distilbert-base-v1':
    embeddings_filepath = 'data/simplewiki-2020-11-01-nq-distilbert-base-v1.pt'
    if not os.path.exists(embeddings_filepath):
        util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01-nq-distilbert-base-v1.pt', embeddings_filepath)

    corpus_embeddings = torch.load(embeddings_filepath)
    corpus_embeddings = corpus_embeddings.float()  # Convert embedding file to float
    if torch.cuda.is_available():
        corpus_embeddings = corpus_embeddings.to('cuda')
else:  # Here, we compute the corpus_embeddings from scratch (which can take a while depending on the GPU)
    corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)


queries = ["who is albert einstein?", "what is the capital of france?", "how many states are there in the united states?"]

for query in queries:

    # Encode the query using the bi-encoder and find potentially relevant passages
    start_time = time.time()
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query

    end_time = time.time()

    # Output of top-k hits
    print("Input question:", query)
    print("Results (after {:.3f} seconds):".format(end_time - start_time))
    for hit in hits:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']]))

    print("\n\n========\n")

Passages: 509663
Input question: who is albert einstein?
Results (after 0.093 seconds):
	0.769	['Gene Sharp', 'Gene Sharp (January 21, 1928 – January 28, 2018) was an American political scientist, writer and activist. He was the founder of the Albert Einstein Institution, a non-profit organization dedicated to the study of nonviolent action. He was a retired professor of political science at the University of Massachusetts Dartmouth. He was known for his writings on nonviolent struggle.']
	0.704	['Albert Einstein', 'Albert Einstein (14 March 1879 – 18 April 1955) was a German-born scientist. He worked on theoretical physics. He developed the theory of relativity. He received the Nobel Prize in Physics in 1921 for theoretical physics. His famous equation is formula_1 (E = energy, m = mass, c = speed of light).']
	0.687	['Hermann Minkowski', "Hermann Minkowski (22 June 1864 in Kaunas – 12 January 1909 in Göttingen) was a German mathematician of Jewish descent. He was one of Albert Einste

## spaCy

In [16]:
import spacy 

nlp = spacy.load('en_core_web_lg')
tokens = nlp("this is a sentence. this is another sentence.")
tokens.vector



array([-9.33790863e-01,  2.10345602e+00, -1.84392011e+00, -1.33380818e+00,
        6.36020994e+00,  8.07110012e-01,  8.93033981e-01,  2.68119192e+00,
       -5.75263977e-01, -4.21091139e-01,  9.52303028e+00,  2.18039960e-01,
       -3.30847025e+00,  1.02345204e+00,  9.72467065e-01,  3.37657213e+00,
        1.68216801e+00,  7.95359969e-01,  1.38530582e-01, -1.84179977e-01,
        1.72705996e+00,  1.18366025e-01, -2.57244730e+00,  1.24378395e+00,
       -3.93339872e+00, -1.86945093e+00, -1.11058807e+00, -2.70361042e+00,
       -2.12335777e+00,  1.53907204e+00,  6.22535944e-01,  5.59828877e-01,
       -2.48922992e+00, -1.79261017e+00, -2.97665620e+00,  2.52682066e+00,
        2.08242059e-01,  1.08852196e+00,  4.71816635e+00,  1.33117783e+00,
       -1.84098017e+00,  1.22467613e+00,  1.01205993e+00, -8.04393947e-01,
       -3.04949999e+00,  2.76379204e+00,  3.52020884e+00, -4.47342014e+00,
       -1.01268899e+00, -2.11250022e-01, -1.05838799e+00,  1.79123998e-01,
        2.60044074e+00, -