#**Create local vector embeddings using sentens-transformer python library**

##**GOAL: to embed text sentences and perform semantic searches using your own Python code.**

There are many pre-trained embedding models available on Hugging Face that you can use to create vector embeddings.
Sentence Transformers (SBERT) is a library that makes it easy to use these models for vector embedding.

Use pip  to install  'sentence_transformers' library  and import  'SentenceTransformer model loader' from this library.

In [2]:
!pip install sentence-transformers





Load the 'paraphrase-MiniLM-L6-v2' model  from HuggingFace resource  using the  SentenceTransformer( *model-name* )  and store the reference to the model object in the 'model' variable

In [3]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]



config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

After loading the model, call the 'encode()' method on the model object to create a vector representation of a specific text sentence. Use your own text string  as the parameter.

In [4]:
# complete the code
sentence = [
    "I love machine learning.",
    "Artificial intelligence is fascinating.",
    "The cat sat on the mat."
]

embedding = model.encode(sentence)
embedding


array([[-0.0168511 , -0.07072181,  0.08554108, ...,  0.09888572,
         0.01277482, -0.08485193],
       [-0.00446599, -0.0631337 ,  0.06713641, ...,  0.10224469,
         0.02390005, -0.07414539],
       [ 0.13023719, -0.01577286, -0.0367167 , ...,  0.05145747,
         0.00296658,  0.05249828]], dtype=float32)

Create vector representations for several text sentences. Place the text strings in a list and use this list as an argument. Use 8-10 sentences of 20-25 words each.  Call the 'encode()' method on the model object with the list of sentences as an argument.

In [5]:
# complete the code
# List of sentences
sentences_list = [
    "Machine learning enables computers to identify patterns in large datasets and make predictions without being explicitly programmed for every possible scenario.",
    "Natural language processing allows systems to understand human language, interpret meaning, and generate responses that feel intuitive and contextually appropriate.",
    "Modern recommendation engines analyze user behavior, preferences, and historical interactions to deliver personalized suggestions across shopping platforms, streaming services, and social media.",
    "Semantic search improves information retrieval by focusing on meaning rather than exact keyword matching, helping users find relevant content even when phrasing differs.",
    "Neural networks consist of interconnected layers of artificial neurons that learn complex relationships through repeated exposure to labeled or unlabeled training data.",
    "Vector embeddings convert text into numerical representations that capture semantic similarity, enabling efficient comparison, clustering, and retrieval of related information.",
    "Transformers revolutionized deep learning by introducing attention mechanisms that allow models to focus on important parts of input sequences during processing.",
    "Large language models are trained on massive corpora of text, enabling them to generate coherent responses, summarize documents, and assist with a wide range of tasks.",
    "Efficient indexing techniques such as FAISS enable fast similarity search across millions of embeddings, making largeâ€‘scale retrieval systems practical and responsive.",
    "Text preprocessing steps like tokenization, normalization, and cleaning help improve model performance by ensuring consistent and meaningful input representations."
]

embeddings = model.encode(sentences_list)

print("Shape of embeddings:", embeddings.shape)





Shape of embeddings: (10, 384)


#**Definition of semantic textual similaritye**

Import 'util' module from sentence_transformers library.

In [6]:
from sentence_transformers import util


You can calculate the cosine similarity of the vector representations of our sentences using the 'cos_sim()' function from the util module.
Example: sim = util.cos_sim(embedding_1, embedding_2). Calculate the cosine similarity for any two sentences from your list.


In [7]:
# Example sentences
sentences_list = [
    "Machine learning helps computers identify patterns in data.",
    "Artificial intelligence enables systems to perform tasks that normally require human intelligence.",
    "The weather today is sunny with a light breeze."
]

# Create embeddings
embeddings = model.encode(sentences_list)

# Pick any two sentences (e.g., 0 and 1)
embedding_1 = embeddings[0]
embedding_2 = embeddings[1]

# Compute cosine similarity
similarity = util.cos_sim(embedding_1, embedding_2)

print("Cosine similarity:", similarity.item())



Cosine similarity: 0.5001106262207031


Write and test a function named 'cos_similarity_calculation' that determines the semantic similarity between the sentences in your list and any text sentence using their vector representations and the cosine distance as a similarity measure.  

In [8]:
def cos_similarity_calculation(sentences_list, query_sentence):
    sentence_embeddings = model.encode(sentences_list)
    query_embedding = model.encode(query_sentence)

    similarities = util.cos_sim(query_embedding, sentence_embeddings)[0]

    # Pair each sentence with its similarity score
    results = list(zip(sentences_list, similarities.tolist()))

    return results


#Example
sentences = [
    "Machine learning helps computers identify patterns in data.",
    "Artificial intelligence enables systems to perform tasks requiring human reasoning.",
    "The weather today is sunny with a light breeze."
]

query = "AI allows machines to think and make decisions."

output = cos_similarity_calculation(sentences, query)

for sentence, score in output:
    print(f"{score:.4f}  ->  {sentence}")



0.5181  ->  Machine learning helps computers identify patterns in data.
0.7436  ->  Artificial intelligence enables systems to perform tasks requiring human reasoning.
-0.0214  ->  The weather today is sunny with a light breeze.


Create a function that determines the cosine similarity between a vector and a batch of vectors using the cosine distance formula and the numpy library. Add code to demonstrate how to use this function.

In [11]:
import numpy as np
def cosine_similarity_single_to_batch(vector, batch_vectors):
    # single vector
    vector_norm = vector / np.linalg.norm(vector)

    # vector in the batch
    batch_norms = batch_vectors / np.linalg.norm(batch_vectors, axis=1, keepdims=True)

    # Compute cosine similarity
    similarities = np.dot(batch_norms, vector_norm)

    return similarities



In [12]:
# Example vectors (pretend these came from an embedding model)
query_vec = np.array([0.2, 0.5, 0.3])
batch_vecs = np.array([
    [0.1, 0.4, 0.2],
    [0.9, 0.1, 0.3],
    [0.2, 0.5, 0.31]
])

scores = cosine_similarity_single_to_batch(query_vec, batch_vecs)

print("Cosine similarity scores:")
for i, score in enumerate(scores):
    print(f"Vector {i}: {score:.4f}")


Cosine similarity scores:
Vector 0: 0.9912
Vector 1: 0.5442
Vector 2: 0.9999
