# Text Similarity with Embeddings

<b>What this project is</b>

Short: convert text → vector (embedding), then compare vectors (usually cosine similarity) to get a semantic similarity score (0–1 roughly).

 - >Step-by-step plan

1. **Data**: Prepare example sentence pairs (or a file of texts).
2. **Load model**: `SentenceTransformer(...)`.
3. **Encode**: Convert texts → embeddings (batched).
4. **Compare**: Compute cosine similarity.
5. **Evaluate**: If you have labels, compute correlation/accuracy.
6. **Scale/serve**: Save embeddings, use ANN index (Faiss) for fast retrieval.
7. **Polish**: chunk long docs, normalize, cache results.

In [1]:
# 1) imports
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [2]:
# 2) load a model (this downloads the model once)
# 'all-MiniLM-L6-v2' is a good fast default
model = SentenceTransformer('all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# 3) example texts
text1 = "I adore you."
text2 = "I love you."


In [30]:
# 4) encode to embeddings (convert_to_numpy=True gives a plain numpy array)
emb1 = model.encode(text1, convert_to_numpy=True)
emb2 = model.encode(text2, convert_to_numpy=True)

In [32]:
# 5) compute cosine similarity with scikit-learn (shape: [[score]])
score = cosine_similarity([emb1], [emb2])[0][0]
print(f"Cosine similarity (text1 vs text2): {score:.4f}")

Cosine similarity (text1 vs text2): 0.6937
