## Embedding Techniques using HuggingFace

### Sentence Embeddings
A Sentence embedding is a single vector that represents the meaning of an entire sentence or a paragraph or a document. Instead of embedding individual words, it compresses the whole sequence into a one fixed size vector. 

#### Why we need Sentence Embeddings?
Sentence Embeddings elements the need for calculating word order or context derivation. It uses special transformers called Sentence Transformers which will embed an entire sentence such that similar sounding sentences like "The cat wears the hat" and "The hat is worn by the cat" generates similar embeddings. 

### Sentence Transformers
Sentence Transformers is a python library built on top of Hugging Face's transformers and Pytorch that makes it easy to create and use sentence embeddings. 

To use HuggingFace embeddings, we need to install the below libraries 

- sentence_transformers
- langchain_huggingface

### Application of Sentence Transformers
- Sementic Search (finding similar sentences in documents)
- Clustering text
- Duplicate detection
- Zero shot classification

In [2]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
os.environ["HUGGINGFACE_API_TOKEN"]=os.getenv("HUGGINGFACE_API_TOKEN")

In [4]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
## Embedding documents
doc_result=embeddings.embed_documents(["This is a sample document for generating embeddings."])
doc_result

[[-0.07522130012512207,
  -0.027015456929802895,
  0.022529976442456245,
  0.010589759796857834,
  0.023517966270446777,
  0.04922173172235489,
  -0.0637856274843216,
  -0.002215262968093157,
  0.01385696604847908,
  -0.016159044578671455,
  0.008074824698269367,
  0.021580981090664864,
  0.06055011972784996,
  -0.01757878251373768,
  -0.06458789855241776,
  0.06342627853155136,
  0.04733093082904816,
  0.031074855476617813,
  0.007063628174364567,
  0.0020324010401964188,
  0.03625836968421936,
  0.03933735191822052,
  0.0859169289469719,
  -0.09467193484306335,
  0.014639021828770638,
  -0.05764240399003029,
  -0.0318312793970108,
  0.07503756880760193,
  0.10996082425117493,
  -0.024229541420936584,
  0.1013808622956276,
  0.024940311908721924,
  0.06182389706373215,
  0.035006627440452576,
  0.0531427226960659,
  0.0730045810341835,
  0.011493715457618237,
  0.06904057413339615,
  -0.0043205092661082745,
  0.06208019331097603,
  -0.0007246480090543628,
  -0.0019010510295629501,
  0

In [6]:
## Embedding queries
text = "This is a sample query for generating embeddings."

query_result=embeddings.embed_query(text)
query_result

[-0.007772895507514477,
 -0.08314358443021774,
 0.0124201076105237,
 0.05465555936098099,
 -0.018045861274003983,
 0.0855402871966362,
 -0.001585175283253193,
 -0.021698957309126854,
 0.033622682094573975,
 -0.07685940712690353,
 0.028566712513566017,
 -0.02715766243636608,
 0.1076282262802124,
 -0.006704982835799456,
 -0.09197292476892471,
 0.07484074681997299,
 0.08612003922462463,
 0.04069514200091362,
 0.0019233896164223552,
 -0.03499314934015274,
 0.0011184068862348795,
 0.03126102313399315,
 0.06642938405275345,
 -0.08582672476768494,
 0.05210261046886444,
 -0.09195464849472046,
 0.004666285589337349,
 0.05190041661262512,
 0.10449287295341492,
 -0.014871664345264435,
 0.04587382823228836,
 0.018618440255522728,
 -0.0007160801906138659,
 0.06988441944122314,
 0.032589141279459,
 0.020719919353723526,
 -0.014043992385268211,
 0.058347348123788834,
 -0.006203175522387028,
 0.015077903866767883,
 0.01484449952840805,
 0.020777452737092972,
 0.05953926965594292,
 0.04313572868704796,

In [7]:
len(query_result)

384