# Langchain Embeddings

LangChain provides a universal interface for working with them, providing standard methods for common operations. This common interface simplifies interaction with various embedding providers through two central methods:

embed_documents: For embedding multiple texts (documents)
embed_query: For embedding a single text (query)

# Lanchain - Hugggingface integration

Langchain supports huggingface, through which we can use models provided by huggingface
into langchain, this can be useful when we don't want to directly use LLM API 
or when we just want to use an embedding model to create vectors for our querying system

Need to install the huggingface langchain package

pip install langchain-huggingface

Langchain has multiple classes to use huggingface API, embedding model, 
huggingface pipeline etc

# Embedding model

Some of the best embedding models to use 

### Jina embedding - `jina_embeddings-v3` 
its XLM-RoBERTa based multi-task text embedding model with multilingual embedding, 8192 tokens
Jina model is trained using `LoRA adapters` training method, which have been fine tuned for multiple tasks

retrieval.query: Used for query embeddings in asymmetric retrieval tasks
retrieval.passage: Used for passage embeddings in asymmetric retrieval tasks
separation: Used for embeddings in clustering and re-ranking applications
classification: Used for embeddings in classification tasks
text-matching: Used for embeddings in tasks that quantify similarity between two texts, such as STS or symmetric retrieval tasks

can pass these any of these task in `task` parameter to encode function with that specific adapter. Can also use it without specifying 
any task specifc.

By default, the model supports a maximum sequence length of 8192 tokens. However, if you want to truncate your input texts to a shorter length, you can pass the max_length parameter to the encode function. Matryoshka Embeddings: Supports flexible embedding sizes (32, 64, 128, 256, 512, 768, 1024), allowing for truncating embeddings to fit your application.

### GTE embedding - (GTE large, base, small)  `thenlper/gte-base`

General Text Embeddings (GTE) model from Alibaba. This is BERT based model
These are more of a general purpose embedding, celebrated for its flexibility across a broad spectrum of NLP tasks.
This is due to large variety of data they've used to train. Training in two steps

IMPORTANT - They are good, because they've used CONTRANSTIVE LEARNING LOSS  to train (selecting pairs of query, for a query q, a 
positive sample is picked (related), and negative samples are also mined, trained in triplet kind of setup)
This loss has been used for both stages

Stage 1 - unsupervised training - web scrapped data in text pairs
Stage 2 - good quality human annotated data in text pairs 


### Sentence transformers embeddings

`all-mpnet

Can use `HuggingFaceEmbeddings` class. It also supports model_kwargs
We can also use the AutoModel from huggingface to vectorize, in case the langchain-huggingface integration isn't maintained or has errors

In [39]:
sample_text = "Let me check out text splitting for document retrievals with langchain"

In [55]:
#from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from  langchain_huggingface.embeddings import HuggingFaceEmbeddings

In [41]:
# since the huggingface-langchain integration is based on sentence-transformers implementation
# it will throw errors if we try to use non sentence transformers models, like Jinaai, which 
# has its own some custom code. Hence we will use the Jina-langchain community 
from langchain_community.embeddings import JinaEmbeddings

In [42]:
JINA_API_KEY = "jina_bba3b217b6cb41e19437d71fc7d2492fBLmMxyrH9ZsByHz1I1ipKf8ScnuZ"

In [43]:
embed_model = JinaEmbeddings(jina_api_key=JINA_API_KEY, model_name = "jina-embeddings-v3")

In [54]:
embed_doc = embed_model.embed_documents([sample_text])
embed_query = embed_model.embed_query(sample_text)
len(embed_query)

1024

In [58]:
# gte embedding
embed_model = HuggingFaceEmbeddings(model_name = "thenlper/gte-base")

In [59]:
embed_query = embed_model.embed_query(sample_text)

# Vector Indexing with FAISS
