# LlamaCpp Embeddings With Langchain

- Author: [Yongdam Kim](https://github.com/dancing-with-coffee/)
- Peer Review: [Teddy](https://github.com/teddylee777)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)

## Overview

This tutorial covers how to perform **Text Embedding** using **Llama-cpp** and **Langchain**.

**Llama-cpp** is an open-source package implemented in C++ that allows you to use LLMs such as llama very efficiently locally.

In this tutorial, we will create a simple example to measure similarity between `Documents` and an input `Query` using **Llama-cpp** and **Langchain**.


### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Llama-cpp Installation and Model Serving](#llama-cpp-installation-and-model-serving)
- [Identify Supported Embedding Models and Serving Model](#identify-supported-embedding-models-and-serving-model)
- [Model Load and Embedding](#model-load-and-embedding)
- [The similarity calculation results](#the-similarity-calculation-results)

### References

- [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)
- [Llama-cpp Python GitHub](https://github.com/abetlen/llama-cpp-python)
- [LangChain Documentation](https://langchain.readthedocs.io/en/latest/)
- [Cosine Similarity - Wikipedia](https://en.wikipedia.org/wiki/Cosine_similarity)
- [CompendiumLabs/bge-large-en-v1.5-gguf - Hugging Face](https://huggingface.co/CompendiumLabs/bge-large-en-v1.5-gguf/tree/main)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can check out the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_community",
        "llama-cpp-python",
        "scikit-learn",
    ],
    verbose=False,
    upgrade=False,
)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "LlamaCpp-Embeddings-With-Langchain",
    }
)

Environment variables have been set successfully.


You can alternatively set `LANGCHAIN_API_KEY` in `.env` file and load it. 

[Note] This is not necessary if you've already set `LANGCHAIN_API_KEY` in previous steps.

In [4]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Llama-cpp Installation and Model Serving

Llama-cpp is an open-source project that makes it easy to run large language models (LLMs) locally. It allows you to download and run various LLMs on your own computer, giving you the freedom to experiment with AI models.

To install **llama-cpp-python**:
```bash
pip install llama-cpp-python
```

1. Make sure you have the required environment for C++ compilation (e.g., on Linux or macOS). 
2. Download or specify your chosen embedding model file (e.g., `CompendiumLabs/bge-large-en-v1.5-gguf`).
3. Here, we use `bge-large-en-v1.5-q8_0.gguf` as an example and you can download it from [CompendiumLabs/bge-large-en-v1.5-gguf - Hugging Face](https://huggingface.co/CompendiumLabs/bge-large-en-v1.5-gguf/tree/main).
4. Check that `llama-cpp-python` can find the model path.

Below, we will demonstrate how to serve a LLaMA model using Llama-cpp. You can follow the official [llama-cpp-python documentation](https://github.com/abetlen/llama-cpp-python) for more details.

## Identify Supported Embedding Models and Serving Model

You can find a variety of embedding models, which typically come in different quantizations (e.g., q4_0, q4_1, q5_0, q8_0, etc.).

**1. Search models**
- You can look for models on Hugging Face or other community websites.

**2. Download or Pull a Model**
- For instance, you could download from Hugging Face if the model is hosted.

**3. Verify the Model**
- Check that the `.bin` (or `.gguf`) file is accessible to your environment.


## Model Load and Embedding

Now that you have installed `llama-cpp-python` and have downloaded a model, let's see how to load it and use it for text embedding.

Below, we define a `Query` or some `Documents` to embed using `Llama-cpp` within LangChain.

In [5]:
from langchain_community.embeddings import LlamaCppEmbeddings

# Example query and documents
query = "What is LangChain?"
docs = [
    "LangChain is an open-source framework designed to facilitate the development of applications powered by large language models (LLMs). It provides tools and components to build end-to-end workflows for tasks like document retrieval, chatbots, summarization, data analysis, and more.",
    "Spaghetti Carbonara is a traditional Italian pasta dish made with eggs, cheese, pancetta, and pepper. It's simple yet incredibly delicious. Typically served with spaghetti, but can also be enjoyed with other pasta types.",
    "The tropical island of Bali offers stunning beaches, volcanic mountains, lush forests, and vibrant coral reefs. Travelers often visit for surfing, yoga retreats, and the unique Balinese Hindu culture.",
    "C++ is a high-performance programming language widely used in system/software development, game programming, and real-time simulations. It supports both procedural and object-oriented paradigms.",
    "In astronomy, the Drake Equation is a probabilistic argument used to estimate the number of active, communicative extraterrestrial civilizations in the Milky Way galaxy. It takes into account factors such as star formation rate and fraction of habitable planets.",
]

### Load the Embedding Model

Below is how you can initialize the `LlamaCppEmbeddings` class by specifying the path to your LLaMA model file (`model_path`).

For example, you might have a downloaded model path: `./bge-large-en-v1.5-q8_0.gguf`.

We demonstrate how to instantiate the embeddings class and then embed queries and documents using Llama-cpp.

In [6]:
# Load the Llama-cpp Embedding Model
model_path = "bge-large-en-v1.5-q8_0.gguf"  # example path

embedder = LlamaCppEmbeddings(model_path=model_path, n_gpu_layers=-1)
print("Embedding model has been successfully loaded.")

llama_load_model_from_file: using device Metal (Apple M3 Max) - 49151 MiB free
llama_model_loader: loaded meta data with 24 key-value pairs and 389 tensors from bge-large-en-v1.5-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = bge-large-en-v1.5
llama_model_loader: - kv   2:                           bert.block_count u32              = 24
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 16

Embedding model has been successfully loaded.


Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 | 
Model metadata: {'tokenizer.ggml.cls_token_id': '101', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.seperator_token_id': '102', 'tokenizer.ggml.unknown_token_id': '100', 'general.quantization_version': '2', 'tokenizer.ggml.token_type_count': '2', 'general.file_type': '7', 'tokenizer.ggml.eos_token_id': '102', 'bert.context_length': '512', 'bert.pooling_type': '2', 'tokenizer.ggml.bos_token_id': '101', 'bert.attention.head_count': '16', 'bert.feed_forward_length': '4096', 'tokenizer.ggml.mask_token_id': '103', 'tokenizer.ggml.model': 'bert', 'bert.attention.causal': 'false', 'general.name': 'bge-large-en-v1.5', 'bert.block_count': '24', 'bert.attention.layer_norm_epsilon': '0.000000', 'bert.embedding_length': '1024', 'general.architecture': 'bert'}
Using fallback chat format: llama-2


### Embedding Queries and Documents

Now let's embed both the `query` and the `documents`. We will verify the dimension of the output vectors.

However, there is currently one issue that cannot be resolved when using the latest model with `LlamaCppEmbeddings`. I will post the link to the issue below, so please check it out and if it is resolved in the latest version, you can use it as instructed in the original langchain official tutorial.

- Issue link : https://github.com/langchain-ai/langchain/issues/22532

In [7]:
# from langchain tutorial

"""
embedded_query = llama_embeddings.embed_query(query)
embedded_docs = llama_embeddings.embed_documents(docs)

print(f"Embedding Dimension Output (Query): {len(embedded_query)}")
print(f"Embedding Dimension Output (Docs): {len(embedded_docs[0])}")
"""

# Overridden version of the LlamaCppEmbeddings class
from typing import List
from langchain_community.embeddings.llamacpp import LlamaCppEmbeddings


class CustomLlamaCppEmbeddings(LlamaCppEmbeddings):
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed a list of documents using the Llama model.

        Args:
            texts: The list of texts to embed.

        Returns:
            List of embeddings, one for each text.
        """
        embeddings = [self.client.embed(text)[0] for text in texts]
        return [list(map(float, e)) for e in embeddings]

    def embed_query(self, text: str) -> List[float]:
        """Embed a query using the Llama model.

        Args:
            text: The text to embed.

        Returns:
            Embeddings for the text.
        """
        embedding = self.client.embed(text)[0]
        return list(map(float, embedding))


c_embedder = CustomLlamaCppEmbeddings(model_path=model_path, n_gpu_layers=-1)
embedded_query = c_embedder.embed_query([query])
embedded_docs = c_embedder.embed_documents([docs])

llama_load_model_from_file: using device Metal (Apple M3 Max) - 48765 MiB free
llama_model_loader: loaded meta data with 24 key-value pairs and 389 tensors from bge-large-en-v1.5-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = bge-large-en-v1.5
llama_model_loader: - kv   2:                           bert.block_count u32              = 24
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 16

### Check custom embeddings

- To check whether the embedding results are output as expected, I output the dimensions of each embedding vector.

In [8]:
print("Query embedding dimension:", len(embedded_query))
print("Document embedding dimension:", len(embedded_docs[0]))

Query embedding dimension: 1024
Document embedding dimension: 1024


## The similarity calculation results

We can use the vector representations of the query and documents to calculate similarity.
Here, we use the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) provided by scikit-learn.


In [9]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


# Calculate Cosine Similarity
similarities = cosine_similarity([embedded_query], embedded_docs)[0]
print(similarities)

# Sort indices in ascending order.
sorted_indices = np.argsort(similarities)[::-1]

print(f"[Query] {query}\n====================================")
for i, idx in enumerate(sorted_indices):
    print(f"[{i}] similarity: {similarities[idx]:.3f} | {docs[idx]}")
    print()

[0.8711899]
[Query] What is LangChain?
[0] similarity: 0.871 | LangChain is an open-source framework designed to facilitate the development of applications powered by large language models (LLMs). It provides tools and components to build end-to-end workflows for tasks like document retrieval, chatbots, summarization, data analysis, and more.



----
This concludes the **Llama-cpp Embeddings With Langchain** tutorial in the style of the original reference notebook.