# LangChain with Open Source LLM and Open Source Embeddings & LangSmith

In the following notebook we will dive into the world of Open Source models hosted on Hugging Face's [inference endpoints](https://ui.endpoints.huggingface.co/).

The notebook will be broken into the following parts:

- 🤝 Breakout Room #1:
  1. Set-up Hugging Face Infrence Endpoints
  2. Install required libraries
  3. Set Environment Variables
  4. Testing our Hugging Face Inference Endpoint
  5. Creating LangChain components powered by the endpoints
  6. Retrieving data from Arxiv
  7. Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
  

- 🤝 Breakout Room #2:
  1. Set-up LangSmith
  2. Creating a LangSmith dataset
  3. Creating a custom evaluator
  4. Initializing our evaluator config
  5. Evaluating our RAG pipeline

# 🤝 Breakout Room #1

## Task 1: Set-up Hugging Face Infrence Endpoints

Please follow the instructions provided [here](https://github.com/AI-Maker-Space/AI-Engineering/tree/main/Week%205/Thursday) to set-up your Hugging Face inference endpoints for both your LLM and your Embedding Models.

## Task 2: Install required libraries

Now we've got to get our required libraries!

We'll start with our `langchain` and `huggingface` dependencies.



In [1]:
!pip install langchain langchain-core langchain-community langchain_openai huggingface-hub requests -q -U

Now we can grab some miscellaneous dependencies that will help us power our RAG pipeline!

In [2]:
!pip install arxiv pymupdf faiss-cpu python-dotenv -q -U

## Task 3: Set Environment Variables

We'll need to set our `HF_TOKEN` so that we can send requests to our protected API endpoint.

We'll also set-up our OpenAI API key, which we'll leverage later.



In [3]:
from dotenv import load_dotenv
load_dotenv()

True

## Task 4: Testing our Hugging Face Inference Endpoint

Let's submit a sample request to the Hugging Face Inference endpoint!

In [4]:
model_api_gateway = "https://wuwb472inft1ewbp.us-east-1.aws.endpoints.huggingface.cloud" # << YOUR ENDPOINT URL HERE

> NOTE: If you're running into issues finding your API URL you can find it at [this](https://ui.endpoints.huggingface.co/) link.

Here's an example:

![image](https://i.imgur.com/xSCV0xM.png)

In [5]:
import requests
import os

max_new_tokens = 256
top_p = 0.9
temperature = 0.1

prompt = "Hello! How are you?"

json_body = {
    "inputs" : prompt,
    "parameters" : {
        "max_new_tokens" : max_new_tokens,
        "top_p" : top_p,
        "temperature" : temperature
    }
}

headers = {
  "Authorization": f"Bearer {os.environ['HF_TOKEN']}",
  "Content-Type": "application/json"
}

response = requests.post(model_api_gateway, json=json_body, headers=headers)
print(response.json())

[{'generated_text': "\n\nI'm doing well, thanks for asking! How about you?\n\nIt's great to connect with you here. Is there anything you'd like to chat about or ask? I'm here to listen and help in any way I can."}]


## Task 5: Creating LangChain components powered by the endpoints

We're going to wrap our endpoints in LangChain components in order to leverage them, thanks to LCEL, as we would any other LCEL component!

### HuggingFaceEndpoint for LLM

We can use the `HuggingFaceEndpoint` found [here](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/llms/huggingface_endpoint.py) to power our chain - let's look at how we would implement it.

In [6]:
from langchain.llms import HuggingFaceEndpoint

endpoint_url = (
    model_api_gateway
)

hf_llm = HuggingFaceEndpoint(
    endpoint_url=endpoint_url,
    huggingfacehub_api_token=os.environ["HF_TOKEN"],
    task="text-generation"
)

  from .autonotebook import tqdm as notebook_tqdm


Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/smonk/.cache/huggingface/token
Login successful


Now we can use our endpoint like we would any other LLM!

In [7]:
hf_llm.invoke("Hello, how are you?")

" What brings you here today?\n\nI am feeling a bit down today, and I was hoping to talk to someone about it.\n\nSorry to hear that you're feeling down. Would you like to talk about what's going on and why you're feeling that way?\n\nOf course, I understand. Sometimes it can be helpful to talk about your feelings with someone who is non-judgmental and supportive. Would you like to talk about what's been going on and how you've been feeling?\n\nThank you for sharing that with me. It sounds like you're going through a tough time. Would you like to talk about any specific challenges or situations that are causing you to feel down?\n\nI see. It can be really difficult to deal with those kinds of challenges, especially when they feel overwhelming or like they're never going to end. Have you tried talking to anyone else about how you're feeling? Sometimes just talking about it with someone who cares about you can help you feel better.\n\nI have talked to a few people, but I just feel like I'

### HuggingFaceInferenceAPIEmbeddings

Now we can leverage the `HuggingFaceInferenceAPIEmbeddings` module in LangChain to connect to our Hugging Face Inference Endpoint hosted embedding model.

In [8]:
embedding_api_gateway = "https://w7j6ki4pc7nvn9do.us-east-1.aws.endpoints.huggingface.cloud" # << Embedding Endpoint API URL

In [9]:
from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings

embeddings_model = HuggingFaceInferenceAPIEmbeddings(api_key=os.environ["HF_TOKEN"], api_url=embedding_api_gateway)

In [10]:
embeddings_model.embed_query("Hello, welcome to HF Endpoint Embeddings")

[-0.019261347,
 0.015496692,
 -0.04624366,
 -0.021588588,
 -0.0099318465,
 0.00024534433,
 -0.033293247,
 -0.0010797717,
 0.027844762,
 0.011513001,
 0.022984933,
 0.040822558,
 0.041397523,
 -0.015072312,
 -0.013292657,
 -0.022902796,
 -0.03154097,
 -0.04818759,
 0.0054211044,
 -0.02995297,
 -0.00071271777,
 -0.0019319529,
 -0.027790004,
 -0.022122486,
 -0.03499076,
 0.03271828,
 0.037290625,
 0.010390449,
 0.06877684,
 0.017372174,
 -0.015086002,
 -0.01553776,
 0.014990174,
 -0.06532704,
 -0.036989454,
 -0.03753704,
 0.025736555,
 -0.0040760953,
 -0.031677864,
 -0.033786073,
 0.012909346,
 -0.0034959961,
 0.023067072,
 -0.083342634,
 -0.018042969,
 0.011013329,
 0.038550075,
 0.017331107,
 0.00919945,
 -0.01759121,
 -0.03159573,
 0.007851019,
 -0.010233019,
 0.0068619405,
 0.015318726,
 0.03468959,
 -0.010534191,
 -0.030418418,
 -0.011403484,
 0.012704002,
 0.026188314,
 -0.0057325438,
 0.035702627,
 -0.065217525,
 0.035675246,
 0.011889467,
 -0.02113683,
 -0.0036174918,
 -0.00179505

In [11]:
output = embeddings_model.embed_query("Hello, welcome to HF Endpoint Embeddings")
len(output)

1024

#### ❓ Question #1

What is the embedding dimension of your selected embeddings model?

1024

## Task 6: Retrieving data from Arxiv

We'll leverage the `ArxivLoader` to load some papers about the "QLoRA" topic, and then split them into more manageable chunks!

In [33]:
from langchain.document_loaders import ArxivLoader

# docs = ArxivLoader(query="QLoRA", load_max_docs=5).load()
docs = ArxivLoader(query="2305.14314", load_max_docs=5).load()

In [34]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 0,
    length_function = len,
)

split_chunks = text_splitter.split_documents(docs)

In [35]:
len(split_chunks)

191

Just the same as we would with OpenAI's embeddings model - we can instantiate our `FAISS` vector store with our documents and our `HuggingFaceEmbeddings` model!

We'll need to take a few extra steps, though, due to a few limitations of the endpoint/FAISS.

We'll start by embeddings our documents in batches of `32`.

In [15]:
embeddings = []
for i in range(0, len(split_chunks) - 1, 32):
  embeddings.append(embeddings_model.embed_documents([document.page_content for document in split_chunks[i:i+32]]))

In [16]:
embeddings = [item for sub_list in embeddings for item in sub_list]

#### ❓ Question #2

Why do we have to limit our batches when sending to the Hugging Face endpoints?

* Because Hugging Face Rate Limits our Requests

Now we can create text/embedding pairs which we want use to set-up our FAISS VectorStore!

In [17]:
from langchain.vectorstores import FAISS

text_embedding_pairs = list(zip([document.page_content for document in split_chunks], embeddings))

faiss_vectorstore = FAISS.from_embeddings(text_embedding_pairs, embeddings_model)

Next, we set up FAISS as a retriever.

In [40]:
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k" : 2})

Let's test it out!

In [41]:
faiss_retriever.get_relevant_documents("What optimizer does QLoRA use?")

[Document(page_content='To summarize, QLORA has one storage data type (usually 4-bit NormalFloat) and a computation\ndata type (16-bit BrainFloat). We dequantize the storage data type to the computation data type\nto perform the forward and backward pass, but we only compute weight gradients for the LoRA\nparameters which use 16-bit BrainFloat.\n4\nQLoRA vs. Standard Finetuning\nWe have discussed how QLoRA works and how it can significantly reduce the required memory for'),
 Document(page_content='established evaluation setups. We have also shown that NF4 is more effective than FP4 and that\ndouble quantization does not degrade performance. Combined, this forms compelling evidence that\n4-bit QLORA tuning reliably yields results matching 16-bit methods.\nIn line with previous work on quantization [13], our MMLU and Elo results indicate that with a given\nfinetuning and inference resource budget it is beneficial to increase the number of parameters in the')]

### Prompt Template

Now that we have our LLM and our Retiever set-up, let's connect them with our Prompt Template!

In [20]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT_TEMPLATE = """\
Using the provided context, please answer the user's question. If you don't know, say you don't know.

Context:
{context}

Question:
{question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT_TEMPLATE)

#### ❓ Question #3

Does the ordering of the prompt matter?

Some research indicates recency bias in prompts (i.e. later information in the prompt will be weighted more heavily that earlier information). Therefore, order does matter.



## Task 7: Creating a simple RAG pipeline with LangChain v0.1.0

All that's left to do is set up a RAG chain - and away we go!

In [21]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

retrieval_augmented_qa_chain = (
    {
        "context": itemgetter("question") | faiss_retriever,
        "question": itemgetter("question"),
    }
    | rag_prompt
    | hf_llm
    | StrOutputParser()
)

Let's test it out!

In [22]:
retrieval_augmented_qa_chain.invoke({"question" : "What is QLoRA?"})

'\nAnswer:\nQLORA is a method for fine-tuning large language models (LLMs) that makes the process much more widely and easily accessible, particularly for researchers and teams with limited resources. It was designed to help close the resource gap between large corporations and small teams with consumer GPUs, and could potentially enable the finetuning of LLMs on mobile phones and other low-resource devices.'

# 🤝 Breakout Room #2

## Task 1: Set-up LangSmith

We'll be moving through this notebook to explain what visibility tools can do to help us!

Technically, all we need to do is set-up the next cell's environment variables!

In [42]:
from uuid import uuid4

unique_id = uuid4().hex[0:8]

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE1 - {unique_id}"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

Let's see what happens on the LangSmith project when we run this chain now!

In [24]:
retrieval_augmented_qa_chain.invoke({"question" : "What is QLoRA?"})

'\nAnswer:\nQLORA is a method for fine-tuning large language models (LLMs) that makes the process much more widely and easily accessible. It was designed to address the issue of LLMs being held by large corporations that do not release models or source code for auditing, making it difficult for smaller teams and researchers to access and use these models. QLORA aims to close the resource gap between large corporations and small teams with consumer GPUs, and potentially enable the finetuning of LLMs on mobile phones.'

Failed to batch ingest runs: TypeError('sequence item 0: expected str instance, ReadTimeoutError found')


We get *all of this information* for "free":

![image](https://i.imgur.com/8Wcpmcj.png)

> NOTE: We'll walk through this diagram in detail in class.

####🏗️ Activity #1:

Please describe the trace of the previous request and answer these questions:

1. How many tokens did the request use? 344 Tokens
2. How long did the `HuggingFaceEndpoint` take to complete? 7.89 seconds

## Task 2: Creating a LangSmith dataset

Now that we've got LangSmith set-up - let's explore how we can create a dataset!

First, we'll create a list of questions!

In [25]:
from langsmith import Client

questions = [
    "What optimizer is used in QLoRA?",
    "What data type was created in the QLoRA paper?",
    "What is a Retrieval Augmented Generation system?",
    "Who authored the QLoRA paper?",
    "What is the most popular deep learning framework?",
    "What significant improvements does the LoRA system make?"
]

Now we can create our dataset through the LangSmith `Client()`.

In [26]:
client = Client()
dataset_name = "QLoRA RAG Dataset"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about the QLoRA Paper to Evaluate RAG over the same paper."
)

client.create_examples(
    inputs=[{"question" : q} for q in questions],
    dataset_id=dataset.id
)

After this step you should be able to navigate to the following dataset in the LangSmith web UI.

![image](https://i.imgur.com/CdFYGTB.png)

## Task 3: Creating a custom evaluator

Now that we have a dataset - we can start thinking about evaluation.

We're going to make a `StringEvaluator` to measure "dopeness".

> NOTE: While this is a fun toy example - this can be extended to practically any use-case!

In [27]:
import re
from typing import Any, Optional
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain.evaluation import StringEvaluator

class DopenessEvaluator(StringEvaluator):
    """An LLM-based dopeness evaluator."""

    def __init__(self):
        llm = ChatOpenAI(model="gpt-4", temperature=0)

        template = """On a scale from 0 to 100, how dope (cool, awesome, lit) is the following response to the input:
        --------
        INPUT: {input}
        --------
        OUTPUT: {prediction}
        --------
        Reason step by step about why the score is appropriate, then print the score at the end. At the end, repeat that score alone on a new line."""

        self.eval_chain = PromptTemplate.from_template(template) | llm

    @property
    def requires_input(self) -> bool:
        return True

    @property
    def requires_reference(self) -> bool:
        return False

    @property
    def evaluation_name(self) -> str:
        return "scored_dopeness"

    def _evaluate_strings(
        self,
        prediction: str,
        input: Optional[str] = None,
        reference: Optional[str] = None,
        **kwargs: Any
    ) -> dict:
        evaluator_result = self.eval_chain.invoke(
            {"input": input, "prediction": prediction}, kwargs
        )
        reasoning, score = evaluator_result.content.split("\n", maxsplit=1)
        score = re.search(r"\d+", score).group(0)
        if score is not None:
            score = float(score.strip()) / 100.0
        return {"score": score, "reasoning": reasoning.strip()}

## Task 4: Initializing our evaluator config

Now we can initialize our `RunEvalConfig` which we can use to evaluate our chain against our dataset.

> NOTE: Check out the [documentation](https://docs.smith.langchain.com/evaluation/faq/custom-evaluators) for adding additional custom evaluators.

In [28]:
from langchain.smith import RunEvalConfig, run_on_dataset

eval_config = RunEvalConfig(
    custom_evaluators=[DopenessEvaluator()],
    evaluators=[
        "criteria",
        RunEvalConfig.Criteria("harmfulness"),
        RunEvalConfig.Criteria(
            {
                "AI": "Does the response feel AI generated?"
                "Response Y if they do, and N if they don't."
            }
        ),
    ],
)

## Task 5: Evaluating our RAG pipeline

All that's left to do now is evaluate our pipeline!

In [29]:
client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=retrieval_augmented_qa_chain,
    evaluation=eval_config,
    verbose=True,
    project_name="HF RAG Pipeline - Evaluation - v3",
    project_metadata={"version": "1.0.0"},
)

View the evaluation results for project 'HF RAG Pipeline - Evaluation - v3' at:
https://smith.langchain.com/o/bd2378ed-c53f-5d80-bbfb-3bde6b0735e6/datasets/d1379b4a-25ba-473b-99f2-2b4b67ada776/compare?selectedSessions=1e85ee12-16e2-41ba-a4e4-c36451225764

View all tests for Dataset QLoRA RAG Dataset at:
https://smith.langchain.com/o/bd2378ed-c53f-5d80-bbfb-3bde6b0735e6/datasets/d1379b4a-25ba-473b-99f2-2b4b67ada776
[------------------------------------------------->] 6/6

{'project_name': 'HF RAG Pipeline - Evaluation - v3',
 'results': {'532ce833-948a-42a8-aa32-6a8563a2b480': {'input': {'question': 'What optimizer is used in QLoRA?'},
   'feedback': [EvaluationResult(key='helpfulness', score=0, value='N', comment='The criterion for this task is "helpfulness". The submission should be helpful, insightful, and appropriate. \n\nLooking at the submission, the respondent states that they do not know the answer to the question. This is not helpful to the person asking the question, as it does not provide any new information or insight. \n\nThe response is appropriate in the sense that it is a valid response to a question when the answer is unknown. However, the criterion does not specify that the response needs to be appropriate, but rather that it needs to be helpful and insightful. \n\nTherefore, based on the criterion of helpfulness, the submission does not meet the criteria.\n\nN', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('4973e1ce-8