LangChain is an open-source framework uniquely designed to empower the development of applications leveraging large language models (LLMs). It stands out by providing essential tools and abstractions that enhance the customization, accuracy, and relevance of the information generated by these models.

At its core, LangChain offers a generic interface compatible with nearly any LLM. This facilitates a centralized development environment where data scientists can seamlessly integrate LLM applications with various external data sources and software workflows. This integration is crucial for those looking to harness the full potential of AI in their processes.

One of the most powerful features of LangChain is its module-based approach. This approach allows flexibility in performing experiments and optimizations of interactions with LLMs. Data scientists can dynamically compare prompts and switch between foundation models without significant code modifications. This saves valuable development time and enhances the ability to fine-tune applications to meet specific needs.


We will dive into how LangChain simplifies the complex process of integrating advanced AI capabilities into practical applications. You will learn the core concepts of LangChain and how to use Langchain's innovative features to build more intelligent, responsive, and efficient applications. Whether you are a developer, a data scientist, or an AI enthusiast, this lab will equip you with a deep understanding of how to leverage LangChain for crafting cutting-edge AI solutions.


For this lab, you will be using the following libraries:

*   [`ibm-watson-ai`, `ibm-watson-machine-learning`](https://ibm.github.io/watson-machine-learning-sdk/index.html) for using LLMs from IBM's watsonx.ai.
*   [`langchain`, `langchain-ibm`, `langchain-community`, `langchain-experimental`](https://www.langchain.com/) for using relevant features from LangChain.
*   [`pypdf`](https://pypi.org/project/pypdf/) is an open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files.
*   [`chromadb`](https://www.trychroma.com/) is an open-source vector database used to store embeddings.


### Installing required libraries

The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You must run the following cell__ to install them:

**Note:** The version has been specified here to pin it. It's recommended that you do the same. Even if the library is updated in the future, the installed version will still support this lab work.

The installation might take approximately 2-3 minutes.

Since `%%capture` is being used to capture the installation process, you won't see the output. However, once the installation is complete, you will see a number beside the cell.


In [1]:
%%capture
!pip install --force-reinstall --no-cache-dir tenacity --user
!pip install "ibm-watsonx-ai==1.0.4" --user
!pip install "ibm-watson-machine-learning==1.0.357" --user
!pip install "langchain-ibm==0.1.7" --user
!pip install "langchain-community==0.2.1" --user
!pip install "langchain-experimental==0.0.59" --user
!pip install "langchainhub==0.1.17" --user
!pip install "langchain==0.2.1" --user
!pip install "pypdf==4.2.0" --user
!pip install "chromadb == 0.4.24" --user

In [None]:
pip install pypdf

In [2]:
pip install -U langchain langchain-community openai


Collecting langchain
  Downloading langchain-0.3.25-py3-none-any.whl.metadata (7.8 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting openai
  Downloading openai-1.78.1-py3-none-any.whl.metadata (25 kB)
Collecting langchain-core<1.0.0,>=0.3.58 (from langchain)
  Downloading langchain_core-0.3.59-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.8 (from langchain)
  Downloading langchain_text_splitters-0.3.8-py3-none-any.whl.metadata (1.9 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Downloading langchain-0.3.25-py3-none-any.whl (1.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hDo

## LangChain concepts

### Model

A large language model (LLM) serves as the interface for the AI's capabilities. It processes plain text input and generates text output, forming the core functionality needed to complete various tasks. When integrated with LangChain, it becomes a powerful tool, providing the foundational structure necessary for building and deploying sophisticated AI applications.


In [3]:
from transformers import pipeline

pipe = pipeline("text-generation", model="tiiuae/falcon-rw-1b")
response = pipe("Explain what IoT is in simple words.", max_new_tokens=100)
print(response[0]['generated_text'])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.62G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Explain what IoT is in simple words.
IoT is a network of physical objects that are connected to each other and to a network. These objects can be anything from a car to a refrigerator.
What is IoT?
IoT is a network of physical objects that are connected to each other and to a network. These objects can be anything from a car to a refrigerator.
What is IoT?
IoT is a network of physical objects that are connected to each other and to a network. These objects can


### Chat model

Chat models support the assignment of distinct roles to conversation messages, helping to distinguish messages from the AI, users, and instructions such as system messages.

To enable the LLM from watsonx.ai to work with LangChain, it needs to be wrapped using `WatsonLLM()`. This wrapper converts the LLM into a chat model, allowing it to integrate seamlessly with LangChain's framework for creating interactive and dynamic AI applications.


In [4]:
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage

# Set your API key (you can also use environment variable)
import os
os.environ["OPENAI_API_KEY"] = ""

# Create chat model
chat = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)

# Run a simple chat
messages = [
    HumanMessage(content="Hi, who are you?"),
    HumanMessage(content="Can you explain IoT in simple terms?")
]

response = chat(messages)
print(response.content)


  chat = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
  response = chat(messages)


Sure! IoT stands for Internet of Things, and it refers to a network of physical objects or devices that are connected to the internet. These devices can communicate with each other and with other systems to gather and exchange data. For example, smart thermostats, wearable fitness trackers, and connected home appliances are all examples of IoT devices. IoT technology allows for automation, monitoring, and control of these devices remotely, making our lives more convenient and efficient.


In [5]:




from langchain.chat_models import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage, AIMessage
import os

# Set OpenAI key
os.environ["OPENAI_API_KEY"] = ""


# Initialize the chat model
chat = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.6)

# Initial conversation history
chat_history = [
    SystemMessage(content="""
You are a virtual health assistant named Dr. Care.
You only answer questions related to general health, wellness, or lifestyle.
If a user asks about anything else (e.g., travel, finance, or technology), politely decline.
""")
,

    HumanMessage(content="Hi Doctor, I often feel tired lately."),
    AIMessage(content="Hi! Fatigue can have many causes like stress, low iron, or sleep issues. Let's explore it."),

    HumanMessage(content="I sleep well but still feel exhausted."),
    AIMessage(content="It could be nutritional. Do you eat enough iron-rich foods like spinach, beans, or red meat?")
]

# New user input
user_input = "Can you recommend meals to improve my iron intake?"
chat_history.append(HumanMessage(content=user_input))

# Get AI response
response = chat(chat_history)
chat_history.append(AIMessage(content=response.content))

# Print reply
print("Dr. Care:", response.content)



Dr. Care: Sure! You can try meals like spinach and chickpea curry, lentil soup, or a beef stir-fry with broccoli.


We manually inserted into the message history to simulate previous AI responses in a multi-turn conversation for demonstration purposes.

### Prompt templates

Prompt templates help translate user input and parameters into instructions for a language model. They can be used to guide a model's response, helping it understand the context and generate relevant and coherent language-based output.

There are several different types of prompt templates.

#### String prompt templates

These prompt templates are used to format a single string, and are generally used for simpler inputs.


In [6]:
from langchain_core.prompts import PromptTemplate

In [7]:
from langchain.chat_models import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage
import os

# ✅ Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = ""

# ✅ Initialize chat model
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)

# ✅ Define prompt template with System + Placeholder
prompt = ChatPromptTemplate.from_messages([
    SystemMessage(content="You are Dr. Care, a friendly and qualified virtual health assistant. You answer only health-related questions in simple terms."),
    MessagesPlaceholder(variable_name="chat_history")
])

# ✅ Simulated conversation: System + Human + AI messages
input_ = {
    "chat_history": [
        HumanMessage(content="Hi Doctor, I have a headache."),
        AIMessage(content="I'm sorry to hear that! Can you tell me if it's sharp, dull, or throbbing?"),
        HumanMessage(content="It's a dull pain, mostly in the morning.")
    ]
}

# ✅ Chain: Prompt + Model
chain = prompt | llm

# ✅ Run simulation
response = chain.invoke(input_)
print("Dr. Care:", response.content)


Dr. Care: A dull headache in the morning could be due to various reasons like dehydration, lack of sleep, or stress. Make sure you drink enough water, try to get sufficient rest, and practice relaxation techniques. If the headache persists or gets worse, it's best to consult a healthcare professional for further evaluation.


### Example selectors

If you have a large number of examples, you may need to select which ones to include in the prompt. The Example Selector is the class responsible for doing so.


Example selector types could based on:
- `Similarity`: Uses semantic similarity between inputs and examples to decide which examples to choose.
- `MMR`: Uses Max Marginal Relevance between inputs and examples to decide which examples to choose.
- `Length`: Selects examples based on how many can fit within a certain length
- `Ngram`: Uses ngram overlap between inputs and examples to decide which examples to choose.

Here, you can use the example selector based on length as an example. For more details on other types, please refer to [https://python.langchain.com/v0.1/docs/modules/model_io/prompts/example_selectors/](https://python.langchain.com/v0.1/docs/modules/model_io/prompts/example_selectors/).


This code creates a few-shot prompt to teach a language model how to generate antonyms. It defines a list of example word pairs (like "happy → sad") and uses a LengthBasedExampleSelector to dynamically choose the shortest examples that fit within a specified token limit. These examples are inserted into a prompt template along with the user’s input word to guide the model's output.

In LangChain, when you define a list of input-output examples, you can use different ExampleSelector strategies to dynamically choose which examples to include in a prompt based on the user’s current input. Although it may seem like you're just providing static examples, selectors like SimilarityExampleSelector and MMRExampleSelector use intelligent filtering. For instance,
- when the user inputs a new word such as "powerful", the selector first embeds this input and compares it to the embeddings of the inputs from your examples (like "happy", "strong", "sunny", etc.).
- Using semantic similarity (e.g., cosine distance), it ranks how closely related each example is to the user’s input and selects the most relevant ones
- For "powerful", it might choose the example "strong → weak" because it is semantically similar, while ignoring unrelated examples like "happy → sad".
- This dynamic selection helps tailor the prompt to the current task while staying within token limits.
- The process varies depending on the selector type—some prioritize relevance (SimilaritySelector), others balance diversity and similarity (MMRSelector), or simply choose the shortest fitting examples (LengthBasedSelector). This approach makes few-shot prompting more efficient, personalized, and effective.



In [8]:
from langchain_core.example_selectors import LengthBasedExampleSelector
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate

# Examples of a pretend task of creating antonyms.
examples = [
    {"input": "happy", "output": "sad"},
    {"input": "tall", "output": "short"},
    {"input": "energetic", "output": "lethargic"},
    {"input": "sunny", "output": "gloomy"},
    {"input": "windy", "output": "calm"},
]

example_prompt = PromptTemplate(
    input_variables=["input", "output"],
    template="Input: {input}\nOutput: {output}",
)
example_selector = LengthBasedExampleSelector(
    examples=examples,
    example_prompt=example_prompt,
    max_length=25,  # The maximum length that the formatted examples should be.
)
dynamic_prompt = FewShotPromptTemplate(
    example_selector=example_selector,
    example_prompt=example_prompt,
    prefix="Give the antonym of every input",
    suffix="Input: {adjective}\nOutput:",
    input_variables=["adjective"],
)

In [9]:
print(dynamic_prompt.format(adjective="big"))

Give the antonym of every input

Input: happy
Output: sad

Input: tall
Output: short

Input: energetic
Output: lethargic

Input: sunny
Output: gloomy

Input: windy
Output: calm

Input: big
Output:


In [10]:
long_string = "big and huge and massive and large and gigantic and tall and much much much much much bigger than everything else"
print(dynamic_prompt.format(adjective=long_string))

Give the antonym of every input

Input: happy
Output: sad

Input: big and huge and massive and large and gigantic and tall and much much much much much bigger than everything else
Output:


### Output parsers

Output parsers are responsible for taking the output of an LLM and transforming it to a more suitable format. This is very useful when you are using LLMs to generate any form of structured data, or to normalize output from chat models and LLMs.

LangChain has lots of different types of output parsers. This is a [list](https://python.langchain.com/v0.2/docs/concepts/#output-parsers) of output parsers LangChain supports. In this lab, you will use the following two output parsers as examples:

- `JSON`: Returns a JSON object as specified. You can specify a Pydantic model and it will return JSON for that model. Probably the most reliable output parser for getting structured data that does NOT use function calling.
- `CSV`: Returns a list of comma separated values.



This guide explains how to ask ChatGPT (via LangChain) to return responses in a structured format using a tool called Pydantic and a LangChain feature called JsonOutputParser.

📌 What Are We Trying to Do?
We want to ask the model:

“Tell me a joke.”

…but instead of a random sentence, we want the result in this format:

{
  "setup": "Why did the chicken cross the road?",
  "punchline": "To get to the other side!"
}


📦 What Is Pydantic?
Pydantic is a Python tool used to define data structures. You can think of it like creating a template or a form that the model must fill in correctly.

✅ In our case:

class Joke(BaseModel):

    setup: str  # The joke question
    punchline: str  # The punchline or answer


This says: “I want an object that has:

a setup (as a string)

a punchline (also a string)”


 - First, we want to ask a language model like ChatGPT to give us a joke, but instead of getting plain text, we want the response to follow a specific structure — something like a dictionary with two fields: setup (the question) and punchline (the answer). This is useful when we want structured, predictable output that can be used in applications or stored in databases.

 - To define this expected structure, we use a Python tool called Pydantic. It lets us create a model called Joke with two string fields: setup and punchline. Think of it like creating a form where the model must fill in the blanks correctly.

 - Then, we use LangChain’s JsonOutputParser, which generates special instructions telling the model: “Please format your answer as JSON that fits the Joke structure.” These formatting instructions are generated automatically by calling output_parser.get_format_instructions().

 - Next, we build a prompt template using LangChain’s PromptTemplate. This combines three things: the task ("Answer the user query"), the format instructions (from the parser), and the actual user question (e.g., "Tell me a joke").

 - We also initialize the ChatOpenAI model, which is the actual GPT-3.5 or GPT-4 model that generates the response based on the prompt.

 - After that, we chain everything together using LangChain’s chaining syntax (prompt | llm | output_parser). This means: take the prompt, send it to the model, then parse the response into a structured Joke object.

 - Finally, we invoke the chain using chain.invoke({"query": "Tell me a joke"}). The result is not just plain text, but a well-structured Python object like Joke(setup="Why did the...", punchline="Because...").



In [12]:
from langchain.chat_models import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field
import os

# Set your API key
os.environ["OPENAI_API_KEY"] = ""

# ✅ 1. Define output structure
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

# ✅ 2. Create the query
joke_query = "Tell me a joke."

# ✅ 3. Initialize the model
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)

# ✅ 4. Set up the parser
output_parser = JsonOutputParser(pydantic_object=Joke)

# ✅ 5. Create format instructions
format_instructions = output_parser.get_format_instructions()

# ✅ 6. Create prompt template
prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": format_instructions},
)

# ✅ 7. Chain prompt → llm → output parser
chain = prompt | llm | output_parser

# ✅ 8. Invoke the chain
result = chain.invoke({"query": joke_query})
print(result)


{'setup': "Why couldn't the bicycle find its way home?", 'punchline': 'Because it lost its bearings.'}


In [13]:
from langchain.output_parsers import CommaSeparatedListOutputParser

output_parser = CommaSeparatedListOutputParser()

format_instructions = output_parser.get_format_instructions()
prompt = PromptTemplate(
    template="Answer the user query. {format_instructions}\nList five {subject}.",
    input_variables=["subject"],
    partial_variables={"format_instructions": format_instructions},
)

chain = prompt | llm | output_parser

chain.invoke({"subject": "ice cream flavors"})

['Vanilla', 'Chocolate', 'Strawberry', 'Mint Chocolate Chip', 'Cookie Dough']

### Documents

#### Document object

A `Document` object in `LangChain` contains information about some data. It has two attributes:

- `page_content`: *`str`*: This attribute holds the content of the document\.
- `metadata`: *`dict`*: This attribute contains arbitrary metadata associated with the document. It can be used to track various details such as the document id, file name, and so on.


Let's use an example to illustrate how to create a `Document` object. This is the object type that `LangChain` utilizes for handling text or documents


In [2]:
from langchain_core.documents import Document

Document(page_content="""Python is an interpreted high-level general-purpose programming language.
                        Python's design philosophy emphasizes code readability with its notable use of significant indentation.""",
         metadata={
             'my_document_id' : 234234,
             'my_document_source' : "About Python",
             'my_document_create_time' : 1680013019
         })

Document(metadata={'my_document_id': 234234, 'my_document_source': 'About Python', 'my_document_create_time': 1680013019}, page_content="Python is an interpreted high-level general-purpose programming language. \n                        Python's design philosophy emphasizes code readability with its notable use of significant indentation.")

Note that you don't have to include metadata if you don't want to:


In [3]:
Document(page_content="""Python is an interpreted high-level general-purpose programming language.
                        Python's design philosophy emphasizes code readability with its notable use of significant indentation.""")

Document(metadata={}, page_content="Python is an interpreted high-level general-purpose programming language. \n                        Python's design philosophy emphasizes code readability with its notable use of significant indentation.")

#### Document loaders

Document loaders in LangChain are designed to load documents from a variety of sources. For instance, if you wish to load a PDF paper and have it read by LLM using LangChain.

LangChain offers over 100 distinct document loaders, along with integrations with other major providers in this field, such as AirByte and Unstructured. These integrations enable the loading of all kinds of documents (HTML, PDF, code) from various locations (private S3 buckets, public websites).

You can find a list of document types that LangChain can load at [https://python.langchain.com/v0.1/docs/integrations/document_loaders/](https://python.langchain.com/v0.1/docs/integrations/document_loaders/).

In this lab, you will be using the PDF loader and the URL/Website loader as examples.

##### PDF loader

By using the PDF loader, you can load a PDF file as a `Document` object.

In this case, you are loading a paper about LangChain. You can access and read the paper at [https://doi.org/10.48550/arXiv.2403.05568](https://doi.org/10.48550/arXiv.2403.05568).




In [4]:
import pypdf
print(pypdf.__version__)

4.2.0


In [5]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/96-FDF8f7coh0ooim7NyEQ/langchain-paper.pdf")


In [6]:
document = loader.load()

In [7]:
document[2]  # take a look at the page 2

Document(metadata={'producer': 'PyPDF', 'creator': 'Microsoft Word', 'creationdate': '2023-12-31T03:50:13+00:00', 'author': 'IEEE', 'moddate': '2023-12-31T03:52:06+00:00', 'title': 's8329 final', 'source': 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/96-FDF8f7coh0ooim7NyEQ/langchain-paper.pdf', 'total_pages': 6, 'page': 2, 'page_label': '3'}, page_content='Figure 2. An AIMessage illustration  \nC. Prompt Template  \nPrompt templates  [10] allow you to structure  input for LLMs. \nThey provide a convenient way to format user inputs and \nprovide instructions to generate responses. Prompt templates \nhelp ensure that the LLM understands the  desired context and \nproduces relevant outputs.  \nThe prompt template classes in LangChain  are built to \nmake constructing prompts with dynamic inputs easier. Of \nthese classes, the simplest is the PromptTemplate.  \nD. Chain  \nChains  [11] in LangChain refer to the combination of \nmultiple components to achieve specific

In [8]:
print(document[1].page_content[:1000])  # print the page 1's first 1000 tokens

LangChain helps us to unlock the ability to harness the 
LLM’s immense potential in tasks such as document analysis, 
chatbot development, code analysis, and countless other 
applications. Whether your desire is to unlock deeper natural 
language understanding , enhance data, or circumvent 
language barriers through translation, LangChain is ready to 
provide the tools and programming support you need to do 
without it that it is not only difficult but also fresh for you . Its 
core functionalities encompass:  
1. Context -Aware Capabilities: LangChain facilitates the 
development of applications that are inherently 
context -aware. This means that these applications can 
connect to a language model and draw from various 
sources of context, such as prompt instructions, a  few-
shot examples, or existing content, to ground their 
responses effectively.  
2. Reasoning Abilities: LangChain equips applications 
with the capacity to reason effectively. By relying on a 
language model, thes

##### URL and website loader

You can also load content from a URL or website into a `Document` object:


In [11]:
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://python.langchain.com/v0.2/docs/introduction/")
web_data = loader.load()
print(web_data[0].page_content[:1000])






Introduction | ü¶úÔ∏èüîó LangChain







Skip to main contentA newer LangChain version is out! Check out the latest version.IntegrationsAPI referenceLatestLegacyMorePeopleContributingCookbooks3rd party tutorialsYouTubearXivv0.2Latestv0.2v0.1ü¶úÔ∏èüîóLangSmithLangSmith DocsLangChain HubJS/TS Docsüí¨SearchIntroductionTutorialsBuild a Question Answering application over a Graph DatabaseTutorialsBuild a Simple LLM Application with LCELBuild a Query Analysis SystemBuild a ChatbotConversational RAGBuild an Extraction ChainBuild an AgentTaggingdata_generationBuild a Local RAG ApplicationBuild a PDF ingestion and Question/Answering systemBuild a Retrieval Augmented Generation (RAG) AppVector stores and retrieversBuild a Question/Answering system over SQL dataSummarize TextHow-to guidesHow-to guidesHow to use tools in a chainHow to use a vectorstore as a retrieverHow to add memory to chatbotsHow to use example selectorsHow to map values to a graph databaseHow to add a semantic layer 

#### Text splitters

Once you've loaded documents, you'll often want to transform them to better suit your application.

The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

At a high level, text splitters work as follows:

1. Split the text up into small, semantically meaningful chunks (often sentences).
2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

[Here](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/) is a list of types of text splitters LangChain support.


Let's use a simple `CharacterTextSplitter` as an example to split the langchain paper you just loaded.

This is the simplest method. This splits based on characters (by default "\n\n") and measures chunk length by number of characters.


In [13]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20, separator="\n")  # define chunk_size which is length of characters, and also separator.
chunks = text_splitter.split_documents(document)
print(len(chunks))

148


It splits the document into 148 chunks. Let's look at the content of a chunk:


In [14]:
chunks[5].page_content   # take a look at any chunk's page content

'contextualized language models to introduce MindGuide, an \ninnovative chatbot serving as a mental health assistant for \nindividuals seeking guidance and support in these critical areas.'

#### Embedding models

Embedding models are specifically designed to interface with text embeddings.

Embeddings generate a vector representation for a given piece of text. This is advantageous as it allows you to conceptualize text within a vector space. Consequently, you can perform operations such as semantic search, where you identify pieces of text that are most similar within the vector space.

There are lots of embedding model providers (OpenAI, IBM, Hugging Face, etc.). Here, you'll use the embedding model from IBM's watsonx.ai to deal with the text.


In [15]:
pip install sentence-transformers


Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [None]:
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames

embed_params = {
    EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: 3,
    EmbedTextParamsMetaNames.RETURN_OPTIONS: {"input_text": True},
}


from langchain_ibm import WatsonxEmbeddings

watsonx_embedding = WatsonxEmbeddings(
    model_id="ibm/slate-125m-english-rtrvr",
    url="https://us-south.ml.cloud.ibm.com",
    project_id="skills-network",
    params=embed_params,
)

## The following embeds content in each of the chunks. You can then output the first 5 numbers in the vector representation of the content of the first chunk:

texts = [text.page_content for text in chunks]

embedding_result = watsonx_embedding.embed_documents(texts)
embedding_result[0][:5]

In [16]:
from langchain.embeddings import HuggingFaceEmbeddings

hf_local_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

texts = [text.page_content for text in chunks]
embedding_result = hf_local_embeddings.embed_documents(texts)

print(embedding_result[0][:5])


  hf_local_embeddings = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[-0.013132886961102486, 0.022271515801548958, 0.0062825786881148815, 0.0027191403787583113, -0.06523989140987396]


#### Vector stores

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A [vector store](https://python.langchain.com/v0.1/docs/modules/data_connection/vectorstores/) takes care of storing embedded data and performing vector search for you.

There are many great vector store options, here `Chroma` as an example is being used.


You have the embedding model perform the embedding process and store the resulting vectors in the Chroma vector database.


In [19]:
from langchain.vectorstores import Chroma

In [20]:
docsearch = Chroma.from_documents(chunks, hf_local_embeddings)

Then, you could use a similarity search strategy to retrieve the information that is related to the query you set.

The model will return a list of similar/relevant document chunks. Here, you can print the contents of the most similar chunk:


In [21]:
query = "Langchain"
docs = docsearch.similarity_search(query)
print(docs[0].page_content)

Off-the-Shelf Chains: LangChain offers pre -configured 
chains, which are structured assemblies of components 
tailored to accomplish specific high -level tasks. These pre -


#### Retrievers

A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.

Retrievers accept a string `query` as input and return a list of `Document`'s as output.


A list of advanced retrieval types LangChain could support is available at [https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/). Let's introduce the `Vector store-backed retriever` and `Parent document retriever` as examples.


##### Vector store-backed retriever

A vector store retriever is a retriever that uses a vector store to retrieve documents. It is a lightweight wrapper around the vector store class to make it conform to the retriever interface. It uses the search methods implemented by a vector store, like similarity search and MMR (Maximum marginal relevance), to query the texts in the vector store.

Since we've constructed a vector store `docsearch`, it's very easy to construct a retriever.


In [22]:
retriever = docsearch.as_retriever()
docs = retriever.invoke("Langchain")
docs[0]

Document(metadata={'author': 'IEEE', 'creationdate': '2023-12-31T03:50:13+00:00', 'creator': 'Microsoft Word', 'moddate': '2023-12-31T03:52:06+00:00', 'page': 1, 'page_label': '2', 'producer': 'PyPDF', 'source': 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/96-FDF8f7coh0ooim7NyEQ/langchain-paper.pdf', 'title': 's8329 final', 'total_pages': 6}, page_content='Off-the-Shelf Chains: LangChain offers pre -configured \nchains, which are structured assemblies of components \ntailored to accomplish specific high -level tasks. These pre -')

Note that the results are identical to the ones obtained using the similarity search strategy.

##### Parent document retriever


When splitting documents for retrieval, there are often conflicting desires:

1. You may want small documents so their embeddings can most accurately reflect their meaning. If too long, then the embeddings can lose meaning.
2. You want to have long enough documents to retain the context of each chunk.

The `ParentDocumentRetriever` strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent IDs for them and returns those larger documents.


In [26]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.documents import Document

# ✅ Step 1: Embeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# ✅ Step 2: Splitters
parent_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=20, separator="\n")
child_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=20, separator="\n")

# ✅ Step 3: Vector Store
vectorstore = Chroma(
    collection_name="split_parents",
    embedding_function=embedding_model
)

# ✅ Step 4: Doc Store
store = InMemoryStore()

# ✅ Step 5: ParentDocumentRetriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# ✅ Step 6: Add Documents
# You must have a list of LangChain `Document` objects.
# Example dummy content:
documents = [
    Document(page_content="LangChain is a framework for building applications with LLMs."),
    Document(page_content="It provides components for chains, memory, agents, and tools."),
]

retriever.add_documents(documents)

# ✅ Step 7: Check parent doc keys in store
print("Number of parent docs stored:", len(list(store.yield_keys())))

# ✅ Step 8: Run vector search directly on child chunks
sub_docs = vectorstore.similarity_search("LangChain")
print("\n📘 Similar child chunk:", sub_docs[0].page_content)

# ✅ Step 9: Retrieve parent documents
retrieved_docs = retriever.get_relevant_documents("LangChain")
print("\n📙 Retrieved parent doc:", retrieved_docs[0].page_content)


  retrieved_docs = retriever.get_relevant_documents("LangChain")


Number of parent docs stored: 2

📘 Similar child chunk: LangChain is a framework for building applications with LLMs.

📙 Retrieved parent doc: LangChain is a framework for building applications with LLMs.


##### RetrievalQA

Now that you understand how to retrieve information from a document, you might be interested in exploring some more exciting applications. For instance, you could have the Language Model (LLM) read the paper and summarize it for you, or create a QA bot that can answer your questions based on the paper.

Here's an example using LangChain's `RetrievalQA`.


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain.llms import HuggingFacePipeline

# ✅ Choose a small, supported Hugging Face model
model_name = "tiiuae/falcon-rw-1b"  # or try "google/flan-t5-base" (for CPU efficiency)

# ✅ Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# ✅ Create HF text-generation pipeline
gen_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
    return_full_text=False
)

# ✅ Wrap in LangChain-compatible LLM
hf_llm = HuggingFacePipeline(pipeline=gen_pipeline)

qa = RetrievalQA.from_chain_type(
    llm=hf_llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)
query = "what is this paper discussing?"
qa.invoke(query)