# LangChain Fundamentals
***
## Table of Contents
***

1. [Introduction](#1-introduction)
1. [Environmental Variables](#2-environmental-variables)
1. [Using Language Model](#3-using-language-model)
    - [Loading Model](#loading-model)
    - [Interacting with Language Model](#interacting-with-language-model)
    - [Prompt Template](#prompt-template)
    - [Formatting with Pydantic](#formatting-with-pydantic)
    - [Runnables](#runnables)
1. [LangSmith](#4-langsmith)
    - [Default Tracing](#default-tracing)
    - [Non-LangChain Code Tracing](#non-langchain-code-tracing)
1. [Sequential Chain](#5-sequential-chain)
1. [Semantic Search](#6-semantic-search)
    - [Document Class](#document-class)
    - [Splitters](#splitters)
    - [Embeddings](#embeddings)
    - [Vector Stores](#vector-stores)
    - [Retrievers](#retrievers)
1. [References](#7-references)

***
## 1. Introduction
For developing modern applications powered by large language models (LLMs), the LangChain ecosystem is the most popular / best practice framework offering thorough simplification and robustness suitable for production-ready deployments.

The LangChain ecosystem comprises several key libraries:

- **LangChain** for core development of LLM-powered applications, providing abstractions, chains, and retrieval mechanisms.
- **LangSmith** for monitoring, evaluating, and debugging LangChain applications.
- **LangGraph** for orchestrating stateful, complex multi-agent workflows and building advanced AI pipelines.

This project covers the fundamentals of LangChain through practical examples, referencing official tutorials and educational videos from YouTube as learning resources.

## 2. Environmental Variables
Firstly, environmental variables need to be configured. These variables, especially API keys should never be hardcoded or made visible to others. The [python-dotenv](https://pypi.org/project/python-dotenv/) libray makes it straightforward to securely access variables set in a `.env` file.

In [1]:
import os
from getpass import getpass

try:
    from dotenv import load_dotenv

    load_dotenv()
except ImportError:
    raise ImportError("Error: 'python-dotenv' not installed")

`os.environ` is a dictionary-like object representing the environment variables of the current process. It allows users to assign new values using the syntax: `os.environ['some_key'] = 'some_value'`

In [2]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass(
    "Enter OpenAI API key: "
)


## 3. Using Language Model
### Loading Model
There are multiple approaches to loading language models:
1. Use `init_chat_model` from `langchain.chat_models`, with specified model name and provider. 
2. Specify the model's integration package first, then call the appropriate method to initialise the model.

In [3]:
# from langchain.chat_models import init_chat_model
from langchain_openai import ChatOpenAI


model_name = "gpt-4o-mini"

# 1. Simpler call
# model = init_chat_model(model=model_name, model_provider="openai", temperature=0.8)

# 2. Directly use of the integration package
model = ChatOpenAI(model=model_name, temperature=0.8)

### Interacting with Language Model
For a simple call, we can pass `messages` to the `.invoke` method. The list of message objects (`messages`) can be categorised into three parts:
- SystemMessage: Text that guides or determines AI's behaviour or actions.
- HumanMessage: Input given by a user.
- AIMessage: Output generated by the model.

In [4]:
from langchain_core.messages import SystemMessage, HumanMessage

messages = [
    SystemMessage(content="Translate the following message from English into French"),
    HumanMessage(content="Today is a beautiful day"),
]

model.invoke(input=messages)

AIMessage(content="Aujourd'hui est une belle journée.", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 7, 'prompt_tokens': 24, 'total_tokens': 31, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-C7SnZoVQPtQDQmWHqahQWkyPqy9lo', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--03b5332e-0983-4a42-b4b9-a34ec2daa615-0', usage_metadata={'input_tokens': 24, 'output_tokens': 7, 'total_tokens': 31, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

### Prompt Template
A prompt template provides a structured way of creating inputs for language models where parts of the prompt can be dynamically changed based on context or user input.

Prompts in LangChain can be split into three components:
- System Prompt: Gives instructions or a personality to the LLM model. This prompt determines the behaviour or characteristics of the model.
- User Prompt: Input given by a user.
- AI Prompt: Output generated by the model.

In [5]:
from langchain.prompts import (
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

system_prompt = SystemMessagePromptTemplate.from_template(
    template="You are an AI translater. Translate text from English to {language}",
    input_variables=["language"],
)

user_prompt = HumanMessagePromptTemplate.from_template(
    template="""Your task is to translate a text. The text to be translated is:
    ---
    {text}
    ---
    Output only the translated text, no other explanation or text should be provided
    """,
    input_variables=["text"],  # Define a variable
)

text_in_english = """
A croissant is a French Viennoiserie in a crescent shape made from a laminated yeast dough that sits between a bread and a puff pastry.
"""

target_language = "French"

Let's display the formatted user prompt after inserting a value into the `text` parameter:

In [6]:
print(user_prompt.format(text=text_in_english))

content='Your task is to translate a text. The text to be translated is:\n    ---\n    \nA croissant is a French Viennoiserie in a crescent shape made from a laminated yeast dough that sits between a bread and a puff pastry.\n\n    ---\n    Output only the translated text, no other explanation or text should be provided\n    ' additional_kwargs={} response_metadata={}


After defining system and user prompts, we can merge them into a full chat prompt using `ChatPromptTemplate`:

In [7]:
from langchain_core.prompts import ChatPromptTemplate

prompt_template = ChatPromptTemplate.from_messages([system_prompt, user_prompt])

ChatPromptTemplate adds a prefix indicating a role of each message (e.g., System:, Human: or AI:).

In [8]:
print(prompt_template.format(text=text_in_english, language=target_language))

System: You are an AI translater. Translate text from English to French
Human: Your task is to translate a text. The text to be translated is:
    ---
    
A croissant is a French Viennoiserie in a crescent shape made from a laminated yeast dough that sits between a bread and a puff pastry.

    ---
    Output only the translated text, no other explanation or text should be provided
    


Using **L**ang**C**hain **E**xpression **L**anguage (LCEL), we can construct a chain that links the user input, prompt templates, the model and the output in a sequence:

`input | prompt template | model | output`

This pipeline enables smooth data flow, where the user input is formatted by the prompt template, passed to the model for inference, and the model's response is captured as the output.

Note that lambda expressions are required to access prompt template variables. These variables are stored within input dictionaries or complex objects, thus the lambdas explicity extract or map the necessary fields from these data structures to ensure the correct values are passed to each stage in the chain.

In [9]:
chain = (
    {"text": lambda x: x["text"], "language": lambda x: x["language"]}
    | prompt_template
    | model
    | {"translated_text": lambda x: x.content}
)

chain.invoke(input={"text": text_in_english, "language": target_language})

{'translated_text': "Un croissant est une viennoiserie française en forme de croissant, faite à partir d'une pâte levée feuilletée qui se situe entre le pain et la pâte feuilletée."}

### Formatting with Pydantic
Using Pydantic, we can enforce a specific format or data structure on the output generated by the model.

In [10]:
from pydantic import BaseModel, Field


class Translation(BaseModel):
    org_text: str = Field(description="Original text")
    tl_text: str = Field(description="Translated text")
    n_words: int = Field(description="Number of words in the translated text")


structured_model = model.with_structured_output(Translation)

second_system_prompt = SystemMessagePromptTemplate.from_template(
    template="""
You are an AI translator. Translate a text from English to {language}.
""",
    input_variables=["language"],
)

second_user_prompt = HumanMessagePromptTemplate.from_template(
    template="""
    Your task is to translate the following text:
    ---
    {text}
    ---

    then count the number of words in the translated text. 
    Output only the translated text and the word counts, no other explanation or text should be provided.
""",
    input_variables=["text"],
)

second_prompt_template = ChatPromptTemplate.from_messages(
    [second_system_prompt, second_user_prompt]
)

second_chain = (
    {
        "text": lambda x: x["text"],
        "language": lambda x: x["language"],
    }
    | second_prompt_template
    | structured_model
    | {
        "org_text": lambda x: x.org_text,
        "tl_text": lambda x: x.tl_text,
        "n_words": lambda x: x.n_words,
    }
)
second_chain.invoke(input={"text": text_in_english, "language": target_language})

{'org_text': 'A croissant is a French Viennoiserie in a crescent shape made from a laminated yeast dough that sits between a bread and a puff pastry.',
 'tl_text': "Un croissant est une viennoiserie française en forme de croissant faite d'une pâte à levure laminée qui se situe entre un pain et une pâte feuilletée.",
 'n_words': 26}

### Runnables
In LangChain, runnables are modular, executable building blocks (units of work) that can be invoked, batched, streamed, transformed, and composed across various LangChain components. Chains and their components are all implemented as runnables.

- **RunnableSequence**: A class that chains multiple runnable components together.
- **RunnableLambda**: A class that turns a Python callable (e.g., function or lambda) into a runnable component.
- **RunnablePassthrough**: A class that passes its input through unmodified (as a placeholder) or adds additional keys to the output for flexible integration into runnable sequences.
- **RunnableParallel**: A class that runs multiple runnables concurrently.

The following example demonstrates the use of `RunnableLambda`. By wrapping a function with `RunnableLambda`, it becomes a runnable and can then be called with the `.invoke` method.

In [11]:
from langchain_core.runnables import RunnableLambda


def greet(name) -> str:
    return f"Hello, {name}!"


greet_runnable = RunnableLambda(
    func=lambda x: greet(name=x)
)  # Wrap function as Runnable

result = greet_runnable.invoke(input="Alice")
print(result)

Hello, Alice!


## 4. LangSmith
LangSmith is a comprehensive platform developed by the LangChain team for observability, debugging, testing, and evaluation of large language model (LLM) applications. It helps developers monitor and improve AI-powered apps by capturing detailed traces of interactions, including inputs, outputs, intermediate steps, execution times, and errors.

In [12]:
os.environ["LANGSMITH_TRACING"] = "true"

os.environ["LANGSMITH_API_KEY"] = os.getenv("LANGSMITH_API_KEY") or getpass(
    "Enter LangSmith API key: "
)
os.environ["LANGSMITH_PROJECT"] = os.getenv("LANGSMITH_PROJECT") or getpass(
    "Enter project name: "
)
os.environ["LANGSMITH_ENDPOINT"] = os.getenv("LANGSMITH_ENDPOINT") or getpass(
    "Enter LangSmith endpoint: "
)

print(f"Project name: {os.environ['LANGSMITH_PROJECT']}")

Project name: lc_fundamentals


### Default Tracing

If the API keys and parameters were properly configured in the `.env` file, the executions above should have been traced on [LangSmith UI](https://eu.smith.langchain.com/). LangSmith automatically records logs (e.g., inputs, outputs, errors, and execution time) which greatly facilitates the debugging process.

![Default Trace](_images/default_trace.png)

### Non-LangChain Code Tracing
By adding the `@traceable` decorator, LangSmith will be able to trace non-LangChain functions.

In [13]:
from langsmith import traceable
import random


@traceable
def generate_random_int():
    num = random.randint(1, 10)
    if num % 2 == 0:
        raise ValueError("Error: The value has to be odd.")
    else:
        return "Odd value. No problem."


generate_random_int()

'Odd value. No problem.'

![Traceable](_images/traceable.png)

## 5. Sequential Chain
A Sequential Chain is a chain that executes multiple steps (or sub-chains) in a defined order, where the output of one step becomes the input to the next. This allows us to build complex workflows by composing simpler chains sequentially. In the following example, the first chain takes the input `ingredient` and generates a `dish` that uses the ingredient. Then, the second chain takes the `dish` name as input and writes a `description` of the dish, returning the description as the final output.

In [14]:
from langchain.chains import SequentialChain, LLMChain
from langchain.prompts import PromptTemplate

chain_1 = LLMChain(
    llm=model,
    prompt=PromptTemplate(
        input_variables=["ingredient"],
        template="Generate a dish that uses {ingredient}. Output only the dish's name.",
    ),
    output_key="dish",
)

chain_2 = LLMChain(
    llm=model,
    prompt=PromptTemplate(
        input_variables=["dish"],
        template="Write a simple description of {dish} in 2 - 3 sentences.",
    ),
    output_key="description",
)

sequential_chain = SequentialChain(
    chains=[chain_1, chain_2],
    input_variables=["ingredient"],
    output_variables=["dish", "description"],
    verbose=True,
)

result = sequential_chain.invoke({"ingredient": "tomato"})
print(result)




[1m> Entering new SequentialChain chain...[0m


  chain_1 = LLMChain(



[1m> Finished chain.[0m
{'ingredient': 'tomato', 'dish': 'Tomato Basil Risotto', 'description': 'Tomato Basil Risotto is a creamy, comforting Italian dish made from Arborio rice simmered slowly in a flavorful broth until tender and velvety. Fresh tomatoes and aromatic basil are stirred in, infusing the risotto with vibrant flavors and a hint of sweetness. This dish is often finished with a sprinkle of Parmesan cheese, adding richness and depth to each bite.'}


## 6. Semantic Search
### Document Class
To handle external data (e.g., websites, PDF files), the Document class stores a piece of text along with its associated metadata. The following are the three attributes of the Document class:

- `page_content`: Content in string.
- `metadata`: Dictionary including metadata.
- `id` (optional): Identifier in string for the document.

In [15]:
from langchain_core.documents import Document

doc = Document(
    page_content="Page content in string format.",
    metadata={"source": "https://example.com"},
)

`PyPDFLoader` facilitates loading a single Document object per PDF (the `pypdf` package is required), and the `.load()` method returns a list containing the page content and metadata from the PDF file. To access the data, it is necessary to use the index `[0]`.

In [16]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "_datasets/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()

print(f"Total number of pages: {len(docs)}")

Total number of pages: 107


In [17]:
print(f"PAGE CONTENT:\n{'*' * 20}\n{docs[0].page_content[:100]}\n")
print(f"META DATA:\n{'*' * 20}\n{docs[0].metadata}")

PAGE CONTENT:
********************
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K


META DATA:
********************
{'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': '_datasets/nke-10k-2023.pdf', 'total_pages': 107, 'page': 0, 'page_label': '1'}


### Splitters
Splitters are utility classes in LangChain designed to divide large bodies of text or document objects into smaller chunks. The most common example is the `TextSplitter`, such as the `ResursiveCharacterTextSplitter` which splits documents based on character count, overlap, or other configurable criteria.

Text splitters ensure that content is broken down into appropriate sized chunks that do not exceed the maximum token limit of LLMs, thereby generating more precise embeddings. This approach also facilitates parallel (batch) processing of document chunks. By setting the `chunk_overlap` parameter, splitters preserve context across adjacent chunks, reducing the loss of important information and improving retrieval accuracy in retrieval-augmented generation (RAG) systems.

The following example splits the loaded document into chunks of up to 1000 characters with 200 characters of overlap between chunks:

In [18]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True,
)
all_splits = text_splitter.split_documents(docs)

print(f"Total number of chunks: {len(all_splits)}")

Total number of chunks: 516


Here are the details of the parameters:

- `chunk_size`: Determines the **maximum number** of characters each chunk can contain.
- `chunk_overlap`: Specifies the number of characters to overlap at the end of a chunk, which will be repeated at the start of the next chunk. This helps to preserve context at the boundaries between chunks.
- `add_start_index`: When set to `True`, each split `Document` will include metadata indicating the starting character index of that chunk within the original document.

The `RecursiveCharacterTextSplitter` prioritises splitting at natural boundaries (e.g., paragraphs, sentences, or newline characters) rather than splitting strictly at the chunk size. This segmentation avoids cutting text in the middle of words or sentences; therefore, setting the `chunk_size` parameter to 1000 can result in chunks containing fewer than 1000 characters.

In [19]:
print(f"Number of characters in the 1st chunk: {len(all_splits[0].page_content)}\n")
print(f"{all_splits[0].page_content}\n")
print(f"{all_splits[0].metadata}\n")

Number of characters in the 1st chunk: 972

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FOR THE FISCAL YEAR ENDED MAY 31, 2023
OR
☐  TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FOR THE TRANSITION PERIOD FROM                         TO                         .
Commission File No. 1-10635
NIKE, Inc.
(Exact name of Registrant as specified in its charter)
Oregon 93-0584541
(State or other jurisdiction of incorporation) (IRS Employer Identification No.)
One Bowerman Drive, Beaverton, Oregon 97005-6453
(Address of principal executive offices and zip code)
(503) 671-6453
(Registrant's telephone number, including area code)
SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:
Class B Common Stock NKE New York Stock Exchange
(Title of each class) (Trading symbol) (Name of each exchange on w

In [20]:
print(f"Number of characters in the 2nd chunk: {len(all_splits[1].page_content)}\n")
print(f"{all_splits[1].page_content}\n")
print(f"{all_splits[1].metadata}\n")

Number of characters in the 2nd chunk: 975

SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:
Class B Common Stock NKE New York Stock Exchange
(Title of each class) (Trading symbol) (Name of each exchange on which registered)
SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT:
NONE
Indicate by check mark: YES NO
• if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act. þ ¨ 
• if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act. ¨ þ 
• whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding
12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the
past 90 days.
þ ¨ 
• whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 40

As described above, the first chunk (`start_index=0`) contains the first 972 (< 1000) characters with indices 0–971, and the second chunk (`start_index=781`) includes 975 (< 1000) characters starting from index 781. Each chunk is split at natural boundaries, with up to 200 characters overlapping between chunks.

### Embeddings
Embeddings are numerical representations that map words, sentences, or documents into vectors within a high-dimensional space. These vectors capture the semantic meaning of the text, such that texts with similar meanings have vectors located close to each other. Each piece of text is converted into a vector using machine learning models trained to understand language semantics. Similarity between texts is then measured using vector similarity metrics (e.g., cosine similarity, Euclidean distance, inner product), which quantify the geometric closeness of their corresponding vectors.

This technique forms the basis of **vector search** to find the most similar data to a given query by comparing their vector representations.

The following example uses the `text-embedding-3-large` model from OpenAI to encode the semantic meaning of the first two text chunks.

In [21]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

print(f"Generated vectors of length: {len(vector_1)}\n")
print(vector_1[:5])
print(vector_2[:5])

Generated vectors of length: 3072

[0.00932583399116993, -0.01603718101978302, 0.0003375610976945609, 0.006354826968163252, 0.020478054881095886]
[0.01703420653939247, -0.018381766974925995, -0.006770508363842964, 0.030064981430768967, 0.020579729229211807]


Note that each value in an embedding vector represents a coordinate in a high-dimensional space encoding latent semantic features of the entire input text. In the example above, the entire texts in the first and second chunks are projected as a whole and encoded as single vectors in the same vector space. However, these individual values do not have explicit, human-interpretable meanings on their own.

### Vector Stores
**Vector Stores** (also called **Vector Databases** or **Vector Search Engines**) are specialised data storage systems designed to store, index, and retrieve vector embeddings based on the similarity of the data. Unlike traditional databases, which handle structured data in tables and rows, vector stores are optimised to manage unstructured or semi-structured data that has been converted into high-dimensional vectors.

LangChain offers integrations with a variety of vector store technologies, some of which are hosted by providers and require specific credentials to use.

The following is the basic in-memory vector store implementation:

In [22]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embedding=embeddings)

Or, using a Chroma vector database using LangChain's Chroma integration:

In [23]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="lc_fundamentals",  # Set local db name
    embedding_function=embeddings,  # Embedding function to use
    persist_directory="./chroma_lc_db",  # Directory to save
)

This implementation defines the local directory path where the Chroma vector data and index will be persisted on disk. It enables the data to be saved in a local database and reloaded across sessions.

After instantiating the vector store, we will be able to index the documents. These IDs are unique identifiers (UUIDs) automatically generated for each document or embedding entry stored in the vector database. They serve as references to the specific records for internal management and retrieval purposes.

In [24]:
ids = vector_store.add_documents(documents=all_splits)
print(ids[:5])

['e73711ee-4e1a-4952-95b4-e6ee5783ec14', 'fa2727d0-3ed3-4477-baa5-9bbfacfc2f2f', '531cc7ae-158c-4663-a87f-3cfc2c7e6489', '99565c7c-319e-4240-9b50-588b410b86a4', 'fad2c9f6-01de-4d0b-8b3a-af9b49700890']


The `.similarity_search()` method takes a string query as input and returns information relevant to that query. The query can also be processed asynchronously:

In [25]:
results = vector_store.similarity_search(
    query="How many distribution centres does Nike have in the US?"
)

# results = await vector_store.asimilarity_search(
#     query="How many distribution centres does Nike have in the US?"
# )

print(results[0])

page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our
wholesale, NIKE Direct and merchandising strategies in the region, among other functions.
In the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are owned and three of which are
leased. Two other distribution centers, one located in Indianapolis, Indiana and one located in Dayton, Tennessee, are leased and operated by third-party logistics
providers. One distribution center for Converse is located in Ontario, California, which is leased. NIKE has a number of distribution facilities outside the United States,
some of which are leased and operated by third-party logistics providers. The most significant distribution facilities outside the United States are located in Laakdal,' metadata={'page_label': '27', 'page': 26, 'title': '000032018

The `.similarity_search_with_score()` method performs the same search as `.similarity_search()` but also returns the similarity score for each result. The specific similarity metric used can vary depending on the vector store provider and implementation.

In [26]:
results = vector_store.similarity_search_with_score(
    query="What was Nike's revenue in 2023?"
)
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 0.6236852407455444

page_content='Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTSThe following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
• NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
• NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.' metad

### Retrievers
Retrievers in LangChain are components responsible for fetching relevant information based on a query. They serve as an interface between the user's query and a data storage or search system (e.g., vector stores or external APIs). Retrievers are runnable components, meaning they support standard runnable operations such as `.invoke()` and `.batch()`.

By wrapping the `.similarity_search()` method with a function and adding the `@chain` decorator, the non-runnable similarity search method can be effectively transformed into a runnable retriever.
Finally, the `.batch()` method allows querying multiple questions at once, making retrieval efficient for bulk queries or parallel processing.

In [27]:
from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> list[Document]:
    return vector_store.similarity_search(query=query, k=1)


retriever.batch(
    inputs=[
        "How many distribution centres does Nike have in the US?",
        "When was Nike incorporated?",
    ]
)

[[Document(id='ca49aae5-2196-4bf5-8cf2-65006bc5ca05', metadata={'keywords': '0000320187-23-000039; ; 10-K', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'moddate': '2023-07-20T16:22:08-04:00', 'start_index': 804, 'page_label': '27', 'source': '_datasets/nke-10k-2023.pdf', 'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'total_pages': 107, 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'page': 26, 'title': '0000320187-23-000039', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31'}, page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our\nwholesale, NIKE Direct and merchandising strategies in the region, among other functions.\nIn the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are owned and thre

Otherwise, we can also use the `.as_retriever()` method implemented in vector stores to convert them into a `VectorStoreRetriever`. It accepts keyword arguments to customise the retrieval, such as:

- `search_type`: Defines the type of search, e.g., `similarity` (default), `mmr` for maximum marginal relevance, or `similarity_score_threshold`.
- search_kwargs: Additional parameters passed to the underlying search method, e.g., number of documents to return (`k`), score thresholds,filters, etc.

The returned retriever can then be used with standard retriever methods such as `.invoke()` and `.batch()` to fetch relevant documents.

In [28]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    inputs=[
        "How many distribution centres does Nike have in the US?",
        "When was Nike incorporated?",
    ]
)

[[Document(id='ca49aae5-2196-4bf5-8cf2-65006bc5ca05', metadata={'title': '0000320187-23-000039', 'page': 26, 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'total_pages': 107, 'source': '_datasets/nke-10k-2023.pdf', 'creationdate': '2023-07-20T16:22:00-04:00', 'creator': 'EDGAR Filing HTML Converter', 'moddate': '2023-07-20T16:22:08-04:00', 'start_index': 804, 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'page_label': '27', 'keywords': '0000320187-23-000039; ; 10-K'}, page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our\nwholesale, NIKE Direct and merchandising strategies in the region, among other functions.\nIn the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are owned and thre

## 7. References

1. aiwithbrandon. (2024). *LangChain master class for beginners 2024 [+20 examples, LangChain V0.2]* [Video]. YouTube.<br>
https://www.youtube.com/watch?v=yF9kGESAi3M

1. Rabbitmetrics. (2024). *Learn LangChain in 7 Easy Steps - Full Interactive Beginner Tutorial* [Video]. Youtube. <br>
https://www.youtube.com/watch?v=8BV9TW490nQ

1. GeeksforGeeks. (2025). *Python | os.environ object*<br>
https://www.geeksforgeeks.org/python/python-os-environ-object

1. Briggs, J. (2025). *LangChain Mastery in 2025 | Full 5 Hour Course [LangChain v0.3]* [Video]. Youtube.<br>
https://www.youtube.com/watch?v=Cyv-dgv80kE

1. LangChain. (n.d.). *Build a semantic search engine* [Tutorial].<br>
https://python.langchain.com/docs/tutorials/retrievers/

1. LangChain. (n.d.). *Build a simple LLM application with chat models and prompt templates* [Tutorial].<br>
https://python.langchain.com/docs/tutorials/llm_chain/