<a href="https://colab.research.google.com/github/sampathk-hps/langchain-fundamentals-colab/blob/main/LangChain_2_Build_a_semantic_search_engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

LangChain's document loader, embedding, and vector store abstractions.
These abstractions are designed to support retrieval of data-- from (vector) databases and other sources-- for integration with LLM workflows. They are important for applications that fetch data to be reasoned over as part of model inference, as in the case of retrieval-augmented generation, or RAG

## Concepts

* Documents and document loaders
* Text splitters;
* Embeddings;
* Vector stores and retrievers.


## Installation

In [None]:
pip install langchain-community pypdf

Collecting langchain-community
  Downloading langchain_community-0.3.30-py3-none-any.whl.metadata (3.0 kB)
Collecting pypdf
  Downloading pypdf-6.1.1-py3-none-any.whl.metadata (7.1 kB)
Collecting requests<3.0.0,>=2.32.5 (from langchain-community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7.0,>=0.6.7->langchain-community)
  Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
Downloading langchain_commun

In [None]:
import os
import getpass

try:
  from dotenv import load_dotenv
  load_dotenv()
except ImportError:
  pass

os.environ["LANGSMITH_TRACING"] = "true"
if "LANGSMITH_API_KEY" not in os.environ:
    os.environ["LANGSMITH_API_KEY"] = getpass.getpass(
        prompt="Enter your LangSmith API key (optional): "
    )
if "LANGSMITH_PROJECT" not in os.environ:
    os.environ["LANGSMITH_PROJECT"] = getpass.getpass(
        prompt='Enter your LangSmith Project Name (default = "default"): '
    )
    if not os.environ.get("LANGSMITH_PROJECT"):
        os.environ["LANGSMITH_PROJECT"] = "default"

Enter your LangSmith API key (optional): ··········
Enter your LangSmith Project Name (default = "default"): ··········


## Documents and Document Loaders

LangChain implements a Document abstraction, which is intended to represent a unit of text and associated metadata. It has three attributes:

1. page_content: a string representing the content;
2. metadata: a dict containing arbitrary metadata;
3. id: (optional) a string identifier for the document.

The metadata attribute can capture information about the source of the document, its relationship to other documents, and other information.

Individual Document object often represents a **chunk** of a larger document.

In [None]:
# We can generate sample documents when desired:


from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

In [None]:
%pip install -qU langchain-community bs4

In [None]:
from langchain_community.document_loaders import BSHTMLLoader

# loader = PyPDFLoader("data/state_of_the_union.pdf")
# documents = loader.load()
# loader = CSVLoader(
#     file_path="/content/sample_data/california_housing_train.csv")
loader = BSHTMLLoader(
    file_path="/content/sample_data/2023-03-28-TSM-BZ$14109a48.html")
docs = loader.load()
print(len(docs))

1


In [None]:
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)

U.S. advanced semiconductor technology embargo on China and ramp-up of the semiconductor technology base have affected China.  Shares in top Chinese chipmakers shed $7.7 billion in market value on Oct

{'source': '/content/sample_data/2023-03-28-TSM-BZ$14109a48.html', 'title': ''}


## Splitting

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, add_start_index=True)
all_splits = text_splitter.split_documents(docs)
len(all_splits)

3

## Embeddings

Using **HuggingFace** Embedding Model

In [None]:
pip install -qU langchain-huggingface

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

In [None]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 768

[-0.02542412094771862, 0.03560704365372658, -0.030742721632122993, -0.044747497886419296, 0.04758850485086441, -0.04024335369467735, 0.061291590332984924, 0.031164580956101418, 0.007982350885868073, 0.014209233224391937]


Using **Ollama** Embedding Model

In [None]:
pip install -qU langchain-ollama

In [None]:
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="llama3")

In [None]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

ConnectionError: Failed to connect to Ollama. Please check that Ollama is downloaded, running and accessible. https://ollama.com/download

Using Google Vertex

In [None]:
pip install -qU langchain-google-vertexai

In [None]:
from langchain_google_vertexai import VertexAIEmbeddings

embeddings = VertexAIEmbeddings(model="text-embedding-005", project="your-gcp-project-id")

In [None]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

## Vector stores

In Memory

In [None]:
pip install -qU langchain-core

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/449.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m317.4/449.5 kB[0m [31m9.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m449.5/449.5 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

In [None]:
# Having instantiated our vector store, we can now index the documents.

ids = vector_store.add_documents(documents=all_splits)

Using ChromaDB

In [None]:
pip install -qU langchain-chroma

In [None]:
!rm -rf ./chroma_langchain_db

In [None]:
from langchain_chroma import Chroma

vector_store = Chroma(collection_name="example_html_collection", embedding_function=embeddings, persist_directory="./chroma_langchain_html_db",)

In [None]:
ids = vector_store.add_documents(documents=all_splits)

Using FAISS - Facebook AI Similarity Search

In [None]:
pip install -qU langchain-community

In [None]:
pip install -qU faiss-cpu

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m65.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores.faiss import FAISS

embedding_dim = len(embeddings.embed_query("hello world"))
index = faiss.IndexFlatL2(embedding_dim)

vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

In [None]:
ids = vector_store.add_documents(documents=all_splits)

## Usage

In [None]:
# Return documents based on similarity to a string query:

results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)

print(results[0])

page_content='KraneShares Trust KraneShares CSI China Internet ETF (NYSE: KWEB), and iShares MSCI China ETF (NASDAQ: MCHI) gained between 0.4%- 2.7% YTD.  The Chinese ETFs have significant exposure to Tencent Holding Ltd (OTC: TCEHY), Alibaba Group Holding Limited (NYSE: BABA), Baidu, Inc (NASDAQ: BIDU), JD.Com, Inc (NASDAQ: JD), and more.  Other factors, such as China's crackdown on tech, also impacted major players like Alibaba, JD, and Baidu.  Photo by Tatiana Popova and rawf8 via Shuttterstock  Copyright © Benzinga. All rights reserved. Write to editorial@benzinga.com with any questions about this content. Benzinga does not provide investment advice.' metadata={'source': '/content/sample_data/2023-03-28-TSM-BZ$14109a48.html', 'title': '', 'start_index': 1617}


In [None]:
# Async Query

results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])

page_content='KraneShares Trust KraneShares CSI China Internet ETF (NYSE: KWEB), and iShares MSCI China ETF (NASDAQ: MCHI) gained between 0.4%- 2.7% YTD.  The Chinese ETFs have significant exposure to Tencent Holding Ltd (OTC: TCEHY), Alibaba Group Holding Limited (NYSE: BABA), Baidu, Inc (NASDAQ: BIDU), JD.Com, Inc (NASDAQ: JD), and more.  Other factors, such as China's crackdown on tech, also impacted major players like Alibaba, JD, and Baidu.  Photo by Tatiana Popova and rawf8 via Shuttterstock  Copyright © Benzinga. All rights reserved. Write to editorial@benzinga.com with any questions about this content. Benzinga does not provide investment advice.' metadata={'source': '/content/sample_data/2023-03-28-TSM-BZ$14109a48.html', 'title': '', 'start_index': 1617}


In [None]:
# Return Scores:
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 1.644883632659912

page_content='vying for funding cannot increase their production of advanced chips in China.  Leading U.S. tech ETFs SPDR Select Sector Fund - Technology (NYSE:XLK), VanEck Semiconductor ETF (NASDAQ:SMH), and iShares Semiconductor ETF (NASDAQ:SOXX) have gained between 17.6% - 25.5% year-to-date.  SMH and SOXX with exposure to Nvidia Corp (NASDAQ: NVDA), Taiwan Semiconductor Manufacturing Company Ltd (NYSE: TSM), Advanced Micro Devices, Inc (NASDAQ: AMD), ASML Holding N.V. (NASDAQ: ASML), Texas Instruments Inc (NASDAQ: TXN), Intel Corp (NASDAQ: INTC) and other chipmakers led the gains.  XLK has significant exposure to Microsoft Corp (NASDAQ: MSFT), Apple Inc (NASDAQ: AAPL), followed by chipmakers Nvidia and more.  Contrastingly leading Chinese tech ETF gains trailed U.S. peers. iShares China Large-Cap ETF (NYSE: FXI), KraneShares Trust KraneShares CSI China Internet ETF (NYSE: KWEB), and iShares MSCI China ETF (NASDAQ: MCHI) gained between 0.4%- 2.7% YTD.  The 

In [None]:
# Return documents based on similarity to an embedded query:
embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

page_content='vying for funding cannot increase their production of advanced chips in China.  Leading U.S. tech ETFs SPDR Select Sector Fund - Technology (NYSE:XLK), VanEck Semiconductor ETF (NASDAQ:SMH), and iShares Semiconductor ETF (NASDAQ:SOXX) have gained between 17.6% - 25.5% year-to-date.  SMH and SOXX with exposure to Nvidia Corp (NASDAQ: NVDA), Taiwan Semiconductor Manufacturing Company Ltd (NYSE: TSM), Advanced Micro Devices, Inc (NASDAQ: AMD), ASML Holding N.V. (NASDAQ: ASML), Texas Instruments Inc (NASDAQ: TXN), Intel Corp (NASDAQ: INTC) and other chipmakers led the gains.  XLK has significant exposure to Microsoft Corp (NASDAQ: MSFT), Apple Inc (NASDAQ: AAPL), followed by chipmakers Nvidia and more.  Contrastingly leading Chinese tech ETF gains trailed U.S. peers. iShares China Large-Cap ETF (NYSE: FXI), KraneShares Trust KraneShares CSI China Internet ETF (NYSE: KWEB), and iShares MSCI China ETF (NASDAQ: MCHI) gained between 0.4%- 2.7% YTD.  The Chinese ETFs have signific

## Retrivers

In [None]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)


retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

[[Document(id='73c498e4-1c5d-43e6-a8be-8f62f717a1fa', metadata={'source': '/content/sample_data/2023-03-28-TSM-BZ$14109a48.html', 'title': '', 'start_index': 1617}, page_content="KraneShares\xa0Trust KraneShares CSI China Internet ETF\xa0(NYSE:\xa0KWEB), and iShares MSCI China ETF\xa0(NASDAQ:\xa0MCHI) gained between 0.4%- 2.7% YTD.  The Chinese ETFs have significant exposure to\xa0Tencent Holding Ltd\xa0(OTC:\xa0TCEHY),\xa0Alibaba Group Holding Limited\xa0(NYSE:\xa0BABA),\xa0Baidu, Inc\xa0(NASDAQ:\xa0BIDU),\xa0JD.Com, Inc\xa0(NASDAQ:\xa0JD), and more.  Other factors, such as China's crackdown on tech, also impacted major players like Alibaba, JD, and Baidu.  Photo by Tatiana Popova and rawf8 via Shuttterstock  Copyright © Benzinga. All rights reserved. Write to editorial@benzinga.com with any questions about this content. Benzinga does not provide investment advice.")],
 [Document(id='73c498e4-1c5d-43e6-a8be-8f62f717a1fa', metadata={'source': '/content/sample_data/2023-03-28-TSM-BZ$141

Vectorstores implement an as_retriever method that will generate a Retriever, specifically a VectorStoreRetriever. These retrievers include specific search_type and search_kwargs attributes that identify what methods of the underlying vector store to call, and how to parameterize them. For instance, we can replicate the above with the following:

In [None]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

[[Document(id='73c498e4-1c5d-43e6-a8be-8f62f717a1fa', metadata={'source': '/content/sample_data/2023-03-28-TSM-BZ$14109a48.html', 'title': '', 'start_index': 1617}, page_content="KraneShares\xa0Trust KraneShares CSI China Internet ETF\xa0(NYSE:\xa0KWEB), and iShares MSCI China ETF\xa0(NASDAQ:\xa0MCHI) gained between 0.4%- 2.7% YTD.  The Chinese ETFs have significant exposure to\xa0Tencent Holding Ltd\xa0(OTC:\xa0TCEHY),\xa0Alibaba Group Holding Limited\xa0(NYSE:\xa0BABA),\xa0Baidu, Inc\xa0(NASDAQ:\xa0BIDU),\xa0JD.Com, Inc\xa0(NASDAQ:\xa0JD), and more.  Other factors, such as China's crackdown on tech, also impacted major players like Alibaba, JD, and Baidu.  Photo by Tatiana Popova and rawf8 via Shuttterstock  Copyright © Benzinga. All rights reserved. Write to editorial@benzinga.com with any questions about this content. Benzinga does not provide investment advice.")],
 [Document(id='73c498e4-1c5d-43e6-a8be-8f62f717a1fa', metadata={'source': '/content/sample_data/2023-03-28-TSM-BZ$141