# Build a semantic search engine

This tutorial will familiarize you with LangChain's **[document loader](https://python.langchain.com/docs/concepts/document_loaders/)**, **[embedding](https://python.langchain.com/docs/concepts/embedding_models/)**, and **[vector store](https://python.langchain.com/docs/concepts/vectorstores/)** abstractions. These abstractions are designed to support retrieval of data-- from (vector) databases and other sources-- for integration with LLM workflows. They are important for applications that fetch data to be reasoned over as part of model inference, as in the case of retrieval-augmented generation, or **[RAG](https://python.langchain.com/docs/concepts/rag/)** (see our RAG tutorial **[here](https://python.langchain.com/docs/tutorials/rag/)**).

Here we will build a search engine over a PDF document. This will allow us to retrieve passages in the PDF that are similar to an input query.

In [None]:
# %pip install langchain-community pypdf

# Loading documents

Let's load a PDF into a sequence of Document objects. There is a sample PDF in the LangChain repo here -- a 10-k filing for Nike from 2023. We can consult the LangChain documentation for **[available PDF document loaders](https://python.langchain.com/docs/integrations/document_loaders/#pdfs)**. Let's select PyPDFLoader, which is fairly lightweight.

In [12]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "nke-10k-2023.pdf"
# file_path = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

107


`PyPDFLoader` loads one `Document` object per PDF page. For each, we can easily access:

1. The string content of the page;
2. Metadata containing the file name and page number.

In [13]:
page_1 = (f"{docs[0].page_content[:50]}\n")
page_2 = (f"{docs[1].page_content[:50]}\n")

print(page_1)
print(page_2)
print(docs[0].metadata)

Table of Contents
UNITED STATES
SECURITIES AND EXC

Table of Contents
As of July 12, 2023, the number 

{'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': 'nke-10k-2023.pdf', 'total_pages': 107, 'page': 0, 'page_label': '1'}


# Splitting

For both information retrieval and downstream question-answering purposes, a page may be too coarse a representation. Our goal in the end will be to retrieve `Document` objects that answer an input query, and further splitting our PDF will help ensure that the meanings of relevant portions of the document are not "washed out" by surrounding text.