# Build a semantic search engine

This tutorial will familiarize you with LangChain's [document loader](https://python.langchain.com/docs/concepts/document_loaders/), [embedding](https://python.langchain.com/docs/concepts/embedding_models/), and [vector store](https://python.langchain.com/docs/concepts/vectorstores/) abstractions. These abstractions are designed to support retrieval of data-- from (vector) databases and other sources-- for integration with LLM workflows. They are important for applications that fetch data to be reasoned over as part of model inference, as in the case of retrieval-augmented generation, or [RAG](https://python.langchain.com/docs/concepts/rag/) (see our RAG tutorial [here](https://python.langchain.com/docs/tutorials/rag/)).

Here we will build a search engine over a PDF document. This will allow us to retrieve passages in the PDF that are similar to an input query.

In [None]:
# %pip install langchain-community pypdf

In [9]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "nke-10k-2023.pdf"
# file_path = "chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

1


In [11]:
page_1 = (f"{docs[0].page_content[:50]}\n")
page_2 = (f"{docs[1].page_content[:50]}\n")

print(page_1)
print(page_2)
print(docs[0].metadata)

Dummy PDF file

{'producer': 'OpenOffice.org 2.1', 'creator': 'Writer', 'creationdate': '2007-02-23T17:56:37+02:00', 'author': 'Evangelos Vlachogiannis', 'source': 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}
