# A LangChain 1.0 app that can talk with a PDF Document
* As you will see, this is not an agent yet. We will make it an agent in the next exercise.

## What is Semantic Search?

**Semantic search** is a search technique that understands the **meaning** (semantics) of your query rather than just matching exact keywords. Instead of looking for exact word matches, it finds content that is conceptually similar to your question.

**Why is this code an example of semantic search?**
- Traditional keyword search would look for exact words like "Gartner" or "percentage"
- Semantic search converts both your question AND the document chunks into numerical vectors (embeddings)
- It then finds chunks whose vectors are closest in mathematical space to your question's vector
- This means it can find relevant answers even if the exact wording is different

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("gen-ai-in-2026.pdf")

data = loader.load()

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)

all_splits = text_splitter.split_documents(data)

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

ids = vector_store.add_documents(documents=all_splits)

results = vector_store.similarity_search(
    "According to Gartner, what percentage of enterprises will use Generative AI APis or deploy generative AI-enabled applications in production environments in 2026?"
)

print(results[0])

page_content='Generative AI in 2026:
Transforming Business and
 Professional Value
As we progress through 2026, Generative AI has reached a critical inflection point. After years
of experimentation and pilot programs, businesses are now demanding concrete returns on
their AI investments. This year marks the transition from proof-of-concept to production-scale
deployment, with organizations shifting their focus from "what's possible" to "what's
profitable."
The 2026 AI Landscape: From Hype to Reality
The generative AI market has experienced explosive growth, with global investment tripling
from 2024 to 2025, reaching approximately $37 billion. Gartner research indicates that by
2026, more than 80% of enterprises will use generative AI APIs or deploy generative
AI-enabled applications in production environmentsâ€”a staggering increase from just 5% in
2023.
However, this growth comes with a sobering reality check. MIT research revealed a 95%' metadata={'producer': 'ReportLab PDF Library -

## Explaining the previous code in simple terms

```python
from langchain_community.document_loaders import PyPDFLoader
```
**Imports the PDF loader** - This is a tool that can read PDF files and convert them into a format LangChain can work with.

```python
loader = PyPDFLoader("gen-ai-in-2026.pdf")
```
**Creates a loader object** for the specific PDF file. Think of this as pointing to your document.

```python
data = loader.load()
```
**Loads the PDF** and converts it into Document objects. Each page becomes a document with `page_content` (the text) and `metadata` (information about the page).

```python
from langchain_text_splitters import RecursiveCharacterTextSplitter
```
**Imports the text splitter** - PDFs are often too long to process all at once, so we need to break them into smaller pieces.

```python
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
```
**Configures how to split the text**:
- `chunk_size=1000`: Each piece will be about 1000 characters long
- `chunk_overlap=200`: Chunks overlap by 200 characters to avoid cutting sentences awkwardly
- `add_start_index=True`: Remembers where each chunk came from in the original document

```python
all_splits = text_splitter.split_documents(data)
```
**Actually performs the splitting** - Takes the full PDF and breaks it into smaller, manageable chunks.

```python
from langchain_openai import OpenAIEmbeddings
```
**Imports the embedding model** - This will convert text into numerical vectors.

```python
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
```
**Creates an embeddings object** using OpenAI's text-embedding-3-large model. This model is specialized in understanding the meaning of text and converting it to numbers.

```python
from langchain_core.vectorstores import InMemoryVectorStore
```
**Imports the vector database** - This is a special database designed to store and search through vectors efficiently.

```python
vector_store = InMemoryVectorStore(embeddings)
```
**Creates a vector database** that will use our embeddings model. "InMemory" means it stores everything in RAM (fast but temporary).

```python
ids = vector_store.add_documents(documents=all_splits)
```
**Adds all document chunks to the database** - Each chunk gets converted to a vector and stored. The `ids` are unique identifiers for each stored chunk.

```python
results = vector_store.similarity_search(
    "According to Gartner, what percentage of enterprises will use Generative AI APis or deploy generative AI-enabled applications in production environments in 2026?"
)
```
**Performs the semantic search**:
1. Converts your question into a vector
2. Compares it to all stored chunk vectors
3. Returns the most similar chunks (by default, 4 chunks)
4. "Similarity" is measured by how close the vectors are in mathematical space

```python
print(results[0])
```
**Prints the first (most similar) result** - This shows the chunk of text that is most semantically similar to your question.


#### Why Are The Results Not Very Good?

When you execute `print(results[0])`, you get a **raw document chunk**, not an actual answer. The problems are:

1. **No Answer Extraction**: You're just seeing the retrieved text, but there's no AI reading it and answering your question
2. **Raw Format**: The output includes metadata and the full chunk, making it hard to read
3. **No Context Synthesis**: If the answer requires information from multiple chunks, you only see one
4. **No Reasoning**: There's no LLM to interpret the retrieved content and formulate a proper response


## How we will Improve the Results in the next exercise with RAG.

This code demonstrates **semantic search** (finding relevant information by meaning), but to get good answers, you need **RAG** (Retrieval-Augmented Generation) which combines:
1. **Retrieval**: Finding relevant chunks (what we have)
2. **Generation**: Using an LLM to read those chunks and generate a coherent answer (what we're missing)

## How to run this code from Visual Studio Code
* Open Terminal.
* Make sure you are in the project folder.
* Make sure you have the poetry env activated.
* Enter and run the following command:
    * `python 020-pdf-agent.py`