# Searching Over Many PDFs with ZeroEntropy


In this cookbook, you will learn how to use ZeroEntropy to search over many complex PDF documents from natural language queries. \
More specifically, you will go over how to upload PDF documents to ZeroEntroy, and then retrieve the most relevant documents, and snippets using complex queries. 


## Pre-requisites

- A ZeroEntropy API Key

That's it! \
You can create your API Key here: https://dashboard.zeroentropy.dev

### Setting up your ZeroEntropy Client

First, let's install ZeroEntropy and initialize a client.

In [None]:
!pip install zeroentropy requests

In [2]:
from zeroentropy import ZeroEntropy

zclient = ZeroEntropy(api_key="YOUR_API_KEY")

That's it! Now let's prepare the data.

### Preparing the data

For this example, we will use a few scientific papers about entropy (because why not!). \
We will create a collection and upload all those PDFs to that collection. \
We're going to write a function that fetches Arxiv papers based on specific keywords.

In [3]:
import requests
from xml.etree import ElementTree as ET


def get_arxiv_papers(query, max_results=10):
    search_url = f"http://export.arxiv.org/api/query?search_query=all:{query}&start=0&max_results={max_results}"
    response = requests.get(search_url)
    
    if response.status_code == 200:
        papers = []
        root = ET.fromstring(response.text)
        
        # Iterate over each entry in the XML response
        for entry in root.findall("{http://www.w3.org/2005/Atom}entry"):
            for link in entry.findall("{http://www.w3.org/2005/Atom}link"):
                if link.attrib.get('title') == 'pdf':  # Find the PDF link specifically
                    pdf_url = link.attrib['href']
                    papers.append(pdf_url)
                    
        return papers
    else:
        print("Error fetching papers.")
        return []

Now, we can use this function to find 50 (very long...) Arxiv papers about Zero Entropy! If you are interested, you can take a look at the papers using the links.

In [None]:
# Get PDFs related to "Zero Entropy"
pdf_list = get_arxiv_papers("zero entropy", max_results=50)
print(pdf_list[:5])

## Uploading the data to ZeroEntropy

Now that we have the list of PDFs we want to upload, let's add them all to a new collection as efficiently as possible.

#### Create a collection

In [None]:
collection = zclient.collections.add(collection_name="arxiv_zero_entropy_papers")
print(collection.message)

If you need to iterate, you can simply delete the collection by uncommenting the line below and rerunning the cell above.

In [33]:
#delete_collection = zclient.collections.delete(collection_name="arxiv_zero_entropy_papers")

#### Uploading a document to the new collection

Now, we're going to define a function that will add each pdf to the newly created collection. The PDFs need to be converted to base64 before being added. 

In [41]:
import base64

def process_pdf(url):
    try:
        # Download the PDF
        response = requests.get(url, timeout=10)
        response.raise_for_status()

        # Convert to base64
        base64_content = base64.b64encode(response.content).decode('utf-8')

        # Upload to ZeroEntropy
        response = zclient.documents.add(
            collection_name="arxiv_zero_entropy_papers",
            path=url,
            content={
                "type": "auto",
                "base64_data": base64_content,
            }
        )
    
    except Exception as e:
        return f"Error processing {url}: {e}"

Now let's paralellize the upload of all those papers to the collection!

In [None]:
from concurrent.futures import ThreadPoolExecutor, as_completed

with ThreadPoolExecutor(max_workers=10) as executor:
    future_to_url = {executor.submit(process_pdf, url): url for url in pdf_list}

    for future in as_completed(future_to_url):
        print(future.result())

DONE! Now let's make sure all documents are indexed properly.

In [None]:
response = zclient.documents.get_info_list(collection_name="arxiv_zero_entropy_papers")
print(response.documents)

Everything seems to look pretty good! Now, let's start querying those documents.

## Sending queries to ZeroEntropy

We are going to play with two endpoints: top documents and top snippets. \
Top documents return the top k documents that are most relevant to a given query, and top snippets return short paragraphs within those documents. \ 
Let's get started!

### Top Documents Queries

In [None]:
response = zclient.queries.top_documents(
    collection_name="arxiv_zero_entropy_papers",
    query="Can we use Zero Entropy to measure the complexity of a system?",
    k=3,
)

print(response.results)

### Top Snippets Queries

In [None]:
response = zclient.queries.top_snippets(
    collection_name="arxiv_zero_entropy_papers",
    query="What are the different types of entropy measures used to analyze dynamical systems, and how do they compare in terms of effectiveness?",
    k=3,
)

print(response.results)

That's it! You can play around the two above examples and change the query and the value of k to retrieve more or less results. \
You'll learn a lot about the concept of entropy in information systems!