# US PTO Patent RAG

This notebook demonstrates a simple RAG pipeline, used to summarize US patents from a specific date from a search phrase.

The stages of the RAG pipeline are:

1. Download and extract the patent data from the US PTO
2. Convert the patents to embeddings, and index these embeddings
3. Query the embeddings index, rerank the results and then summarize the top results

## Installing Requirements

The notebook uses:

1. A self-authored library 'pipedata' to process the patent data
2. A self-authored library 'xml-to-pydantic', to extract the patent data from the patent XML using XPath
3. The cohere API to do embeddings, reranking and summarization
4. The annoy library to do nearest-neighbor search in embeddings space

In practice, we'd pin the library versions using eg poetry, but pip install will work in a pinch.

Note: !pip install installs into the global Jupyter workspace, %pip install installs into the active kernel.

In [1]:
# %pip install pipedata[ops] xml-to-pydantic fsspec pyarrow ijson requests aiohttp

In [2]:
# %pip install cohere

In [3]:
# %pip install annoy

## Global Config

In [4]:
import json
with open(".secrets", "r") as f:
    secrets = json.load(f)
COHERE_API_KEY = secrets["cohere_api_key"]
del secrets

In [5]:
import logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s: %(message)s"
)
logger = logging.getLogger(__name__)

## Downloading and processing the patents

The first step is to get the patents from the US PTO bulk data service, and to extract the patent data.

In [6]:
import datetime
from io import BytesIO
from typing import Iterator

from lxml import etree
from pipedata.core import Stream, ops
from pipedata.ops import zipped_files
from pydantic import computed_field
from xml_to_pydantic import XmlBaseModel, XmlField

In [7]:
def split_into_xmls(files):
    """
    The US patent office file is not a single XML
    file, but many XMLs concatenated together. This just
    splits on the starting line of each XML and yields
    each true XML in turn.
    """
    for file in files:
        xml = []
        for line in file.contents:
            if line.startswith(b"<?xml"):
                if len(xml) > 0:
                    yield b"".join(xml)
                xml = []
            xml.append(line)
        yield b"".join(xml)



@ops.filtering
def remove_genetic_sequences(xml) -> bool:
    """
    The patent file has both patents and genetic sequences. Here, we
    filter out the genetic sequences, which have a different structure.
    """
    tree = etree.parse(BytesIO(xml))
    if "sequence-cwu" in tree.docinfo.doctype:
        return False
    return True


class Patent(XmlBaseModel):
    patent_id: str = XmlField(xpath="normalize-space(string(/us-patent-grant/us-bibliographic-data-grant/publication-reference/document-id))")
    patent_type: str = XmlField(xpath="/us-patent-grant/us-bibliographic-data-grant/application-reference/@appl-type")
    title: str = XmlField(xpath="string(/us-patent-grant/us-bibliographic-data-grant/invention-title)")
    assignees: list[str] | None = XmlField(xpath="/us-patent-grant/us-bibliographic-data-grant/assignees/assignee/addressbook/orgname/text()", default=None)
    claims: str = XmlField(xpath="string(/us-patent-grant/claims)")

    @computed_field
    def all_text(self) -> str:
        return self.title + "\n\n" + self.claims


@ops.mapping
def extract_patent(xml):
    """
    The patent file has a lot of information in it, eg bibliographic data
    and citations. Here, we just extract the key data.
    """
    return Patent.model_validate_xml(xml)


def take_n(n):
    """
    This is a helper function to run the stream for just n
    elements, if needed for debugging.
    """
    def taken(els):
        for i, el in enumerate(els):
            if i == n:
                break
            yield el

    return taken

In [8]:
patent_grant_release_date = datetime.date(2024, 2, 6)

url_template = "https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/{YYYY}/ipg{YYMMDD}.zip"
url = url_template.format(YYYY=patent_grant_release_date.strftime("%Y"), YYMMDD=patent_grant_release_date.strftime("%y%m%d"))

patents = (
    Stream([url])
    .then(zipped_files)
    .then(split_into_xmls)
    .then(remove_genetic_sequences)
    .then(extract_patent)
    # .then(take_n(100))  # For debugging
).to_list()

logger.info(f"Extracted {len(patents)} patents")

2024-06-07 15:44:07,754 - INFO: Initializing zipped files reader
2024-06-07 15:44:07,755 - INFO: Opening zip file at https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg240206.zip
2024-06-07 15:44:08,578 - INFO: Found 1 files in zip file
2024-06-07 15:44:08,579 - INFO: Reading file 0 (ipg240206.xml) from zip file
2024-06-07 15:45:00,837 - INFO: Extracted 6812 patents


In [9]:
print(patents[0])

patent_id='US D1013321 S1 20240206' patent_type='design' title='Canine biscuit' assignees=None claims='\n\nThe ornamental design for a canine biscuit, as shown and described.\n\n' all_text='Canine biscuit\n\n\n\nThe ornamental design for a canine biscuit, as shown and described.\n\n'


In [10]:
from collections import Counter
Counter(patent.patent_type for patent in patents)

Counter({'utility': 6107, 'design': 680, 'plant': 18, 'reissue': 7})

We now have a list of Patent objects in 'patents'.

## Converting patents to embeddings and indexing the embeddings

This just sends the whole patent to the embeddings endpoint - even though the patent claim text can be quite long. An improvement could be made by separating the patent claims into chunks and finding embeddings for each chunk.

In [11]:
import cohere
from annoy import AnnoyIndex

EMBED_MODEL = "embed-english-v3.0"
DB_N_TREES = 10


def get_embeddings_index(co, docs):
    embeds = co.embed(
        texts=docs,
        model=EMBED_MODEL,
        input_type="search_document"
    ).embeddings
    
    search_index = AnnoyIndex(len(embeds[0]), "angular")
    for i, embed in enumerate(embeds):
        search_index.add_item(i, embed)
    search_index.build(DB_N_TREES)
    return search_index
    

co = cohere.Client(COHERE_API_KEY)
search_index = get_embeddings_index(co, [el.all_text for el in patents])

2024-06-07 15:45:01,503 - INFO: HTTP Request: POST https://api.cohere.com/v1/embed "HTTP/1.1 200 OK"
2024-06-07 15:45:01,532 - INFO: HTTP Request: POST https://api.cohere.com/v1/embed "HTTP/1.1 200 OK"
2024-06-07 15:45:01,540 - INFO: HTTP Request: POST https://api.cohere.com/v1/embed "HTTP/1.1 200 OK"
2024-06-07 15:45:01,577 - INFO: HTTP Request: POST https://api.cohere.com/v1/embed "HTTP/1.1 200 OK"
2024-06-07 15:45:01,606 - INFO: HTTP Request: POST https://api.cohere.com/v1/embed "HTTP/1.1 200 OK"
2024-06-07 15:45:01,644 - INFO: HTTP Request: POST https://api.cohere.com/v1/embed "HTTP/1.1 200 OK"
2024-06-07 15:45:01,746 - INFO: HTTP Request: POST https://api.cohere.com/v1/embed "HTTP/1.1 200 OK"
2024-06-07 15:45:05,385 - INFO: HTTP Request: POST https://api.cohere.com/v1/embed "HTTP/1.1 200 OK"
2024-06-07 15:45:05,692 - INFO: HTTP Request: POST https://api.cohere.com/v1/embed "HTTP/1.1 200 OK"
2024-06-07 15:45:05,763 - INFO: HTTP Request: POST https://api.cohere.com/v1/embed "HTTP/1.

At this stage, as well as the raw patent data in 'patents', we have a searchable embeddings index in 'search_index'.

## Querying and summarizing the patents

In [12]:
def search(co, query, search_index, n_results):
    query_embed = co.embed(
        texts=[query],
        model=EMBED_MODEL,
        input_type="search_query",
    ).embeddings
    
    return search_index.get_nns_by_vector(
        query_embed[0], n_results, include_distances=False
    )


def rerank(co, query, docs, n_results):
    reranked_results = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=docs,
        top_n=n_results,
        return_documents=False,
    )
    return [docs[result.index] for result in reranked_results.results]


def summarize(co, docs):
    res = co.chat(
      model="command-r-plus",
      message="Please summarize the dominant themes in the documents in under 500 words.",
      documents=docs
    )
    return res

In [13]:
query = "artificial intelligence"
initial_n = 100
final_n = 10

initial_ixs = search(co, query, search_index, initial_n)
docs_to_rerank = [patents[i].all_text for i in initial_ixs]
reranked_docs = rerank(co, query, docs_to_rerank, final_n)

result = summarize(co, [{"doc": doc} for doc in reranked_docs])

2024-06-07 15:45:24,229 - INFO: HTTP Request: POST https://api.cohere.com/v1/rerank "HTTP/1.1 200 OK"
2024-06-07 15:45:57,642 - INFO: HTTP Request: POST https://api.cohere.com/v1/chat "HTTP/1.1 200 OK"


In [14]:
print(result.text)

The dominant themes in the documents revolve around the use of artificial intelligence (AI) and machine learning (ML) in various applications, including:

- Cooking devices: AI is used to provide guidance and optimize the cooking process based on various sensors and user input.
- 3D modeling: AI is employed to create 3D models of rooms by analyzing images from different perspectives and identifying objects, walls, and corners.
- Train defect detection: AI, in combination with cameras and illumination, detects anomalies in moving trains.
- Model reconstruction: AI models are reconstructed based on usage patterns and context information to improve performance and reliability.
- Power optimization: Techniques are described to optimize power consumption in AI processors by dynamically turning resources on and off.
- Chatbots: AI chatbots interact with human users to reduce the time spent by human agents, utilizing machine learning algorithms.
- Photography: AI is used to categorize and org