# The AI Patent Analyst: From Unstructured PDFs to a Queryable Knowledge Graph

## 1. High-Level Summary

This project solves the critical challenge of analyzing unstructured patent PDFs by building an end-to-end pipeline that transforms them into a structured, queryable Knowledge Graph entirely within Google BigQuery.

The final solution is an interactive analysis engine that delivers significant cost savings by automating tasks that would otherwise require hundreds of hours of expensive expert analysis from patent lawyers or R&D engineers. It answers:

*   **Deep Architectural Analysis:** Use standard SQL with `UNNEST` and `GROUP BY` to discover the most common design patterns and technical component connections across hundreds of patents.

*   **Component Search:** Go beyond patent-level search to find specific, functionally similar technical parts across different domains (e.g., "find a mechanism for encrypting data").

*   **Quantitative Portfolio Analysis:** Compare patent applicants by the complexity (average component count) and breadth (number of domains) of their innovations.

## 2. The Workflow: A Multi-Stage AI Pipeline

Our solution follows a three-stage process, showcasing a powerful combination of BigQuery's multimodal, generative, and vector search capabilities.

### Stage 1: Multimodal Data Processing (🖼️ Pioneer)
We use **Object Tables** to directly read and process raw PDFs from Cloud Storage. The Gemini model is then used with `ML.GENERATE_TEXT` to analyze the both the text and the technical diagrams within the PDFs.

### Stage 2: Generative Knowledge Graph Extraction (🧠 Architect)
The consolidated patent text is fed into the `AI.GENERATE_TABLE` function. A custom prompt instructs the AI to act as an expert analyst, extracting a structured table of high-level insights (`invention_domain`, `problem_solved`) and a detailed, nested graph of all technical components, their functions, and their interconnections.

### Stage 3: Component-Level Semantic Search (🕵️‍♀️ Detective)
To enable deep discovery, we build a novel search engine that understands context. We use `ML.GENERATE_EMBEDDING` to create two separate vectors:
1.  One for the patent's high-level context (title, abstract)
2.  Another for each component's specific function

These vectors are mathematically averaged into a single, final vector for each component via BigQuery's UDF (User-Defined Functions).

Finally, `VECTOR_SEARCH` is used on these combined vectors, creating a powerful search that returns highly relevant, context-aware results.

## 3. Dataset Overview
- **403 PDFs** (197 English, others in FR/DE) at `gs://gcs-public-data--labeled-patents/*.pdf`.
- **Tables**: `extracted_data` (metadata), `invention_types` (labels), `figures` (91 diagram coordinates).
- **Source**: [Labeled Patents](https://console.cloud.google.com/marketplace/product/global-patents/labeled-patents?inv=1&invt=Ab5j9A&project=bq-ai-patent-analyst&supportedpurview=organizationId,folder,project) (1TB/mo free tier).

## 4. Code
*   **Notebook & Repository:** [https://github.com/veyselserifoglu/bq-ai-patent-analyst/blob/main/notebooks/bigquery-ai-the-patent-analyst-project.ipynb](https://github.com/veyselserifoglu/bq-ai-patent-analyst/blob/main/notebooks/bigquery-ai-the-patent-analyst-project.ipynb)

## 5. Architecture Pipeline

In [1]:
from IPython.display import HTML

# Display Architecture pipeline

HTML(f'''
<div style="text-align: center; padding: 15px;">
    <a href="https://github.com/veyselserifoglu/bq-ai-patent-analyst/blob/main/doc/Patent%20Analysis%20Pipeline%20Architecture%20-%20PNG.png?raw=true" 
       target="_blank" 
       style="cursor: pointer; display: inline-block; text-decoration: none;">
        <div style="position: relative; display: inline-block;">
            <img src="https://github.com/veyselserifoglu/bq-ai-patent-analyst/blob/main/doc/Patent%20Analysis%20Pipeline%20Architecture%20-%20PNG.png?raw=true" 
                 width="300" 
                 height="200"
                 style="border: 2px solid #e0e0e0; border-radius: 8px; transition: all 0.3s ease; box-shadow: 0 4px 8px rgba(0,0,0,0.1);"
                 onmouseover="this.style.borderColor='#4285F4'; this.style.boxShadow='0 6px 12px rgba(66, 133, 244, 0.3)'"
                 onmouseout="this.style.borderColor='#e0e0e0'; this.style.boxShadow='0 4px 8px rgba(0,0,0,0.1)'">
            <div style="position: absolute; top: 8px; right: 8px; background: rgba(255,255,255,0.9); border-radius: 50%; width: 24px; height: 24px; display: flex; align-items: center; justify-content: center; font-size: 14px;">
                ↗
            </div>
        </div>
    </a>
    <p style="margin-top: 12px; color: #5f6368; font-size: 13px; font-style: italic;">Click to explore the full architecture</p>
</div>
''')

In [2]:
# For visualization purposes
%pip install -q pyvis
%pip install -q plotly
%pip install -q ipywidgets

In [5]:
# BigQuery
import os
from google.cloud import bigquery
from kaggle_secrets import UserSecretsClient
import pandas as pd
from pyvis.network import Network
import plotly.express as px
from google.cloud import bigquery
from IPython.display import Image, display, HTML, IFrame
import ipywidgets as widgets
from ipywidgets import Layout
import warnings


pd.set_option('display.max_colwidth', None)

# Suppress the specific UserWarning from the BigQuery client
warnings.filterwarnings("ignore", message="BigQuery Storage module")

# Google Cloud Project Setup

This guide outlines the one-time setup required in Google Cloud and Kaggle to enable the analysis.

---

### 1. Google Cloud Project Configuration

First, configure your Google Cloud project.

1.  **Select or Create a Project**
    * Ensure you have a Google Cloud project.
    * Copy the **Project ID** (e.g., `my-project-12345`), not the project name.

2.  **Enable Required APIs**
    * In your project, enable the following two APIs:
        * **Vertex AI API**
        * **BigQuery Connection API**

3.  **Create a Service Account for the Notebook**
    * This service account allows the Kaggle notebook to act on your behalf.
    * Navigate to **IAM & Admin** > **Service Accounts**.
    * Click **+ CREATE SERVICE ACCOUNT**.
    * Give it a name (e.g., `kaggle-runner`).
    * Grant it these three roles: `Be sure to follow the principle of least privilege.`  
        * `BigQuery Admin`
        * `Vertex AI User`
        * `Service Usage Admin`
    * After creating the account, go to > manage keys > create a new key. A file will be downloaded to your computer.

---

### 2. Kaggle Notebook Configuration

Next, configure this Kaggle notebook to use your project.

1.  **Add Kaggle Secrets**
    * In the notebook editor, go to the **"Add-ons"** menu and select **"Secrets"**.
    * Add two secrets:
        * **`GCP_PROJECT_ID`**: Paste your Google Cloud **Project ID** here.
        * **`GCP_SA_KEY`**: Open the downloaded JSON key file, copy its entire text content, and paste it here.

---

### 3. Final Permission Step (After Running Code)

The first time you run the setup cells in the notebook, a new BigQuery connection will be created. This connection has its own unique service account that needs permission to use AI models.

1.  **Find the Connection Service Account**
    * After running the setup cells, go to **BigQuery** > **External connections** in your Google Cloud project.
    * Click on the connection named `llm-connection`.
    * Copy its **Service Account ID** (it will look like `bqcx-...@...gserviceaccount.com`).

2.  **Grant Permission**
    * Go to the **IAM & Admin** page.
    * Click **+ Grant Access**.
    * Paste the connection's service account ID into the **"New principals"** box.
    * Give it the single role of **`Vertex AI User`**.
    * Click **Save**.

---

With this setup complete, the notebook has secure access to your Google Cloud project and can run all subsequent analysis cells.

In [6]:
user_secrets = UserSecretsClient()
project_id = user_secrets.get_secret("GCP_PROJECT_ID")
gcp_key_json = user_secrets.get_secret("GCP_SA_KEY")
location = 'US'

In [7]:
# Write the key to a temporary file in the notebook's environment
key_file_path = 'gcp_key.json'
try:
    with open(key_file_path, 'w') as f:
        f.write(gcp_key_json)
    
    # Remove "> /dev/null 2>&1" to show the output.
    # Authenticate the gcloud tool using the key file
    !gcloud auth activate-service-account --key-file={key_file_path} > /dev/null 2>&1
    
    # Configure the gcloud tool to use your project
    !gcloud config set project {project_id} > /dev/null 2>&1
    
finally:
    # Securely delete the key file immediately after use
    if os.path.exists(key_file_path):
        os.remove(key_file_path)

# Enable the Vertex AI and BigQuery Connection APIs. Run only once Or Enable using the Cloud Interface.
# !gcloud services enable aiplatform.googleapis.com bigqueryconnection.googleapis.com > /dev/null 2>&1

In [None]:
# This command creates the connection resource. Remove "> /dev/null 2>&1" to show the output.
!bq mk --connection --location={location} --connection_type=CLOUD_RESOURCE llm-connection > /dev/null 2>&1

In [None]:
# This command shows the details of your connection. Remove "> /dev/null 2>&1" to show the output.
!bq show --connection --location={location} llm-connection > /dev/null 2>&1

# BigQuery Resource Creation

This section creates the necessary resources for our analysis inside our BigQuery project.

---

### 1. Create a Dataset in the Correct Region.

First, we create a new dataset named `patent_analysis` in our chosen region. This dataset acts as a container for the AI models and the object table of the dataset.

### 2. Create a Reference to the AI MultiModel.

Next, we create a "shortcut" to Google's `gemini-2.5-flash` model. This command gives us an easy name, `gemini_vision_analyzer`, to use in our analysis queries.

### 3. Create an Object Table for the PDFs.

Next, we create an object table named `patent_documents_object_table`. This is a special "map" that points directly to all the raw PDF files in the public Google Cloud Storage bucket, making them ready for analysis.

### 4. Create a Reference to the AI Embedding Model.

Next, we create a "shortcut" to Google's `gemini-embedding-001` model. This command gives us an easy name, `embedding_model`, to use in our embedding tasks.

### 5. Create a Reference to do L2 Normalization

Next, We create a custom SQL function to standardize and normalize our vectors.

### 6. Create a Reference to perform a weighted average of two vectors.

Finally, we create a custom UDF (user defined function) to intelligently blend our two different types of embeddings (patent context and component function) into a single, more powerful context-aware vector.

---

In [8]:
# Initiate BigQuery client.
client = bigquery.Client(project=project_id, location=location)
client

<google.cloud.bigquery.client.Client at 0x7a1b2df1b550>

In [None]:
# 1. Create the new dataset "patent_analysis"
create_dataset_query = f"""
CREATE SCHEMA IF NOT EXISTS `{project_id}.patent_analysis`
OPTIONS(location = '{location}');
"""
print(f"Creating dataset 'patent_analysis' in {location}...")
job = client.query(create_dataset_query)
try:
    job.result()
except Exception as e:
    print(f"❌ FAILED to create dataset. Error:\n\n{e}")


# 2. Create the AI model reference inside the new dataset
create_model_query = f"""
CREATE OR REPLACE MODEL `{project_id}.patent_analysis.gemini_vision_analyzer`
  REMOTE WITH CONNECTION `{location}.llm-connection`
  OPTIONS (endpoint = 'gemini-2.5-flash');
"""
print("\nCreating the AI model reference...")
job = client.query(create_model_query)
try:
    job.result()
except Exception as e:
    print(f"❌ FAILED to create the AI Model reference. Error:\n\n{e}")


# 3. Create the Object Table
# This query creates the "map" to the PDF files inside the local 'patent_analysis' dataset.
object_table_query = f"""
CREATE OR REPLACE EXTERNAL TABLE `{project_id}.patent_analysis.patent_documents_object_table`
WITH CONNECTION `{location}.llm-connection`
OPTIONS (
    object_metadata = 'SIMPLE',
    uris = ['gs://gcs-public-data--labeled-patents/*.pdf'] 
);
"""
print("Creating the object table...")
job = client.query(object_table_query)
try:
    job.result()
except Exception as e:
    print(f"❌ FAILED to create the object table. Error:\n\n{e}")


# 4. Create a remote connection for the embedding model.
sql_query = f"""
CREATE OR REPLACE MODEL `{project_id}.patent_analysis.embedding_model`
  REMOTE WITH CONNECTION `{location}.llm-connection`
  OPTIONS (endpoint = 'gemini-embedding-001');
"""

print("Creating the AI Embedding Model reference...")
job = client.query(sql_query)
try:
    job.result()
except Exception as e:
    print(f"❌ FAILED to create the AI Embedding Model reference. Error:\n\n{e}")


# 5. creates a helper function to perform L2 normalization on a vector.
create_classification_model = f"""
CREATE OR REPLACE FUNCTION `{project_id}.patent_analysis.L2_NORMALIZE`(vec ARRAY<FLOAT64>)
RETURNS ARRAY<FLOAT64> AS ((
  
  -- Calculate the L2 Norm (magnitude) of the vector.
  WITH vector_norm AS (
    SELECT SQRT(SUM(element * element)) AS norm
    FROM UNNEST(vec) AS element
  )
  
  -- Divide each element by the norm to create a unit vector.
  -- Handle the case where the norm is 0 to avoid division by zero errors.
  SELECT
    ARRAY_AGG(
      IF(norm = 0, 0, element / norm)
    )
  FROM
    UNNEST(vec) AS element, vector_norm
));
"""
print("Creating a Vector Normalization UDF...")
job = client.query(create_classification_model)
try:
    job.result()
except Exception as e:
    print(f"❌ FAILED to create the Vector Normalization reference. Error:\n\n{e}")


# 6. This creates a helper function to perform a weighted average of two vectors.
sql_query = f"""
CREATE OR REPLACE FUNCTION `{project_id}.patent_analysis.VECTOR_WEIGHTED_AVG`(
  vec1 ARRAY<FLOAT64>, weight1 FLOAT64,
  vec2 ARRAY<FLOAT64>, weight2 FLOAT64
)
RETURNS ARRAY<FLOAT64>
LANGUAGE js AS r'''
  if (!vec1 || !vec2 || vec1.length !== vec2.length) {{
    return null;
  }}
  let weighted_vec = [];
  for (let i = 0; i < vec1.length; i++) {{
    weighted_vec.push((vec1[i] * weight1) + (vec2[i] * weight2));
  }}
  return weighted_vec;
''';
"""

print("Creating a weighted average vector UDF...")
job = client.query(sql_query)
try:
    job.result()
except Exception as e:
    print(f"❌ FAILED to create the weighted average UDF reference. Error:\n\n{e}")

# DataFrame Styler

In [9]:
def display_styled_df(df: pd.DataFrame, title: str):
    """
    Takes a DataFrame and returns a styled HTML table for better readability.
    """
    if df.empty:
        print("⚠️ DataFrame is empty.")
        return

    styler = df.style \
        .set_caption(f"<h3>{title}</h3>") \
        .set_properties(**{
            'text-align': 'left',
            'white-space': 'normal', # Crucial for wrapping long text
            'font-size': '14px',
            'vertical-align': 'top', # Aligns text to the top of the cell
            'border': '1px solid #444',
            'padding': '8px'
        }) \
        .set_table_styles([
            {'selector': 'th', 'props': [('text-align', 'left'), ('font-size', '16px'), ('background-color', '#333')]},
            {'selector': 'caption', 'props': [('caption-side', 'top'), ('font-size', '18px'), ('text-align', 'center')]}
        ])

    display(HTML(styler.to_html()))

# Data Extraction & Knowledge Graph Creation

## What did we build?
We created two foundational data assets that power our analysis.

1. the `ai_text_extraction` table: transforms the raw PDFs into structured text, capturing the title and abstract.
2. the `patent_knowledge_graph` table: builds on this, creating a queryable graph of technical components and their connections.

## Why is this important?
- Automates Expert Work, saving hundreds of expert hours. 
- Accelerates Time-to-Insight, analyzing patents in seconds.

## How did we do it?
The process used a sequence of BigQuery's native AI functions:

1. **Multimodal Analysis**:
   - we used `ML.GENERATE_TEXT` to analyze the text and the technical diagrams within each patent's PDF.

2. **Knowledge Graph Extraction**:
   - Next, we fed all the consolidated text into the `AI.GENERATE_TABLE` function, to extract:
     - A nested table of all technical components.
     - Their functions.
     - Their connections for each patent.

In [12]:
# 1. Multimodal Analysis - only texts - ai_text_extraction table

prompt_text = """From this patent document, perform the following tasks:

1.  **Extract these fields**: title, inventor, abstract, 
    the **Filed**, the **Date of Patent**, the international classification code, and the applicant.
    
2.  **Translate**: If the original title and abstract are in German or French, translate them into English.

3.  **Identify Language**: Determine the original language of the document.

Return ONLY a valid JSON object with EXACTLY these ten keys: 
"title_en", "inventor", "abstract_en", "filed", "date_of_patent", "class_international", "applicant", and "original_language".

**Formatting Rule**: For any key that has multiple values (like "inventor" or "class_international" or "applicant"), 
combine them into a single string, separated by a comma and a space. For example: "Igor Karp, Lev Stesin".

The "original_language" value must be one of these three strings: 'EN', 'FR', or 'DE'.
If any other field is unavailable, use null as the value.
"""

# The main SQL query.
sql_query = f"""
CREATE OR REPLACE TABLE `{project_id}.patent_analysis.ai_text_extraction` AS (
  WITH raw_json AS (
      SELECT
        uri,
        ml_generate_text_llm_result AS llm_result
      FROM
        ML.GENERATE_TEXT(
          MODEL `{project_id}.patent_analysis.gemini_vision_analyzer`,
          TABLE `{project_id}.patent_analysis.patent_documents_object_table`,
          STRUCT(
            '''{prompt_text}''' AS prompt,
            2048 AS max_output_tokens,
            0.2 AS temperature,
            TRUE AS flatten_json_output
          )
        )
    ),
    parsed_json AS (
      -- Step 2: Clean and parse the JSON output.
      SELECT
        uri,
        llm_result,
        SAFE.PARSE_JSON(
          REGEXP_REPLACE(llm_result, r'(?s)```json\\n(.*?)\\n```', r'\\1')
        ) AS json_data
      FROM
        raw_json
    )
  SELECT
    uri,
    llm_result,
    
    SAFE.JSON_VALUE(json_data, '$.original_language') AS original_language,
    SAFE.JSON_VALUE(json_data, '$.title_en') AS extracted_title_en,
    SAFE.JSON_VALUE(json_data, '$.inventor') AS extracted_inventor,
    SAFE.JSON_VALUE(json_data, '$.abstract_en') AS extracted_abstract_en,
    SAFE.JSON_VALUE(json_data, '$.filed') AS filed_date,
    SAFE.JSON_VALUE(json_data, '$.date_of_patent') AS official_patent_date,
    SAFE.JSON_VALUE(json_data, '$.class_international') AS class_international,
    SAFE.JSON_VALUE(json_data, '$.applicant') AS applican
    
  FROM
    parsed_json
);
"""

print("Attempting to create the ai text extraction table...")
job = client.query(sql_query)
try:
    job.result()
    print("✅ Success: The `ai_text_extraction` table was created.")

    print("\nFetching a sample of 5 records from the new table:")
    sql_select_sample_query = f"""
    SELECT 
        ate.uri, 
        ate.original_language,
        ate.extracted_title_en,
        ate.extracted_inventor, 
        ate.extracted_abstract_en,
        ate.filed_date,
        ate.class_international
    FROM `{project_id}.patent_analysis.ai_text_extraction` AS ate
    WHERE ate.extracted_title_en is not NULL
    LIMIT 5;
    """
    
    df_sample = client.query(sql_select_sample_query).to_dataframe()
    display_styled_df(df_sample, title="Sample of 5 Records from the `ai_text_extraction` Table")

except Exception as e:
    print(f"❌ FAILED: An error occurred. Error:\n\n{e}")

Attempting to create the ai text extraction table...
✅ Success: The `ai_text_extraction` table was created.

Fetching a sample of 5 records from the new table:


Unnamed: 0,uri,original_language,extracted_title_en,extracted_inventor,extracted_abstract_en,filed_date,class_international
0,gs://gcs-public-data--labeled-patents/espacenet_de85.pdf,DE,GARDEN TOOL FOR SOIL CULTIVATION AND SOWING OR PLANTING METHOD WITH THE AID OF SUCH A GARDEN TOOL,"Bindhammer, Markus","The invention relates to a garden tool for soil cultivation, with a handle section, a soil contact section attached thereto, a power supply, a moisture detection device for detecting a number of quantities corresponding to soil moisture, a nutrient detection device for detecting a number of quantities corresponding to the nutrient content in the soil, a computing unit for calculating the soil moisture from the quantities corresponding to the soil moisture and the nutrient content from the quantities corresponding to the nutrient content, and an output device for outputting the soil moisture and the nutrient content in the soil. The invention is characterized in that the moisture detection device has at least two first electrodes (3, 4) arranged on the soil contact section at a distance from each other in the manner of a plate capacitor, wherein a capacitance of the plate capacitor is influenced with earth as a dielectric when the earth is contacted with the soil contact section in the area between the electrodes (3, 4), wherein the garden tool has a user interface, a selection program for garden plants and/or vegetable types selectable via the user interface, in which favorable target values for the soil moisture or the nutrient content in the soil for the garden plants and/or vegetable types are stored, as well as a comparison device which compares the stored target values for the selected garden plant or vegetable type with the detected actual values for soil moisture and/or nutrient content in the soil and transmits the result for output to the output device. The invention further relates to a sowing or planting method to be carried out with the garden tool.",22.03.2018,A01B 1/02
1,gs://gcs-public-data--labeled-patents/espacenet_de86.pdf,DE,"OVAL WHEEL FLOW METER, METHOD FOR MEASURING A FLOW AND DOSING SYSTEM","Matzner, Tobias, Turkiewicz, Michael","The invention relates to an oval wheel flow meter (100) for measuring a flow of at least one process medium in a dosing system. The oval wheel flow meter (100) has a housing (110, 115), in which a measuring chamber (120), an inlet section (130) and an outlet section (140) are formed, two oval wheels (150, 160) rotatably mounted in the measuring chamber (120) between the inlet section (130) and the outlet section (140), and a detection device for detecting a position of the oval wheels (150, 160). In at least one housing part (115) or section of the housing (110, 115), a groove (170) facing the measuring chamber (120) is formed. The groove (170) is formed between the axes of rotation of the oval wheels (150, 160). A longitudinal axis of the groove (170) extends along a flow meter longitudinal axis connecting the inlet section (130) to the outlet section (140) between the inlet section (130) and the outlet section (140), wherein a longitudinal dimension of the groove (170) is chosen such that a gas-permeable connection between the inlet section (130) and the outlet section (140) is provided by the groove (170).",22.03.2018,G01F 3/10
2,gs://gcs-public-data--labeled-patents/espacenet_de73.pdf,DE,DEVICE FOR INDUCTIVE ENERGY TRANSFER,"Acero Acero, Jesus, Carretero Chamarro, Claudio, Hernandez Blasco, Pablo Jesus, Llorente Gil, Sergio, Lope Moratilla, Ignacio, Moya Albertin, Maria Elena, Serrano Trullen, Javier","The invention relates to a device for inductive energy transfer (10a-j) with at least two overlapping induction elements (12a-j). In order to advantageously further develop a generic device, it is proposed that the device for inductive energy transfer (10a-j) has at least one magnetic flux bundling unit (14a-j), which is provided for bundling at least one magnetic flux provided by at least one of the induction elements (12a-j), and which has at least one magnetic flux bundling element (16a-j), which is assigned to the overlapping induction elements (12a-j) for flux bundling.",03.04.2018,H05B 6/12
3,gs://gcs-public-data--labeled-patents/espacenet_de74.pdf,DE,DEVICE FOR INDUCTIVE ENERGY TRANSFER,"Acero Acero, Jesus, Almolda Fandos, Manuel, Hernandez Blasco, Pablo Jesus, Llorente Gil, Sergio, Lope Moratilla, Ignacio, Moya Albertin, Maria Elena, Serrano Trullen, Javier","The invention relates to a device for inductive energy transfer with at least one induction unit (10a-c), which comprises at least one induction element (12a-c), and with at least one contact plane (14a-c). To achieve a compact design and high efficiency, it is proposed that the at least one induction element (12a-c) extends at least in a partial area (16a-c) along a first main extension plane (18a-c) that deviates from the contact plane (14a-c).",03.04.2018,H05B 6/12
4,gs://gcs-public-data--labeled-patents/espacenet_de78.pdf,DE,MEDICAL DEVICE,"Emmanouilidis, Nikos","The invention relates to a medical device, e.g. for endoscopy and/or intubation, with an elongated main element to be inserted into a body opening of a patient, such as a catheter, a tube, an endoscope or another elongated flexible object, wherein the medical device further comprises a reference component which is adapted to be fixed at a defined position near the body opening on the patient before the main element is inserted into the body opening, wherein the medical device further comprises a measuring device which, via at least one sensing means, detects the insertion depth of the main element and/or the axial orientation of the main element relative to the reference component and provides it as sensor data for further processing and/or storage.",28.03.2018,A61B 1/00


In [15]:
# 1. Multimodal Analysis - only extending ai_text_extraction table with the technical diagrams.

diagram_prompt_text = """
Describe this technical diagram from a patent document. 
What is its primary function and what key components are labeled?
"""

sql_query = f"""
CREATE OR REPLACE TABLE `{project_id}.patent_analysis.ai_text_extraction` AS (

  WITH figures_with_object_ref AS (
      SELECT
        fig.*, obj.ref
      FROM
        `bigquery-public-data.labeled_patents.figures` AS fig
      JOIN
        `{project_id}.patent_analysis.patent_documents_object_table` AS obj
      ON
        fig.gcs_path = obj.uri
    ),
    
    generated_descriptions AS (
      SELECT
        gcs_path,
        ml_generate_text_llm_result AS diagram_description
      FROM
        ML.GENERATE_TEXT(
          MODEL `{project_id}.patent_analysis.gemini_vision_analyzer`,
          (
            SELECT
              gcs_path,
              [
                JSON_OBJECT('uri', ref.uri, 'bounding_poly', [
                  STRUCT(x_relative_min AS x, y_relative_min AS y),
                  STRUCT(x_relative_max AS x, y_relative_min AS y),
                  STRUCT(x_relative_max AS x, y_relative_max AS y),
                  STRUCT(x_relative_min AS x, y_relative_max AS y)
                ])
              ] AS contents,
              '''{diagram_prompt_text}''' AS prompt
            FROM
              figures_with_object_ref
          ),
          STRUCT(
            4096 AS max_output_tokens,
            0.2 AS temperature,
            TRUE AS flatten_json_output
          )
        )
    ),

    aggregated_descriptions AS (
      SELECT
        gcs_path,
        ARRAY_AGG(diagram_description IGNORE NULLS) AS diagram_descriptions
      FROM
        generated_descriptions
      GROUP BY
        gcs_path
    )

  SELECT
    T.*,
    S.diagram_descriptions
  FROM
    `{project_id}.patent_analysis.ai_text_extraction` AS T
  LEFT JOIN
    aggregated_descriptions AS S
  ON
    T.uri = S.gcs_path
);
"""

print("Attempting to extend the ai text extraction table with the diagram description...")
job = client.query(sql_query)
try:
    job.result()
    print("✅ Success: The `ai_text_extraction` table was extended.")

    print("\nFetching a sample of 5 records from the table:")
    sql_select_sample_query = f"""
    SELECT 

        ate.uri, 
        ate.original_language,
        ate.extracted_title_en,
        ate.extracted_inventor,
        ate.filed_date,
        ate.diagram_descriptions
    
    FROM `{project_id}.patent_analysis.ai_text_extraction` AS ate
    WHERE ate.extracted_title_en is not NULL AND ARRAY_LENGTH(ate.diagram_descriptions) > 0
    LIMIT 5;
    """
    
    df_sample = client.query(sql_select_sample_query).to_dataframe()
    display_styled_df(df_sample, title="Sample of 5 Records from the `ai_text_extraction` Table, with diagrams descriptions")

except Exception as e:
    print(f"❌ FAILED: An error occurred. Error:\n\n{e}")

Attempting to extend the ai text extraction table with the diagram description...
✅ Success: The `ai_text_extraction` table was extended.

Fetching a sample of 5 records from the table:


Unnamed: 0,uri,original_language,extracted_title_en,extracted_inventor,filed_date,diagram_descriptions
0,gs://gcs-public-data--labeled-patents/espacenet_en49.pdf,EN,"METHOD, DEVICE AND SYSTEM FOR SELECTING GATEWAY","QIN, Yun, WU, Ling",22.12.2016,"['This technical diagram, typical of a patent drawing, depicts a **cross-sectional view of a processing apparatus or reactor system**. It illustrates the internal structure of a main chamber and its associated external control and flow components.\n\n**Primary Function:**\n\nThe primary function of this apparatus appears to be to facilitate a **controlled process involving fluid interaction, mixing, and/or heat transfer within a chamber**. Given the internal heating element, the multiple baffles/mixing elements, and the controlled flow (pump, valve), it is designed for efficient processing, likely a chemical reaction, a mixing operation, or a highly efficient heat exchange process, where precise temperature and flow dynamics are crucial. The arrows indicating turbulent flow suggest enhanced contact between substances.\n\n**Key Components Labeled:**\n\n* **100: Main Chamber / Housing:** The primary vessel where the process takes place.\n* **102: Primary Fluid Inlet:** An entry point for a main fluid or material into the chamber, typically driven by a pump.\n* **104: Processed Material Outlet:** An exit point for the processed fluid or material from the chamber, controlled by a valve.\n* **106: Secondary Inlet:** An additional entry point, likely for a second fluid, gas, or reagent to be introduced into the chamber.\n* **108: Internal Heating Element / Shaft:** A central elongated component within the chamber, connected to an external heater, suggesting it provides heat to the internal process. It might also serve as a central shaft for the mixing elements.\n* **110: Internal Baffles / Mixing Elements / Discs:** Multiple disc-like or ring-shaped structures arranged along the central element (108). These are designed to create turbulence, increase surface area for heat transfer, or promote thorough mixing of fluids as they pass through the chamber.\n* **112: Flow Path / Turbulence Indicators:** Arrows indicating the direction of fluid flow and the turbulent or swirling motion created by the internal baffles (110).\n* **114: Heater (External):** An external unit responsible for generating heat and supplying it to the internal heating element (108).\n* **116: Controller:** An electronic unit that manages and regulates the operation of the heater (114) and potentially other system parameters based on sensor input.\n* **118: Pump:** An external device connected to the primary inlet (102), used to force the primary fluid into the chamber.\n* **120: Valve:** An external device connected to the outlet (104), used to control the flow rate or stop the discharge of processed material.\n* **122: Sensor:** An internal component positioned near the outlet (104), designed to measure a property of the processed material (e.g., temperature, concentration, pH) or the internal conditions of the chamber.\n* **124: Display:** A user interface component connected to the controller (116) and/or sensor (122), used to show real-time data or system status to an operator.']"
1,gs://gcs-public-data--labeled-patents/espacenet_en32.pdf,EN,"DIAL PRESENTATION METHOD, DEVICE AND SMART WATCH","QIAN, Li, HUANG, Xueyan, HUANG, Kangmin, HUANG, Maosheng",16.11.2017,"['This technical diagram is a **high-level block diagram** illustrating the functional architecture of a generic computing system or device.\n\n**Primary Function:**\n\nThe primary function of this diagram in a patent document is to **depict the foundational hardware environment or system on which a claimed invention (e.g., a method, a software application, or a specific hardware component) would be implemented or operate.** It provides a conceptual overview of how different functional blocks interact to form a complete operational system, without delving into specific circuit details. It helps to establish the context and scope of the invention by showing the general computing platform it utilizes.\n\n**Key Components Labeled:**\n\n1. **Processor (102):** This is the central processing unit (CPU), depicted as the core component. It is responsible for executing instructions, performing calculations, and managing the overall operations of the system. It is shown with connections to almost all other modules, indicating its central role.\n\n2. **Memory (104):** This component is connected to the Processor (102) and is used to store data and program instructions that the Processor (102) needs to access. This would typically include both volatile memory (like RAM) for active data and non-volatile memory (like ROM or flash memory) for persistent storage.\n\n3. **Input/Output (I/O) Interface (106):** This module acts as a bridge or controller that facilitates communication between the Processor (102) and various peripheral devices. It manages the flow of data into and out of the core processing unit.\n\n4. **Network Interface (108):** Connected to the I/O Interface (106), this component enables the system to communicate with other devices or networks (e.g., the internet, a local area network) via wired or wireless connections.\n\n5. **Display (110):** An output device, connected via the I/O Interface (106), used to present visual information (e.g., text, graphics, video) to a user.\n\n6. **Input Device (112):** An input device (e.g., keyboard, mouse, touchscreen, microphone, camera), connected via the I/O Interface (106), allowing users to provide data or commands to the system.\n\n7. **Power Source (114):** This component provides the necessary electrical power to operate all the components of the system.\n\nThe arrows between the blocks generally indicate data and control flow, showing how information moves between these functional units, all typically orchestrated by the Processor (102).']"
2,gs://gcs-public-data--labeled-patents/espacenet_en93.pdf,EN,METHOD AND APPARATUS FOR POWER CONTROL AND MULTIPLEXING FOR DEVICE TO DEVICE COMMUNICATION IN WIRELESS CELLULAR COMMUNICATION SYSTEM,"Kwak, Yongjun, Cho, Joonyoung, Ji, Hyoungju, Ro, Sangmin",17.02.2014,"['This technical diagram, labeled **""FIG. 1""** and explicitly identified as **""PRIOR ART""**, is a high-level **block diagram** illustrating a conventional data processing system. The ""PRIOR ART"" designation is crucial in a patent document, indicating that this diagram represents existing technology that the invention described in the patent aims to improve upon or differentiate from.\n\n---\n\n### Primary Function:\n\nThe primary function of the system depicted in FIG. 1 is to **process data, interact with a user, store information, and communicate with external networks and devices.** It represents a generic, fundamental architecture for a computer system or any device capable of computation and interaction.\n\n---\n\n### Key Components Labeled:\n\nThe diagram uses rectangular blocks to represent components and lines with arrows to indicate data or control flow.\n\n1. **DATA PROCESSING SYSTEM 100:**\n * This is the central, overarching component of the diagram, representing the core computational unit.\n * It encapsulates the processor and the internal communication bus.\n\n2. **USER INTERFACE 102:**\n * Connected to the `DATA PROCESSING SYSTEM 100` via `BUS 112`.\n * Represents the means by which a user interacts with the system (e.g., keyboard, mouse, display, touchscreen).\n * Bi-directional arrows indicate both input from the user and output to the user.\n\n3. **MEMORY 104:**\n * Connected to the `DATA PROCESSING SYSTEM 100` via `BUS 112`.\n * Represents storage for data and program instructions (e.g., RAM, ROM, hard drive).\n * Bi-directional arrows indicate that data can be read from and written to memory.\n\n4. **NETWORK INTERFACE 106:**\n * Connected to the `DATA PROCESSING SYSTEM 100` via `BUS 112`.\n * Facilitates communication between the `DATA PROCESSING SYSTEM 100` and an external `NETWORK 114`.\n * Bi-directional arrows indicate data transmission and reception.\n\n5. **EXTERNAL DEVICE 108:**\n * Connected to the `DATA PROCESSING SYSTEM 100` via `BUS 112`.\n * A generic placeholder for any peripheral device that can connect to the system (e.g., printer, scanner, camera, external storage).\n * Bi-directional arrows indicate communication with the external device.\n\n6. **PROCESSOR 110:**\n * Located *inside* the `DATA PROCESSING SYSTEM 100`.\n * The central processing unit (CPU) responsible for executing instructions and performing computations.\n * Connected to `BUS 112`.\n\n7. **BUS 112:**\n * Located *inside* the `DATA PROCESSING SYSTEM 100`.\n * Represents the internal communication pathway that connects the `PROCESSOR 110` to all other internal and external components (User Interface, Memory, Network Interface, External Device).\n * It acts as the backbone for data transfer within the system.\n\n8. **NETWORK 114:**\n * Located *outside* the `DATA PROCESSING SYSTEM 100`.\n * Represents an external communication network (e.g., the internet, a local area network) that the system can connect to via the `NETWORK INTERFACE 106`.\n\nIn summary, FIG. 1 provides a foundational understanding of a typical computer system\'s architecture, serving as a baseline against which the novel aspects of the patent\'s invention would later be presented.']"
3,gs://gcs-public-data--labeled-patents/espacenet_en65.pdf,EN,ACCESS NETWORK DISCOVERY AND SELECTION,"SIROTKIN, Alexander, HIMAYAT, Nageen, BANGOLAE, Sangeetha",18.12.2013,"['This technical diagram, likely from a patent application, presents a **cross-sectional view of a microfluidic device or reaction chamber with integrated temperature control capabilities.**\n\n**Primary Function:**\n\nThe primary function of this device is to **provide a precisely temperature-controlled environment for chemical or biological reactions, analyses, or processes.** It allows for the heating, cooling, or maintaining of a specific temperature within the central reaction chamber by circulating thermal fluids through adjacent channels. This is crucial for applications where reaction kinetics, enzyme activity, or sample stability are highly temperature-dependent.\n\n**Key Components Labeled:**\n\n1. **10: Reaction Chamber:** This is the central processing area where the main fluid (e.g., sample, reactants) flows and undergoes the desired reaction or analysis. It\'s the core functional part of the device.\n2. **12: Inlet Channel:** The conduit through which the fluid (sample or reactants) enters the reaction chamber (10).\n3. **14: Outlet Channel:** The conduit through which the processed fluid or reaction products exit the reaction chamber (10).\n4. **16: Heating Channel:** A channel positioned adjacent to the reaction chamber (10), designed to circulate a heating fluid. The arrows indicate the flow direction of this heating fluid.\n5. **18: Cooling Channel:** Another channel, also positioned adjacent to the reaction chamber (10), designed to circulate a cooling fluid. The arrows indicate the flow direction of this cooling fluid.\n6. **20: Inlet for Heating Channel:** The port or opening where the heating fluid enters the heating channel (16).\n7. **22: Outlet for Heating Channel:** The port or opening where the heating fluid exits the heating channel (16).\n8. **24: Inlet for Cooling Channel:** The port or opening where the cooling fluid enters the cooling channel (18).\n9. **26: Outlet for Cooling Channel:** The port or opening where the cooling fluid exits the cooling channel (18).\n10. **28: Substrate:** The main body or base material of the device, typically a solid material (e.g., silicon, glass, polymer) in which the channels are fabricated or etched.\n11. **30: Cover:** A top layer that seals the channels, forming enclosed conduits for fluid flow. This cover is bonded or attached to the substrate (28).\n\nIn essence, the diagram illustrates a sophisticated ""lab-on-a-chip"" component designed for highly controlled thermal processing of small fluid volumes.']"
4,gs://gcs-public-data--labeled-patents/espacenet_en43.pdf,EN,EMOTION RECOGNITION IN VIDEO CONFERENCING,"SHABUROV, Victor, MONASTYRSHYN, Yurii",18.03.2016,"[""This technical diagram, typical of a patent document figure, is a **schematic representation of a catalytic reactor system**.\n\n**Primary Function:**\nIts primary function is to facilitate a chemical reaction by passing a feed material through a catalyst bed under precisely controlled temperature and pressure conditions, thereby converting the feed into a desired product. It's designed for continuous or semi-continuous operation where process parameters need to be actively monitored and adjusted.\n\n**Key Components Labeled:**\n\n* **100: Reaction Vessel / Reactor Chamber:** The main enclosure where the chemical reaction takes place. It's depicted as a cylindrical or rectangular tank.\n* **102: Feed Inlet:** The entry point for the raw material or reactant stream into the reaction vessel. An arrow indicates the direction of flow.\n* **104: Product Outlet:** The exit point for the processed material or product stream from the reaction vessel. An arrow indicates the direction of flow.\n* **106: Catalyst Bed:** An internal structure within the reaction vessel, typically a packed bed or a structured catalyst, where the catalytic reaction occurs.\n* **108: Heating Element:** A component positioned around or within the reaction vessel, responsible for supplying heat to maintain or raise the temperature of the reaction.\n* **110: Cooling Element:** A component positioned around or within the reaction vessel, responsible for removing heat to maintain or lower the temperature of the reaction, especially for exothermic processes.\n* **112: Temperature Sensor:** A device placed inside the reaction vessel (likely within or near the catalyst bed) to measure the internal temperature.\n* **114: Pressure Sensor:** A device placed inside the reaction vessel to measure the internal pressure.\n* **116: Controller:** An external unit (e.g., a computer or PLC) that receives data from the sensors (112, 114) and sends commands to the heating (108), cooling (110), pump (118), and valve (120) elements to maintain desired operating conditions.\n* **118: Pump:** A device located on the feed line (102) to control the flow rate and pressure of the incoming feed material into the reactor.\n* **120: Valve:** A device located on the product line (104) to control the flow rate or pressure of the outgoing product stream.\n\nIn essence, the diagram illustrates a controlled chemical processing unit, emphasizing the critical role of the catalyst and the precise management of environmental parameters (temperature, pressure, flow) for efficient and effective conversion.""]"


In [17]:
# 2. Knowledge Graph - patent_knowledge_graph table.

# Define the schema as a Python variable
schema = """
invention_domain STRING, problem_solved STRING, patent_type STRING, 
components ARRAY<STRUCT<component_name STRING, component_function STRING, connected_to ARRAY<STRING>>>
"""

# The prompt text remains the same
prompt_text = """
From the following patent text, perform these tasks:
1. Determine the high-level technical domain (e.g., 'Telecommunications', 'Medical Devices').
2. Provide a one-sentence summary of the core problem the invention solves.
3. Classify the patent as a 'Method', 'System', 'Apparatus', or a combination.
4. Extract all technical components into a nested list. 
For each component, provide its name, its primary function, and a list of other components it is connected to.

Here is the text:
"""

sql_query = f"""
CREATE OR REPLACE TABLE `{project_id}.patent_analysis.patent_knowledge_graph` AS (
  SELECT
    t.uri,
    t.invention_domain,
    t.problem_solved,
    t.patent_type,
    t.components
  FROM
    AI.GENERATE_TABLE(
      MODEL `{project_id}.patent_analysis.gemini_vision_analyzer`,
      (
        SELECT
          uri,
          CONCAT(
            '''{prompt_text}''',
            '\\n\\n',
            IFNULL(extracted_title_en, ''),
            '\\n\\n',
            IFNULL(extracted_abstract_en, ''),
            '\\n\\nDiagrams:\\n',
            IFNULL(ARRAY_TO_STRING(diagram_descriptions, '\\n'), '')
          ) AS prompt
        FROM
          `{project_id}.patent_analysis.ai_text_extraction`
        WHERE
          extracted_abstract_en IS NOT NULL
      ),
      STRUCT(
        '''{schema}''' AS output_schema
      )
    ) AS t
);
"""

print("Attempting to create the patent knowledge graph...")
job = client.query(sql_query)
try:
    job.result()
    print("✅ Success: The `patent_knowledge_graph` table was extended.")

    print("\nFetching a sample of 5 records from the table:")
    sql_select_sample_query = f"""
    SELECT 
    
        pkg.uri,
        pkg.invention_domain,
        pkg.problem_solved,
        pkg.patent_type,
        pkg.components
    
    FROM `{project_id}.patent_analysis.patent_knowledge_graph` AS pkg
    WHERE ARRAY_LENGTH(pkg.components) > 0 and pkg.invention_domain is not NULL
    LIMIT 5;
    """
    
    df_sample = client.query(sql_select_sample_query).to_dataframe()
    display_styled_df(df_sample, title="Sample of 5 Records from the `patent_knowledge_graph` Table")

except Exception as e:
    print(f"❌ FAILED: An error occurred. Error:\n\n{e}")

Attempting to create the patent knowledge graph...
✅ Success: The `patent_knowledge_graph` table was extended.

Fetching a sample of 5 records from the table:


Unnamed: 0,uri,invention_domain,problem_solved,patent_type,components
0,gs://gcs-public-data--labeled-patents/espacenet_en83.pdf,Electronics,The invention provides an improved dual-stage low noise amplifier for efficient signal amplification in multiband receivers.,Apparatus,"[{'component_function': 'output first stage amplified voltage mode signals', 'component_name': 'plurality of first stage amplifiers', 'connected_to': array(['switch apparatus'], dtype=object)}  {'component_function': 'output amplified current mode signals', 'component_name': 'plurality of second stage amplifiers', 'connected_to': array(['switch apparatus'], dtype=object)}  {'component_function': 'connect selected second stage input ports to selected first stage output ports', 'component_name': 'switch apparatus', 'connected_to': array(['plurality of first stage amplifiers',  'plurality of second stage amplifiers'], dtype=object)} ]"
1,gs://gcs-public-data--labeled-patents/us_015.pdf,Telecommunications,The invention solves the problem of dynamically adjusting audio communication during a session based on detected anomalies in session data to improve the communication quality.,Apparatus,"[{'component_function': 'Collects session data, determines session context, detects anomalies in session data, and adjusts audio stream or device settings based on detected anomalies.', 'component_name': 'Telecommunications device', 'connected_to': array(['Remote telecommunications device', 'Audio stream', 'Session data'],  dtype=object)}  {'component_function': 'Participates in a communication session with the telecommunications device.', 'component_name': 'Remote telecommunications device', 'connected_to': array(['Telecommunications device'], dtype=object)}  {'component_function': 'Carries audio for a communication session and can be adjusted by the telecommunications device.', 'component_name': 'Audio stream', 'connected_to': array(['Telecommunications device'], dtype=object)}  {'component_function': 'Collected by the telecommunications device to determine session context and detect anomalies.', 'component_name': 'Session data', 'connected_to': array(['Telecommunications device'], dtype=object)}]"
2,gs://gcs-public-data--labeled-patents/espacenet_de77.pdf,Electrical Engineering,"The invention solves the problem of enabling efficient wireless data transmission to and from electronic components housed within a conductive, metallic enclosure, by integrating specifically designed slot-shaped recesses into the housing wall.",Apparatus,"[{'component_function': 'Receives electronic components; provides shielding; integrates access openings and slots for wireless transmission', 'component_name': 'Conductive, metallic housing', 'connected_to': array(['Electronic components', 'Flap, door or similar means',  'Group of adjacent slot-shaped recesses'], dtype=object)}  {'component_function': 'Closes access opening of the housing', 'component_name': 'Flap, door or similar means', 'connected_to': array(['Conductive, metallic housing'], dtype=object)}  {'component_function': 'Performs various electronic functions', 'component_name': 'Electronic components', 'connected_to': array(['Conductive, metallic housing',  'Measuring and measurement data transmission device'], dtype=object)}  {'component_function': 'Measures and transmits or receives data wirelessly', 'component_name': 'Measuring and measurement data transmission device', 'connected_to': array(['Electronic components', 'Excitation and feeding device'],  dtype=object)}  {'component_function': 'Arranged inside the housing in the area of the slots; excites and feeds the slots for wireless transmission', 'component_name': 'Excitation and feeding device', 'connected_to': array(['Measuring and measurement data transmission device',  'Group of adjacent slot-shaped recesses'], dtype=object)}  {'component_function': 'Introduced into the wall of the housing; adapted for wireless transmission at a selected frequency band', 'component_name': 'Group of adjacent slot-shaped recesses', 'connected_to': array(['Conductive, metallic housing', 'Excitation and feeding device'],  dtype=object)} ]"
3,gs://gcs-public-data--labeled-patents/espacenet_de49.pdf,Electrical Engineering,The invention solves the problem of monitoring the temperature of an electrical appliance supplied with energy via an adapter device.,Apparatus,"[{'component_function': 'Arranges connection between an electrical appliance and a socket, and includes a temperature sensor.', 'component_name': 'Adapter Device', 'connected_to': array(['Electrical appliance (12)', 'Socket',  'Remote temperature sensor (14)'], dtype=object)}  {'component_function': 'To be supplied with electrical energy.', 'component_name': 'Electrical appliance (12)', 'connected_to': array(['Adapter Device', 'Remote temperature sensor (14)'], dtype=object)}  {'component_function': 'Connected to an electrical energy source to provide power.', 'component_name': 'Socket', 'connected_to': array(['Adapter Device', 'Electrical energy source'], dtype=object)}  {'component_function': 'Detects the temperature of the electrical appliance.', 'component_name': 'Remote temperature sensor (14)', 'connected_to': array(['Adapter Device', 'Electrical appliance (12)'], dtype=object)}  {'component_function': 'Provides electrical energy.', 'component_name': 'Electrical energy source', 'connected_to': array(['Socket'], dtype=object)}]"
4,gs://gcs-public-data--labeled-patents/espacenet_en22.pdf,Electrochemical Cells,Facilitating electrochemical energy storage and conversion.,Apparatus,"[{'component_function': 'Provides an electrical pathway for electrons to and from the anode active material.', 'component_name': 'Anode current collector', 'connected_to': array(['Anode active material layer', 'Electrons (e-)'], dtype=object)}  {'component_function': 'The negative electrode material where lithium ions are stored and released during discharge.', 'component_name': 'Anode active material layer', 'connected_to': array(['Anode current collector', 'Separator', 'Lithium ions (Li+)',  'Electrons (e-)'], dtype=object)}  {'component_function': 'A porous membrane that physically separates the anode and cathode layers, preventing short circuits, while allowing the passage of lithium ions.', 'component_name': 'Separator', 'connected_to': array(['Anode active material layer', 'Cathode active material layer',  'Lithium ions (Li+)'], dtype=object)}  {'component_function': 'The positive electrode material where lithium ions are stored and released during charging/discharging.', 'component_name': 'Cathode active material layer', 'connected_to': array(['Separator', 'Cathode current collector', 'Lithium ions (Li+)',  'Electrons (e-)'], dtype=object)}  {'component_function': 'Provides an electrical pathway for electrons to and from the cathode active material.', 'component_name': 'Cathode current collector', 'connected_to': array(['Cathode active material layer', 'Electrons (e-)'], dtype=object)}  {'component_function': 'Charge carriers moving from the anode through the separator to the cathode during discharge.', 'component_name': 'Lithium ions (Li+)', 'connected_to': array(['Anode active material layer', 'Separator',  'Cathode active material layer'], dtype=object)}  {'component_function': 'Charge carriers moving from the anode to the cathode via the current collectors during discharge.', 'component_name': 'Electrons (e-)', 'connected_to': array(['Anode current collector', 'Anode active material layer',  'Cathode current collector', 'Cathode active material layer'],  dtype=object)} ]"


## Visualization

### Strategic Patent Portfolio Analysis

This section visualizes the data from our knowledge graph to uncover quantifiable insights about market trends and technical architecture.

----
### Chart 1: Patent Filing Trends by Technical Domain
It allows us to visually track market trends and identify which technology sectors experienced the most significant 

In [10]:
# This query creates a time-series of patent filings per domain.
sql_query = f"""
SELECT
  EXTRACT(YEAR FROM SAFE.PARSE_DATE('%d.%m.%Y', T1.filed_date)) AS filing_year,
  T2.invention_domain,
  COUNT(T1.uri) AS patent_count
FROM
  `{project_id}.patent_analysis.ai_text_extraction` AS T1
JOIN
  `{project_id}.patent_analysis.patent_knowledge_graph` AS T2
ON
  T1.uri = T2.uri
WHERE
  T1.filed_date IS NOT NULL
  AND T2.invention_domain IS NOT NULL
GROUP BY
  filing_year,
  invention_domain
ORDER BY
  filing_year,
  patent_count DESC;
"""

df_timeseries = client.query(sql_query).to_dataframe()

# Find the top 7 domains with the most patents overall.
top_domains = df_timeseries.groupby('invention_domain')['patent_count'].sum().sort_values(ascending=False).head(7).index.tolist()

# Group all other domains into a single "Other" category.
df_timeseries['display_domain'] = df_timeseries['invention_domain'].apply(
    lambda x: x if x in top_domains else 'Other'
)

# Aggregate the counts for the new display domains.
df_chart_data = df_timeseries.groupby(['filing_year', 'display_domain'])['patent_count'].sum().reset_index()


# Create the Interactive Stacked Area Chart.
fig = px.area(
    df_chart_data,
    x="filing_year",
    y="patent_count",
    color="display_domain",
    title="<b>Patent Filing Trends by Technical Domain Over Time</b>",
    labels={
        "filing_year": "Year of Filing",
        "patent_count": "Number of Patents Filed",
        "display_domain": "Invention Domain"
    },
    # Use a color scale that is easy to distinguish
    color_discrete_sequence=px.colors.qualitative.Vivid
)

# Customize the layout for a readable look
fig.update_layout(
    xaxis_title="<b>Year ➡️</b>",
    yaxis_title="<b>Annual Patent Count ⬆️</b>",
    legend_title="<b>Top Invention Domains</b>",
    font=dict(family="Arial, sans-serif", size=12)
)

fig.show()


BigQuery Storage module not found, fetch data with the REST endpoint instead.



# Patent Insights with SQL Analysis

## What did we build?

Now that we have transformed the unstructured patent data into a structured Knowledge Graph, we can finally ask it complex questions.

## Why is this important?
- This is the payoff. 
- We will run queries that are impossible to perform on the original text.
- Uncovering quantifiable insights about:
  - Invention complexity
  - Common design patterns across the entire dataset
- Proves the value of the data transformation pipeline.

## What will we find?
We will perform two types of analysis:

1. **Quantitative Analysis**:
   - Compare the average number of components across different technical domains
   - Measure and rank their complexity

2. **Architectural Pattern Mining**:
   - `UNNEST` the component data
   - Finds the most common "building blocks" and design patterns connected to any component we choose.

In [12]:
# Quantitative Analysis.

sql_query = f"""
    SELECT
      invention_domain,
      COUNT(uri) AS number_of_patents,
      ROUND(AVG(ARRAY_LENGTH(components)), 2) AS average_components,
      MIN(ARRAY_LENGTH(components)) AS min_components,
      MAX(ARRAY_LENGTH(components)) AS max_components
    FROM
      `{project_id}.patent_analysis.patent_knowledge_graph`
    WHERE
      ARRAY_LENGTH(components) > 0
    GROUP BY
      invention_domain
    ORDER BY
      average_components DESC;
"""

job = client.query(sql_query)
try:
    job.result()
except Exception as e:
    print(f"❌ FAILED: The query failed. Error:\n\n{e}")

df = job.to_dataframe()
df[df['number_of_patents'] >= 4]


BigQuery Storage module not found, fetch data with the REST endpoint instead.



Unnamed: 0,invention_domain,number_of_patents,average_components,min_components,max_components
48,Optoelectronics,4,8.0,5,10
65,Blockchain Technology,6,7.17,4,11
90,Electrical Engineering,5,6.2,4,9
121,Telecommunications,70,5.96,1,21
122,Image Processing,4,5.75,4,9
125,Information Technology,7,5.43,2,9
126,Wireless Communication,7,5.43,3,11
127,Medical Devices,24,5.42,2,10
130,Biotechnology,4,5.0,1,10
150,Wireless Communications,4,4.25,2,8


In [11]:
# Architectural Pattern Mining.

searching_topic = "server"

sql_query = f"""
    WITH
      patent_components AS (
        SELECT
          t.uri,
          c.component_name,
          c.connected_to
        FROM
          `{project_id}.patent_analysis.patent_knowledge_graph` AS t,
          UNNEST(t.components) AS c
      ),
    
      component_connections AS (
        SELECT
          pc.uri,
          pc.component_name,
          connected_component
        FROM
          patent_components AS pc,
          UNNEST(pc.connected_to) AS connected_component
      )
      
    SELECT
      connected_component,
      COUNT(connected_component) AS connection_count
    FROM
      component_connections
    WHERE
      REGEXP_CONTAINS(component_name, r'(?i){searching_topic}')
    GROUP BY
      connected_component
    ORDER BY
      connection_count DESC
    LIMIT 10;
"""

job = client.query(sql_query)
try:
    job.result()
except Exception as e:
    print(f"❌ FAILED: The query failed. Error:\n\n{e}")

df = job.to_dataframe()
df.head(5)


BigQuery Storage module not found, fetch data with the REST endpoint instead.



Unnamed: 0,connected_component,connection_count
0,User Device,3
1,Policy Evaluation Server,2
2,Boolean circuit,2
3,Client device,2
4,Network,2


# Patent Search Engine

## What did we build?
A powerful semantic search engine that finds specific technical components based on a natural language description of their function.

## Why is this important?
- Standard search finds keywords. This search finds meaning.
- By combining two different vector embeddings, the engine understands patent's components and the technical context in which it operates.
- This allows an engineer to find a "valve for precise fluid delivery" and get results from relevant medical patents, not car engine patents.

## How did we do it?
The process involves three key stages, all performed within BigQuery:

1. **Dual Embeddings**:
   - We first generate two separate vector embeddings:
     - One for the high-level patent context (title, abstract, domain, diagrams)
     - Another for the specific component's function

2. **Vector Combination**:
   - We then create a custom User-Defined Function (UDF) to mathematically average these two vectors.
   - This creates a single, final vector for each component that is rich with both specific and contextual meaning.

3. **Semantic Search**:
   - Finally, we use the `VECTOR_SEARCH` function to compare a user's query against these combined vectors.
   - Returns the most similar components from the entire dataset.


In [20]:
# This query creates a flat table of all components from all patents.
sql_query = f"""
CREATE OR REPLACE TABLE `{project_id}.patent_analysis.patent_components_flat` AS (
  SELECT
    t.uri,
    t.invention_domain,
    c.component_name,
    c.component_function,
    c.connected_to
  FROM
    `{project_id}.patent_analysis.patent_knowledge_graph` AS t,
    UNNEST(t.components) AS c
  WHERE
    c.component_function IS NOT NULL
    AND c.component_name IS NOT NULL
);
"""

print("Attempting to create the flattened components table...")
job = client.query(sql_query)
try:
    job.result()
    print("✅ Success: The `patent_components_flat` table was created.")

    print("\nFetching a sample of 5 records from the new table:")
    sql_select_sample_query = f"""
    SELECT * FROM `{project_id}.patent_analysis.patent_components_flat` 
    LIMIT 5;
    """
    
    df_sample = client.query(sql_select_sample_query).to_dataframe()
    display(df_sample)

except Exception as e:
    print(f"❌ FAILED: An error occurred. Error:\n\n{e}")

Attempting to create the flattened components table...
✅ Success: The `patent_components_flat` table was created.

Fetching a sample of 5 records from the new table:



BigQuery Storage module not found, fetch data with the REST endpoint instead.



Unnamed: 0,uri,invention_domain,component_name,component_function,connected_to
0,gs://gcs-public-data--labeled-patents/espacenet_en85.pdf,Microfluidics,Pump chamber or actuation chamber (120),Whose volume is changed by the deformation of the membrane (114).,"[Flexible, deformable membrane or diaphragm (114), Lower substrate or base layer (118), Inlet or outlet channel (122), Another inlet or outlet channel (124)]"
1,gs://gcs-public-data--labeled-patents/med_tech_5.pdf,Medical Devices,Bone fragments,Parts of a fractured bone that are brought together and stabilized,"[Bone plate, Compression screws]"
2,gs://gcs-public-data--labeled-patents/espacenet_de73.pdf,Electrical Engineering,device for inductive energy transfer,performs inductive energy transfer,"[induction elements, magnetic flux bundling unit]"
3,gs://gcs-public-data--labeled-patents/espacenet_en33.pdf,Chemical Engineering,Insulation,"Reduces heat loss or gain, maintaining desired temperature.","[Heating/Cooling Jacket, Main Vessel/Column]"
4,gs://gcs-public-data--labeled-patents/espacenet_en75.pdf,Audio Technology,Audio drivers,Outputting audio and distributing sound evenly throughout the room,"[Array speaker, Speaker housing]"


In [19]:
# This query creates a single context vector for each patent, reading from ai_text_extraction table.
sql_query = f"""
CREATE OR REPLACE TABLE `{project_id}.patent_analysis.patent_context_embeddings` AS (
  SELECT
    t.uri,
    t.ml_generate_embedding_result AS patent_context_vector
  FROM
    ML.GENERATE_EMBEDDING(
      MODEL `{project_id}.patent_analysis.embedding_model`,
      (
        SELECT
          uri,
          CONCAT(
            'Represent this technical patent for semantic search: \\n\\n', 
            'Patent Title: ', IFNULL(extracted_title_en, ''), '\\n\\n',
            'Applicant: ', IFNULL(applican, ''), '\\n\\n',
            'International Classification: ', IFNULL(class_international, ''), '\\n\\n',
            'Abstract: ', IFNULL(extracted_abstract_en, ''), '\\n\\n',
            'Diagram Descriptions: ', IFNULL(ARRAY_TO_STRING(diagram_descriptions, '\\n'), '')
          ) AS content
        FROM
          `{project_id}.patent_analysis.ai_text_extraction`
        WHERE
          extracted_title_en IS NOT NULL
      )
    ) AS t
);
"""

print("Attempting to create the patent context embeddings table...")
job = client.query(sql_query)
try:
    job.result() 
    print("✅ Success: The `patent_context_embeddings` table was created.")

    print("\nFetching a sample of 5 records from the new table:")
    sql_select_sample_query = f"""
    SELECT 
        uri, 
        ARRAY_LENGTH(patent_context_vector) as vector_dimensions 
    FROM `{project_id}.patent_analysis.patent_context_embeddings` 
    LIMIT 5;
    """
    
    df_sample = client.query(sql_select_sample_query).to_dataframe()
    display(df_sample)

except Exception as e:
    print(f"❌ FAILED: An error occurred. Error:\n\n{e}")

Attempting to create the patent context embeddings table...
✅ Success: The `patent_context_embeddings` table was created.

Fetching a sample of 5 records from the new table:



BigQuery Storage module not found, fetch data with the REST endpoint instead.



Unnamed: 0,uri,vector_dimensions
0,gs://gcs-public-data--labeled-patents/espacenet_en9.pdf,3072
1,gs://gcs-public-data--labeled-patents/espacenet_en40.pdf,3072
2,gs://gcs-public-data--labeled-patents/espacenet_en44.pdf,3072
3,gs://gcs-public-data--labeled-patents/espacenet_fr19.pdf,3072
4,gs://gcs-public-data--labeled-patents/us_092.pdf,3072


In [21]:
# This query creates a single specific function vector for each individual component.
sql_query = f"""
CREATE OR REPLACE TABLE `{project_id}.patent_analysis.component_function_embeddings` AS (
  SELECT
    t.uri,
    t.component_name,
    t.ml_generate_embedding_result AS component_function_vector
  FROM
    ML.GENERATE_EMBEDDING(
      MODEL `{project_id}.patent_analysis.embedding_model`,
      (
        SELECT
          uri,
          component_name,
          CONCAT(
            'Represent this technical patent for semantic search: \\n\\n',
            'A component named "', component_name, '" whose function is to ', component_function
          ) AS content
        FROM
          `{project_id}.patent_analysis.patent_components_flat`
      )
    ) AS t
);
"""

print("Attempting to create the component function embeddings table...")
job = client.query(sql_query)
try:
    job.result()
    print("✅ Success: The `component_function_embeddings` table was created.")

    print("\nFetching a sample of 5 records from the new table:")
    sql_select_sample_query = f"""
    SELECT 
        uri, 
        component_name,
        ARRAY_LENGTH(component_function_vector) as vector_dimensions 
    FROM `{project_id}.patent_analysis.component_function_embeddings` 
    LIMIT 5;
    """
    
    df_sample = client.query(sql_select_sample_query).to_dataframe()
    display(df_sample)

except Exception as e:
    print(f"❌ FAILED: An error occurred. Error:\n\n{e}")

Attempting to create the component function embeddings table...
✅ Success: The `component_function_embeddings` table was created.

Fetching a sample of 5 records from the new table:



BigQuery Storage module not found, fetch data with the REST endpoint instead.



Unnamed: 0,uri,component_name,vector_dimensions
0,gs://gcs-public-data--labeled-patents/computer_vision_10.pdf,image sensor,3072
1,gs://gcs-public-data--labeled-patents/computer_vision_10.pdf,second output port,3072
2,gs://gcs-public-data--labeled-patents/computer_vision_10.pdf,first output port,3072
3,gs://gcs-public-data--labeled-patents/computer_vision_11.pdf,Module,3072
4,gs://gcs-public-data--labeled-patents/computer_vision_11.pdf,Processor,3072


In [11]:
# Normalization

def normalize_and_save_vectors(
    table_id: str,
    vector_column: str,
    client: bigquery.Client
):
    """
   Normalizes a vector column in a BigQuery table in-place by replacing
    the table with its normalized version.

    Args:
        table_id: The full ID of the table to update (e.g., "project.dataset.table").
        vector_column: The name of the column containing the vectors to normalize.
        client: An authenticated BigQuery client object.
    """


    # This SQL query selects all original columns and replaces the vector
    # column with its normalized version.
    sql_query = f"""
    CREATE OR REPLACE TABLE `{table_id}` AS (
      SELECT
        * EXCEPT({vector_column}),
        `{client.project}.patent_analysis.L2_NORMALIZE`({vector_column}) AS {vector_column}
      FROM
        `{table_id}`
    );
    """

    try:
        # Execute the query.
        job = client.query(sql_query)
        job.result()
    except Exception as e:
        print(f"❌ FAILED: An error occurred during normalization. Error:\n\n{e}")


# 1. Normalize the patent context embeddings.
print("--- Normalizing Patent Context Vectors ---")
normalize_and_save_vectors(
   table_id=f"{project_id}.patent_analysis.patent_context_embeddings",
   vector_column="patent_context_vector",
   client=client
)

# 2. Normalize the component function embeddings.
print("\n--- Normalizing Component Function Vectors ---")
normalize_and_save_vectors(
   table_id=f"{project_id}.patent_analysis.component_function_embeddings",
   vector_column="component_function_vector",
   client=client
)

print("\n--- Fetching a Diverse Sample of 5 Unique Patents ---")

# This query uses QUALIFY to get one component from 5 different patents.
sql_select_sample = f"""
SELECT
    uri,
    component_name,
    ARRAY_LENGTH(component_function_vector) as vector_dimensions
FROM
    `{project_id}.patent_analysis.component_function_embeddings`
QUALIFY
    ROW_NUMBER() OVER(PARTITION BY uri ORDER BY RAND()) = 1
LIMIT 5;
"""

try:
    df_diverse_sample = client.query(sql_select_sample).to_dataframe()
    display(df_diverse_sample)
except Exception as e:
    print(f"❌ FAILED to fetch a diverse sample. Error:\n\n{e}")


--- Fetching a Diverse Sample of 5 Unique Patents ---




Unnamed: 0,uri,component_name,vector_dimensions
0,gs://gcs-public-data--labeled-patents/computer_vision_15.pdf,user,3072
1,gs://gcs-public-data--labeled-patents/espacenet_en28.pdf,Electrochemical Cell (100),3072
2,gs://gcs-public-data--labeled-patents/espacenet_en47.pdf,Ultra-capacitor (533),3072
3,gs://gcs-public-data--labeled-patents/espacenet_fr57.pdf,Camera,3072
4,gs://gcs-public-data--labeled-patents/us_034.pdf,Common Data Channel,3072


In [13]:
# This query rebuilds the search index using the UDF - weighted average function.
sql_query = f"""
CREATE OR REPLACE TABLE `{project_id}.patent_analysis.component_search_index` AS (
  SELECT
    flat.uri,
    flat.component_name,
    flat.component_function,
    -- Call our new UDF with the desired weights.
    `{project_id}.patent_analysis.VECTOR_WEIGHTED_AVG`(
      func.component_function_vector, 0.7, -- 70% weight to the function
      ctx.patent_context_vector, 0.3      -- 30% weight to the context
    ) AS combined_vector
  FROM
    `{project_id}.patent_analysis.patent_components_flat` AS flat
  JOIN
    `{project_id}.patent_analysis.patent_context_embeddings` AS ctx
  ON
    flat.uri = ctx.uri
  JOIN
    `{project_id}.patent_analysis.component_function_embeddings` AS func
  ON
    flat.uri = func.uri AND flat.component_name = func.component_name
);
"""

print("Attempting to create the final component search index table...")
job = client.query(sql_query)
try:
    job.result()
    print("✅ Success: The `component_search_index` table was created.")

    print("\nFetching a diverse sample of 5 records from the new table:")
    sql_select_sample_query = f"""
    SELECT
        uri,
        component_name,
        ARRAY_LENGTH(combined_vector) as vector_dimensions
    FROM
        `{project_id}.patent_analysis.component_search_index`
    QUALIFY
        ROW_NUMBER() OVER(PARTITION BY uri ORDER BY RAND()) = 1
    LIMIT 5;
    """
    
    df_sample = client.query(sql_select_sample_query).to_dataframe()
    display(df_sample)

except Exception as e:
    print(f"❌ FAILED: An error occurred. Error:\n\n{e}")

Attempting to create the final component search index table...
✅ Success: The `component_search_index` table was created.

Fetching a diverse sample of 5 records from the new table:




Unnamed: 0,uri,component_name,vector_dimensions
0,gs://gcs-public-data--labeled-patents/espacenet_de71.pdf,virtual positive form,3072
1,gs://gcs-public-data--labeled-patents/espacenet_en77.pdf,100: Overall device or cartridge assembly,3072
2,gs://gcs-public-data--labeled-patents/med_tech_1.pdf,indicator,3072
3,gs://gcs-public-data--labeled-patents/us_049.pdf,Second Time Interval,3072
4,gs://gcs-public-data--labeled-patents/espacenet_fr28.pdf,textual message,3072


In [7]:
# --- 1. AI Gatekeeper Function ---
# (This function remains the same)
def is_query_technical(search_query: str, client: bigquery.Client) -> bool:
    safe_search_query = search_query.replace("'", "\\'")
    classification_prompt = f"Is the following user query related to a technical, scientific, or engineering topic? Answer with only 'Yes' or 'No'. Query: {safe_search_query}"
    sql_query = f"""
    SELECT ml_generate_text_llm_result
    FROM ML.GENERATE_TEXT(
        MODEL `{client.project}.patent_analysis.gemini_vision_analyzer`,
        (SELECT '''{classification_prompt}''' AS prompt),
        STRUCT(0.0 AS temperature, TRUE AS flatten_json_output, 1024 AS max_output_tokens)
    )
    """
    try:
        query_job = client.query(sql_query)
        results = query_job.result()
        for row in results:
            response = row.ml_generate_text_llm_result.strip().lower()
            if "yes" in response:
                return True
        return False
    except Exception as e:
        print(f"Error during query classification: {e}")
        return False

# --- 2. NEW Results Styling & Grouping Function ---
def style_and_group_results(results_df: pd.DataFrame, search_query: str, top_n_patents=5):
    """
    Takes a large DataFrame of component results, finds the top N unique patents,
    and returns a styled, grouped HTML table.
    """
    if results_df.empty:
        return "<p>⚠️ No relevant technical components found for this query.</p>"

    # --- This is the new diversification logic ---
    # 1. Find the best (lowest) distance score for each patent.
    top_patents_df = results_df.loc[results_df.groupby('uri')['distance'].idxmin()]
    # 2. Sort the patents by this best score and select the top N.
    top_uris = top_patents_df.sort_values('distance', ascending=True).head(top_n_patents)['uri'].tolist()
    # 3. Filter the original results to only include components from these top patents.
    final_df = results_df[results_df['uri'].isin(top_uris)].copy()
    # -----------------------------------------

    # --- Generate Grouped HTML Output ---
    html = f"<h3>Top {len(top_uris)} Patent Matches for: '{search_query}'</h3>"
    html += "<div style='font-family: Arial, sans-serif;'>"

    for uri in top_uris:
        patent_df = final_df[final_df['uri'] == uri].sort_values('distance', ascending=True)
        if not patent_df.empty:
            short_uri = uri.split("/")[-1]
            html += f"<h4 style='margin-top: 20px; margin-bottom: 5px; background-color: #333; color: white; padding: 5px; border-radius: 3px;'>"
            html += f"Patent: <a href='{uri}' target='_blank' style='color: #8ab4f8;'>{short_uri}</a></h4>"
            
            # Create a simple table for the components within this patent
            html += "<table style='width: 100%; border-collapse: collapse;'>"
            html += "<tr><th style='width: 30%; text-align: left; padding: 8px;'>Component Name</th>"
            html += "<th style='width: 55%; text-align: left; padding: 8px;'>Component Function</th>"
            html += "<th style='width: 15%; text-align: left; padding: 8px;'>Distance</th></tr>"

            for index, row in patent_df.iterrows():
                distance_str = f"{row['distance']:.4f}"
                html += f"<tr style='background-color: #222;'><td style='padding: 8px;'>{row['component_name']}</td>"
                html += f"<td style='padding: 8px;'>{row['component_function']}</td>"
                html += f"<td style='padding: 8px;'>{distance_str}</td></tr>"
            html += "</table>"

    html += "</div>"
    return html

# --- 3. Main Search Logic Function (Updated) ---
def handle_search_request(search_query: str, client: bigquery.Client, distance_threshold=0.8):
    """
    Orchestrates the search: classification, fetching a large pool, and styling/grouping.
    """
    if not is_query_technical(search_query, client):
        return "<p>⚠️ Query is not technical. Please enter a query related to a technical component or function.</p>"

    # --- THIS IS THE KEY CHANGE: Fetch a larger pool of candidates (top_k => 50) ---
    sql_query = f"""
    WITH search_results AS (
      SELECT
        base.uri, base.component_name, base.component_function, distance
      FROM
        VECTOR_SEARCH(
          TABLE `{client.project}.patent_analysis.component_search_index`,
          'combined_vector',
          (
            SELECT ml_generate_embedding_result
            FROM ML.GENERATE_EMBEDDING(
              MODEL `{client.project}.patent_analysis.embedding_model`,
              (SELECT CONCAT('Represent this technical patent component for semantic search: ', '{search_query}') AS content)
            )
          ),
          top_k => 50, -- Fetch 50 candidates
          distance_type => 'COSINE'
        )
    )
    SELECT * FROM search_results WHERE distance < {distance_threshold};
    """
    
    try:
        df = client.query(sql_query).to_dataframe()
        # Call the new styling and grouping function
        return style_and_group_results(df, search_query)
    except Exception as e:
        return f"<p>❌ FAILED: The vector search failed. Error: {e}</p>"

# --- 4. UI Setup and Event Handling ---
# (This part remains the same)
search_input = widgets.Text(value='a device for processing data', placeholder='Describe a technical function...', description='Search Query:', layout=Layout(width='80%'))
search_button = widgets.Button(description='Find Similar Components', button_style='success', icon='search')
output_area = widgets.HTML(value="<p>Enter a query and click the button to see results.</p>")

def on_button_clicked(b):
    output_area.value = "<em>Classifying query and searching...</em>"
    search_query = search_input.value
    html_result = handle_search_request(search_query, client)
    output_area.value = html_result

search_button.on_click(on_button_clicked)

display(widgets.VBox([search_input, search_button, output_area]))

VBox(children=(Text(value='a device for processing data', description='Search Query:', layout=Layout(width='80…