In [1]:
import os
import sys
import json
from kaggle_secrets import UserSecretsClient
from google.oauth2 import service_account
os.chdir('/kaggle/input/patent-art')
sys.path.append('/kaggle/input/patent-art')

In [2]:
!uv pip install --no-cache-dir -r requirements-kaggle.txt &> /dev/null

In [3]:
def get_credentials(secret_name: str = "gcp_service_account")  -> service_account.Credentials:
    """Fetch GCP Credentials"""
    user_secrets =  UserSecretsClient()
    service_account_json = user_secrets.get_secret(secret_name)

    with open("/tmp/service_account.json", "w") as f:
        f.write(service_account_json)
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/tmp/service_account.json"
    credentials_info = json.loads(service_account_json)
    credentials = service_account.Credentials.from_service_account_info(credentials_info)


    gcp_secrets = json.loads(user_secrets.get_secret(secret_name))


    os.environ["project_id"] = gcp_secrets["project_id"]
    os.environ["dataset_id"] = user_secrets.get_secret("dataset_id") 
    os.environ["publication_table"] = user_secrets.get_secret("publication_table")
    os.environ["small_model_id"] = user_secrets.get_secret("small_model_id")
    os.environ["embedding_table"] = user_secrets.get_secret("embedding_table")
    os.environ["service_account_path"] = "/tmp/service_account.json"
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/tmp/service_account.json"

    os.environ["hf_token"] = user_secrets.get_secret('hf_token') 

    return user_secrets, credentials

user_secrets, credentials = get_credentials()

In [4]:
from src.kaggle.patent_dashboard_chart_functions import (
    create_patent_dashboard_demo
)
from run_patent_search_pipeline import (
    run_semantic_search_pipeline,
)
from src.kaggle.kaggle_patent_search_demo import(
    demo_search_interface
)
from src.kaggle.kaggle_patent_chart_metrics import(
    create_latency_visualization,
    create_bigquery_visualization,
    create_discoverability_visualization,

)

2025-09-12 16:29:11.751243: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1757694551.980170      13 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1757694552.049953      13 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


<h1>Patent Intelligence: Semantic Search for Innovation Discovery</h1>

<h2>Patent Intelligence: Semantic Search for Innovation Discovery Using BigQuery AI</h2>

## Problem Statement

The exponential growth in global patent publication (3.6M applications in 2023, representing 2.7% growth year-over-year [WIPO 2024](https://www.wipo.int/edocs/pubdocs/en/wipo-pub-943-2024-en-wipo-ip-facts-and-figures-2024.pdf). 68.7% of global patent aplications in 2023 originated from China (WIPO 2024) displaying China's dominance. Globally, patent filings increased by 18.1% between 2019-2023 indicating accelerating innovation; this creates significant challenges for innovation research and competitive intelligence. Our analysis of Google's patent publication corpus reveals that English-language patent publication peaked in 2020-2021, declined 5.6% through 2023, with partial recovery in 2024.<br>
Traditional keyword-based patent search systems fail to capture semantic relationships between technologies, forcing researchers to spend excessive time on manual review and potentially missing relevant prior art. With 2.9M English-language patents from our 2024 dataset(14.4GB) representing the first half of the year of global innovation output, there is a clear need for semantic search capabilities that can understand technological concepts beyond surface-level keyword understanding.

## Impact Statement

Our solution demonstrates semantic patent discovery across 2.9M patents for the first half of 2024. The system combines optimized embedding generation with BigQuery's vector search capabilities to enable semantic search similarity analysis of patent documents. By implementing production-ready optimizations including partition pruning and intelligent clustering, we show how semantic search can compliment traditional keyword approaches in patent research workflows, potentially helping organizations better navigate the growing patent landscape.

## Architectural Diagram & Decisions

**Hybrid Processing Strategy:** The combination of performance constraints and platform limitations led to a multi-stage architecture:
1. Offline Bulk Processing: Kaggle GPU for embedding generation (6.5 hours for 2.9M patents)
2. Real-time Query Processing: CPU-based query embeddings (~500ms per query)
3. Production-Ready Optimization: BigQuery partitioning and clustering for cost efficiency
4. Edition-Aware Design: Index-ready architecture that can utilize vector indexing when deployed on Enterprise+ tiers
5. Streamlit Cloud Demo - Easily shareable interactive dashboard enabling non-technical stakeholders to experience semantic patent discovery with explainability, leveraging real-time BigQuery AI integration for enterprise adoption validation

```mermaid
graph TB
    %% Data Sources and Filtering
    A[Google Patents Public Dataset<br/>2.6TB - Global Publications] --> B{Multi-Stage Filtering}
    B --> |publication_date: 2017-01 to 2025-02<br/>LENGTH title_en > 0<br/>LENGTH abstract_en > 0<br/>English Language Publications| C[Filtered Enterprise Dataset<br/>49M+ English Patents]
    
    %% Table Optimization
    C --> D[BigQuery Table Optimization<br/>patents_2017_2025_en]
    D --> E[PARTITION BY publication_date<br/>CLUSTER BY publication_number, country_code<br/>Monthly Partitions for Cost Optimization]
    
    %% Subset Selection for Demo
    E --> F{Demo Subset Selection}
    F --> |Focus: 2024-01 to 2024-06<br/>High Quality Patents: title_en >= 30<br/>AND abstract_en >= 100| G[Working Dataset<br/>2.9M Patents - 6 Months]
    
    %% Embedding Generation Pipeline
    G --> H[Patent Text Extraction<br/>title_en + abstract_en]
    H --> I[Sentence Transformers<br/>all-MiniLM-L6-v2<br/>384-dimensional vectors]
    I --> |56x faster than<br/>ML.GENERATE_EMBEDDING<br/>6.5 hours vs 15 days| J[Vector Embeddings<br/>2.9M × 384 dimensions]
    
    %% BigQuery AI Integration
    J --> K[(BigQuery Embeddings Table<br/>patent_embeddings_local)]
    K --> L[Table Optimization<br/>PARTITION BY publication_date<br/>CLUSTER BY publication_number, country_code]
    L --> M[CREATE VECTOR INDEX<br/>IVF Indexing with STORING]
    
    %% Search Pipeline
    N[User Query<br/>Natural Language] --> O[Query Embedding<br/>Sentence Transformers]
    O --> P[BigQuery AI VECTOR_SEARCH<br/>With Partition Pruning]
    P --> Q[Semantic Similarity<br/>Cosine Distance + Filtering]
    
    %% Optimized Storage and Processing
    M --> P
    E --> |Partition Pruning<br/>80-85% Cost Reduction| P
    Q --> R[Top-K Similar Patents<br/>- Document Similarity Ranking<br/>- Sentence-Level Explainability]
    
    %% User Interfaces - Split into two paths
    R --> S1[Streamlit Cloud Demo<br/>- Interactive Dashboard<br/>- Real-time Search & Explainability<br/>- Performance Metrics<br/>- User-Friendly Interface]
    R --> S2[Kaggle Notebook<br/>- Technical Implementation<br/>- Code Demonstration<br/>- Static Demo<br/>- Performance Analysis]
    
    %% Performance Monitoring
    U[Performance Metrics] --> V[Query Latency: <5s<br/>Partition Efficiency: 80-85% cost reduction<br/>Discoverability: 98%+<br/>Scalability: 49M+ patents]
    P --> U
    
    %% Architecture Benefits
    W[Production Optimizations] --> X[Monthly Partitioning<br/>Smart Clustering<br/>Vector Indexing<br/>Hybrid Architecture<br/>Enterprise Scalability]
    
    %% Data Flow Annotations
    Y[Data Scale Progression] --> Z[2.6TB → 49M patents → 14.4GB → 2.9M subset<br/>Production filtering → Optimized storage → Demo+Metrics focus]
    
    %% Demo Platform Comparison
    AA[Demo Platform Benefits] --> BB[Streamlit: Interactive UI, Local Control<br/>Kaggle: Cloud Performance, GPU Processing<br/>BigQuery Proximity, Production Metrics]
    
    %% Styling
    classDef dataSource fill:#e1f5fe
    classDef filtering fill:#f3e5f5
    classDef optimization fill:#fff3e0
    classDef processing fill:#f3e5f5
    classDef bigqueryAI fill:#fff3e0
    classDef userInterface fill:#e8f5e8
    classDef metrics fill:#fff8e1
    
    class A,C dataSource
    class B,F filtering
    class D,E,K,L,M optimization
    class H,I,J processing
    class O,P,Q bigqueryAI
    class N,S1,S2,T1,T2 userInterface
    class U,V,W,X,Y,Z,AA,BB metrics

## Data Pipeline Overview

This notebook demonstrates the core BigQuery AI semantic search functionality. 
The complete pipeline involved:
1. Raw dataset filtering (2.6TB → 49M patents → 2.9M subset)
2. Embedding generation (6.5 hours on Kaggle GPU)
3. BigQuery AI vector search implementation (shown below)

Due to compute costs, the preprocessing steps were executed offline. Check the <a id="Assets"></a>[Repo](#Assets) for source code

# Dashboard Insights: Global Patent Landscape Analysis

Our comprehensive analysis of Google Publications dataset (49M patents across 59 countries) reveals significant trends in global innovation patterns and data quality characteristics.<br>
The following analysis describe insights drawn from the 9 charts that are shown below.

### Dataset Overview

- **Total Patents Analyzed:** 49,184,081 patents<br>
- **Geographic Coverage:** 59 countries<br>
- **Patent Families:** 35,182,526 unique families<br>
- **Average Title Length:** 59.3 characters<br>
- **Average Abstract Length:** 1,007.4 characters

### Dataset Quality Assessment

- **Title Completeness:** 100% (all patents have titles)<br>
- **Abstract Completeness:** 100% (comprehensive abstract coverage)<br>
- **Claims Completeness:** 12.1% (indicating potential data enrichment opportunities)

### Geographic Innovation Patterns

#### Publication Volume Leaders:

- **China:** 71% of global patent publications (2017-2025)
- **United States:** 12% of global publications
- China maintains dominance across the timeline, with US leading in early 2025. Together, these 2 countries dominate the patents publication space

#### Innovation Quality Indicators:

- **US Patents:** 72% citation rate, indicating higher research impact<br>
- **Chinese Patents:** 36% citation rate, suggesting volume-focused strategy<br>
This disparity highlights different national innovation approaches and the value of semantic search for discovering high-quality innovations regardless of citation patterns

#### Leading Technology Areas:

- **Energy Storage & Batteries:**  Highest publication volume
- **Human Necessities:** Broad application patents
- **Neural Networks & Machine Learning:** Emerging technology focus

#### Technology Convergence:

- **Physics & Electricity:** 197K publications show strong interdisciplinary innovation
- **Examples:** Smart medical devices, battery storage, electric vehicles, renewable energy systems
- Cross-domain patent activity indicates increasing technology integration

#### Temporal Trends and Growth Patterns

Overall Growth: 6% CAGR since 2017, indicating sustained innovation investment

Concerning Patterns:

- Top 10 countries show publication decline since 2020-2021
- Gradual recovery beginning in 2024
- This trend suggests potential impacts of global disruptions on innovation cycles

#### Implications for Semantic Search


These insights validate the need for semantic patent discovery:

- **Volume vs Quality:** With China's high-volume, lower-citation strategy, semantic search can identify conceptually valuable patents beyond citation metrics
- **Cross-Domain Innovation:** Technology convergence patterns make keyword search insufficient for discovering interdisciplinary innovations
- **Global Coverage:** Semantic understanding helps navigate patent landscapes across diverse countries and innovation strategies

The declining trend in English-language patent publications from 2022 onwards, despite global filing growth, suggests several potential factors:

- **Language shift:** Increasing proportion of patents published in native languages (particularly Chinese) rather than English
- **Publication timing:** Delayed publication of patents filed during 2021-2022 peak filing periods
- **Jurisdictional changes:** Shifts in where multinational companies choose to file patents
- **Post-pandemic normalization:** Return to pre-2020 publication patterns after temporary surge

This trend indicates our semantic search system addresses a specific but significant subset of global patent literature, focusing on the substantial English-language patent corpus that remains critical for international technology transfer and prior art research. Additionally, BigQuery AI's multilingual embedding capabilities position the system for future expansion to non-English patent corpora as semantic search becomes increasingly important for navigating diverse global patent landscapes.

In [5]:
create_patent_dashboard_demo()

country_code,total_patents,patent_share,patents_with_citations,citation_rate_pct,avg_citations_per_patent,highly_cited_patents
CN,35060573.0,71.3,12731211.0,36.3,2.3,2052893.0
US,5966174.0,12.1,4308745.0,72.2,22.9,2327816.0
WO,2154498.0,4.4,2072322.0,96.2,5.9,170156.0
JP,1590258.0,3.2,839387.0,52.8,2.9,77223.0
EP,1341623.0,2.7,486018.0,36.2,2.1,40827.0
KR,1000102.0,2.0,722750.0,72.3,2.9,16080.0
TW,508099.0,1.0,120513.0,23.7,0.9,2681.0
AU,435255.0,0.9,141245.0,32.5,0.9,1285.0
CA,431495.0,0.9,0.0,0.0,0.0,0.0
RU,166250.0,0.3,166160.0,99.9,4.8,2147.0


## Static Demo: Semantic Patent Search with Explainability

**Text Query Mode:** Users enter natural language descriptions of technical concepts or inventions. The system provides configurable search parameters:

- **Date Range Filter:** Limited to our 2024 dataset (January-June 2024, 2.9M patents)
- **Results Slider:** Select top-K similar patents (1-20 results)

**Patent Number Mode:** Direct lookup by entering specific patent publication numbers (one per line). If multiple patent numbers
are provided, then the system computes the average value of the embeddings to return similar patents
**Search Results & Relevance Scoring:**
Upon executing a search, the system returns ranked results with comprehensive metadata:

- Patent title and publication details (number, date, country)
-  **Document Relevance Score:** Semantic similarity percentage

    - **High Relevance:** >65% (strong conceptual match)
    - **Moderate Relevance:** 50-65% (related concepts)
    - **Low Relevance:** <50% (weak conceptual connection)


- Complete patent abstract for context

**Input Validation & Quality Control:**
The interface implements query sanitization and validation to optimize database performance. While most invalid queries are filtered, some may pass validation but return low-relevance results due to poor semantic matching.<br>
**Explainability Feature:**
Our advanced explainability component provides transparency into search results by identifying the most semantically similar sentences between the user query and patent content. This feature returns up to 3 explanatory results ranked by sentence-level similarity scores, enabling users to understand why specific patents were recommended.
This approach demonstrates BigQuery AI's capability to provide not just semantic search results, but interpretable insights into the matching process.



In [6]:
demo_search_interface()

VBox(children=(HTML(value='<h2>🔍 Patent Semantic Search</h2>'), HTML(value='<p>Search for patents using natura…

## BigQuery AI Semantic Search: Technical Implementation and Scalability Analysis

The following sections present comprehensive performance analysis of our BigQuery AI implementation, demonstrating sub-5 second query times across 2.9M patents, 84% cost reduction and 12% time reduction through intelligent partitioning, and 98.3% unique discovery rates compared to keyword search methods.<br>
Testing was carried out across 10 diverse technology queries avoiding caching, comparing performances between development (laptop) and cloud (Kaggle) environments. Each query returned 20 semantic results including full patent metadata (title, abstract, publication details, similarity scores), representing typical user search behavior:

#### Scalability Validation

Our implementation demonstrates production-ready scalability across multiple dimensions:

**Data Volume Scalability:**
- Successfully processes 2.9M patents(14.4GB) with consistent performance
- Linear cost scaling through partition pruning(84% data reduction for targeted searches)
- Architecture supports horizontal scaling to full 49M patent corpus via month-based partitioning

**Query Performance Scalability:**
- Achieves sub-5 second vector search response times across 2.9M patents with partition pruning enabling cost-effective scaling to larger datasets
- Consistent performance regardless of partition size (2.8-3.9 seconds for vector search operations)
- 1-month partitioning delivers 12% performance improvement in addition to 84% cost reduction
- Partition pruning demonstrates stable performance despite 5x variation in data volume

**Cost-Efficient Scalability:**
- Partition pruning delivers 84% reduction in bytes processed for 1-month searches
- Production deployment cost optimization through BigQuery clustering strategies
- Enterprise+ tier compatibility enables vector indexing for further performance gains

**Concurrent Processing Scalability:**
- Batch processing architecture handles 5,000-record embedding generation efficiently
- Multi-environment deployment validation (Kaggle GPU vs local processing)
- Separation of offline bulk processing (embeddings) from real-time queries enables concurrent user support


In [7]:
create_bigquery_visualization()

#### Performance Testing Results

**Latency Measurements:** **Environment Comparison**

Core Vector Search Performance<br>
**Kaggle Environment:**

- Mean latency: 3.0 seconds
- Median latency: 2.9 seconds
- 90th percentile: 3.2 seconds
- Range: 2.5s - 4.2s

**Laptop Environment:**

- Mean latency: 4.5 seconds
- Median latency: 4.2 seconds
- 90th percentile: 4.9 seconds
- Range: 3.9s - 7.2s

Complete Pipeline Performance (with Explainability)<br>
**Kaggle Environment:**

- Mean latency: 5.2 seconds
- Median latency: 5.0 seconds
- 90th percentile: 5.5 seconds
- Range: 4.2s - 6.7s

**Laptop Environment:**

- Mean latency: 12.9 seconds
- Median latency: 11.6 seconds
- 90th percentile: 15.2 seconds
- Range: 9.4s - 18.7s

Key Performance Insights

**Environment Impact:** Kaggle environment shows 33% faster vector search and 60% faster complete pipeline performance, highlighting the benefits of cloud-native deployment proximity to BigQuery infrastructure.<br>
**Consistency:** Kaggle environment demonstrates lower variability (std dev: 467ms vs 991ms for vector search), indicating more predictable performance characteristics for production deployment.<br>
**Production Scalability:** Both environments successfully process semantic searches across 2.9M patents, with vector search completing in under 5 seconds across all test scenarios.

In [8]:
create_latency_visualization()

#### Semantic versus Keyword Search Discovery Analysis

**Discovery Methodology:** Comparative analysis across 10 diverse technology queries that evaluate the discovery capabilities between BigQuery AI semantic search and traditional keyword search approaches. Semantic search uniqueness percentage represents the percentage of semantic search results that were not returned by keyword search methods. Each query was executed, to return a maximum of 30 results,  across both latop and Kaggle environments to validate consistency. <br>
The testing revealed significant differences in discovery capabilities between semantic and keyword search approaches.<br>

Note - Without a ground truth dataset, we cannot accurately measure recall or precision

Search Performance Comparison<br>
**Keyword Search:**

Average results per query: 18 patents
Average search time: 2.1 seconds (Kaggle), 3.4 seconds (laptop)

**Semantic Search:**

Average results per query: 30 patents
Average search time: 2.7 seconds (Kaggle), 5.0 seconds (laptop)

**Discovery Effectiveness:**
**Cross-Environment Validation:** Testing across both laptop and Kaggle environments confirmed consistent discovery capabilities, with semantic search finding 295 unique patents and keyword search finding 175 unique patents in both environments:

- **Total semantic results:** 300 patents (30 per query × 10 queries)
- **Total keyword results:** 180 patents (18 per query × 10 queries)

**Unique Patent Discovery:**

- Semantic search discovers 295 patents not found by keyword search
- Keyword search discovers 175 patents not found by semantic search
- Overlap: 5 patents found by both methods

Search Method Limitations<br>
**Keyword Search Challenges:**

- Complex, multi-term technical queries often return zero results
- Long descriptive queries (e.g., "quantum error correction topological qubits fault tolerance") fail to match patent terminology
Requires exact or near-exact term matching, missing conceptually similar descriptions

**Semantic Search Robustness:**

- Handles complex technical descriptions effectively
- Finds conceptually related patents even with different terminology
- Maintains consistent result quality across diverse query complexity levels

**Implications for Patent Research**

**Semantic Search Superiority:** Semantic search successfully finds patents (relevant or not cannot be determined with a subject matter expert or a ground truth dataset) for all query types, including complex technical descriptions where keyword search returns zero results.<br>
**Complementary Discovery:** The minimal overlap (5 patents) demonstrates that semantic and keyword searches access fundamentally different types of patent relevance, with semantic search providing broader conceptual coverage.<br>
**Query Complexity Handling:** Semantic search succeeds where keyword search fails on complex technical descriptions, making it essential for sophisticated patent research workflows.



In [9]:
create_discoverability_visualization()

## Technical Challenges

#### Embedding Generation Performance Bottleneck


**Initial ML.GENERATE_EMBEDDING Challenges:**
The biggest technical hurdle was BigQuery's native embedding generation performance. Initial attempts using ML.GENERATE_EMBEDDING on 2013-2014 patent data revealed severe limitations:
- **Monitoring Issues:** Jobs ran for 30+ minutes with no meaningful progress tracking beyond basic UI status
- **Performance Variability:** Batch processing showed extreme variation (7-70 minutes per 5,000 record batch)
- **Scale Reality:** Processing 48K records (one day) took 6 hours across 10 batches, averaging 36 minutes per batch
- **Cost-Performance Trade-off:** At ~0.45 seconds per embedding, processing 2.9M patents would require 15 days

**Implementation Iterations:**

**First attempt:** Single large job (2013-2014) with no progress tracking - abandoned after 30 minutes<br>
**Second iteration:** Stored procedure with 2,000 record batches - still too slow with no visibility<br>
**Third iteration:** Added BigQuery logging, increased to 5,000 record batches, reduced scope to 1 day (48K); the job completed in 6 hours<br>
**Data Quality Issue:** Initial approach using LIMIT + OFFSET without ORDER BY created duplicates; resolved with ROW_NUMBER() windowing

**Kaggle GPU Solution:**
The breakthrough came with Sentence Transformers on Kaggle GPU, processing 2.9M records (14GB) in 6.5 hours using 5,000-record batches - a dramatic improvement that made the project feasible.

#### BigQuery Edition Limitations

**Vector Index Discovery:**
After implementing comprehensive vector indexing with 100% coverage on the 14.4GB dataset, we discovered that BigQuery Standard edition (free tier) supports VECTOR_SEARCH creation functionality but restricts vector index utilization to Enterprise+ tiers. This limitation highlights critical production planning considerations for BigQuery AI deployments.

#### Development Environment Challenges

**Kaggle-Specific Issues:**

**Dependency conflicts:** Library version mismatches requiring minimal dependency strategies<br>
**Repository integration:** Inconsistent import paths (/kaggle/input vs /kaggle/input/d/username) requiring defensive coding<br>
**Workflow friction:** Manual repository re-import for each code change as the 'Link To GitHub' feature wasn't automatically picking up changes<br>
**Environment incompatibility:** Kaggle's environment lacks Streamlit support, forcing maintenance of separate requirements.txt files divergent from the project's pyproject.toml, including charts and demos - violating software engineering principles of single source of truth and creating dual repository management overhead

**Solution Strategy:** Minimize external dependencies and leverage Kaggle's pre-installed libraries to avoid conflicts, while accepting the technical debt of maintaining divergent dependency specifications.

## Conclusion

This project was interesting and challenging on several fronts. Starting with finding a suitable dataset to explore BigQuery AI features with. We initially looked at question and answer type datasets, blogs and video transcripts, but after a chat with Claude decided to settle with Google's public patents dataset - `publications`. The  motivation being that, the prior art industry spends hours identifying patent art that could save them millions in costly patent conflict; it could also assist organizations direct their research spending on potentially new ideas in a cost effective manner.

On the preprocessing step phase, when we were exploring the raw dataset, I was surprised that it was 2.6TB in size. I had to spend some time researching how best to sample the dataset to get an idea of what it contains. I learned quite a number of BigQuery specific functions that helped with cutting down the size of the dataset to a manageable size while analyzing the dataset. During the embedding generation phase, we ran into various issues - the chief issue being the performance of generating embeddings using ML.GenerateEmbedding() - it was slow. We tried various solutions to deal with the slowness, they weren't satisfactory. The breakthrough came with using SentenceTransformer's all-MiniLM-L6-v2 model, achieving 52x faster embedding generation (6.5 hours vs estimated 14 days with ML.GENERATE_EMBEDDING); this allowed us to make progress after spending a couple of days with the earlier solution. We successfully implemented BigQuery AI's VECTOR_SEARCH functionality, achieving sub-5 second query times across 2.9M patents with 84% cost optimization through partition pruning.

As we planned to build a demo using Streamlit, we started local development, but when we tried to link our GitHub link to Kaggle, we ran into various dependency conflicts. This was resolved by importing the bare minimum libraries and relying on Kaggle's default libraries. Additionally, separate demos had to be created as Streamlit couldn't be launched on Kaggle.

During the metric testing phase, we discovered some interesting insights: <br>
1) partitioning using 1 month date range resulted in the largest reduction - 84% with a 12% reduction in query processing time whilst ensuring that the partition contains at least 420K patents.<br>
2) Performance on the Kaggle environment was significantly faster than local development for the same partitions, which could be attributed due dataset location proximity. The performance varied between 2.8-3.9 seconds regardless of size of the partition.<br>
3) Semantic search proved to be more robust than mere keyword search. The project demonstrated that semantic search discovers 98.3% unique patents compared to keyword search, proving its value for comprehensive prior art research

In sum, it was an interesting, well-spent 2 weeks; this project successfully demonstrated BigQuery AI's semantic search capabilities for patent discovery while revealing important performance optimization strategies for production deployment.

## Next Steps

For the future, we hope to explore other aspects of the dataset. Some of the other dimensions are:

1) Corporate Patent Code (CPC) - Given the hierarchical nature of the codes, it will be interesting to see how seemingly different patents appear related due to them sharing a common CPC. Semantic similarity could identify patent clusters which help surface important industries where research money could be well-spent

2) Patent inventors - How many inventions does one inventor come up with in a year. What is or are their areas of expertise, which can be inferred from their invention. How many inventors collaborate on an invention

3) Time grant - How long does it take between publication and granting a patent. This would give us an idea of the average turnaround time for various industries

4) Citation analysis - Explore patent citation networks to identify influential patents, citation patterns across technology domains, and how semantic similarity correlates with citation relationships. This could reveal highly cited patents cluster in the embedding space and help identify breakthrough innovations through citation flow analysis.


## Assets

1. [bigquery_ai_survey](https://www.kaggle.com/datasets/laxmsun/bigquery-ai-survey)
2. [patent_art_github](https://github.com/sl2902/patent_art)
3. [streamlit demo](https://patent-art.streamlit.app/)

In [10]:
# import generate_patent_embeddings
# set the env variable os.environ["embedding_table"] = "table_name"

In [11]:
# !python generate_patent_embeddings.py --date-start='2017-01-02' --date-end='2017-01-02' --batch-size=1000