<a href="https://colab.research.google.com/github/jaanvi-prabhakar/SP-BTT-Patent-Classification/blob/fixes/code_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
from sklearn.metrics import precision_score, recall_score, accuracy_score, confusion_matrix
import time
import pandas as pd
import google.generativeai as genai
import re
import json
import os
from dotenv import load_dotenv
load_dotenv()


In [None]:
industry_keywords = {
    "Digital Healthcare": ["remote medical patient monitoring", "telemedicine", "remote surgery", "telehealth", "teledentistry"],
    "Sustainable Farming": ["farming technology", "precision agriculture", "vertical farming", "hydroponics", "alternative meat"],
    "Autonomous Vehicles": ["self-driving", "autonomous vehicle", "autonomous car", "automated driving",
    "driverless", "automated vehicle", "robotic car", "intelligent vehicle", "vehicle", "navigation", "transportation", "driving", "sensors", "autonomous"
],
    "Artificial Intelligence": ["artificial intelligence", "graphics processing unit", "large language model", "deep learning"],
    "3D Printing": ["3D printer", "additive manufacturing", "bioprinting", "3D scanner", "soundwave printing", "vehicle", "3D print", "3D design", "3D scan", "Additive print", "Additive printing", "binder jet", "3D model", "3D modeling", "additive manufacture", "metal print", "metal printing", "bioprint", "bio-print", "bio-printing", "Continuous Liquid Interface", "deductive manufacturing", "deductive manufacture", "direct energy deposition"
],
    "Virtual Reality": ["virtual reality headset", "augmented reality glasses", "virtual reality platform", "VR software", "AR hardware"],
    "Nanotechnology": ["nanoscale material", "nanoscale technology"]
}


## Prompt Generation

In [None]:
def generate_prompt(industry_keywords_dict, industry, abstract, few_shot_examples):
    """
    Generate a prompt for the LLM using few-shot examples and industry keywords.

    Args:
        industry_keywords_dict (dict): Keywords associated with industries.
        abstract (str): Abstract to be classified.

    Returns:
        str: The full prompt for the LLM.
    """
    # Add few-shot examples to the prompt
    examples_text = "\nExamples:\n"
    for example in few_shot_examples["samples"]:
        if example['Patent Class'] == industry:  # Only include examples from the current industry

          examples_text += f"\nPatent Class: {example['Patent Class']}\n"
          examples_text += f"Positive Sample: {example['Positive Sample']}\n"
          examples_text += f"Negative Sample: {example['Negative Sample']}\n"
          examples_text += f"Reason: {example['Reason']}\n"

    #hard coded for autonomous vehicles, comment below out if doing other categories
    examples_text += "\nJudgment Call Examples (for Autonomous Vehicles):\n"
    examples_text += """
    1. Abstract: An automatic deceleration control apparatus for a vehicle operates so as to secure
      appropriate tire grip performance, depending on the conditions of a road surface on which the vehicle runs turning,
      and restrain excessive rolling of the vehicle body, there by stabilizing the turning behavior of the vehicle at all times.
      While the vehicle is turning on a high-friction road surface without undergoing haw moment control, a safe vehicle speed is
      computed within the rollover limit of the vehicle. While the vehicle is turning on a low-friction road surface under haw moment
      control, on the other hand, a safe vehicle speed that ensures satisfactory tire grip performance is computed in accordance with an
      estimated road friction coefficient. When the vehicle is about to exceed its safe speed as it turns, it is automatically decelerated
      to the safe speed or below. Thus, the vehicle can be prevented from spinning, drifting out, or rolling over. (Yes)
      Reason: While it describes a control system not exclusive to autonomous driving, it could be applied in autonomous vehicle systems.

    2. Abstract: A method and apparatus for warning a driver of a motor vehicle of a predicted collision with a stationary object.
    A proximity sensor is used to determine a position of an obstacle relative to the vehicle. A steering angle of the vehicle is
    determined and used to predict a trajectory of the vehicle. A collision zone on the vehicle is identified based on the vehicle
    trajectory, the collision zone being that spot or location on the vehicle which is predicted to contact the obstacle. A visual
    display within the vehicle (on the control panel, for example) provides a visual indication to the driver of the position of the
     collision zone on the vehicle and a predicted trajectory of the collision zone as the vehicle moves in accordance with the steering angle. (Yes)
       Reason: This system can clearly be applied to autonomous driving, even though it is not solely for that purpose.

    3. Abstract: To determine whether an emergency braking situation exists for a vehicle, the vehicle determines at least the
    following state variables: its own velocity, its own longitudinal acceleration, its relative distance from an object in front,
    and the speed and acceleration of the object in front. A suitable evaluation method to assess whether an emergency braking situation
     is present is determined as a function of these state variables from a plurality of evaluation method options, including at least a
     movement equation evaluation method in which a movement equation system of the vehicle and of the object in front is determined, and
      an evaluation method in which a braking distance of the vehicle is determined. (No)
       Reason: The system is too generic and not specific enough to be classified as an autonomous vehicle technology.

    4. Abstract: To prevent overbraking of the vehicle rear wheels, brake pressure control
    valves are employed comprising essentially a stepped piston and a valve. A control piston
    is provided which prevents closing of the valve in the event of failure of a brake circuit.
    The prior known arrangements are expensive to manufacture and require a large number of seals.
    The invention, therefore, provides a brake pressure control valve in which an annular piston is
    provided with a bore which is penetrated by the control piston. A first of the annular piston
    transverse surfaces is subjected to the pressure of the front wheel brake circuit while a second
     of the annular piston transverse surfaces is subjected to the regulated pressure of the rear wheel
     brake circuit. The control piston&#39;s end adjacent the valve bears against the annular piston in
     the direction of a control spring. (No)
       Reason: The system is too general and unrelated to autonomous vehicles specifically.



    5. Abstract: An automotive vehicle includes engine start/stop (ESS) and adaptive cruise control with stop
     and go functionality (ACCS&amp;G). A method of coordinating operation of the ESS and ACCS&amp;G systems is
     provided. The ACCS&amp;G system brings the vehicle to a stop. After a delay and satisfaction of autostop conditions,
     the ESS system stops the engine. Upon receipt of an input and satisfaction of start conditions, the ESS system restarts
     the engine. The ACCS&amp;G system then resumes control of the restarted engine.
     (No)
       Reason: Not a feature of vehicular autonomy


    6. Abstract: A traction control system for an automotive vehicle comprises, a brake fluid pressure control actuator
    associated with at least driven wheels for reducing traction created through the driven wheels, and sensors for
    monitoring an acceleration-slip state of at least one of the driven wheels. A recovery control unit is arranged
    for deriving a degree of turn of the vehicle, and for deriving a rate of change in the vehicle turning degree.
    The recovery control unit is responsive to the change-rate in the turning degree during an acceleration-slip control,
    for controlling traction in a transient state shifting from turning to straight-ahead driving, in such a manner as
     to increase a recovery amount of driving torque caused by the driven wheel as the turning degree is weakened. (No)
      Reason: Not a feature of vehicular autonomy

    7. Abstract: A vehicle monitoring system uses at least one image capture device located
     on a vehicle and provides two or more associated images for combined display on an
     autostereoscopic display provided adjacent an operator. Preferably the images are
     processed for display and provide the operator with visual information that otherwise
      may not be available. The invention is directed to both apparatus and method for providing of this information. (Yes)
       Reason: This does seem relevant: "vehicle monitoring system," "image capture device located on a vehicle"


    """

    # Generate the response format and keywords for all topics
    response_format = "\n".join([f"{industry}: [Yes/No]" for industry in industry_keywords_dict.keys()])
    keywords_list = "\n".join([f"{industry}: {', '.join(keywords)}" for industry, keywords in industry_keywords_dict.items()])

    # Build the full prompt
    full_prompt = (
        f"Determine if the following patent abstract is relevant to the category '{industry}' listed below.\n"
        f"Answer ONLY with 'Yes' or 'No' for the category.\n\n"
        f"The following is a list of keywords associated with this category to assist you:\n"
        f"{', '.join(industry_keywords_dict[industry])}\n"
        f"\n{examples_text}\n"
        f"Do NOT include any explanations or extra text. Just the 'Yes' or 'No' answer.\n\n"
        f"Abstract:\n{abstract}\n\n"
    )
    return full_prompt


## Predict Industry

In [None]:
# API KEY setup
try:
    from google.colab import userdata
    GOOGLE_API_KEY = userdata.get("google_api_key")
except:
    GOOGLE_API_KEY = os.getenv("GOOGLE_API")
genai.configure(api_key=GOOGLE_API_KEY)

# Initialize model
model = genai.GenerativeModel('gemini-1.5-flash')

# Load few-shot examples from the json file
with open('./patent_classification.json', 'r') as file:
    few_shot_examples = json.load(file)




In [None]:
def predict_industry(abstract, industry, keywords, few_shot_examples, max_retries=5, delay=4):
    """
    Classify an abstract using the LLM with the generated prompt.

    Args:
        abstract (str): Abstract to classify.
        industry (str): Name of the industry/category.
        keywords (list): Keywords related to the industry.
        few_shot_examples (dict): Few-shot examples for prompt construction.
        max_retries (int): Maximum number of retries in case of API failure.
        delay (int): Delay between retries (in seconds).

    Returns:
        str: Classification result ("Yes" or "No").
    """
    full_prompt = generate_prompt(industry_keywords, industry, abstract, few_shot_examples)
    attempts = 0
    response_text = ""


    while attempts < max_retries:
        try:
            # Make a single API call with the refined prompt
            response = model.generate_content(full_prompt)
            response_text = response.text
            break  # Exit loop if request is successful
        except Exception as e:
            print(f"API error: {e}, retrying in {delay} seconds...")
            time.sleep(delay)
            attempts += 1

    if attempts == max_retries:
        print(f"Failed to get a response after {max_retries} attempts.")
        return "No"



    # Match only "Yes" or "No" as a standalone word in the response
    match = re.search(r"\b(Yes|No)\b", response_text, re.IGNORECASE)
    return match.group(1).capitalize() if match else "No"


In [None]:
def add_keywords_column(data, keywords):
    """
    Add a column to the dataset indicating whether abstracts contain any of the specified keywords.

    Args:
        data (pd.DataFrame): Dataset to filter.
        keywords (list): List of keywords to search for in the abstracts.

    Returns:
        None: Modifies the DataFrame in place.
    """
    data['has_keywords'] = data['abstract'].apply(lambda x: 1 if any(keyword.lower() in x.lower() for keyword in keywords) else 0)


def prepare_review_table(classified_data):
    """
    Prepare a table for manual review with metadata.

    Args:
        classified_data (pd.DataFrame): Classified dataset.

    Returns:
        pd.DataFrame: Review table ready for export.
    """

    # Rename columns to match expected format for CSV export
    classified_data = classified_data.rename(columns={'Patent ID': 'Index', 'abstract': 'Abstract'})

    classified_data['Reviewer'] = assign_reviewers(len(classified_data))
    classified_data['True label'] = ""
    classified_data['Comments'] = ""
    classified_data['Requires expert review'] = False

    # Select the relevant columns for the review table
    return classified_data[['Index', 'Abstract', 'Reviewer', 'True label', 'Comments', 'Requires expert review']]


def assign_reviewers(num_cases):
    """
    Assign reviewers to cases in a round-robin fashion.

    Args:
        num_cases (int): Number of cases to review.

    Returns:
        list: Reviewer assignments.
    """
    reviewers = ['Rohan', 'Raymond', 'Lior', 'Jamie', 'Zara', 'Johanna']
    return [reviewers[i % len(reviewers)] for i in range(num_cases)]

In [None]:
def classify_patents(data, industry, keywords, few_shot_examples, max_positives=5, rpm=15, use_filter=True):
    """
    Classify patents to find relevant abstracts based on the LLM.

    Args:
        data (pd.DataFrame): Dataset to classify.
        industry (str): Name of the industry/category.
        keywords (list): Keywords related to the industry.
        few_shot_examples (dict): Few-shot examples for prompt construction.
        max_positives (int, optional): Maximum number of positive samples to collect. If None, classify all data.
        rpm (int): Maximum requests per minute to the API.
        use_filter (bool): Whether to filter abstracts by keywords before classification.

    Returns:
        pd.DataFrame: Dataset with predictions added as a new column.
    """
    positive_count = 0
    requests_made = 0
    delay_per_request = 60 / rpm  # Calculate delay based on RPM

    # Apply keyword filter if specified
    if use_filter:
        add_keywords_column(data, keywords)
        data = data[data['has_keywords'] == 1]
        print(f"Filtered down to {len(data)} abstracts containing relevant keywords.")

    data = data.copy()  # Make a deep copy
    data['LLM Prediction'] = ""  # Initialize column for predictions

    for index, row in data.iterrows():
        if max_positives and positive_count >= max_positives:
            print("Reached the maximum number of positive samples.")
            break

        try:
            prediction = predict_industry(row['abstract'], industry, keywords, few_shot_examples)
            data.at[index, 'LLM Prediction'] = prediction
            if prediction == "Yes":
                positive_count += 1

            requests_made += 1
            # if index % 100 == 0:  # Log every 100 rows
            #     print(f"Processed {index}/{len(data)} rows. Current Positive Count: {positive_count}")

            # Rate limiting
            if requests_made % rpm == 0:
                print(f"Reached {rpm} requests. Pausing for 60 seconds to avoid rate limit.")
                time.sleep(60)
            else:
                time.sleep(delay_per_request)

        except Exception as e:
            print(f"Error processing row {index}: {e}")
            data.at[index, 'LLM Prediction'] = "Error"
            continue

    print(f"Classification completed. Found {positive_count} positive samples.")
    return data


## Helper Functions

In [None]:
def save_balanced_csv(data, file_path, industry_name, max_samples_per_class=60):
    """
    Save a balanced dataset with an equal number of positive and negative samples.

    Args:
        data (pd.DataFrame): Dataset with predictions.
        file_path (str): Path to save the CSV file.
        industry_name (str): Industry/category name for the dataset.
        max_samples_per_class (int): Maximum number of samples per class.

    Returns:
        pd.DataFrame: Balanced dataset.
    """
    positive_samples = data[data[industry_name] == "Yes"].sample(n=max_samples_per_class, random_state=1234)
    negative_samples = data[data[industry_name] == "No"].sample(n=max_samples_per_class, random_state=1234)

    balanced_data = pd.concat([positive_samples, negative_samples])
    balanced_data.to_csv(file_path, index=False)
    return balanced_data


In [None]:
def add_keywords_column(data, keywords):
    """
    Add a column to the dataset indicating whether abstracts contain any of the specified keywords.

    Args:
        data (pd.DataFrame): Dataset to filter.
        keywords (list): List of keywords to search for in the abstracts.

    Returns:
        None: Modifies the DataFrame in place.
    """
    data['has_keywords'] = data['abstract'].apply(lambda x: 1 if any(keyword.lower() in x.lower() for keyword in keywords) else 0)


def prepare_review_table(classified_data):
    """
    Prepare a table for manual review with metadata.

    Args:
        classified_data (pd.DataFrame): Classified dataset.

    Returns:
        pd.DataFrame: Review table ready for export.
    """

    # Rename columns to match expected format for CSV export
    classified_data = classified_data.rename(columns={'Patent ID': 'Index', 'abstract': 'Abstract'})

    classified_data['Reviewer'] = assign_reviewers(len(classified_data))
    classified_data['True label'] = ""
    classified_data['Comments'] = ""
    classified_data['Requires expert review'] = False

    # Select the relevant columns for the review table
    return classified_data[['Index', 'Abstract', 'Reviewer', 'True label', 'Comments', 'Requires expert review']]


def assign_reviewers(num_cases):
    """
    Assign reviewers to cases in a round-robin fashion.

    Args:
        num_cases (int): Number of cases to review.

    Returns:
        list: Reviewer assignments.
    """
    reviewers = ['Rohan', 'Raymond', 'Lior', 'Jamie', 'Zara', 'Johanna']
    return [reviewers[i % len(reviewers)] for i in range(num_cases)]



# Main Pipeline

In [None]:
#Change industry to the one we are doing
industry = "Autonomous Vehicles"
keywords = industry_keywords[industry]

total_abstracts_df = pd.read_csv('./required_files/total_abstracts_df_sets.csv')

classified_data = classify_patents(total_abstracts_df, industry, keywords, few_shot_examples, rpm=2000)

# Save a balanced dataset
balanced_data = save_balanced_csv(classified_data, f"balanced_{industry.lower().replace(' ', '_')}_dataset.csv", industry)


review_table = prepare_review_table(classified_data)
review_table.to_csv('classified_review_set.csv', index=False)

positive_examples = classified_data[classified_data['LLM Prediction'] == 'Yes']
positive_indexes = positive_examples['Patent ID'].tolist()  # List of positive classification indexes
print("Indexes of 40 positive classifications:", positive_indexes)


Filtered down to 4853 abstracts containing relevant keywords.


KeyboardInterrupt: 

# Metrics

In [1]:
from sklearn.metrics import precision_score, recall_score, accuracy_score, confusion_matrix

# function that calculates metrics of performance
# model_table parameter: table of model classifications with a column "prediction".
# true_table: table of true labels with a column "true_label"
# returns: Dictionary containing precision, recall, accuracy, and confusion matrix.
def calculate_metrics(model_table, true_table):

    # ensure both tables have an index column that indexes each unique patent abstract
    if 'Index' not in model_table.columns or 'Index' not in true_table.columns:
        raise ValueError("Both tables must have an 'Index' column to perform a join.")

    # perform an inner join on the index column to match predictions with true labels
    merged_table = pd.merge(model_table, true_table, on='Index', how='inner')

    # extract predictions + true labels
    predictions = merged_table['Prediction'].str.strip().str.title()
    true_labels = merged_table['True label'].str.strip().str.title()

    # calculate metrics
    precision = precision_score(true_labels, predictions, pos_label='Yes', zero_division=0)
    recall = recall_score(true_labels, predictions, pos_label='Yes', zero_division=0)

    accuracy = accuracy_score(true_labels, predictions)
    baseline_accuracy = (true_labels == true_labels.mode()[0]).mean()
    lift = accuracy / baseline_accuracy

    conf_matrix = confusion_matrix(true_labels, predictions, labels=['No', 'Yes'])
    TN, FP, FN, TP = conf_matrix.ravel()

    # return results in a dictionary
    metrics = {
        'Precision': precision,
        'Recall': recall,
        'Accuracy': accuracy,
        'Baseline Accuracy': baseline_accuracy,
        'Lift': lift,
        'Confusion Matrix': conf_matrix,
        'True Negatives (TN)': TN,
        'False Positives (FP)': FP,
        'False Negatives (FN)': FN,
        'True Positives (TP)': TP,
    }

    return metrics

In [4]:
import pandas as pd

classified_review_set = pd.read_csv('sample_data/classified_review_set.csv')

model_table = classified_review_set[['Index', 'Prediction']]
true_table = classified_review_set[['Index', 'True label']]

# calculate + print metrics
metrics = calculate_metrics(model_table, true_table)
print(metrics)

{'Precision': 0.325, 'Recall': 0.5909090909090909, 'Accuracy': 0.9394957983193277, 'Baseline Accuracy': 0.9630252100840336, 'Lift': 0.9755671902268761, 'Confusion Matrix': array([[546,  27],
       [  9,  13]]), 'True Negatives (TN)': 546, 'False Positives (FP)': 27, 'False Negatives (FN)': 9, 'True Positives (TP)': 13}
