<a href="https://colab.research.google.com/github/sanjaydasgupta/LLM-created-patent-database/blob/main/few_shot_trials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Converting Patent Bibliographic Summaries to JSON Records

**CAUTION**: This notebook needs a T4 GPU, and runs for 15-20 minutes. Running it may incur a substantial cost.

## Install Essential Libraries

In [None]:
!pip install -q -U flash-attn
!pip install -q -U bitsandbytes
!pip install -q -U peft
!pip install -q -U accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for flash-attn (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h

## Load a Model

IMPORTANT: Obtain a [huggingface token](https://huggingface.co/docs/hub/en/security-tokens) and save it in the Secrets panel on the left, using the name "HF_TOKEN". Then restart the run-time before running the cell below.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

################### Un-comment one of the folowing LLM names #############################
#
# model_id = "microsoft/Phi-3-mini-128k-instruct"
# model_id = "microsoft/Phi-3-mini-4k-instruct"
#
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
#
################### The following LLMs do not support system-prompts #############################
# model_id = "mistralai/Mistral-7B-Instruct-v0.2"
# model_id = "google/gemma-7b-it"
# model_id = "google/gemma-1.1-7b-it"
#

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

## Run Predictions

**IMPORTANT**: Add your HF_TOKEN ([huggingface token](https://huggingface.co/docs/hub/en/security-tokens)) in the Secrets panel on the left before running the cell below.

The following cell uses a dataset (*s11a.parquet* in the `data` folder) containing two columns *input* and *output*. A few rows of this dataset are used for setting up the few-shot prompting examples. Other rows are used for testing the accuracy of the prompted LLMs conversions. The column *input* contains unstructured text extracted from pages of volume II of the [Official Journal of the Patent Office](https://search.ipindia.gov.in/IPOJournal/Journal/Patent) published on 19th April 2024. The column *output* contains corresponding structured text represented as JSON strings. The strings in *input* were obtained by using [PyMuPDF](https://pypi.org/project/PyMuPDF/). The JSON strings in *output* were obtained using a process completely different from that described here. The dataset contains over 690 rows -- one row for each page of the journal. The PDF document from which the *input* column was created can be found in the `data` folder.

In [None]:
import pandas as pd
from datetime import datetime
from difflib import SequenceMatcher

prompt = """Study the following examples of conversion of a plain-text record to JSON.
Then convert the user-provided plain-text record into JSON.
Each example input contains multiple SECTIONs, prefixed by one of the following TAGs:
"(21)", "(19)", "(22)", "(43)", "(54)", "(51)", "(31)", "(32)", "(33)",  "(86)", "(87)", "(61)", "(62)", "(71)", "(72)", "(57)".
The TAGs may occur in any order, except that "(57)" always comes at the end.
The SECTION with a TAG of "(57)" extends to the end of the input.
There are no SECTIONs after the SECTION with a TAG of "(57)".
In each SECTION, the TAG is followed by the DESCRIPTION and VALUE of some data item.
From each SECTION, extract the TAG and the VALUE to produce a TAG-VALUE pair.
Any VALUE that contains just a date must be rewritten in YYYY-MM-DD (ISO-8601) style.
Any VALUE that contains just a country or region name must be rewritten as a 2-character (ISO-3166) code.
Beware of typos in the TAG field, use the DESCRIPTION to disambiguate the TAG when necessary.
Do not make any other assumptions. Do not generate any TAG-VALUE pairs that are not found in the Input.
Combine all such TAG-VALUE pairs into a JSON Response with the TAGs arranged in ascending order."""

def score_prediction(output: str, prediction: str) -> float:
    return SequenceMatcher(None, output, prediction).ratio()

test_row_count = 10 # Number of rows for testing
few_shot_count = 5 # Number of rows for few-shot prompting

data_url = "https://github.com/sanjaydasgupta/LLM-created-patent-database/raw/main/data/s11a.parquet"
df5 = pd.read_parquet(data_url).sample(test_row_count + few_shot_count)
print('Row count:', len(df5), 'Columns:', list(df5), '\n')

total_elapsed_seconds = total_score = 0

for input, output in df5.tail(test_row_count).values:
    print('INPUT:', input)
    print('OUTPUT:', output)

    generation_args = {
        "max_new_tokens": 1024,
        "return_full_text": False,
        "temperature": 0.05,
        "do_sample": True,
    }

    messages = [
        # System prompt ...
        {"role": "system", "content": prompt},
        # Shot #1 input and output (from first and second columns of DataFrame) ...
        {"role": "user", "content": df5.iat[0, 0]},
        {"role": "assistant", "content": df5.iat[0, 1]},
        # Shot #2 input and output (from first and second columns of DataFrame) ...
        {"role": "user", "content": df5.iat[1, 0]},
        {"role": "assistant", "content": df5.iat[1, 1]},
        # Question for LLM ...
        {"role": "user", "content": input},
    ]

    t0 = datetime.now()
    prediction = pipe(messages, **generation_args)
    predicted_string = prediction[0]['generated_text'].strip()
    delta_seconds = (datetime.now() - t0).total_seconds()
    total_elapsed_seconds += delta_seconds
    score = score_prediction(output, predicted_string)
    total_score += score
    length_ratio = len(predicted_string) / len(output)
    print('PRED:', predicted_string)
    print('score:', score, 'length:', length_ratio, 'seconds:', delta_seconds)
    print('\n')

print('Model:', model_id, 'Mean time:', total_elapsed_seconds / test_row_count, 'Mean score:', total_score / test_row_count)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Row count: 15 Columns: ['input', 'output'] 

INPUT: (21) Application No.202317009367 A  (19) INDIA  (22) Date of filing of Application :13/02/2023  (43) Publication Date : 19/04/2024    (54) Title of the invention : COMPOSITIONS THAT REDUCE P EROXIDE OFF TASTE AND USES THEREOF     (51) International classification  :A23L 27/30, A23L 2/56, A23L 2/60, A23L 27/00, A61K 8/49 (31) Priority Document No  :63/118966  (32) Priority Date  :29/11/2020  (33) Name of priority country  :U.S.A.  (86) Inter national Application No          Filing Date  :PCT/EP2021/083031  :25/11/2021  (87) International Publication No  :WO 2022/112432  (61) Patent of Addition to Application Number          Filing Date  :NA :NA (62) Divisional to Application Number          Filing Date  :NA :NA    (71)Name of Applicant :     1)FIRMENICH SA        Address of Applicant :7, rue de la Bergère 1242 Satigny Switzerland  (72)Name of Inventor :     1)OUYANG, Qing -bo    2)ASHOKAN, Bharani     3)KIZILBASH, Muhammad   (57) Abstr

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202317009367 A", "(22)": "2023-02-13", "(31)": "63/118966", "(32)": "2020-11-29", "(33)": "US", "(43)": "2024-04-19", "(51)": "A23L 27/30, A23L 2/56, A23L 2/60, A23L 27/00, A61K 8/49", "(54)": "COMPOSITIONS THAT REDUCE PEROXIDE OFF TASTE AND USES THEREOF", "(57)": "The pre sent disclosure generally provides taste -modifying compositions that reduce the off taste of hydrogen peroxide. In some aspects, the disclosure provides uses of such taste -modifying compositions to reduce the off taste of hydrogen peroxide. In some other aspects, the disclosure provides compositions (such as comestible compositions or oral care compositions), which comprise hydr ogen peroxide and a taste -modifying composition of the present disclosure. In some embodiments, such compositions are in the form o f a food product, a beverage product, or an oral care product, such as a toothpaste, a mouthwash, tooth -whitening compositions, and the like.", "(61)": "NA :NA", "(62)": "NA :

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202317010702 A", "(22)": "2023-02-17", "(31)": "202010869841.1", "(32)": "2020-08-26", "(33)": "CN", "(43)": "2024-04-19", "(51)": "C08F 214/22, C08F 214/06, C08F 220/18, C08F 220/14, C08F 220/06", "(54)": "COPOLYMERIZED PVDF RESIN FOR LITHIUM BATTERY BINDER AND PREPARATION METHOD THEREFOR", "(57)": "A copolymerized PVDF resin for a lithium battery binder and a preparation method therefor. The preparation method comprises t he following steps: reacting 300 -600 parts of deion ized water,0.04 -0.25 part of a pH buffer regulator, 85 -99.5 parts of a vinylidene fluoride (VDF) monomer, 0.5 -15 parts of a comonomer, 0.3 -3 parts of a metallocene synergist, 0.2 -1.0 part of an initiator and 0.08 -0.35 part of a dispersant at 40 -65? under 5.5-8.0 Mpa; recovering unreacted monomers after the reaction is finished; and washing, filtering and drying same, so as to obtain the copolymerized PVDF resin. The copolymerized PVDF resin for a lithium battery b inder impr

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202317013199 A", "(22)": "2023-02-27", "(31)": "202010779964.6", "(32)": "2020-08-05", "(33)": "CN", "(43)": "2024-04-19", "(51)": "A61K 31/1 67, A61K 31/6615, A61K 31/24, A61P 25/00, A61P 25/14", "(54)": "PHARMACEUTICAL COMPOSITION OF AQUAPORIN INHIBITOR AND PREPARATION METHOD THEREOF", "(57)": "Provided are a pharmaceutical composition of an aquaporin inhibitor and a preparation method thereof. The pharmaceutical composition comprises 2 -((3,5 -bis (trifluoromethyl) phenyl)carbamoyl) -4-chlorophenyl dihydro gen phosphate or a pharmaceutically acceptable salt thereof, or a pharmaceutically acceptable solvate thereof, and meglumine. The pharmaceutical composition of t he aquaporin inhibitor and the preparation method thereof have the following advantages: the pr ocess is simple, has strong operability, and is conducive to industrial production, and the product has good stability, and obviously less content of degradable impur ities, which ensures the ef

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202317012408 A", "(22)": "2023-02-23", "(31)": "63/083486", "(32)": "2020-09-25", "(33)": "US", "(43)": "2024-04-19", "(51)": "H04W 74/00, H04W 74/08, H04L 1/08", "(54)": "CONFIGURING RANDOM ACCESS PROCEDURES", "(57)": "Apparatuses, methods, and systems are disclosed for configuring random access procedures. One method (1000) includes receivin g (1002), at a user equipment, a fir st configuration from a network. The first configuration corresponds to performing a physical random access channel transmission on multiple random access channel occasions. The method (1000) includes receiving (1004) a second  configuration from the networ k. The second configuration corresponds to performing Msg3 repetition, MsgA repetition, or a combination thereof. The method (1000) includes performing (1006) a random access procedure based on the first configuration and the second configuration.", "(61)": "NA :NA", "(62)": "NA :NA", "(71)": "1)LENOVO (SINGAPORE) PTE. LTD. 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202211021462 A", "(22)": "2022-10-11", "(31)": "NA", "(32)": "NA", "(33)": "IN", "(43)": "2024-04-19", "(51)": "A61K0008970000, A61K0031498500, A61K0009510000, A01N0065080000, A61K0036730000", "(54)": "COMPOSITION, NANOGELS AND CREAMS COMPRISING PLANT EXTRACTS AND METHODS OF PREPARING THEREOF", "(57)": "The present invention relates to synergistic compositio n comprising of plant extracts obtained from plants selected from but not limited to Elaeocarpus angustifolius and Pinus roxburghii. The composition of the present invention has application in the tr eatment of chronic wounds. The present invention also rel ates to methods for preparing the said synergistic composition and other products such as nanogels or topical creams comprising the plant extracts obtained from Elaeocarpus angustifolius and Pinus roxburghii with  an improved efficacy for the treatment of c hronic wounds.", "(61)": "NA :NA", "(62)": "NA :NA", "(71)": "1)GURUKULA KANGRI VISHWAVIDY

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202317013130 A", "(22)": "2023-02-27", "(31)": "20188808.8", "(32)": "2020-07-31", "(33)": "EPO", "(43)": "2024-04-19", "(51)": "G01S 19/21", "(54)": "A COMPUTER -IMPLEMENTED METHOD FOR DETECTING GLOBAL NAVIGATION SATE LLITE SYSTEM SIGNAL SPOOFING, A DATA PROCESSING APPARATUS, A COMPUTER PROGRAM PRODUCT, AND A COMPUTER -READABLE STORAGE MEDIUM", "(57)": "A computer -implemented method for detecting Global Navigation Satellite System (GNSS) signal spoofing. The method comprises: storing (120), at a GNSS receiver, sample sequences of the predictable part and of the unpredictable part of a GNSS signal, wh erein the predictable part comprises predictable bits and the unpredictable part comprises unpredictable bits; verifying (125) the value of the unpredictable bits from which the unpredictable sample sequences are extracted; computing (130) a first and a second partial correlation between the unpredictable, respectively predictable, sample sequences and a 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202317010751 A", "(22)": "2023-02-17", "(31)": "63/068081", "(32)": "2020-08-20", "(33)": "US", "(43)": "2024-04-19", "(51)": "B01D 61/38, B01D 53/22, B01D 69/10, B01D 71/02", "(54)": "ENHANCED DUAL PHASE MEMBRANES FOR SEPARATING CARBON FROM CARBON-CONTAINING FEED GASES AND SEPARATION METHODS USING THE SAME", "(57)": "Dual phase membranes include a porous support providing a solid phase having a matrix of connected pores, and a liquefiable ion transport phase within the pores of the porous support. The ion transport phase is formed of at least one alkali metal hydroxide, and at least one oxide ion transport agent providing a source of ions selected from the group consisting of borate ions, nitrate ions, phosphate ions, vanadate ions, niobate ions or sulfate ions. The at least one alkali metal hydroxide may be selected from the group consisting of NaOH, KOH, LiOH, RbOH, CsOH and mixtures thereof. The oxide ion transport agent is preferably present in the

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202211058129 A", "(22)": "2022-10-12", "(31)": "NA", "(32)": "NA", "(33)": "NA", "(43)": "2024-04-19", "(51)": "B60R0019000000, B60J0005040000, B62D0025020000, B60R0021000000, B60R0021210000", "(54)": "SIDE CRASH PROTECTION DEVICE FOR SIDE DOORS OF A VEHICLE", "(57)": "A side crash protection device 200 for a side door 100 of a vehicle includes a longitudinal inflatable member 202 filled with a gaseous media and mounted within a channel shaped intrusion beam 204 fixed between an outer panel 304 and an inner panel 302 of the side door 100. The inflatable member 202 extends horizontally from a hinged side of the side door 100 to an opening side of the side door 100 to absorb crash impact and dissipate the crash impact/energy. The inflatable member 202 has a cross -section similar to a tyre of a wheel with an open side facing the inner panel 302. Alternatively, the inflatable member 202 can have has a cross -section similar to a suspension bellow.", "(61)"

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202221059628 A", "(22)": "2022-10-18", "(31)": "NA", "(32)": "NA", "(33)": "NA", "(43)": "2024-04-19", "(51)": "H02J0003380000, G06Q0050060000, C25B0001040000, H01M0008065600, C10G0002000000", "(54)": "A PROCESS FOR A HYBRID RENEWABLE ENERGY SYSTEM", "(57)": "ABSTRACT TITLE:  A PROCESS FOR A HYBRID RENEWABLE ENERGY SYSTEM The present invention relates to a process 100 for sustainable power supply in a hybrid renewable energy system (HRES) 400, the process comprising: collecting meteorolo gical 111, social 112, financial 113 and  real time load estimation 114 input data of the selected site for the HRES by a data collection unit 101, from a central server database 115 through one or more cloud network 116; optimizing a hybrid renewable energy system (HRES) 400 by processing the in put data upon receipt from the data collection unit 101 by an optimization unit 104 comprising of method of optimizing 200 for economical 203, technical 205, social 207 and env