<a href="https://colab.research.google.com/github/sanjaydasgupta/llm-patent-database/blob/main/few-shot-trials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Converting Patent Bibliographic Summaries to JSON Records

## Install Essential Libraries

In [1]:
!pip install -q -U flash-attn
!pip install -q -U bitsandbytes
!pip install -q -U peft
!pip install -q -U accelerate

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m1.2 MB/s[0m eta [36m0:00:03[0m[2K     [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/2.5 MB[0m [31m2.1 MB/s[0m eta [36m0:00:02[0m[2K     [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.5/2.5 MB[0m [31m4.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m1.3/2.5 MB[0m [31m9.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m15.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Load a Model

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

################### Un-comment one of the folowing LLM names #############################
#
# model_id = "microsoft/Phi-3-mini-128k-instruct"
# model_id = "microsoft/Phi-3-mini-4k-instruct"
#
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
#
################### The following LLMs do not support system-prompts #############################
# model_id = "mistralai/Mistral-7B-Instruct-v0.2"
# model_id = "google/gemma-7b-it"
# model_id = "google/gemma-1.1-7b-it"
#

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

## Run Predictions

In [4]:
import pandas as pd
from datetime import datetime
from difflib import SequenceMatcher

prompt = """Study the following examples of conversion of a plain-text record to JSON.
Then convert the user-provided plain-text record into JSON.
Each example input contains multiple SECTIONs, prefixed by one of the following TAGs:
"(21)", "(19)", "(22)", "(43)", "(54)", "(51)", "(31)", "(32)", "(33)",  "(86)", "(87)", "(61)", "(62)", "(71)", "(72)", "(57)".
The TAGs may occur in any order, except that "(57)" always comes at the end.
The VALUE of the SECTION with a TAG of "(57)" extends to the end of the input.
Do not look for any more TAGs or SECTIONs after the SECTION with a TAG of "(57)" is found.
In each SECTION, the TAG is followed by the DESCRIPTION and VALUE of some data item.
From each SECTION, extract the TAG and the VALUE to produce a TAG-VALUE pair.
Any VALUE that contains just a date must be rewritten in YYYY-MM-DD (ISO-8601) style.
Any VALUE that contains just a country or region name must be rewritten as a 2-character (ISO-3166) code.
Beware of typos in the TAG field, use the DESCRIPTION to disambiguate the TAG when necessary.
Do not make any other alterations. Do not generate any TAG-VALUE pairs that are not found in the Input.
Combine all such TAG-VALUE pairs into a JSON Response with the TAGs arranged in ascending order."""

def score_prediction(output: str, prediction: str) -> float:
    return SequenceMatcher(None, output, prediction).ratio()

test_row_count = 10 # Number of rows for testing
few_shot_count = 5 # Number of rows for few-shot prompting

data_url = "https://github.com/sanjaydasgupta/llm-patent-database/raw/main/data/s11a.parquet"
df5 = pd.read_parquet(data_url).sample(test_row_count + few_shot_count)
print('Row count:', len(df5), 'Columns:', list(df5), '\n')

total_elapsed_seconds = total_score = 0

for input, output in df5.tail(test_row_count).values:
    print('INPUT:', input)
    print('OUTPUT:', output)

    generation_args = {
        "max_new_tokens": 1024,
        "return_full_text": False,
        "temperature": 0.05,
        "do_sample": True,
    }

    messages = [
        # System prompt ...
        {"role": "system", "content": prompt},
        # Shot #1 input and output ...
        {"role": "user", "content": df5.iat[0, 0]},
        {"role": "assistant", "content": df5.iat[0, 1]},
        # Shot #2 input and output ...
        {"role": "user", "content": df5.iat[1, 0]},
        {"role": "assistant", "content": df5.iat[1, 1]},
        # Question for LLM ...
        {"role": "user", "content": input},
    ]

    t0 = datetime.now()
    prediction = pipe(messages, **generation_args)
    predicted_string = prediction[0]['generated_text'].strip()
    delta_seconds = (datetime.now() - t0).total_seconds()
    total_elapsed_seconds += delta_seconds
    score = score_prediction(output, predicted_string)
    total_score += score
    length_ratio = len(predicted_string) / len(output)
    print('PRED:', predicted_string)
    print('score:', score, 'length:', length_ratio, 'seconds:', delta_seconds)
    print('\n')

print('Model:', model_id, 'Mean time:', total_elapsed_seconds / test_row_count, 'Mean score:', total_score / test_row_count)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Row count: 15 Columns: ['input', 'output'] 

INPUT: (21) Application No.202327076373 A  (19) INDIA  (22) Date of filing of Application :08/11/2023  (43) Publication Date : 19/04/2024    (54) Title of the invention : MESSAGE BASED NAVIGATIONAL ASSISTANCE     (51) International classification  :G01C 21/36, G08G 1/00 (31) Priority Document No  :NA (32) Priority Date  :NA (33) Name of priority country  :NA (86) International Application No          Filing Date  :PCT/US2021/031382  :07/05/2021  (87) International Publication No  :WO 2022/235274  (61) Patent of Addition to Application Number          Filing Date  :NA :NA (62) Divisional to Application Number          Filing Date  :NA :NA    (71)Name of Applicant :     1)GOOGLE LLC        Address of Applicant :1600 Amphitheatre Parkway Mountain View,  California 94043 U.S.A.  (72)Name of Inventor :     1)SHARIFI, Matthew   (57) Abstract :  Methods, systems, devices, and tangible non -transitory computer readable media for using incoming commu

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202327076373 A", "(22)": "2023-11-08", "(31)": "NA", "(32)": "NA", "(33)": "NA", "(43)": "2024-04-19", "(51)": "G01C 21/36, G08G 1/00", "(54)": "MESSAGE BASED NAVIGATIONAL ASSISTANCE", "(57)": "Methods, systems, devices, and tangible non -transitory computer readable media for using incoming communications to generate suggestions for navigation. The disclose d technology can include accessing route data that includes information associated with navigation from a starting location to a destination. Based on the route data, one or more routes from the starting location to the destination can be determined. Messa ge data including one or more messages to a user can be accessed. Based on the message data and one or more machine -learned models, at least one entity and objectives that are associated with the one or more messages can be determined. Based on the one or more routes, the at least one entity, and the objectives, suggestions associated with the on

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202317010339 A", "(22)": "2023-02-16", "(31)": "2021-055063", "(32)": "2021-03-29", "(33)": "JP", "(43)": "2024-04-19", "(51)": "G16H 10/40, G01N 33/66, A61B 5/00, A61B 5/02, A61B 5/022", "(54)": "BLOOD SUGAR LEVEL E STIMATION DEVICE, BLOOD SUGAR LEVEL ESTIMATION METHOD AND PROGRAM", "(57)": "[Problem] Conventionally known methods for measuring a blood sugar level involve sampling the blood from a subject, estimatin g the blood sugar level using the interstitial fluid, and the like. However, the probl em with all these methods is that they require making a puncture in the skin of a subject in an invasive manner, which places a psychological and physical burden on the subject. [ Solution] In the present invention, a blood sugar level estimation model is g enerated through machine learning based on previously -acquired attribute information, non -invasive biological information, and test data from the blood tests of a plurality of subjects, and as a resul

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202318012745 A", "(22)": "2023-02-24", "(31)": "62/724,790", "(32)": "2018-08-30", "(33)": "USA", "(43)": "2024-04-19", "(51)": "B29C 45/64, B29C 45/28, B29C 45/36, B29C 45/33", "(54)": "PLASTIC MOLDING APPARATUS AND METHOD WITH SHAPER MODULE", "(57)": "An apparatus for operating a  mold having a cavity assembly (192) and a core (190, 3112, 3114) that cooperatively define a mold for molding of plastic articles, comprising: a carriage comprising a support plate (3052); a clamping assembly (3042) mounted to said support plate; first an d second mold support plates (196) mounted to said clamping assembly, and movable by said clamping assembly between a closed position in which cavity plates (194) of the cavity assembly (192) abut one another in clamped contact to de fine a surface of an ar ticle to be molded, and an open position for removal of molded articles; said clamping assembly operable to exert a clamp force on said first and second mold support pla

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202327077191 A", "(22)": "2023-11-13", "(31)": "20211039 3943.5", "(32)": "2021-04-13", "(33)": "CN", "(43)": "2024-04-19", "(51)": "H04N 5/225, H04M 1/02", "(54)": "CAMERA ASSEMBLY AND ELECTRONIC DEVICE", "(57)": "The present application provides a camera assembly and an electronic device. The camera assembly comprises: a first circuit b oard, a flexible circuit board, a base, a plurality of first e lastic members, a first coil, and a photosensitive chip. Fixed ends of the plurality of first elastic members are fixedly mounted on the base, and free ends of the plurality of first elastic members are suspended.  The first circuit board is mounted on a su spension rack formed by the plurality of first elastic members, and the first coil and the photosensitive chip are provided on the first circuit board. One end of the flexible circuit board is fixedly connected to the first circuit  board, and the other end  of the flexible circuit board is fixedly conne

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202327069095 A", "(22)": "2023-10-13", "(31)": "10 2021 106 674.3", "(32)": "2021-03-18", "(33)": "DE", "(43)": "2024-04-19", "(51)": "C23C 16/02, C23C 16/30, C23C 28/04,  C30B 29/40, C30B 29/60", "(54)": "ALN-BASED HARD MATERIAL LAYER ON BODIES OF METAL, HARD METAL, CERMET OR CERAMICS, AND METHOD FOR THE PRODUCTION THEREOF", "(57)": "The invention relates to the field of materials engineering and relates to an AlN -based hard material layer on bodies of metal, hard metal, cermet or ceramics and to a method for the production thereof. The aim of the invention is to provide an AlN hard mate rial layer which has improved hardness and wear resistance and can be produced in an inexpensive and time -efficient manner. According to the invention, an AlN -based hard material layer is provided, which is an individual layer or a multi -layered layer system, wherein at least the one layer or at least one layer of the multi -layered layer system is an AlN -based ha

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202317012593 A", "(22)": "2023-02-24", "(31)": "63/068575", "(32)": "2020-08-21", "(33)": "US", "(43)": "2024-04-19", "(51)": "A61P 19/08, C07K 16/28", "(54)": "FGFR3 ANTIBODIES AND ME THODS OF USE", "(57)": "Anti-FGFR3 antigen -binding proteins and antigen -binding binding fragments thereof are provided. Methods of inhibiting FGFR3 activity and methods of treating FGFR3 -mediated diseases and  disorders are also provided.    No. of Pages : 121  No. of Claims : 81", "(61)": "NA :NA", "(62)": "NA :NA", "(71)": "1)GENZYME CORPORATION        Address of Applicant :50 Binney Street Cambridge, Massachusetts 02142 U.S.A.", "(72)": "1)SABBAGH, Yves     2)CHEN, Yangde     3)BRONDYK, William     4)QIU, Huawei     5)PARK, Sunghae     6)WEI, Ronnie     7)QIU, Yu     8)ZHOU, Yanfeng     9)LEMOINE, Cendrine     10)CHO, HyunSuk", "(86)": "PCT/US2021 /046958  :20/08/2021", "(87)": "WO 2022/040560"}
score: 0.9989165763813651 length: 0.9978354978354979 seconds: 55.798639

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202327068365 A", "(22)": "2023-10-11", "(43)": "2024-04-19", "(51)": "H04N 7/15", "(54)": "VIDEO ANALYSIS PROGRAMME", "(57)": "[Problem] To objectively evaluate online communications such as meetings and lectures, which have become mainstream, in order to carry out communication more efficiently. [Solution] A system of the present disclosure comprises: a video acquisition unit that acquires a video obtain ed by videoing participants during an online session; an analysis unit that analyzes changes in the biological reactions of the participants on the basis of the video acquired by the video acquisition unit; a target reading unit that re ads target informati on relating to the analysis result; and an evaluation unit that carries out an evaluation by comparing the read target information and the analysis result for the participants.     No. of Pages : 25  No. of Claims : 4", "(61)": "NA :NA", "(62)": "NA :NA", "(71)": "1)I'MBESIDEYOU INC.        Address 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202317012689 A", "(22)": "2023-02-24", "(31)": "10202008815X", "(32)": "2020-09-09", "(33)": "SG", "(43)": "2024-04-19", "(51)": "C10G 3/00", "(54)": "PROCESS AND SYSTEM FOR HYDROTREATING RENEWABLE FEEDSTOCK", "(57)": "The present invention provides a process for producing one or more of hydrocarbon products from a renewable feedstock compris ing triglyceri des, free fatty acids or combinations thereof. The process may comprise the steps of mixing the renewable feedstock with a diluent to form a diluted feedstock; supplying or providing hydrogen gas to the diluted feedstock so that the hydrogen gas ma y dissol ve in the diluted feedstock to form a diluted feedstock enriched with dissolved hydrogen; and feeding the diluted feedstock enriched with dissolved hydrogen to at least a reactor having at least a reaction zone comprising at least a catalyst bed und er predefined conditions, thereby producing a reaction effluent which can be further processed (e.g

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


PRED: {"(19)": "INDIA", "(21)": "202318012739 A", "(22)": "2023-02-24", "(31)": "62/903,823", "(32)": "2019-09-21", "(33)": "US", "(43)": "2024-04-19", "(51)": "H04N 19/60, H04N 19/137, H04N 19/132, H04N 19/119, H04N 19/18", "(54)": "TRANSFORM-BASED IMAGE CODING METHOD AND DEVICE", "(57)": "An image decoding method according to the present document comprises a step of deriving a residual sample, wherein the step o f deriving the residual sample compr ises the steps of: when a current block is divided into sub -partition blocks, deriving a transform kernel for an inverse primary transform applied to a sub -partition block on the basis of a horizontal or a vertical length of the sub -partition block; and de riving the residual sample from transform coefficients on the basis of the transform kernel.    No. of Pages : 112  No. of Claims : 6", "(61)": "NA :NA", "(62)": "202217021895 :2020-09-21", "(71)": "1)LG ELECTRONICS INC.        Address of Applicant :128, Yeoui -daero, Yeongdeungpo -gu