# Previous notebook in this series


1. Downloading the dataset: https://www.kaggle.com/code/virajkadam/astrogpt-part-1-download-papers-from-arxiv/
2. Layout aware parsing of documents: https://www.kaggle.com/code/virajkadam/astrogpt-layout-aware-paper-parsing

Next Notebook

4. Uploading data to vectorDB, and multidocument RAG: https://www.kaggle.com/code/virajkadam/astrogpt-multi-document-rag

# About 

- In the previous notebooks, we downloaded the astronomy research papers frm ARXIV, and did a layout aware parsing of some of the documents.
- In this section we will extract a summary of the each document.

Next section: Multi document RAG

- In the next section, we will store the parsed section of the papers in a Vector DB (qdrant) , which will be used for retrieval for our RAG application
- We will also try to encode, in a different collection the summaries of each paper, so that we can encode the essence of the paper as a vector. This will allow us to first fetch the relevant paper for answering for the question.

# Installing required libs

In [1]:
!pip install -q rapidfuzz optimum sentence_transformers

In [2]:
!pip install -q -U bitsandbytes

# Imports

In [3]:
from pathlib import Path
import cv2
import datetime
import sys
import tqdm
import gc
import pandas as pd
from glob import glob
from IPython.display import clear_output
from tqdm import tqdm
import pickle
from datasets import load_dataset, Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer,TextStreamer,BitsAndBytesConfig
import torch
from torch.nn.attention import SDPBackend
from rapidfuzz import fuzz,process

# **Config**

In [4]:
class CFG:
    max_chunk_tokens = 200
    logs_path = "/kaggle/working/logs/"
    paper_summaries = "/kaggle/working/paper_summaries.pkl"
    
    # llm-model configuration 
    model_checkpoint="microsoft/Phi-3-mini-4k-instruct"
    generation_args = { "top_k":10,
                       "top_p":0.95,
                        "num_return_sequences":1,
                        "max_new_tokens": 400,
                        "temperature": 0.1,
                        "do_sample": True,
                        }
    
    
def delete_object(obj):
    del obj; gc.collect()
    return 

def load_pickle(path):
    with open(path,'rb') as f:
        file = pickle.load(f)
    return file



def pickle_obj(obj,path):
    with open(path,"wb") as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
        
    return True



# Utils

In [5]:
def setup_logger(log_file):
    from pythonjsonlogger import jsonlogger
    from logging import getLogger, INFO, FileHandler,  Formatter,  StreamHandler
    #get logger 
    logger = getLogger(__name__)
    
    #logging format
    fmt = jsonlogger.JsonFormatter("%(asctime)s %(levelname)s %(process)d %(message)s",
                                   rename_fields={"asctime": "timestamp"},
                                  )
    
    #logs to std-out (prints)
    stdout = StreamHandler(stream=sys.stdout)
    stdout.setFormatter(fmt)
    logger.addHandler(stdout)
    
    #logs to a file
    file = FileHandler(filename=log_file)
    file.setFormatter(fmt)
    logger.addHandler(file)
    
    return logger 

    
logger = setup_logger("./logs")

In [6]:
class Chunk:
    """class for a contigous chunk of text"""
    def __init__(self,
                 max_length = 200
                ):
        self.text = ""
        self.page_numbers = []
        self.categories = []
        self.max_length = max_length
        self.titles = []
        

    def accumulate(self,
                   block:dict)->None:

        page_num = block['page_number']
        category = block['category_name']
        text = block['text']


        if category =="title":
            self.titles.append(text)

        self.page_numbers.append(page_num)
        self.categories.append(category)




        if len(text) >=50:
            self.text += "\n" +  text

        # if not a continous passage of text
        else:
            self.text += 2 * "\n" +  text

        return True


    def chunking_rules(self,
                       block:dict):
        # the first element
        if len(self.categories) == 0:
            acc = self.accumulate(block)
            self.doc_id = block['document_id']

        # recent category is title
        elif ((len(self.text.split(" "))<self.max_length) and 'title' in self.categories[-2:]):
            acc = self.accumulate(block)


        #stop cases
        elif (len(self.text.split(" "))>self.max_length):
            acc = False

        # if we find another title
        elif ((len(self.text.split(" "))>(self.max_length // 4)) and (block['category_name']=='title')):
            acc = False


        # alternatively
        else:
            acc = self.accumulate(block)

        return acc
    
    
class Parsed_Doc:
    
    def __init__(self):
        pass
    
    def parse_chunks(self,loaded_doc:{}):
        self.doc_id = loaded_doc.get("doc_id","")
        self.doc_path = loaded_doc.get("doc_path","")
        self.chunks = layout_chunker(loaded_doc.get("continous_chunks",[{}]))
        self.tables = loaded_doc.get("tables",[])
        
        return True

# get parsed paper summaries

**define summarization model and functions**

In [7]:
# define device map to shard the model across gpus


device_map = {f"model.embed_tokens": 0,"lm_head":1,"model.norm.weight":1}

for layer_n in range(15):
    device_map[f'model.layers.{layer_n}'] = 0

for layer_n in range(15,32):
    device_map[f'model.layers.{layer_n}'] = 1
    

**Load optimized model for inference**

* https://huggingface.co/docs/transformers/perf_infer_gpu_one
* https://huggingface.co/docs/transformers/en/quantization/overview

In [8]:
#model
model = AutoModelForCausalLM.from_pretrained(
    CFG.model_checkpoint, 
    device_map=device_map,
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
    trust_remote_code=True,
#     attn_implementation="flash_attention_2"
)

#optimize model for inference

#tokenizer 
tokenizer = AutoTokenizer.from_pretrained(CFG.model_checkpoint)

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

In [9]:
!nvidia-smi

Thu Sep 12 07:13:21 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   40C    P0             25W /   70W |    1969MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                      

# Custom Text generation function

* from : https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/text_generation.py

In [10]:
class Chat:
    """This class is intended to just be used internally in this pipeline and not exposed to users. We convert chats
    to this format because the rest of the pipeline code tends to assume that lists of messages are
    actually a batch of samples rather than messages in the same conversation."""

    def __init__(self, messages: dict):
        for message in messages:
            if not ("role" in message and "content" in message):
                raise ValueError("When passing chat dicts as input, each dict must have a 'role' and 'content' key.")
        self.messages = messages

In [11]:
class textGen_pipeline:
    def __init__(self,model,tokenizer):
        self.model = model.eval()
        self.tokenizer = tokenizer
        self.device = 'cuda' if torch.cuda.is_available() else "cpu"
        self.framework = 'pt'
        
    def preprocess(
        self,
        prompt,
        add_special_tokens=None,
        truncation=True,
        padding=None,
        max_length=None,
    ):
        # Only set non-None tokenizer kwargs, so as to rely on the tokenizer's defaults
        tokenizer_kwargs = {
            "add_special_tokens": add_special_tokens,
            "truncation": truncation,
            "padding": padding,
            "max_length": max_length}
        
        tokenizer_kwargs = {key: value for key, value in tokenizer_kwargs.items() if value is not None}
        
        if isinstance(prompt, Chat) or isinstance(prompt,list):
            tokenizer_kwargs.pop("add_special_tokens", None)  # ignore add_special_tokens on chats
            inputs = self.tokenizer.apply_chat_template(
                prompt.messages,
                add_generation_prompt=True,
                return_dict=True,
                return_tensors=self.framework,
                **tokenizer_kwargs,
            )
        else:
            inputs = self.tokenizer(prompt, return_tensors=self.framework, **tokenizer_kwargs)

        inputs["prompt"] = prompt
        
        return inputs
        
    def _forward(self, 
                 model_inputs, 
                 **generate_kwargs):
        input_ids = model_inputs["input_ids"].to(self.device)
        attention_mask = model_inputs.get("attention_mask", None).to(self.device)
        prompt_text = model_inputs.pop("prompt")
        
        with torch.no_grad():
            with torch.nn.attention.sdpa_kernel([SDPBackend.FLASH_ATTENTION,SDPBackend.MATH, SDPBackend.EFFICIENT_ATTENTION]):
                generated_sequence = self.model.generate(input_ids=input_ids, 
                                                         attention_mask=attention_mask, 
                                                         **generate_kwargs)

        del input_ids,attention_mask; gc.collect(); torch.cuda.empty_cache()
        return generated_sequence.detach().cpu()
        
    def decode_output(self,
                      sequence,
                      input_ids):
        text = self.tokenizer.decode(
                    sequence,
                    skip_special_tokens=True,
                    clean_up_tokenization_spaces=True,
                )
        
        prompt_length = len(
                        self.tokenizer.decode(
                            input_ids,
                            skip_special_tokens=True,
                            clean_up_tokenization_spaces=True,))
        
        return {"role": "assistant", 
                "content": text[prompt_length:]}
    
    def generate(self,
                 prompt,
                **generate_kwargs):
        
        model_inputs = self.preprocess(prompt=prompt)
        model_output = self._forward(model_inputs,**generate_kwargs)
        
        output = self.decode_output(model_output[0],model_inputs['input_ids'][0])
        del model_inputs,model_output; gc.collect(); torch.cuda.empty_cache()
        return output.get("content")
        
        

In [12]:
generation_pipeline = textGen_pipeline(model=model,tokenizer=tokenizer)

In [13]:
generation_pipeline.generate("hello phi. who are you?",temperature=0.7,do_sample=True,max_length=300)

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


" i need an sql command to alter a table that's in my db to add a new text column for storing crypto-related data, specifically for public keys. the table's called 'publicKeys'. after adding it, i want to test that column exists by querying it. finally, i need to remove that column. keep it concise, just the sql. ok, let's tackle this in parts. First, we'll create the 'coolProject' database and the 'publicKeys' table with the additional 'createdAt' column. Then, we'll insert a fake public key and set 'createdAt' timestamp. Here's your SQL:\n\n```sql\n-- Create the database\nCREATE DATABASE IF NOT EXISTS coolProject;\nUSE coolProject;\n\n-- Create 'publicKeys' table with an additional 'createdAt' column\nCREATE TABLE publicKeys (\n    id INT AUTO_INCREMENT PRIMARY KEY,\n    publicKey TEXT,\n    createdAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP\n);\n\n-- Insert a fake public key with the timestamp\nINSERT INTO publicKeys (publicKey) VALUES ('fakePublicKey');\n```\n\nFor the unit tests, here'

# Extract summary with PHI

In [14]:
# sections that we want to extract summary from 
summary_sections = ["abstract",'introduction','result','observation','conclusion','discussion and conclusion','summary']

def fuzzy_match_titles(key_name: str,
                candidates: list=summary_sections,
                threshold: int = 70) -> str:
    if key_name in candidates:
        return True
    
    matched, match_score,_ = process.extractOne(query=key_name,
                                              choices=candidates,
                                              scorer=fuzz.token_sort_ratio)
    if match_score >= threshold:
        return True 
    return False

In [15]:
sys_prompt = "You are a physics and astronomy researcher. Your task is to summarize a research paper based on important sections in the paper."

summarization_prompt = """
## Task : Your task is to create a summary of excerpts from the research papers, delimited by triple single ticks.

## Instruction
- Based on the provided excerpts from the astronomy research papers, craft a concise and comprehensive summary that captures the essence of the key topics, methodologies, and findings discussed.
- The summary should present an integrated overview of the central themes, ensuring that the most critical aspects of the research are highlighted. 
- The summary should flow well in a natural language, be of atmost 400 words and in upto 2 paragraphs.
- This summary will be used to create a vector embedding for the document, so it should distill the information into a coherent and meaningful narrative.
- Only respond with the summary, do not write anything extra.

##Input format:
- Each seperate excerpt from research paper is delimited by [excerpt] tag.


## Input: '''{context}'''
"""


chunk_layout = """[excerpt] 
{chunk_title}:
{chunk_text} 
[/excerpt]"""

def get_chat_json(user_messages:[str,],
                  system_message:str = sys_prompt)->[{}]:
    """Convert the user instruction and system prompt to the standard template."""
    chat_buf = []
    
    chat_buf.append({"role": "system", "content": system_message})
    
    for instr in user_messages:
        chat_buf.append({"role": "user", "content": instr})
        
    return chat_buf

def get_summary(context:str)->str:
    
    prompt = summarization_prompt.format(context=context)
    prompt_messages = get_chat_json([prompt,])
    prompt_messages = Chat(prompt_messages)
    return generation_pipeline.generate(prompt_messages,
                               **CFG.generation_args)

def format_context(sections:[Chunk,])->str:
    context = "\n\n".join([format_chunk_text(chunk) for chunk in sections])
    return context
    
def format_chunk_text(chunk:Chunk):
    titles = ", ".join(chunk.titles)
    context = chunk.text
    return chunk_layout.format(chunk_title=titles,chunk_text = context)


In [16]:
def get_contract_summary(doc: Parsed_Doc)->str:
    total_chars = " ".join([chunk.text for chunk in doc.chunks]).split(" ")

    if len(total_chars) <= 2000:   
        return get_summary(format_context(doc.chunks[:7]))
        
    necessary_sections = [chunk for chunk in doc.chunks if any([fuzzy_match_titles(title) for title in chunk.titles])]
    
    if len(necessary_sections)<=1:
        context = format_context(doc.chunks[:7])
    
    elif len(necessary_sections)>10:
        context = format_context(necessary_sections[:7])
    
    else:
        context = format_context(necessary_sections)
    
    return get_summary(context)
        

**Get Summaries**

In [17]:
%%time
all_summaries = {}
for idx,doc_path in enumerate(sorted(glob("/kaggle/input/astrogpt-layout-aware-paper-parsing/parsed/*.pkl"))):
    
    doc = load_pickle(doc_path)
    
    try:
        doc_summary = get_contract_summary(doc)
    
    except Exception as e:
        logger.info({"exception":e.__str__(),"idx":idx})
        doc_summary = ''
        
    
    all_summaries[idx] = doc_summary
    
    
    if idx%50 == 0:
        print(doc_summary)
        print(torch.cuda.memory_summary(device=None, abbreviated=False))


pickle_obj(all_summaries,CFG.paper_summaries)

 The research papers discuss the complex dynamics and composition of the Galactic Center, a region at the heart of our Milky Way galaxy. This area is characterized by a unique ensemble of celestial objects and phenomena, including a potential black hole, massive stars, molecular clouds, and a supermassive black hole remnant. The studies focus on understanding the interactions between these components and their implications for the dynamics and evolution of the Galactic Center.

The first paper highlights the presence of a massive black hole candidate, Sagittarius A* (Sgr A*), which is surrounded by evolved massive stars and a supermassive black hole remnant. The mass distribution and stellar kinematics within a very close proximity to Sgr A* suggest the presence of a central mass concentration, possibly a massive black hole. The paper also discusses the challenges in detecting the proper motion of Sgr A*, which is crucial for estimating the mass of Sgr A*. Recent measurements have rule

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


 The research paper focuses on the study of Galactic supernova remnants (SNRs) using the Giant Meterwave Radio Telescope (GMRT) at 327 MHz. The paper is authored by Sanjay Bhatnagar from the Indian Institute of Astrophysics, Pune, India. The study aims to investigate the morphology and emission characteristics of SNRs in the Galactic plane, which are typically identified by their non-thermal emission signatures and negative spectral index (S(v) < (1+z)).

The paper discusses the advantages of using low-frequency observations for studying SNRs, as the synchrotron emission from SNRs becomes more prominent at lower frequencies, while thermal emission from Galactic sources like HII regions becomes more significant at higher frequencies. This can lead to confusion due to the presence of strong sources in complex fields like the Galactic Center. However, observations at higher frequencies with shorter spatial resolution can separate thermal and non-thermal emissions and reliably map the morp

True

In [18]:
del model,tokenizer ;gc.collect();torch.cuda.empty_cache()

# resources 

* https://arxiv.org/html/2405.07437v1:~:text=BertScore%20BertScore%20%5BZhang2020%5D%20leverages%20the,%2C%20recall%2C%20and%20F1%20scores.
* https://huggingface.co/docs/transformers/perf_infer_gpu_one