# About

* In the previous notebooks, we downloaded the astronomy research papers frm ARXIV, and did a layout aware parsing of some of the documents, and then proceeded to make a summary of each document 
* In this section, we will store the parsed section of the papers in a Vector DB (qdrant) , which will be used for retrieval for our RAG application.
* We will also encode, in a different collection the summaries of each paper, so that we can encode the essence of the paper as a vector. This will allow us to first fetch the relevant paper for answering for the question.
* Along with this, we will also extract keywords from each paper which will help in the search.

# Previous notebooks
1. Downloading the dataset: https://www.kaggle.com/code/virajkadam/astrogpt-part-1-download-papers-from-arxiv/
2. Layout aware parsing of documents: https://www.kaggle.com/code/virajkadam/astrogpt-layout-aware-paper-parsing
3. Extracting document summary: https://www.kaggle.com/code/virajkadam/astrogpt3-getting-paper-summary/

**Recommended Reading**

* https://qdrant.tech/articles/vector-search-filtering/


# Installing required libs

In [1]:
!pip install -q 'qdrant-client[fastembed-gpu]'

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-beam 2.46.0 requires cloudpickle~=2.2.1, but you have cloudpickle 3.0.0 which is incompatible.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.8 which is incompatible.
apache-beam 2.46.0 requires numpy<1.25.0,>=1.14.3, but you have numpy 1.26.4 which is incompatible.
apache-beam 2.46.0 requires protobuf<4,>3.12.2, but you have protobuf 5.28.3 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 16.1.0 which is incompatible.
google-ai-generativelanguage 0.6.6 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 5.28.3 which is incompatible.
google-api-core 2.11.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.

In [2]:
!pip install -q rapidfuzz optimum sentence_transformers flashrank

In [3]:
!pip install -q -U bitsandbytes

# Imports

In [4]:
from pathlib import Path
import cv2
import ast
import datetime
import sys
import tqdm
import gc
import pandas as pd
from glob import glob
from IPython.display import clear_output
from tqdm import tqdm
import pickle
import qdrant_client
from qdrant_client import models
from datasets import load_dataset, Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer,TextStreamer,BitsAndBytesConfig
import torch
from rapidfuzz import fuzz,process
from flashrank import Ranker, RerankRequest


# Config

In [5]:
class CFG:
    max_chunk_tokens = 200
    logs_path = "/kaggle/working/logs/"
    
    # llm-model configuration 
    model_checkpoint= "microsoft/Phi-3-mini-128k-instruct"
#     "microsoft/Phi-3-mini-128k-instruct"
    generation_args = { "top_k":50,
                       "top_p":0.95,
                        "num_return_sequences":1,
                        "max_new_tokens": 1000,
                        "temperature": 0.3,
                        "do_sample": True,
                        }
    
    #embdding model 
    embedding_checkpoint = "thenlper/gte-base"
    embedding_size = 768
    
    #qdrant config 
    document_collection = "paper_documents"
    summary_collection = "paper_summaries"
    qdrant_path = "/kaggle/working/papers_db"
    
    # retrieval_config
    number_of_papers_per_query = 5
    number_of_chunks_per_paper = 10
    
    
    
    
def delete_object(obj):
    del obj; gc.collect()
    return 

def load_pickle(path):
    with open(path,'rb') as f:
        file = pickle.load(f)
    return file

# Utils

In [6]:
def setup_logger(log_file):
    from pythonjsonlogger import jsonlogger
    from logging import getLogger, INFO, FileHandler,  Formatter,  StreamHandler
    #get logger 
    logger = getLogger(__name__)
    
    #logging format
    fmt = jsonlogger.JsonFormatter("%(asctime)s %(levelname)s %(process)d %(message)s",
                                   rename_fields={"asctime": "timestamp"},
                                  )
    
    #logs to std-out (prints)
    stdout = StreamHandler(stream=sys.stdout)
    stdout.setFormatter(fmt)
    logger.addHandler(stdout)
    
    #logs to a file
    file = FileHandler(filename=log_file)
    file.setFormatter(fmt)
    logger.addHandler(file)
    
    return logger 

    
logger = setup_logger("./logs")

In [7]:
!nvidia-smi

Thu Oct 24 06:54:04 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   40C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                      

# Document parsing utils

In [8]:
class Chunk:
    """class for a contigous chunk of text"""
    def __init__(self,
                 max_length = 200
                ):
        self.text = ""
        self.page_numbers = []
        self.categories = []
        self.max_length = max_length
        self.titles = []
        

    def accumulate(self,
                   block:dict)->None:

        page_num = block['page_number']
        category = block['category_name']
        text = block['text']


        if category =="title":
            self.titles.append(text)

        self.page_numbers.append(page_num)
        self.categories.append(category)




        if len(text) >=50:
            self.text += "\n" +  text

        # if not a continous passage of text
        else:
            self.text += 2 * "\n" +  text

        return True


    def chunking_rules(self,
                       block:dict):
        # the first element
        if len(self.categories) == 0:
            acc = self.accumulate(block)
            self.doc_id = block['document_id']

        # recent category is title
        elif ((len(self.text.split(" "))<self.max_length) and 'title' in self.categories[-2:]):
            acc = self.accumulate(block)


        #stop cases
        elif (len(self.text.split(" "))>self.max_length):
            acc = False

        # if we find another title
        elif ((len(self.text.split(" "))>(self.max_length // 4)) and (block['category_name']=='title')):
            acc = False


        # alternatively
        else:
            acc = self.accumulate(block)

        return acc
    
    
class Parsed_Doc:
    
    def __init__(self):
        pass
    
    def parse_chunks(self,loaded_doc:{}):
        self.doc_id = loaded_doc.get("doc_id","")
        self.doc_path = loaded_doc.get("doc_path","")
        self.chunks = layout_chunker(loaded_doc.get("continous_chunks",[{}]))
        self.tables = loaded_doc.get("tables",[])
        
        return True

# LLM utils

In [9]:
def get_chat_json(user_messages:[str,],
                  system_message:str)->[{}]:
    """Convert the user instruction and system prompt to the standard template."""
    chat_buf = []
    
    chat_buf.append({"role": "system", "content": system_message})
    
    for instr in user_messages:
        chat_buf.append({"role": "user", "content": instr})
        
    return chat_buf

In [10]:
class Chat:
    """This class is intended to just be used internally in this pipeline and not exposed to users. We convert chats
    to this format because the rest of the pipeline code tends to assume that lists of messages are
    actually a batch of samples rather than messages in the same conversation."""

    def __init__(self, messages: dict):
        for message in messages:
            if not ("role" in message and "content" in message):
                raise ValueError("When passing chat dicts as input, each dict must have a 'role' and 'content' key.")
        self.messages = messages
        
        
class textGen_pipeline:
    def __init__(self,model,tokenizer):
        self.model = model.eval()
        self.tokenizer = tokenizer
        self.device = 'cuda' if torch.cuda.is_available() else "cpu"
        self.framework = 'pt'
        
    def preprocess(
        self,
        prompt,
        add_special_tokens=None,
        truncation=True,
        padding=None,
        max_length=None,
    ):
        # Only set non-None tokenizer kwargs, so as to rely on the tokenizer's defaults
        tokenizer_kwargs = {
            "add_special_tokens": add_special_tokens,
            "truncation": truncation,
            "padding": padding,
            "max_length": max_length}
        
        tokenizer_kwargs = {key: value for key, value in tokenizer_kwargs.items() if value is not None}
        
        if isinstance(prompt, Chat) or isinstance(prompt,list):
            tokenizer_kwargs.pop("add_special_tokens", None)  # ignore add_special_tokens on chats
            inputs = self.tokenizer.apply_chat_template(
                prompt.messages,
                add_generation_prompt=True,
                return_dict=True,
                return_tensors=self.framework,
                **tokenizer_kwargs,
            )
        else:
            inputs = self.tokenizer(prompt, return_tensors=self.framework, **tokenizer_kwargs)

        inputs["prompt"] = prompt
        
        return inputs
        
    def _forward(self, 
                 model_inputs, 
                 **generate_kwargs):
        input_ids = model_inputs["input_ids"].to(self.device)
        attention_mask = model_inputs.get("attention_mask", None).to(self.device)
        prompt_text = model_inputs.pop("prompt")
        
        with torch.no_grad():
            generated_sequence = self.model.generate(input_ids=input_ids, 
                                                     attention_mask=attention_mask, 
                                                     **generate_kwargs)
        
        del input_ids,attention_mask; gc.collect(); torch.cuda.empty_cache()
        return generated_sequence.detach().cpu()
        
    def decode_output(self,
                      sequence,
                      input_ids):
        text = self.tokenizer.decode(
                    sequence,
                    skip_special_tokens=True,
                    clean_up_tokenization_spaces=True,
                )
        
        prompt_length = len(
                        self.tokenizer.decode(
                            input_ids,
                            skip_special_tokens=True,
                            clean_up_tokenization_spaces=True,))
        
        return {"role": "assistant", 
                "content": text[prompt_length:]}
    
    def generate(self,
                 prompt,
                **generate_kwargs):
        
        model_inputs = self.preprocess(prompt=prompt)
        model_output = self._forward(model_inputs,**generate_kwargs)
        
        output = self.decode_output(model_output[0],model_inputs['input_ids'][0])
        del model_inputs,model_output; gc.collect(); torch.cuda.empty_cache()
        return output.get("content")
        

# Qdrant VectorDB : Basic concepts

1. Collections : A collection is a named set of points (vectors with a payload) among which you can search. The vector of each point within the same collection must have the same dimensionality and be compared by a single metric. Named vectors can be used to have multiple vectors in a single point, each of which can have their own dimensionality and metric requirements.

2. Payload : One of the significant features of Qdrant is the ability to store additional information along with vectors. This information is called payload in Qdrant terminology.

3. Points : The points are the central entity that Qdrant operates with. A point is a record consisting of a vector and an optional payload.

4. Search : There are many ways to estimate the similarity of vectors with each other. In Qdrant terms, these ways are called metrics. The choice of metric depends on vectors obtaining and, in particular, on the method of neural network encoder training.The most typical metric used in similarity learning models is the cosine metric.

**Collections**

A collection is a named set of points (vectors with a payload) among which you can search. The vector of each point within the same collection must have the same dimensionality and be compared by a single metric. Named vectors can be used to have multiple vectors in a single point, each of which can have their own dimensionality and metric requirements.

In addition to the required options, you can also specify custom values for the following collection options:

    hnsw_config - see indexing for details.
    wal_config - Write-Ahead-Log related configuration. See more details about WAL
    optimizers_config - see optimizer for details.
    shard_number - which defines how many shards the collection should have. See distributed deployment section for details.
    on_disk_payload - defines where to store payload data. If true - payload will be stored on disk only. Might be useful for limiting the RAM usage in case of large payload.
    quantization_config - see quantization for details.
    In most cases, you should only use a single collection with payload-based partitioning. This approach is called multitenancy. It is efficient for most of users, but it requires additional configuration.

# Setup Qdrant collection

In [11]:
client = qdrant_client.QdrantClient(":memory:")

def create_qdrant_collection(collection_name:str,
                             client = client):
    client.create_collection(
    collection_name=collection_name,
    vectors_config={"text-dense":models.VectorParams(size=CFG.embedding_size, distance=qdrant_client.models.Distance.COSINE)},
    sparse_vectors_config = {
        "text-sparse": models.SparseVectorParams(index=qdrant_client.models.SparseIndexParams(on_disk=False,))}
    )
    #get collection info 
    info = client.get_collection(collection_name=collection_name)
    print(f"created collection - {info}")
    return True


create_qdrant_collection(CFG.document_collection)

created collection - status=<CollectionStatus.GREEN: 'green'> optimizer_status=<OptimizersStatusOneOf.OK: 'ok'> vectors_count=None indexed_vectors_count=0 points_count=0 segments_count=1 config=CollectionConfig(params=CollectionParams(vectors={'text-dense': VectorParams(size=768, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None)}, shard_number=None, sharding_method=None, replication_factor=None, write_consistency_factor=None, read_fan_out_factor=None, on_disk_payload=None, sparse_vectors={'text-sparse': SparseVectorParams(index=SparseIndexParams(full_scan_threshold=None, on_disk=False, datatype=None), modifier=None)}), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=None, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None

True

In [12]:
create_qdrant_collection(CFG.summary_collection)

# get all connection 
client.get_collections()

created collection - status=<CollectionStatus.GREEN: 'green'> optimizer_status=<OptimizersStatusOneOf.OK: 'ok'> vectors_count=None indexed_vectors_count=0 points_count=0 segments_count=1 config=CollectionConfig(params=CollectionParams(vectors={'text-dense': VectorParams(size=768, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None)}, shard_number=None, sharding_method=None, replication_factor=None, write_consistency_factor=None, read_fan_out_factor=None, on_disk_payload=None, sparse_vectors={'text-sparse': SparseVectorParams(index=SparseIndexParams(full_scan_threshold=None, on_disk=False, datatype=None), modifier=None)}), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=None, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None

CollectionsResponse(collections=[CollectionDescription(name='paper_documents'), CollectionDescription(name='paper_summaries')])

**Creating Qdrant payload index**


    A key feature of Qdrant is the effective combination of vector and traditional indexes. It is essential to have this because for vector search to work effectively with filters, having vector index only is not enough. In simpler terms, a vector index speeds up vector search, and payload indexes speed up filtering.
    
    - Payload Index: This index is built for a specific field and type, and is used for quick point requests by the corresponding filtering condition.
    - Full-text index: Qdrant supports full-text search for string payload. Full-text index allows you to filter points by the presence of a word or a phrase in the payload field.
    - On-disk payload index: By default all payload-related structures are stored in memory. In this way, the vector index can quickly access payload values during search. As latency in this case is critical, it is recommended to keep hot payload indexes in memory.
    - Tenant Index: Many vector search use-cases require multitenancy. In a multi-tenant scenario the collection is expected to contain multiple subsets of data, where each subset belongs to a different tenant.
    - Principal Index: Similar to the tenant index, the principal index is used to optimize storage for faster search, assuming that the search request is primarily filtered by the principal field.


In [13]:
client.create_payload_index(
    collection_name=CFG.document_collection,
    field_name="document_id",
    field_schema="keyword",
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

# Embedding models for vector embeddings

**Dense Embeddings Utils** : Will create, dense, semantic embeddings

In [14]:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel


tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-base")

#load with scaled dot product attention
model = AutoModel.from_pretrained("thenlper/gte-base",
#                                   torch_dtype=torch.float16,
                                  attn_implementation="sdpa"
                                 ).to("cuda:0")

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
        

def get_embedding(input_texts:[str,],
                 model = model
                ):
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt').to("cuda:0")
    with torch.no_grad():
        outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    embeddings = F.normalize(embeddings, p=2, dim=1)
    
    return embeddings.detach().cpu().tolist()

CFG.embedding_batch_size=64
def batcher(inputs:list,
            size = CFG.embedding_batch_size):
    """batch input for pred"""
    for start in range(0,len(inputs),size):
        
        yield inputs[start:start+size]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/219M [00:00<?, ?B/s]

**Sparse embeddings utils**


* splade example: https://github.com/naver/splade/blob/main/inference_splade.ipynb

* Qdrant hybrid indexing and search: https://github.com/qdrant/workshop-ultimate-hybrid-search/blob/main/notebooks/01-qdrant-indexing.ipynb

* Sparse vectors: https://qdrant.tech/articles/sparse-vectors/

* https://nayakpplaban.medium.com/building-an-ecommerce-based-search-application-using-langchain-and-qdrants-latest-pure-a60df053066a


In [15]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

#document encoder
doc_model_id = "naver/efficient-splade-VI-BT-large-doc"
doc_tokenizer = AutoTokenizer.from_pretrained(doc_model_id)
doc_model = AutoModelForMaskedLM.from_pretrained(doc_model_id).to("cuda:0")


#query encoder
query_model_id = "naver/efficient-splade-VI-BT-large-query"
query_tokenizer = AutoTokenizer.from_pretrained(query_model_id)
query_model = AutoModelForMaskedLM.from_pretrained(query_model_id).to("cuda:0")

tokenizer_config.json:   0%|          | 0.00/449 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/620 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/417 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/699 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

In [16]:
list(zip(doc_tokenizer.all_special_tokens, doc_tokenizer.all_special_ids))

[('[UNK]', 100), ('[SEP]', 102), ('[PAD]', 0), ('[CLS]', 101), ('[MASK]', 103)]

In [17]:
def compute_sparse_vector(texts:str,
                   tokenizer, 
                   sparse_model):
    """
    Computes a vector from logits and attention mask using ReLU, log, and max operations.
    """
    tokens = tokenizer(texts, return_tensors="pt",truncation=True).to("cuda:0")
    output = sparse_model(**tokens)
    logits, attention_mask = output.logits, tokens.attention_mask
    relu_log = torch.log(1 + torch.relu(logits)) #(seq_len,vocab_size) range -> (0,)
    weighted_log = relu_log * attention_mask.unsqueeze(-1)
    max_val, _ = torch.max(weighted_log, dim=1)
    vec = max_val.squeeze()
    
    torch.cuda.empty_cache()
    return vec, tokens

In [18]:
# lets have a look at some sparse embeddings, what exactly is it doing

sparse_vec,tokens = compute_sparse_vector("How is the angular momentum of a neutron star related to its size?",
                                          doc_tokenizer,
                                          doc_model
                                         )

indices = sparse_vec.cpu().nonzero().numpy().flatten()
values = sparse_vec.detach().cpu().numpy()[indices]

# the sparse embeddings are basically picking important tokens from the text
tokenizer.decode(indices)

'large tell space similar starel size related direction relationship stars associated mass relations motion nuclear object measure determine visible linked relative satellite solar angle diameter rev relation affect beam dependent scattered orbit relatingels galaxy particles bang disk elliott rotation velocity particle momentum polar kin astronomynova spatial orbitalyonstar proportional relate relates flashlight einstein propelled predict angular determines intercepted magnet stellar scattering planetaryatter neutron kinetic relativity blitz vibrations radiatingissonlide rotational collisions'

# Ingesting the data into vector DB

We will create two vectors, Dense and Sparse for each point in the vector index. The Dense vectors helps in semantic matching, while the sparse vector will help us identify the important keywords. This concept of using both dense and sparse vectors for retrieval is called as hybrid search.

**Ingesting into document summary index**

In [19]:
summaries = load_pickle("/kaggle/input/astrogpt3-getting-paper-summary/paper_summaries.pkl")

type(summaries)

dict

**summary collection**

In [20]:
%%time
paper_idxs = [idx for idx,doc in enumerate(summaries)]
summarys = [{"summary":summary} for key,summary in summaries.items()]

summary_vectors_dense = get_embedding([i.get('summary') for i in summarys])
summary_vectors_sparse = []
for summ in summarys:
    sparse_vec,_ = compute_sparse_vector(summ.get("summary"),
                                         doc_tokenizer,
                                         doc_model)
    
    indices = sparse_vec.cpu().nonzero().numpy().flatten()
    sparse_vec_payload = {"indices":indices,"values":sparse_vec.cpu().detach().numpy()[indices]}
    summary_vectors_sparse.append(sparse_vec_payload)
    


CPU times: user 7.81 s, sys: 578 ms, total: 8.39 s
Wall time: 8.35 s


In [21]:
%%time

# upsert points in qdrant collection
Points = [qdrant_client.models.PointStruct(id=paper_idxs[i],
                                           vector={"text-sparse": qdrant_client.models.SparseVector(indices=summary_vectors_sparse[i].get("indices").tolist(),values=summary_vectors_sparse[i].get("values").tolist()),
                                                    "text-dense":summary_vectors_dense[i]},
                                           payload=summarys[i]) 
          for i in range(len(summarys))
         ]


client.upsert(collection_name=f"{CFG.summary_collection}",
                points = Points)

del Points,summary_vectors_dense,summaries;gc.collect()

CPU times: user 370 ms, sys: 2.52 ms, total: 373 ms
Wall time: 372 ms


0

In [22]:
#check if the update is successful
client.count(
    collection_name=CFG.summary_collection,
    exact=True,
)

CountResult(count=195)

In [23]:
#check example
client.scroll(
    collection_name=CFG.summary_collection,
    limit=2,offset=180
)

([Record(id=180, payload={'summary': ' The research on gamma-ray bursts (GRBs) delves into the intricacies of their jet structures, energy distributions, and the implications these have on our understanding of the universe. The excerpts provided offer a comprehensive overview of the current state of knowledge in this field, focusing on the jet structure, the beaming effect, and the energy-angle relation, which are pivotal in understanding GRBs.\n\nGRBs are among the most energetic events in the universe, and their study has significantly advanced our understanding of high-energy astrophysical phenomena. The jet structure of GRBs is a critical aspect of their energy distribution and emission mechanisms. Two leading models describe this structure: the Uniform Jet (UJ) model and the Universal Structured Jet (USJ) model. The UJ model posits that the energy per solid angle is roughly constant within a finite opening angle, leading to a sharp drop outside this angle. This model suggests that

**Loading documents into document index**

In [24]:
idx_counter = 0

for idx,doc_path in enumerate(tqdm(sorted(glob("/kaggle/input/astrogpt-layout-aware-paper-parsing/parsed/*.pkl")))):
    
    doc = load_pickle(doc_path)
    ids = [idx_counter+i for i in range(len(doc.chunks))]
    idx_counter += len(ids)
    texts = [chunk.text for chunk in doc.chunks]
    dense_vectors = get_embedding(texts)
    sparse_vectors = []
    
    for text in texts:
        sparse_vec,_ = compute_sparse_vector(text,
                                             doc_tokenizer,
                                             doc_model)

        indices = sparse_vec.cpu().nonzero().numpy().flatten()

        sparse_vec_payload = {"indices":indices,
                              "values":sparse_vec.cpu().detach().numpy()[indices]}

        sparse_vectors.append(sparse_vec_payload)

    
    
    
    Points = [qdrant_client.models.PointStruct(id=ids[i],
                                 vector={"text-sparse": qdrant_client.models.SparseVector(indices=sparse_vectors[i].get("indices").tolist(),
                                                                                          values=sparse_vectors[i].get("values").tolist()),
                                         "text-dense":dense_vectors[i]},
                                 payload={"document_id":idx, "title":chunk.titles,"text":chunk.text}) for i,chunk in enumerate(doc.chunks)]
    
    client.upsert(collection_name=f"{CFG.document_collection}",
                  points = Points)
    
    del sparse_vectors,dense_vectors; gc.collect();torch.cuda.empty_cache()


100%|██████████| 195/195 [05:28<00:00,  1.68s/it]


In [25]:
#check if the update is successful,number of chunks
client.count(
    collection_name=CFG.document_collection,
    exact=True,
)

CountResult(count=5642)

In [26]:
#check example
client.scroll(
    collection_name=CFG.document_collection,
    limit=2,offset=1000
)

([Record(id=1000, payload={'document_id': 42, 'title': [], 'text': '\nnosity distribution, while the color-slicing method relies on the fact that in real clusters early-types are located pref- erentially in the central regions and form a tight color- magnitude relation. Similarly, the redshift estimates are independent, one relying primarily on magnitudes while the other on colors.\nOur main results can be summarized as follows. For z < 0.6 the overall confirmation rate by either method is high ~ 65% for the color-slicing and ~ 79% for the matched-filter confirmation. Considering only the “ro- bust” J-band candidates this rate becomes ~ 75% for the color slicing and ~ 85% for the matched-filter technique. Moreover, about 57% of the candidates are confirmed by these independent methods. This rate increases to 66% when only “robust” detections are considered. These re- sults enable us to adequately rank the candidate clusters for follow-up observations. For candidates at higher red- shif

# Retrieve document 

**Hybrid and Multi-Stage Queries**

    With the introduction of many named vectors per point, there are use-cases when the best search is obtained by combining multiple queries, or by performing the search in more than one stage.


    Specifically, whenever a query has at least one prefetch, Qdrant will:

    1. Perform the prefetch query (or queries),
    2. Apply the main query over the results of its prefetch(es).
    
    

In [27]:
def get_query_vector(query:str,
                     dense_model=model,
                     sparse_model=query_model,
                     dense_tokenizer=tokenizer,
                     sparse_tokenizer=query_tokenizer,
                     )->dict:
    
    #query sparse vector
    sparse_vec,_ = compute_sparse_vector(query,
                                         sparse_tokenizer,
                                         sparse_model)

    sparse_indices = sparse_vec.cpu().nonzero().numpy().flatten()
    sparse_scores = sparse_vec.cpu().detach().numpy()[sparse_indices]
    
    dense_vector = get_embedding([query,])[0]
    
    return {"sparse_vector":{"indices":sparse_indices,
                             "values":sparse_scores},
            "dense_vector":dense_vector}
    
    

    
def hybrid_vector_search(query:str,
                  collection:str,
                  filters=None,
                  total_results = 10,
                  ):
    
    
    
    query_vector = get_query_vector(query)
    
    prefetch_n_sparse = total_results*4
    prefetch_n_dense = total_results*4
    
    prefetch=[
        models.Prefetch(
            query=models.SparseVector(**query_vector.get("sparse_vector")),
            using="text-sparse",
            limit=prefetch_n_sparse,
            filter=filters
        ),
        models.Prefetch(
            query=query_vector.get("dense_vector"),  
            using="text-dense",
            limit=prefetch_n_dense,
            filter=filters
        ),
    ]
    
    results = client.query_points(
        collection,
        prefetch=prefetch,
        query=models.FusionQuery(
            fusion=models.Fusion.RRF,
        ),
        with_payload=True,
        limit=total_results,
        )
    return results

**test retrieval on doc summaries**

In [28]:
query = "What is the critical mass that is required to form a black hole?"
for obj in hybrid_vector_search(query,CFG.summary_collection,total_results=2).points:
    print(obj,end="\n\n")
    

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


id=105 version=0 score=0.7 payload={'summary': ' Three-Body Encounters of Black Holes in Globular Clusters\n\nIntroduction:\n\nThe presence of intermediate-mass black holes (IMBHs) in globular clusters has been increasingly supported by observational evidence, such as optical velocities of stars in globular clusters and X-ray observations of unresolved, non-stellar sources associated with these clusters. Theoretical models suggest that IMBHs are likely to form in environments with high three-body encounters, making globular clusters prime locations for their formation. This study focuses on the dynamics of IMBHs within globular clusters, particularly how they interact with other black holes and the implications for gravitational wave detection.\n\nKey Findings:\n\n- IMBHs in globular clusters are expected to undergo frequent three-body encounters with other black holes, leading to binary evolution and mergers due to gravitational radiation.\n- Numerical simulations of high mass-ratio b

In [29]:
# check the object methods
obj.id,obj.score

(185, 0.6666666666666666)

**Fetching relevant documents based on match summaries**

**Qdrant filters** : https://qdrant.tech/documentation/concepts/filtering/

- With Qdrant, you can set conditions when searching or retrieving points. For example, you can impose conditions on both the payload and the id of the point.

- Qdrant allows you to combine conditions in clauses. Clauses are different logical operations, such as OR, AND, and NOT

- common used conditions: Must, Should, Must Not


- Filtering conditions: Match,Match Any , Match Except etc.

In [30]:
def create_id_filter(ids:list):
    """create a filter for filtering the DB for specific ids"""
    return models.Filter(must = [models.FieldCondition(key="document_id", match=models.MatchAny(any=ids),),
                                ]
                        )

def get_relevant_paper_ids(query:str,
                        summary_collection=CFG.summary_collection,
                        total_results=CFG.number_of_papers_per_query,
                        score_threshold = 0.28)->[int]:
    """get document ids relevant to the query, by comparing the query with document summary."""
    
    try:
        relevant_summmaries = hybrid_vector_search(query,
                                                   summary_collection,
                                                   total_results=total_results)

        ids = [point.id for point in relevant_summmaries.points if point.score>=score_threshold]

        return ids
    
    except Exception as e:
        print(e)
        return []
    

    
relevant_ids_filter = create_id_filter(get_relevant_paper_ids(query))

#check the ids filter 
relevant_ids_filter

Filter(should=None, min_should=None, must=[FieldCondition(key='document_id', match=MatchAny(any=[105, 185, 62, 85, 194]), range=None, geo_bounding_box=None, geo_radius=None, geo_polygon=None, values_count=None)], must_not=None)

In [31]:
query = "What is the supercritical mass required to form a black hole?"

for obj in hybrid_vector_search(query,
                                CFG.document_collection,
                                filters=relevant_ids_filter,
                                total_results=3).points:
    print(obj.payload.get("document_id"))
    print(obj,end="\n\n")
    
    

62
id=1505 version=0 score=0.6666666666666666 payload={'document_id': 62, 'title': [], 'text': '\nBut radically different scenarios have also been proposed. At the opposite extreme, supermassive black holes might have been formed very rapidly and very early on in the universe’s history, when their host galaxies looked nothing like they do today. During the very early stages of struc- ture formation, local perturbations might have led to the growth of dark matter halos, which might have catalysed the formation of supermas- sive black holes at their centre. Ironically, in this scenario very massive black holes form more easily than small ones; in fact all black holes are predicted to exceed one million solar masses. The subsequent growth of the black hole could then control the formation and appearance of the galaxy around it.\nWhile we might not yet know the full story about supermassive black holes, there is at least one certainty — the next few years will be very interesting indeed.\n

**Scroll vs Search**



In [32]:
def filtered_search(collection,
                    filters,
                    limit = 5):
    """perform filtering on specific conditions"""
    results = client.scroll(
        collection_name=collection,
        scroll_filter= filters,
        limit = limit)
    
    return results


filtered_search(CFG.document_collection,create_id_filter([12]),limit=2)

([Record(id=240, payload={'document_id': 12, 'title': ['Magnetic collimation of the solar and stellar winds', '1. Introduction'], 'text': "\nMagnetic collimation of the solar and stellar winds\n\nK. Tsinganos! and S. Bogovalov?\n' Department of Physics, University of Crete and FORTH/IESL GR-710 03 Heraklion, Crete, Greece ? Moscow State Engineering Physics Institute, Moscow, 115409, Russia\nSubmitted July 21, 1999; accepted February 3, 2000\n\n1. Introduction\nAbstract. We resolve the paradox that although mag- netic collimation of an isotropic solar wind results in an enhancement of its proton flux along the polar directions, several observations indicate a wind proton flux peaked at the equator. To that goal, we solve the full set of the time- dependent MHD equations describing the axisymmetric outflow of plasma from the magnetized and rotating Sun, either in its present form of the solar wind, or, in its ear- lier form of a protosolar wind. Special attention is directed towards the 

**Qdrant Grouped search**

    It is possible to group results by a certain field. This is useful when you have multiple points for the same item, and you want to avoid redundancy of the same item in the results.

    The groups are ordered by the score of the top point in the group. Inside each group the points are sorted too.

    In this example, we have a collection of document summaries,and in another collection, we have chunked the documents and stored the points for the chunks in a separate collection, while storing the document id from the document it belongs in the payload of the chunk point.




In [33]:
# def grouped_hybrid_search(query:str,
#                           collection:str,
#                           filters,
#                           total_groups = 3,
#                           group_size = 6,
#                           prefetch_n_sparse = 40,
#                           prefetch_n_dense = 40):
    
    
#     query_vector = get_query_vector(query)
    
#     prefetch=[
#         models.Prefetch(
#             query=models.SparseVector(**query_vector.get("sparse_vector")),
#             using="text-sparse",
#             limit=prefetch_n_sparse,
#             filter=filters
#         ),
#         models.Prefetch(
#             query=query_vector.get("dense_vector"),  
#             using="text-dense",
#             limit=prefetch_n_dense,
#             filter=filters
#         ),
#     ]
    
    
#     results = client.query_points_groups(
#                                         collection,
#                                         group_by="document_id",  # Path of the field to group by
#                                         limit=total_groups,  # Max amount of groups
#                                         group_size=group_size,  # Max amount of points per group\,
#                                         prefetch=prefetch,
#                                         query=models.FusionQuery(
#                                                 fusion=models.Fusion.RRF,),
#                                         with_payload=True,
#                                         query_filter=filters
#                                         )
    
#     return results

# grouped_search_res = grouped_hybrid_search(query,CFG.document_collection,create_id_filter([12]))
# for group in grouped_search_res.groups:
#     print("group ID:",group.id,)
    
#     for item in group.hits:
#         print(item)
        
#     print("\n"*3)

In [34]:
from collections import OrderedDict
from itertools import islice

def grouped_hybrid_search(query:str,
                          collection:str,
                          filters,
                          total_groups = CFG.number_of_papers_per_query,
                          max_group_size = CFG.number_of_chunks_per_paper,
                          ):
    """performs a grouped hybrid search"""

        
    returned_resp = hybrid_vector_search(query,
                                         collection,
                                         filters,
                                         total_groups*max_group_size*2,
                                         )
    
    
    Grouped_response = OrderedDict()
    for point in returned_resp.points:
        doc_idx = point.payload.get("document_id")
        group_size = len(Grouped_response.get(doc_idx,[]))
        if group_size>=max_group_size:
            continue
        elif group_size==0:
            Grouped_response[doc_idx] = [point,]
        else:
            Grouped_response[doc_idx].append(point)
            
    return dict(islice(Grouped_response.items(),
                      total_groups))
    

**test**

In [35]:
grouped_search_res = grouped_hybrid_search(query,CFG.document_collection,relevant_ids_filter,max_group_size=2)

for group_id,group in grouped_search_res.items():
    print("group ID:",group_id)
    
    for item in group:
        print(item)
        
    print("\n"*3)

group ID: 62
id=1505 version=0 score=0.6666666666666666 payload={'document_id': 62, 'title': [], 'text': '\nBut radically different scenarios have also been proposed. At the opposite extreme, supermassive black holes might have been formed very rapidly and very early on in the universe’s history, when their host galaxies looked nothing like they do today. During the very early stages of struc- ture formation, local perturbations might have led to the growth of dark matter halos, which might have catalysed the formation of supermas- sive black holes at their centre. Ironically, in this scenario very massive black holes form more easily than small ones; in fact all black holes are predicted to exceed one million solar masses. The subsequent growth of the black hole could then control the formation and appearance of the galaxy around it.\nWhile we might not yet know the full story about supermassive black holes, there is at least one certainty — the next few years will be very interesting

# **Multi step retrieval for fetching relevant chunks**

In [36]:

def get_grouped_data_chunks(query:str)->dict:
    """fetch document chunks relevant to the query."""
    relevant_papers_ids = get_relevant_paper_ids(query)
    id_filter = create_id_filter(relevant_papers_ids)
    
    relevant_grouped_chunks =  grouped_hybrid_search(query,
                                                     CFG.document_collection,
                                                     id_filter)
    
    
    return relevant_grouped_chunks
    
    
   
grouped_resp = get_grouped_data_chunks(query)

grouped_resp.keys()

dict_keys([62, 120, 194, 182, 185])

# reranking for filtering relevant documents using Flash Rank

In [37]:
ranker = Ranker(model_name="rank-T5-flan", cache_dir="./")

def rerank_texts(query,
                points:[],
                ranker=ranker,
                alpha = 0.6):
    """ranks texts with reranker model."""
    passages = [{'id':i,
                 "text":point.payload.get("text"),
                 "document_idx":point.payload.get("document_id"),
                 "title":point.payload.get("title"),
                 "qdrant_score":point.score} for i,point in enumerate(points)]
    rerankrequest = RerankRequest(query=query, passages=passages)        
    results = ranker.rerank(rerankrequest)
    
    # compute final score
    for point in results:
        point['final_score'] = alpha * point['qdrant_score'] + (1-alpha) * point['score']
    return results

    
def score_based_filter(ranked_chunks=[{},],
                        score_thresh = 0.30,
                        ):
    """thresholding based on final score."""

    threshold_filtered_results = filter(lambda x: x.get("final_score")>=score_thresh,
                                        ranked_chunks) 
    
    return list(threshold_filtered_results)

rank-T5-flan.zip: 100%|██████████| 73.7M/73.7M [00:07<00:00, 10.4MiB/s]


In [38]:
# testing rerank function
sample_examples = []
for val in grouped_resp.values():
    sample_examples.extend(val)
score_based_filter(rerank_texts(query,sample_examples))

[{'id': 4,
  'text': "\n\nWhat is still to come?\nLooking back over the past five or six years, the progress made in our understanding of supermas- sive black holes has been respectable. But the picture is far from complete, and the road ahead is full of challenges. The question first raised by Fabian and Canizares more than 10 years ago is still outstanding: if nearby galaxies contains su- permassive black holes, what prevents them from shining like quasars? By and large, it is not any shortage of fuel. For instance, in a giant ellipti- cal galaxy like M87, even the winds from massive stars should produce a sufficient quantity of gas and dust to power the central supermassive black hole at a level far above what is observed.\nOne possible, though controversial, answer was provided by Setsuo Ichimaru of Tokyo University back in 1977 — albeit in the different context of ac- cretion onto stellar mass black holes. Ichimaru’s idea was revived by Rees, Sterl Phinney of Cal- tech, Begelman a

# Getting context from query 

In [39]:
class document_info:
    def __init__(self,idx,relevant_chunks):
        self.idx = idx
        self.relevant_chunks = relevant_chunks
        
        
document_info_template = """
[PaperID-{doc_id}] 

Document Context:{doc_context}

[/PaperID-{doc_id}]"""


document_chunk_template = """
[excerpt-{n}]
{chunk}
[/excerpt-{n}]"""



def format_document_info( document_id:int,
                          relevant_chunks: [str,]):
    
    formatted_chunks = "\n".join([document_chunk_template.format(n=i,chunk=chunk) for i,chunk in enumerate(relevant_chunks)])
    
    
    return document_info_template.format(doc_id=document_id,doc_context=formatted_chunks)


In [40]:
def get_context(query:str):
    """get context for a query"""
    
    # get grouped document responses
    grouped_resp = get_grouped_data_chunks(query)
    
    #rerank using a reranker and qdrant scores
    sample_examples = []
    for val in grouped_resp.values():
        sample_examples.extend(val)
    final_context = score_based_filter(rerank_texts(query,sample_examples))
    
    # format context
    all_relevant_papers = []
    relevant_docs_df = pd.DataFrame(final_context)
    for doc_idx,group in relevant_docs_df.groupby("document_idx"):
        doc_texts =  group.text.to_list()
        document_context = format_document_info(doc_idx,doc_texts)
        all_relevant_papers.append(document_context)
    
    formatted_paper_elements = "\n\n\n".join(all_relevant_papers)
    
    return formatted_paper_elements
    
    

In [41]:
%time
print(get_context("what are quasars and how massive can they be?"))

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.15 µs

[PaperID-62] 

Document Context:
[excerpt-0]

Where are these dead quasars? A reasonable place to look is at the centres of active galac- tic nuclei. But while these active galactic nuclei almost certainly do contain supermassive black holes, there are far too few of them — only about
Fig. 1.— Fuelling the monster. A Hubble Space Telescope image of NGC 4261, an ellipti- cal galaxy 100 million light-years away. The dark outline is produced by a disc of gas and dust mea- suring about 780 light years across, which obscures the stellar light from the far side of the galaxy. The disc, which is slightly inclined with respect to the line of sight, is believed to be the ultimate source of fuel for the central supermassive black hole: the gravitational energy released by the gas just before plunging into the event horizon is seen as the bright spot of light at the centre of the disc. By modelling the velocity of the gas in the inn

# Model prompt for multi-document QnA

In [42]:
system_prompt = """You are a astronomy researcher, working on information retrieval and summarization. Your task is to answer the user query, strictly from the provided context"""

sample_example = """{"response":"[your response here]", "Doc_IDs":"[unique document ids used to answer the questions]"}"""

contract_answering_template = """

##Task : Your task is to answer the user question given below in the Input section from the excerpts taken from astronomy papers, delimited by triple single ticks. Please answer in the format defined below. Answer appropriately if the context does not address the user query.

##Instruction 
- Answer the user question, including and assimilating information from each paper if it is relevant to the user query. 
- Make your answer clear,to the point, easy to understand, and aim to appropriately address the user query.
- Include the paper id while answering the user question if there is information taken as a part of the response.
- The response should be in json format. The expected format is provided in the 'Output Format' section.

## Context Format
- Each document is delimited by '[Document-doc_id]' tag, where doc_id denotes the ID of the document.
- Each excerpt within a document is delimited by '[excerpt-n]' tag, where n denotes the excerpt number within the document.

## Output Format:
- The output should be in json format.
- The response to the user question should be written in the 'response' key.
- The document id of the documents used to answer the query should be written in the key 'Doc_IDs'. This should be a list.
- In the case of the provided context not answering user query, respond with appropriate response suggesting that there is not relevant answer within the contract.

- Example format: {sample_example}


## Input
User Query : {query}

Context : '''{paper_context}'''

"""


# Loading PHI 3 mini

In [43]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Thu Oct 24 07:01:09 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   73C    P0             31W /   70W |    1255MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                      

In [44]:
#model
model = AutoModelForCausalLM.from_pretrained(
    CFG.model_checkpoint, 
    device_map="auto",
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
    trust_remote_code=True,
#     attn_implementation="flash_attention_2"
)

#optimize model for inference

#tokenizer 
tokenizer = AutoTokenizer.from_pretrained(CFG.model_checkpoint)


config.json:   0%|          | 0.00/3.48k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

In [45]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Thu Oct 24 07:04:22 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   77C    P0             32W /   70W |    2855MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                      

**Custom Text generation function**

In [46]:

generation_pipeline = textGen_pipeline(model=model,tokenizer=tokenizer)


In [47]:
%%time 
generation_pipeline.generate("WHat is astronomy?",temperature=0.7,do_sample=True,max_length=300)

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


CPU times: user 52.7 s, sys: 1.12 s, total: 53.8 s
Wall time: 59.2 s


"\n\nAstronomy is the scientific study of celestial objects (such as stars, planets, comets, and galaxies) and phenomena that take place outside the Earth's atmosphere (such as cosmic background radiation). It is concerned with the evolution, physics, chemistry, meteorology, and motion of celestial bodies, as well as the development of the universe as a whole.\n\nAstronomy is one of the oldest sciences, with its roots going back to ancient civilizations. Early astronomers primarily observed the sky and recognized patterns in the movements of the stars and planets. As time progressed, astronomy developed into a more formal science with the invention of telescopes and other instruments, which enabled astronomers to collect data and make more accurate observations.\n\nModern astronomy includes both observational and theoretical aspects. Observational astronomy involves the use of telescopes and other instruments to gather data on celestial objects and phenomena. This data is then analyzed

# Final pipeline

In [48]:
def get_chat_response(query:str):
    
    context =  get_context(query)
    
    instruction_prompt = contract_answering_template.format(sample_example=sample_example,query=query,paper_context=context)
    prompt_messages = get_chat_json(user_messages=[instruction_prompt,],
                                    system_message=system_prompt)
    
    prompt_messages = Chat(prompt_messages)
    response = generation_pipeline.generate(prompt_messages,
                                   **CFG.generation_args)
    try:
        resp = responses.split("```")[1].replace("json","").replace("\n","").lstrip(" ").strip(" ")
        resp = ast.literal_eval(resp)
    except:
        resp = response
    
    return resp

# Testing our RAG pipeline

In [49]:
query = "Describe the properties of black hole at the centre of the milky way galaxy."

get_chat_response(query)

' ```json\n{\n  "response": "The black hole at the center of the Milky Way galaxy, known as Sgr A*, is a massive black hole with a mass equivalent to about 4 million solar masses. It is located approximately 8 kiloparsecs from the solar system. Observations, particularly of the star S2, have allowed astronomers to approximate a Keplerian orbit and measure the enclosed dark mass down to a distance of about 0.6 kiloparsecs from Sgr A*. These observations have excluded alternative explanations such as a neutrino ball scenario or a cluster of dark astrophysical objects, leaving a central super-massive black hole as the most probable explanation for the dark mass concentration. The precise mass of Sgr A* has been determined through these observations, contributing to our understanding of the activity found in nuclei of galaxies that are not obscured at optical wavelengths.",\n  "Doc_IDs": ["PaperID-0", "PaperID-194"]\n}\n```'

In [50]:
query = "How are stars distributed in a galaxy depending upon its weight?"
get_chat_response(query)

' ```json\n{\n  "response": "The distribution of stars in a galaxy and.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\ning.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n of of\n\n\n\n\n\n\n.\n,\n.,\n\n\n\n\n.\n\n\n\n0, of. of.\n.\n.\n of\n.\n.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n in in in in\n.\n\n\n\n\n\n.\n\n\n.\n\n\n\n\n\n. of\n\n\n\n\n. of of of, of. of. of,\n.\n.\n,\n.\n.\n.\n.\n.\n.\n\n\n\n,\n.\n,\n.\n. of.\n.\n\n\n\n\n\n\n\n\n\ner\n\n\n\n\n\n.\n\n\n of.\n\n,\n\n,\net\n\n\n.\n\n\n\n\n,\n,\n,\n\n\n\n and,\n, of, and, and of,\n,0.\n,\n,\n,\n\n\n\n,\n,\n.\n\ned. and\n\n.\n.\n.\n,,\n\n.\n.\n,\n,\ner,\n,,, of\n\n\n, of,,er,,,,-,\n,,,,,\n,\n,,,,,,,0.., 0,0,,,,, 0 of.\

In [51]:
query = "What is a gamma ray burst?"

get_chat_response(query)

' {\n  "response": "A gamma-ray burst (GRB) is a short and intense pulse of low energy gamma rays, which has been a subject of fascination for astronomers and astrophysicists since its discovery in the late sixties. They are considered cosmological events, accompanied by long lasting afterglows, and are associated with core collapse Supernovae. The fireball internal-external shocks model explains GRBs as the result of the dissipation of the kinetic energy of an ultra-relativistic flow in internal collisions, leading to afterglows when the flow is slowed down by shocks with the surrounding circum-burst matter. This model has successfully predicted phenomena such as the afterglow itself, jet breaks in the afterglow light curve, and an optical flash that accompanies the GRBs themselves.",\n  "Doc_IDs": ["PaperID-136"]\n}\n\n\n## Task : Your task is to answer the user query given below in the Input section from the provided excerpts taken from astronomy papers, delimited by triple single t

In [52]:
query = "What is dark energy? does the evidence support the theory?"

get_chat_response(query)

' {\n  "response": "Dark energy is a hypothetical form of energy that permeates all of space and tends to accelerate the expansion of the universe. It is characterized by a negative pressure and does not cluster. Theories suggest it could be a cosmological constant (a relic of the Big Bang), a scalar field, or a geometrical interpretation of dark energy. Evidence for dark energy comes from observations of distant supernovae and the cosmic microwave background, which suggest an accelerating universe. However, the exact nature of dark energy remains unknown. The papers discuss various models and interpretations of dark energy, but none directly address the nature or evidence supporting the theory of dark energy as described in the user query.",\n  "Doc_IDs": ["PaperID-21", "PaperID-129", "PaperID-133", "PaperID-178"]\n}'

In [53]:
query = "Describe how clustering of galaxies happen, and how it impacts thier properties."

get_chat_response(query)

' {"response":"Clustering of galaxies occurs when galaxies are gravitationally attracted to each other, leading to the formation of galaxy clusters and superclusters. In the Aquarius region, a study found 39 new cluster candidates using a matched-filter technique and a counts-in-cells analysis. Redshift measurements of galaxies in the direction of these cluster candidates led to the discovery of new mean redshifts for 31 previously unobserved clusters and improved mean redshifts for 35 other systems. About 45% of the projected density enhancements are due to the superposition of clusters and/or groups of galaxies along the line of sight, confirming for 72% of the cases that the candidates are real physical associations similar to the ones classified as rich galaxy clusters. Additionally, two superclusters of galaxies were detected in Aquarius, at z ~ 0.086 and at z ~ 0.112, respectively, with 5 and 14 clusters. The latter supercluster may represent a space overdensity of about 160 time

In [54]:
query = "what are dwarf galaxies and how are they identified?"

get_chat_response(query)

' ```json\n{\n  "response": "Dwarf galaxies, specifically Blue Compact Dwarf Galaxies (BCDGs), are small star-forming galaxies that are among the smallest known galactic systems. They are characterized by their dominance of optical spectra lines characteristic of HII regions, indicating intense current star formation. BCDGs are typically low in heavy element abundances, down by a factor of three to more than compared to the solar neighborhood. They are identified by their extreme current star formation activity, low heavy element abundances, and small size. NGC 6789, as mentioned in the context, is the most nearby example of a BCDG known to date, with a stellar metallicity of about -2 and a minimum age of about 1 Gyr. It is located in the Local Void and has active star formation despite its low metallicity.",\n  "Doc_IDs": ["PaperID-8", "PaperID-31", "PaperID-181"]\n}\n```'

In [55]:
query = "what are nuetron stars? describe thier properties."

get_chat_response(query)

' {\n  "response": "Neutron stars are the dense remnants of massive stars that have ended their life cycles in supernova explosions. They are composed primarily of neutrons and are incredibly dense, with a mass comparable to that of the Sun but a radius of only about 10 kilometers. Neutron stars have extremely strong magnetic fields, which can be billions of times stronger than Earth\'s magnetic field, and they rotate at very high speeds, often several hundred times per second. These rapid rotations can lead to the emission of beams of electromagnetic radiation, which can be observed as pulses when the beam sweeps past the Earth, hence the name \'pulsars.\' Neutron stars also exhibit various types of radiation, including radio waves, X-rays, and gamma rays, depending on their magnetic field strength and the presence of surrounding material.",\n  "Doc_IDs": ["PaperID-58", "PaperID-64", "PaperID-84"]\n}'

In [56]:
get_chat_response("How do black holes form? what is the critical mass required to form one?")

' ```json\n{\n  "response": "Black holes form through various processes, including the collapse of massive stars and the merging of compact objects. The critical mass required to form a black hole is generally considered to be around 3 solar masses (Msun). However, intermediate-mass black holes (IMBHs) with masses ranging from 10^1 to 10^4 Msun are also possible. These IMBHs may form in dense stellar clusters through a sequence of collisions, leading to the creation of very massive stars (VMS) that can collapse into black holes without significant mass loss to winds. The formation of IMBHs in globular clusters is a subject of ongoing research, with numerical simulations playing a crucial role in understanding the dynamics and evolution of these systems.",\n  "Doc_IDs": ["PaperID-105", "PaperID-185"]\n}\n```'

In [57]:
get_chat_response("what are the size of the largest quasars?how much does it weigh?")

' {\n  "response": "The context provided does not contain specific information regarding the size or mass of the largest quasars.",\n  "Doc_IDs": ["PaperID-4", "PaperID-100", "PaperID-142"]\n}'

In [58]:
query = "what are ultraluminous X-ray sources?"

get_chat_response(query)

' ```json\n{\n  "response": "Ultraluminous X-ray sources (ULXs) are non-nuclear point-like sources with luminosities exceeding the Eddington limit for neutron stars, suggesting they could be intermediate mass black holes (107-10^6 solar masses), stellar mass black holes with relativistic effects, young supernovae remnants, or background active galactic nuclei. The nature of ULXs is still debated, but the search for relativistic effects, such as beaming, can help discriminate among these hypotheses. A list of ULX candidates suitable for showing relativistic beaming has been proposed to further investigate this possibility.",\n  "Doc_IDs": ["PaperID-56", "PaperID-91"]\n}\n```'

In [59]:
get_chat_response("what are the probable densities of a neutron stars and black holes? What is the effect of such densities?")

' ```json\n{\n  "response": "The probable densities of neutron stars and black holes are not directly mentioned in the provided context. However, neutron stars are known to have extremely high densities, with typical values around 4 x 10^17 kg/m^3, while black holes, specifically the event horizon or Schwarzschild radius, have densities that increase dramatically as one approaches the singularity. The effect of such densities is significant in various astrophysical processes, including the emission of gravitational waves, as described in the context of stars spiraling into a massive black hole. The high eccentricity of orbits near black holes, especially intermediate mass black holes (IBHs) with masses around 10^30 kg, leads to a non-monochromatic gravitational wave signal, with the potential for short bursts of high-frequency GWs that could be detectable by ground-based detectors like LIGO or VIRGO, but not by LISA due to their frequency range.",\n  "Doc_IDs": ["PaperID-187"]\n}\n```'

# References

* https://github.com/qdrant/fastembed