# RAG Inference

This notebook follows `rag_create_index.ipynb` where we set up a RAG index. Here we actually use that index to interact with RAG-enabled cosmosage.

In [1]:
# load model, tokenizer, retriever
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "models/cosmosage_v2/"
#model = AutoGPTQForCausalLM.from_quantized(model_dir) # not yet supported
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16).to('cuda')
tokenizer = AutoTokenizer.from_pretrained(model_dir)
embeddings = HuggingFaceEmbeddings(model_name='BAAI/bge-large-en-v1.5')
index = FAISS.load_local(f"datasets/faiss_index.bin", embeddings)
retriever = index.as_retriever(search_type="similarity", search_kwargs={'k': 4})

In [2]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain.chains import LLMChain
from langchain.schema.runnable import RunnablePassthrough

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.7,
    repetition_penalty=1.01,
    return_full_text=True,
    max_new_tokens=1000,
    do_sample=True,
)
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)
prompt_template = """You are cosmosage, an AI assistant designed to give detailed and factual answers to questions about cosmology. Provide some relevant context before answering the question. You may find the following additional content helpful.
ADDITIONAL CONTENT: {context}
USER: {question}
ASSISTANT:"""
prompt = PromptTemplate(input_variables=["context", "question"], template=prompt_template)
llm_chain = LLMChain(llm=llm, prompt=prompt)
rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

2024-02-21 07:39:13.266019: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-21 07:39:13.300224: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-21 07:39:13.300255: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-21 07:39:13.301168: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-21 07:39:13.306827: I tensorflow/core/platform/cpu_feature_guar

In [3]:
question = "What is the CLASS VPM made of?"

In [4]:
result = llm_chain.invoke({"context":"", "question": question})
print(result['text'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 The CLASS VPM (variable-delay polarization modulator) consists of a wire grid polarizer in front of a moving mirror, creating a phase delay between polarization states parallel and perpendicular to the direction of the wire grid. As the mirror moves, the phase delay between the two polarization states changes, modulating the Stokes U parameter.


In [5]:
result_rag = rag_chain.invoke(question)
print(result_rag['text'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 The CLASS VPM (Variable-Delay Polarization Modulator) consists of a wire array backed by a lightweight aluminum honeycomb mirror.


In [6]:
result_rag

{'context': [Document(page_content='Each CLASS VPM, see Figure (b)b, consists of a wire array backed by a lightweight aluminum honeycomb mirror. During modulation, the wire array remains fixed and the mirror translates towards and away from the\n\nFigure 8: (a) Photograph of a secondary reflector. (b) The mechanical design for the VPM consists of a resonant four-bar linkage flexure attached to the mirror. A reaction-canceling mass will be moved in opposition to the mirror to reduce vibrations that could feed into detector timestreams.', metadata={'source': 'datasets/arxiv_mmd/1408.4788.mmd'}),
  Document(page_content='The CLASS VPMs are 600 mm in diameter and are operated at ambient temperature. Because of the longer wavelengths (CLASS operates down to 38 GHz), a four-bar-linkage flexure was used in place of the single-material flexures. A voice coil is used for actuation and an optical encoder is used to measure the distance and close the feedback loop. To fully cover the \\(Q-U\\) sp