<a href="https://colab.research.google.com/github/tummalapallimurali/Algorithms/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Use case to read ML paper titles which are lengthy for general audience. They can be too technical, so i will be using open source RAG to generate short titles based on previously created short titles.

# Libraries used
- Chroma DB vector store DB
- Sentence transformer - To generate emebeddings.
- pandas to read csv file.
- Open AI Key is optional to compare query results with RAG responses.

# Prompting Instructions


In [None]:
# install libraries:

%%capture
!pip install chromadb sentence-transformers pandas python-dotenv openai

import json
import chromadb
import pandas as pd
import random
from tqdm import tqdm


In [None]:

ml_papers = pd.read_csv("/content/ml-potw-10232023.csv", header=0, encoding="utf-8")

# remove empty titles or description

ml_papers = ml_papers.dropna(subset=["Title", "Description"])

# read title and description to json format

ml_papers_dict = ml_papers.to_dict(orient="records")

ml_papers_dict[0]


{'Title': 'Llemma',
 'Description': 'an LLM for mathematics which is based on continued pretraining from Code Llama on the Proof-Pile-2 dataset; the dataset involves scientific paper, web data containing mathematics, and mathematical code; Llemma outperforms open base models and the unreleased Minerva on the MATH benchmark; the model is released, including dataset and code to replicate experiments.',
 'PaperURL': 'https://arxiv.org/abs/2310.10631',
 'TweetURL': 'https://x.com/zhangir_azerbay/status/1714098025956864031?s=20',
 'Abstract': 'We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finet

# Data Pre-processing , Embeddings and Storing documents in vector database,

In [None]:
from sentence_transformers import SentenceTransformer
from chromadb import Documents, EmbeddingFunction, Embeddings

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2',
                                      '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')

# initialize chroma db directory and client
client = chromadb.PersistentClient(path="/content/chromadb")

# create collections

collection = client.get_or_create_collection(name="ml_papers")

# create batch embeddings function

# generate embedings in batches

batch_size = 50

# loop and add embeddings to vector store

for i in tqdm(range(0, len(ml_papers_dict), batch_size)):
    batch = ml_papers_dict[i:i+batch_size]

    # if empty string is found in title, mark as "No title"

    batch_titles = [str(paper['Title']) if str(paper['Title'] !="") else "No title" for paper in batch]

    # generate random batch_ids for each batch_titles

    batch_ids = [str(sum(ord(c) + random.randint(1, 10000) for c in paper["Title"])) for paper in batch]

    # generate embeddings for batch_titles

    batch_metadata = [paper['PaperURL'] for paper in batch]

    embeddings = embedding_model.encode(batch_titles)

    # insert into vector DB

    collection.upsert(
        documents=batch_titles,
        embeddings=embeddings.tolist(),
        ids=batch_ids
    )


100%|██████████| 9/9 [00:03<00:00,  2.33it/s]


In [None]:
# Test the Retriever

retriver = collection.query(
    query_texts=["Software Engineering"],
    n_results=2,
)

print(retriver["documents"])
print(retriver["distances"])

[['LLMs for Software Engineering', 'LLMs for Software Engineering']]
[[0.8221281170845032, 0.8221281170845032]]


Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
ChemCrow: Augmenting large-language models with chemistry tools
A Survey of Large Language Models
LLaMA: Open and Efficient Foundation Language Models
SparseGPT: Massive Language Models Can Be Accurately Pruned In One-Shot
REPLUG: Retrieval-Augmented Black-Box Language Models
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Auditing large language models: a three-layered approach
Fine-Tuning Language Models with Just Forward Passes
DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents


In [None]:
from openai import OpenAI
# get chat completions

from google.colab import userdata


def get_completion(prompt, model = "gpt-3.5-turbo"):
  client = OpenAI(api_key=userdata.get('Open_AI_Key') )
  completion = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": prompt}],
    temperature=0,
    max_tokens=50,

  )
  return completion.choices[0].message.content




In [None]:
user_query = "S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models"

# query for user query
results = collection.query(
    query_texts=[user_query],
    n_results=10,
)

short_titles = '\n'.join(results['documents'][0])
print(short_titles)


Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
ChemCrow: Augmenting large-language models with chemistry tools
A Survey of Large Language Models
LLaMA: Open and Efficient Foundation Language Models
SparseGPT: Massive Language Models Can Be Accurately Pruned In One-Shot
REPLUG: Retrieval-Augmented Black-Box Language Models
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Auditing large language models: a three-layered approach
Fine-Tuning Language Models with Just Forward Passes
DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents


In [None]:
# prompt templates

prompt_template =  f'''[INS]

your main task is to generate 5 SUGGESTED_PAPER_TITLE based on PAPER_TITLE

you should mimic a similar style and length as short_titles

PAPER_TITLE :{user_query}
SHORT_TITLES :{short_titles}
SUGGESTED_PAPER_TITLE :

[/INS]
'''

results = collection.query(
    query_texts=[prompt_template],
    n_results=10,
)

short_titles = '\n'.join(results['documents'][0])
print(short_titles)

# responses = get_completion(prompt_template)
# suggested_titles = ''.join([str(r) for r in responses])

# print(suggested_titles)

# Model Suggestions:

# 1. S3Eval: A Comprehensive Evaluation Suite for Large Language Models
# 2. Synthetic and Scalable Evaluation for Large Language Models
# 3. Systematic Evaluation of Large Language Models with S3Eval
# 4. S3Eval: A Synthetic and Scalable Approach to Language Model Evaluation
# 5. S3Eval: A Synthetic and Scalable Evaluation Suite for Large Language Models

ChemCrow: Augmenting large-language models with chemistry tools
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents
LLaMA: Open and Efficient Foundation Language Models
SparseGPT: Massive Language Models Can Be Accurately Pruned In One-Shot
Eight Things to Know about Large Language Models
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
A Survey of Large Language Models
REPLUG: Retrieval-Augmented Black-Box Language Models
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention


# Summary:

The short titles generated by LLM can be improved using fine-tuning techniques, overall RAG is critical to productionize the Gen AI projects.