# Survey Existing Research and Reproduce Available Solutions

[Here](https://docs.google.com/document/d/1m6uXRaL1lH2Z0_-un1YRI107AuSMUeoYCemiAFAVfFU/edit?usp=sharing) is a brief Google doc surveying existing solutions. Below is a reproduction of one existing solution - [OpenAI's tutorial on creating a question answering system](https://platform.openai.com/docs/tutorials/web-qa-embeddings).

# Baseline RAG Model

This notebook implements a baseline RAG model - based [OpenAI's tutorial on creating a question answering system](https://platform.openai.com/docs/tutorials/web-qa-embeddings).

It uses the raw plaintext generated from `paper_scraper.py` and a naive chunking strategy that simply splits the papers if they are longer than OpenAI's embedding model max input (~8,000 tokens).

In [1]:
import numpy as np
import pandas as pd
import re
import tiktoken

In [2]:
papers_with_n_tokens = pd.read_csv('papers_with_n_tokens.csv')
papers_with_n_tokens.head()

Unnamed: 0,year,name,text,n_tokens
0,1987,Bit-Serial Neural Networks,573 \n\nBIT - SERIAL NEURAL NETWORKS \n\nAlan...,9136.0
1,1987,Connectivity Versus Entropy,1 \n\nCONNECTIVITY VERSUS ENTROPY \n\nYaser S...,5220.0
2,1987,The Hopfield Model with Multi-Level Neurons,278 \n\nTHE HOPFIELD MODEL WITH MUL TI-LEVEL N...,4445.0
3,1987,How Neural Nets Work,442 \n\nAlan Lapedes \nRobert Farber \n\nThe...,11220.0
4,1987,Spatial Organization of Neural Networks: A Pro...,740 \n\nSPATIAL ORGANIZATION OF NEURAL NEn...,8575.0


## Chunking

First, let's chunk the papers down so they will all fit in OpenAI's embedding model max input (~8,000 tokens).

This function is adapted from [OpenAI's tutorial](https://platform.openai.com/docs/tutorials/web-qa-embeddings).

In [3]:
# Function to split a row into chunks of a maximum number of tokens
def split_row(row, tokenizer, max_tokens):

    # Split the text into sentences
    sentences = re.split(r'\.\s+', row['text'])

    # Get the number of tokens for each sentence
    n_tokens = [len(tokenizer.encode(' ' + sentence)) for sentence in sentences]

    chunks = []
    tokens_so_far = 0
    chunk = []

    # Loop through the sentences and tokens joined together in a tuple
    for sentence, tokens in zip(sentences, n_tokens):

        # If the number of tokens so far plus the number of tokens in the current sentence is greater
        # than the max number of tokens, then add the chunk to the list of chunks and reset
        # the chunk and tokens so far
        if tokens_so_far + tokens > max_tokens:
            chunks.append('. '.join(chunk) + '.')
            chunk = []
            tokens_so_far = 0

        # If the number of tokens in the current sentence is greater than the max number of
        # tokens, go to the next sentence
        if tokens > max_tokens:
            continue

        # Otherwise, add the sentence to the chunk and add the number of tokens to the total
        chunk.append(sentence)
        tokens_so_far += tokens + 1
    
    # Append the last chunk
    chunks.append('. '.join(chunk) + '.')

    # Loop through the chunks and create new rows out of them (copying the data of the non-text columns)
    new_rows = []
    for chunk in chunks:
        new_row = row.copy()
        new_row['text'] = chunk
        new_row['n_tokens'] = len(tokenizer.encode(new_row['text']))
        new_rows.append(new_row)

    return new_rows

In [4]:
# Function to apply split_into_many against all the rows of a DataFrame and return a new DataFrame with the new chunks
def chunk_df(df, tokenizer, max_tokens = 8000):
    split_rows = []
    for _, row in df.iterrows():
        split_rows.extend(split_row(row, tokenizer, max_tokens))

    # Create a new DataFrame from the split rows
    return pd.DataFrame(split_rows)

In [5]:
tokenizer = tiktoken.encoding_for_model('text-embedding-3-small')
chunked_papers = chunk_df(papers_with_n_tokens, tokenizer)
chunked_papers.head(10)


Unnamed: 0,year,name,text,n_tokens
0,1987,Bit-Serial Neural Networks,573 \n\nBIT - SERIAL NEURAL NETWORKS \n\nAlan...,7959
0,1987,Bit-Serial Neural Networks,"Graf, and J. S. Denker, ""Microelectronic I...",892
1,1987,Connectivity Versus Entropy,1 \n\nCONNECTIVITY VERSUS ENTROPY \n\nYaser S...,5116
2,1987,The Hopfield Model with Multi-Level Neurons,278 \n\nTHE HOPFIELD MODEL WITH MUL TI-LEVEL N...,4337
3,1987,How Neural Nets Work,442 \n\nAlan Lapedes \nRobert Farber \n\nThe...,7942
3,1987,How Neural Nets Work,"One never needs more than two layers, o...",2967
4,1987,Spatial Organization of Neural Networks: A Pro...,740 \n\nSPATIAL ORGANIZATION OF NEURAL NEn...,7980
4,1987,Spatial Organization of Neural Networks: A Pro...,"Weisbuch, ""Random \n\nBoolean Networks"", Cy...",380
5,1987,A Neural-Network Solution to the Concentrator ...,775 \n\nA NEURAL-NETWORK SOLUTION TO THE C...,5343
6,1987,LEARNING BY STATE RECURRENCE DETECTION,642 \n\nLEARNING BY ST ATE RECURRENCE DETECfIO...,6030


In [6]:
chunked_papers.n_tokens.describe()

count    36198.000000
mean      5589.083264
std       2574.888267
min          1.000000
25%       3366.000000
50%       6630.000000
75%       7924.000000
max       8001.000000
Name: n_tokens, dtype: float64

The papers have now been naively chunked so that each chunk (or row of the dataset) is below the max number of tokens that the OpenAI embeddings API accepts (8,191 tokens).

## Embedding

Because embedding ~36,000 documents takes a long time, this notebook just loads the embeddings from a saved file. Here is the code that was used to generate the embeddings:

```python
from openai import OpenAI
client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
   try:
      embedding = client.embeddings.create(input = [text], model=model).data[0].embedding
   except:
      embedding = None
   return embedding

chunked_papers['embedding'] = chunked_papers.text.apply(lambda x: get_embedding(x))
chunked_papers.to_csv('papers_with_embeddings.csv', index=False, header=True)
```

In [7]:
papers_with_embeddings = pd.read_csv('papers_with_embeddings.csv')
papers_with_embeddings['embedding'] = papers_with_embeddings['embedding'].apply(eval).apply(np.array)
papers_with_embeddings.head()

Unnamed: 0,year,name,text,n_tokens,embedding
0,1987,Bit-Serial Neural Networks,573 \n\nBIT - SERIAL NEURAL NETWORKS \n\nAlan...,7959,"[0.007057549431920052, 0.022557897493243217, -..."
1,1987,Bit-Serial Neural Networks,"Graf, and J. S. Denker, ""Microelectronic I...",892,"[-0.015091345645487309, -0.006406506057828665,..."
2,1987,Connectivity Versus Entropy,1 \n\nCONNECTIVITY VERSUS ENTROPY \n\nYaser S...,5116,"[0.014461712911725044, -0.022812657058238983, ..."
3,1987,The Hopfield Model with Multi-Level Neurons,278 \n\nTHE HOPFIELD MODEL WITH MUL TI-LEVEL N...,4337,"[0.017370466142892838, -0.003854247974231839, ..."
4,1987,How Neural Nets Work,442 \n\nAlan Lapedes \nRobert Farber \n\nThe...,7942,"[0.006703744176775217, -0.002929536160081625, ..."


## Retrieving relevant information

Now we'll write a function to retrieve paper chunks that are most relevant to the user's question. We'll use the _cosine similarity_ metric on the embedding of the user's question and the embeddings of all the papers.

We'll include as many papers as we can that fit into OpenAI's `gpt-4o-mini`'s context window (128,000 tokens) - with some buffer to account for padding between the papers and the user's question - bringing it down to ~127,000 tokens.

This function is adapted from [OpenAI's tutorial](https://platform.openai.com/docs/tutorials/web-qa-embeddings).

In [8]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

In [9]:
def retrieve_relevant_chunks(client, df, question, max_len=127000):
    """
    Return a DataFrame of the chunks whose tokens fit within the `max_len` provided.
    """

    # Get the embeddings for the question
    q_embedding = client.embeddings.create(input=question, model='text-embedding-3-small').data[0].embedding

    # Get the distances from the embeddings
    df['distance'] = df.embedding.apply(lambda x: cosine_similarity(q_embedding, x))

    # Sort by distance
    df.sort_values('distance', ascending=False, inplace=True)

    # Account for padding at the end of each paper
    df['n_tokens'] += 4

    # Create an expanding window sum of tokens
    df['expanding_sum'] = df.n_tokens.expanding().sum()

    # Return rows whose expanding sum is less than the max length of the context
    return df[df['expanding_sum'] <= max_len]

Let's test this on a few questions.

In [10]:
from openai import OpenAI
client = OpenAI()

In [11]:
relevant_paper_chunks = retrieve_relevant_chunks(
                    client,
                    papers_with_embeddings,
                    "Can you summarize the key findings of 'Modelling Cellular Perturbations with the Sparse Additive Mechanism Shift Variational Autoencoder'?")

print(f'Number of chunks selected: {len(relevant_paper_chunks)}')


Number of chunks selected: 28


In [12]:
print('Most relevant selected chunks:')
relevant_paper_chunks.head()

Most relevant selected chunks:


Unnamed: 0,year,name,text,n_tokens,embedding,distance,expanding_sum
28409,2023,Modelling Cellular Perturbations with the Spar...,"Finally, the formal link to\nsparse mechanism ...",2505,"[-0.012617362663149834, 0.032124001532793045, ...",0.615885,2505.0
28408,2023,Modelling Cellular Perturbations with the Spar...,Modelling Cellular Perturbations with the Spar...,7975,"[0.004742556717246771, 0.0179368294775486, 0.0...",0.596355,10480.0
24390,2022,Alleviating Adversarial Attacks on Variational...,Alleviating Adversarial Attacks on Variational...,1651,"[0.005380531772971153, 0.011847461573779583, 0...",0.571067,12131.0
18268,2020,The Autoencoding Variational Autoencoder,AVAE\nSE5\nSE20 [5]\nSE5 AVAE\nAVAE SS\nTask /...,3162,"[0.01375912968069315, -0.0035207539331167936, ...",0.563902,15293.0
17582,2020,Evidential Sparsification of Multimodal Latent...,Direct sampling metrics were\ncomputed over 20...,3231,"[0.0006273535545915365, 0.00663420045748353, 0...",0.549324,18524.0


In [13]:
print('Least relevant selected chunks:')
relevant_paper_chunks.tail()

Least relevant selected chunks:


Unnamed: 0,year,name,text,n_tokens,embedding,distance,expanding_sum
17648,2020,Constraining Variational Inference with Geomet...,Although we accept that use of “vanilla” VAEs ...,2741,"[-0.007083908189088106, 0.013144437223672867, ...",0.520358,106996.0
3612,2007,On Sparsity and Overcompleteness in Image Models,On Sparsity and Overcompleteness in Image Mode...,7350,"[-0.015349962748587132, 0.024404319003224373, ...",0.51938,114346.0
26099,2022,Exploring the Latent Space of Autoencoders wit...,This may be explained by the DCI-D metric\nNam...,6698,"[-0.017867492511868477, 0.018746009096503258, ...",0.519336,121044.0
16887,2020,Dirichlet Graph Variational Autoencoder,81–88. [3]\nSophie Burkhardt and Stefan Kramer...,1990,"[-0.020206132903695107, -0.0012961796019226313...",0.519259,123034.0
26100,2022,Pythae: Unifying Generative Autoencoders in Py...,Pythae: Unifying Generative Autoencoders in Py...,1648,"[0.008975673466920853, -0.0222366601228714, 0....",0.519003,124682.0


The top 2 chunks came from the paper referenced in the user's question, which is great! The rest of the chunks seem to be generally related (about Autoencoders or sparse data), which will be important if the user asks a question that might require many papers to answer.

Let's do another test.

In [14]:
relevant_paper_chunks = retrieve_relevant_chunks(
                    client,
                    papers_with_embeddings,
                    "How do the performance of different types of variational auto-encoders (VAEs) compare?")

print(f'Number of chunks selected: {len(relevant_paper_chunks)}')

Number of chunks selected: 28


In [15]:
print('Most relevant selected chunks:')
relevant_paper_chunks.head()

Most relevant selected chunks:


Unnamed: 0,year,name,text,n_tokens,embedding,distance,expanding_sum
24390,2022,Alleviating Adversarial Attacks on Variational...,Alleviating Adversarial Attacks on Variational...,1655,"[0.005380531772971153, 0.011847461573779583, 0...",0.606672,1655.0
26100,2022,Pythae: Unifying Generative Autoencoders in Py...,Pythae: Unifying Generative Autoencoders in Py...,1652,"[0.008975673466920853, -0.0222366601228714, 0....",0.593998,3307.0
29780,2023,AdaVAE: Bayesian Structural Adaptation for Var...,AdaVAE: Bayesian Structural Adaptation for\nVa...,7857,"[0.02144601009786129, 0.015637459233403206, 0....",0.572955,11164.0
25839,2022,A Geometric Perspective on Variational Autoenc...,MODEL\nMNIST (16)\nSVHN (16)\nCIFAR 10 (32)\nC...,4747,"[-0.02092185989022255, 0.017854847013950348, 0...",0.572759,15911.0
32154,2023,Isometric Quotient Variational Auto-Encoders f...,Isometric Quotient Variational Auto-Encoders f...,2800,"[0.004320634063333273, 0.0066658128052949905, ...",0.568576,18711.0


In [16]:
print('Least relevant selected chunks:')
relevant_paper_chunks.tail()

Least relevant selected chunks:


Unnamed: 0,year,name,text,n_tokens,embedding,distance,expanding_sum
31251,2023,Complexity Matters: Rethinking the Latent Spac...,Complexity Matters: Rethinking the Latent Spac...,7946,"[0.006931154523044825, -0.00225873407907784, 0...",0.531068,94520.0
10240,2017,VAE Learning via Stein Variational Gradient De...,VAE Learning via Stein Variational Gradient De...,7931,"[-0.0016620290698483586, 0.014140750281512737,...",0.526778,102451.0
13510,2019,Copulas as High-Dimensional Generative Models:...,Copulas as High-Dimensional Generative Models:...,7894,"[0.02679332345724106, -0.0021856760140508413, ...",0.525333,110345.0
15808,2019,D-VAE: A Variational Autoencoder for Directed ...,D-VAE: A Variational Autoencoder for Directed\...,7939,"[0.002114873146638274, -0.0012863284209743142,...",0.522984,118284.0
10289,2017,Adversarial Symmetric Variational Autoencoder,Adversarial Symmetric Variational Autoencoder\...,7928,"[0.03652070835232735, 0.007922451011836529, 0....",0.522859,126212.0


The question was about a certain class of models (variational auto-encoders (VAEs)) - so selecting 28 chunks all discussing different types of this model seems appropriate for generating an answer comparing the different types of this model.

Next, let's look at generating a response based on the selected chunks.

## Generating the Retrieval-Augemented Response

Now, we'll include the retreived paper chunks as context to the user's question, and send the user's question and the context to OpenAI's generative model - specifying that the model should only use the provided context in generating the response.

This function has been adapted from [OpenAI's tutorial](https://platform.openai.com/docs/tutorials/web-qa-embeddings).

In [17]:
def answer_question(client, df, question, max_len=127000):
    """
    Answer a question based on the most similar context from the dataframe texts
    """
    context = '\n\n###\n\n'.join(retrieve_relevant_chunks(client, df, question, max_len=max_len)['text'].tolist())

    try:
        # Create a chat completion using the question and context
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Answer the question based on the context below, and if the question can't be answered based on the context, say \"I don't know\"\n\n"},
                {"role": "user", "content": f"Context: {context}\n\n---\n\nQuestion: {question}\nAnswer:"}
            ]
        )
        return response.choices[0].message.content
    except Exception as e:
        print(e)
        return ""

Let's try it out with the same questions we used in the previous section.

In [18]:
print(answer_question(client,
                      papers_with_embeddings,
                      "Can you summarize the key findings of 'Modelling Cellular Perturbations with the Sparse Additive Mechanism Shift Variational Autoencoder'?"))

The paper "Modelling Cellular Perturbations with the Sparse Additive Mechanism Shift Variational Autoencoder" proposes the Sparse Additive Mechanism Shift Variational Autoencoder (SAMS-VAE) to analyze cellular perturbation data. The key findings of the study include:

1. **Model Development**: The SAMS-VAE combines principles from generative modeling and sparse mechanisms to disentangle the effects of basal and perturbation states in cellular data, facilitating a better understanding of biological mechanisms.

2. **Improved Predictive Performance**: SAMS-VAE outperforms traditional models by including a sparsity mechanism that enhances predictive accuracy and allows for more robust inference of latent variables related to gene expression and cellular responses.

3. **Factorization of Variation**: The proposed model enables the effective factorization of variation in biological data, isolating the specific contributions of different perturbations on the observed cellular responses.

4. 

Looks pretty good!

Let's try another.

In [19]:
print(answer_question(client,
                      papers_with_embeddings,
                      "How do the performance of different types of variational auto-encoders (VAEs) compare?"))

The performance of different types of variational auto-encoders (VAEs) can vary significantly based on their architectures, training methodologies, and the specific datasets they are applied to. Several studies and experiments have highlighted the distinctions in performance among various VAE models. Here are some key observations:

1. **Standard VAE**: Typically serves as a baseline, but often generates blurry images and exhibits limitations in terms of reconstruction quality and sample diversity.

2. **β-VAE**: Incorporates a hyperparameter (β) that can adjust the balance between reconstruction and the KL divergence term, leading to better disentangled representations but may still suffer from blurry reconstructions.

3. **Variations like LVAE, SkipVAE, and NVAE**: These models enhance the architecture of the VAE to improve generative performance. For instance, NVAE introduces more complex networks with skip connections, which generally result in sharper images and better adherence t

Sounds pretty good, but that answer did not mention the SAMS-VAE model we had explored in the earlier question. Let's ask about that model specifically.

In [20]:
print(answer_question(client,
                      papers_with_embeddings,
                      "How does the performance of the Sparse Additive Mechanism Shift Variational Autoencoder (SAMS-VAE) compare to other types of variational auto-encoders (VAEs)?"))

The performance of the Sparse Additive Mechanism Shift Variational Autoencoder (SAMS-VAE) is demonstrated to outperform comparable models in terms of generalization across both in-distribution and out-of-distribution tasks. This includes improved predictive capabilities and the ability to recover factors that correlate strongly with known biological mechanisms when applied to single-cell RNA-sequencing datasets. SAMS-VAE combines sparse perturbation-specific latent effects, perturbation-independent natural variations of cells, and additive composition of perturbation effects into a joint model, leading to insights that potentially surpass those provided by other types of VAEs. Specific comparisons to other VAE variants, such as CPA-VAE and SVAE+, emphasize SAMS-VAE's superior performance in various quantitative and qualitative evaluations, indicating its efficacy in modeling cellular perturbations.


Now let's see what happens if we ask a question outside the scope of the NeurIPS conference.

In [21]:
print(answer_question(client,
                      papers_with_embeddings,
                      "What battles took place in New York City in the American Revolution?"))

I don't know


Well that's pretty good! It doesn't answer a question completely unrelated to the the content of the NeurIPS conference.

Let's try a super broad question, but one that's still relevant to the NeurIPS conference.

In [22]:
print(answer_question(client,
                      papers_with_embeddings,
                      "What are the major research trends in NeurIPS this year?"))

I don't know


Considering that we only have papers up to 2023, and it's currently 2024 - this is probably a good answer. But let's see what happens if we ask about a specific year.

In [23]:
print(answer_question(client,
                      papers_with_embeddings,
                      "What are the major research trends in NeurIPS in 2023?"))

I don't know.


This looks like it could be an opportunity for improvement, since we do have all the papers from 2023 available.

## Opportunities for Improvement

Here are the areas where we could potentially improve the RAG model:
- **Developing a standard list of questions and responses to evaluate the model** - this notebook just tests a few quick and dirty questions to get a general sense of whether the model is working. If we develop a more robust list of questions and potential answers, we can get a better idea of whether the model is truly using the context as desired - as opposed to generating a response based on the LLM's general knowledge and/or hallucinating. This would also allow us to determine whether any sort of tuning we do (to chunking, prompts, etc.) provides noticeable improvement to the responses of the model.
- **Better PDF conversion** - the raw plaintext from the PDF conversion currently do not contain structured versions of the papers' tables, equations, graphs, and images. If we convert more of this information in a structured way, the source data for the RAG may contain even more useful context to draw from.
- **More intelligent chunking** - Currently the chunks contain no overlaps and are naively cut off at 8000 tokens. Because we will include up to 127,000 tokens in our context for the response, more than likely a full paper will be included in the context - even if it was broken into chunks. However there are no guarantees the chunks of the paper will be in a logical order, nor that all the chunks of a paper will actually be included. One strategy could be to include overlap in the chunks to make sure that there is not too much context lost between chunks. Another potential strategy could be to use a generative LLM to summarize the papers down to 8000 characters rather than chunking (thus ensuring that a full paper is included in each embedding). However, information could be lost during that summarization, and it would take more compute resources to generate (using the `gpt-4o-mini` for every single paper) which might not be worth it.
- **Ranking papers during the retrieval step** - we might want to rank more recent papers as more important to the context as those papers have the most up-to-date information.
- **Prompt engineering** - We could experiment different ways of structuring our prompt to ensure the _only_ the context provided will be used in the response. Additionally, perhaps we might want to experiment with extracting key pieces of information from the prompt, such as phrases like "this year", to filter which papers are selected as candidates for the context. If we include more structured information in the DataFrame, such as subfield (e.g, reinforcement learning, NLP, computer vision), authors, citations, etc., we also might be able to use that information in conjuction with the users' question to make sure we have