# Build a Contextual Retrieval based RAG System

## Install OpenAI, and LangChain dependencies

In [None]:
%%capture
!pip install langchain #==0.3.10
!pip install langchain-openai #==0.2.12
!pip install langchain-community #==0.3.11

In [None]:
%%capture
!pip install jq #==1.8.0

In [None]:
%%capture
!pip install pymupdf #==1.25.1

## Install Chroma Vector DB and LangChain wrapper

In [None]:
%%capture
!pip install langchain-chroma #==0.1.4

Collecting langchain-chroma
  Downloading langchain_chroma-0.2.5-py3-none-any.whl.metadata (1.1 kB)
Collecting chromadb>=1.0.9 (from langchain-chroma)
  Downloading chromadb-1.0.20-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Collecting pybase64>=1.4.1 (from chromadb>=1.0.9->langchain-chroma)
  Downloading pybase64-1.4.2-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb>=1.0.9->langchain-chroma)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb>=1.0.9->langchain-chroma)
  Downloading onnxruntime-1.22.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.9 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb>=1.0.9->langchain-chroma)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.36.0-py3-none-any.whl.metadata (2.4 kB)
Collecting pypika>

## Enter Open AI API Key

In [None]:
import os
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

### Open AI Embedding Models

LangChain enables us to access Open AI embedding models which include the newest models: a smaller and highly efficient `text-embedding-3-small` model, and a larger and more powerful `text-embedding-3-large` model.

In [None]:
from langchain_openai import OpenAIEmbeddings

# details here: https://openai.com/blog/new-embedding-models-and-api-updates
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

## Loading and Processing the Data

### Get the dataset

In [None]:
# if you can't download using the following code
# go to https://drive.google.com/file/d/1aZxZejfteVuofISodUrY2CDoyuPLYDGZ download it
# manually upload it on colab
!gdown 1aZxZejfteVuofISodUrY2CDoyuPLYDGZ

Downloading...
From: https://drive.google.com/uc?id=1aZxZejfteVuofISodUrY2CDoyuPLYDGZ
To: /content/rag_docs.zip
100% 5.92M/5.92M [00:00<00:00, 22.7MB/s]


In [None]:
!unzip rag_docs.zip

Archive:  rag_docs.zip
   creating: rag_docs/
  inflating: rag_docs/attention_paper.pdf  
  inflating: rag_docs/cnn_paper.pdf  
  inflating: rag_docs/resnet_paper.pdf  
  inflating: rag_docs/vision_transformer.pdf  
  inflating: rag_docs/wikidata_rag_demo.jsonl  


### Load and Process JSON Documents

In [None]:
from langchain.document_loaders import JSONLoader

loader = JSONLoader(file_path='./rag_docs/wikidata_rag_demo.jsonl',
                    jq_schema='.',
                    text_content=False,
                    json_lines=True)
wiki_docs = loader.load()

In [None]:
len(wiki_docs)

1801

In [None]:
wiki_docs[3]

Document(metadata={'source': '/content/rag_docs/wikidata_rag_demo.jsonl', 'seq_num': 4}, page_content='{"id": "71548", "title": "Chi-square distribution", "paragraphs": ["In probability theory and statistics, the chi-square distribution (also chi-squared or formula_1\\u00a0 distribution) is one of the most widely used theoretical probability distributions. Chi-square distribution with formula_2 degrees of freedom is written as formula_3. It is a special case of gamma distribution.", "Chi-square distribution is primarily used in statistical significance tests and confidence intervals. It is useful, because it is relatively easy to show that certain probability distributions come close to it, under certain conditions. One of these conditions is that the null hypothesis must be true. Another one is that the different random variables (or observations) must be independent of each other."]}')

In [None]:
import json
from langchain.docstore.document import Document
wiki_docs_processed = []

for doc in wiki_docs:
    doc = json.loads(doc.page_content)
    metadata = {
        "title": doc['title'],
        "id": doc['id'],
        "source": "Wikipedia",
        "page": 1
    }
    data = ' '.join(doc['paragraphs'])
    wiki_docs_processed.append(Document(page_content=data, metadata=metadata))

In [None]:
wiki_docs_processed[3]

Document(metadata={'title': 'Chi-square distribution', 'id': '71548', 'source': 'Wikipedia', 'page': 1}, page_content='In probability theory and statistics, the chi-square distribution (also chi-squared or formula_1\xa0 distribution) is one of the most widely used theoretical probability distributions. Chi-square distribution with formula_2 degrees of freedom is written as formula_3. It is a special case of gamma distribution. Chi-square distribution is primarily used in statistical significance tests and confidence intervals. It is useful, because it is relatively easy to show that certain probability distributions come close to it, under certain conditions. One of these conditions is that the null hypothesis must be true. Another one is that the different random variables (or observations) must be independent of each other.')

### Load and Process PDF documents

#### Create Chunk Contexts for Contextual Retrieval

![](https://i.imgur.com/LRhKHzk.png)

In [None]:
from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

In [None]:
# Import necessary libraries from the LangChain framework and standard libraries
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser, Document

# --- Helper Function to Generate Context for a Single Chunk ---

def generate_chunk_context(document, chunk):
    """
    Generates a brief, contextual summary for a text chunk based on the entire document.

    Args:
        document (str): The full content of the original document.
        chunk (str): The specific text chunk that needs a contextual summary.

    Returns:
        str: An AI-generated contextual summary for the chunk.
    """

    # This is the prompt template that instructs the language model.
    # It clearly defines the role of the AI, the inputs (the full paper and the specific chunk),
    # and the desired output format (a concise context).
    chunk_process_prompt = """You are an AI assistant specializing in research paper analysis.
                            Your task is to provide brief, relevant context for a chunk of text
                            based on the following research paper.

                            Here is the research paper:
                            <paper>
                            {paper}
                            </paper>

                            Here is the chunk we want to situate within the whole document:
                            <chunk>
                            {chunk}
                            </chunk>

                            Provide a concise context (3-4 sentences max) for this chunk,
                            considering the following guidelines:

                            - Give a short succinct context to situate this chunk within the overall document
                            for the purposes of improving search retrieval of the chunk.
                            - Answer only with the succinct context and nothing else.
                            - Context should be mentioned like 'Focuses on ....'
                            do not mention 'this chunk or section focuses on...'

                            Context:
                         """

    # Create a prompt object from the string template
    prompt_template = ChatPromptTemplate.from_template(chunk_process_prompt)

    # Define the "chain" using LangChain Expression Language (LCEL).
    # 1. The prompt_template takes the 'paper' and 'chunk' inputs.
    # 2. The formatted prompt is passed to the language model ('chatgpt').
    # 3. The model's output is parsed into a simple string by 'StrOutputParser'.
    agentic_chunk_chain = (prompt_template
                           |
                           chatgpt # Your language model instance
                           |
                           StrOutputParser())

    # Execute the chain by providing the actual document and chunk content.
    context = agentic_chunk_chain.invoke({'paper': document, 'chunk': chunk})

    # Return the generated context string.
    return context

In [None]:

from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import uuid # Used for generating unique IDs for each chunk

# --- Main Function to Create Contextual Chunks from a File ---

def create_contextual_chunks(file_path, chunk_size=3500, chunk_overlap=0):
    """
    Loads a PDF, splits it into chunks, and adds a contextual summary to each chunk.

    Args:
        file_path (str): The path to the PDF file.
        chunk_size (int): The maximum number of characters for each chunk.
        chunk_overlap (int): The number of characters to overlap between consecutive chunks.

    Returns:
        list[Document]: A list of LangChain Document objects, where each document's
                        page_content is the context + original chunk text.
    """

    print('Loading pages:', file_path)
    # Initialize the PDF loader with the file path
    loader = PyMuPDFLoader(file_path)
    # Load the document into a list of LangChain Document objects, one per page
    doc_pages = loader.load()

    print('Chunking pages:', file_path)
    # Initialize the text splitter with the desired chunk size and overlap
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,
                                              chunk_overlap=chunk_overlap)
    # Split the loaded pages into smaller chunks
    doc_chunks = splitter.split_documents(doc_pages)

    print('Generating contextual chunks:', file_path)
    # Reassemble the entire document's text to provide full context for each chunk
    original_doc = '\n'.join([doc.page_content for doc in doc_chunks])

    # This list will store the final, enriched chunks
    contextual_chunks = []

    # Loop through each raw chunk to generate its context
    for chunk in doc_chunks:
        # Get the text content and metadata from the current chunk
        chunk_content = chunk.page_content
        chunk_metadata = chunk.metadata

        # Create a new, updated metadata dictionary for better organization
        chunk_metadata_upd = {
            'id': str(uuid.uuid4()),  # Add a unique identifier
            'page': chunk_metadata['page'],  # Keep the original page number
            'source': chunk_metadata['source'],  # Keep the original file path
            'title': chunk_metadata['source'].split('/')[-1] # Extract just the filename as title
        }

        # Call our helper function to generate the context for this specific chunk
        context = generate_chunk_context(original_doc, chunk_content)

        # Create a new LangChain Document object
        # The content is the generated context, a newline, and then the original chunk's content
        # The metadata is our newly created dictionary
        contextual_chunks.append(Document(page_content=context + '\n' + chunk_content,
                                          metadata=chunk_metadata_upd))

    print('Finished processing:', file_path)
    print()

    # Return the list of newly created, context-aware chunks
    return contextual_chunks

In [None]:
from glob import glob

pdf_files = glob('./rag_docs/*.pdf')
pdf_files

['./rag_docs/attention_paper.pdf',
 './rag_docs/cnn_paper.pdf',
 './rag_docs/vision_transformer.pdf',
 './rag_docs/resnet_paper.pdf']

In [None]:
paper_docs = []
for fp in pdf_files:
    paper_docs.extend(create_contextual_chunks(file_path=fp, chunk_size=3500))

Loading pages: ./rag_docs/attention_paper.pdf
Chunking pages: ./rag_docs/attention_paper.pdf
Generating contextual chunks: ./rag_docs/attention_paper.pdf
Finished processing: ./rag_docs/attention_paper.pdf

Loading pages: ./rag_docs/cnn_paper.pdf
Chunking pages: ./rag_docs/cnn_paper.pdf
Generating contextual chunks: ./rag_docs/cnn_paper.pdf
Finished processing: ./rag_docs/cnn_paper.pdf

Loading pages: ./rag_docs/vision_transformer.pdf
Chunking pages: ./rag_docs/vision_transformer.pdf
Generating contextual chunks: ./rag_docs/vision_transformer.pdf
Finished processing: ./rag_docs/vision_transformer.pdf

Loading pages: ./rag_docs/resnet_paper.pdf
Chunking pages: ./rag_docs/resnet_paper.pdf
Generating contextual chunks: ./rag_docs/resnet_paper.pdf
Finished processing: ./rag_docs/resnet_paper.pdf



In [None]:
len(paper_docs)

79

In [None]:
paper_docs[0]

Document(metadata={'id': '3fc63115-1249-42fa-99a4-3467fe72355b', 'page': 0, 'source': './rag_docs/attention_paper.pdf', 'title': 'attention_paper.pdf'}, page_content='Focuses on the authorship and permissions related to the research paper "Attention Is All You Need," which introduces the Transformer model for sequence transduction tasks. It outlines the contributions of each author and provides details on the paper\'s presentation at the 31st Conference on Neural Information Processing Systems (NIPS 2017).\nProvided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗†\nUniversity of Toro

In [None]:
display(Markdown(paper_docs[10].page_content))

Focuses on the performance of the Transformer model in English constituency parsing, comparing its results to various established parsers, both in discriminative and semi-supervised settings. It highlights the model's competitive F1 scores, demonstrating its effectiveness despite a lack of task-specific tuning, and contrasts its performance with that of recurrent neural network-based models.
Table 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23
of WSJ)
Parser
Training
WSJ 23 F1
Vinyals & Kaiser el al. (2014) [37]
WSJ only, discriminative
88.3
Petrov et al. (2006) [29]
WSJ only, discriminative
90.4
Zhu et al. (2013) [40]
WSJ only, discriminative
90.4
Dyer et al. (2016) [8]
WSJ only, discriminative
91.7
Transformer (4 layers)
WSJ only, discriminative
91.3
Zhu et al. (2013) [40]
semi-supervised
91.3
Huang & Harper (2009) [14]
semi-supervised
91.3
McClosky et al. (2006) [26]
semi-supervised
92.1
Vinyals & Kaiser el al. (2014) [37]
semi-supervised
92.1
Transformer (4 layers)
semi-supervised
92.7
Luong et al. (2015) [23]
multi-task
93.0
Dyer et al. (2016) [8]
generative
93.3
increased the maximum output length to input length + 300. We used a beam size of 21 and α = 0.3
for both WSJ only and the semi-supervised setting.
Our results in Table 4 show that despite the lack of task-specific tuning our model performs sur-
prisingly well, yielding better results than all previously reported models with the exception of the
Recurrent Neural Network Grammar [8].
In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-
Parser [29] even when training only on the WSJ training set of 40K sentences.
7
Conclusion
In this work, we presented the Transformer, the first sequence transduction model based entirely on
attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with
multi-headed self-attention.
For translation tasks, the Transformer can be trained significantly faster than architectures based
on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014
English-to-French translation tasks, we achieve a new state of the art. In the former task our best
model outperforms even all previously reported ensembles.
We are excited about the future of attention-based models and plan to apply them to other tasks. We
plan to extend the Transformer to problems involving input and output modalities other than text and
to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs
such as images, audio and video. Making generation less sequential is another research goals of ours.
The code we used to train and evaluate our models is available at https://github.com/
tensorflow/tensor2tensor.
Acknowledgements
We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful
comments, corrections and inspiration.
References
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. CoRR, abs/1409.0473, 2014.
[3] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural
machine translation architectures. CoRR, abs/1703.03906, 2017.
[4] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine
reading. arXiv preprint arXiv:1601.06733, 2016.
10

### Combine all document chunks in one list

In [None]:
len(wiki_docs_processed)

1801

In [None]:
total_docs = wiki_docs_processed + paper_docs
len(total_docs)

1880

## Index Document Chunks and Embeddings in Vector DB

Here we initialize a connection to a Chroma vector DB client, and also we want to save to disk, so we simply initialize the Chroma client and pass the directory where we want the data to be saved to.

In [None]:
from langchain_chroma import Chroma

# create vector DB of docs and embeddings - takes < 30s on Colab
chroma_db = Chroma.from_documents(documents=total_docs,
                                  collection_name='my_context_db',
                                  embedding=openai_embed_model,
                                  # need to set the distance function to cosine else it uses euclidean by default
                                  # check https://docs.trychroma.com/guides#changing-the-distance-function
                                  collection_metadata={"hnsw:space": "cosine"},
                                  persist_directory="./my_context_db")

### Load Vector DB from disk

This is just to show once you have a vector database on disk you can just load and create a connection to it anytime

In [None]:
# load from disk
chroma_db = Chroma(persist_directory="./my_context_db",
                   collection_name='my_context_db',
                   embedding_function=openai_embed_model)

In [None]:
chroma_db

<langchain_chroma.vectorstores.Chroma at 0x7f36027ee180>

### Semantic Similarity based Retrieval

We use simple cosine similarity here and retrieve the top 5 similar documents based on the user input query

In [None]:
similarity_retriever = chroma_db.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 5})

In [None]:
from IPython.display import display, Markdown

def display_docs(docs):
    for doc in docs:
        print('Metadata:', doc.metadata)
        print('Content Brief:')
        display(Markdown(doc.page_content[:1000]))
        print()

In [None]:
query = "what is machine learning?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

Metadata: {'page': 1, 'id': '564928', 'source': 'Wikipedia', 'title': 'Machine learning'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'id': '359370', 'title': 'Supervised learning', 'source': 'Wikipedia', 'page': 1}
Content Brief:


In machine learning, supervised learning is the task of inferring a function from labelled training data. The results of the training are known beforehand, the system simply learns how to get to these results correctly. Usually, such systems work with vectors. They get the training data and the result of the training as two vectors and produce a "classifier". Usually, the system uses inductive reasoning to generalize the training data.


Metadata: {'title': 'Deep learning', 'page': 1, 'id': '663523', 'source': 'Wikipedia'}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, whic


Metadata: {'page': 1, 'id': '6360', 'source': 'Wikipedia', 'title': 'Artificial intelligence'}
Content Brief:


Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study which tries to make computers "smart". They work on their own without being encoded with commands. John McCarthy came up with the name "Artificial Intelligence" in 1955. In general use, the term "artificial intelligence" means a programme which mimics human cognition. At least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though not in the same way as we do. Andreas Kaplan and Michael Haenlein define AI as a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation. An ideal (perfect) intelligent machine is a flexible agent which perceives its environment and takes actions to maximize its chance of success at some goal or objective. As machines become increasingly capable, mental facu


Metadata: {'page': 1, 'source': 'Wikipedia', 'title': 'Artificial neural network', 'id': '44742'}
Content Brief:


A neural network (also called an ANN or an artificial neural network) is a sort of computer software, inspired by biological neurons. Biological brains are capable of solving difficult problems, but each neuron is only responsible for solving a very small part of the problem. Similarly, a neural network is made up of cells that work together to produce a desired result, although each individual cell is only responsible for solving a small part of the problem. This is one method for creating artificially intelligent programs. Neural networks are an example of machine learning, where a program can change as it learns to solve a problem. A neural network can be trained and improved with each example, but the larger the neural network, the more examples it needs to perform well—often needing millions or billions of examples in the case of deep learning. There are two ways to think of a neural network. First is like a human brain. Second is like a mathematical equation.




In [None]:
query = "what is the difference between transformers and vision transformers?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

Metadata: {'page': 7, 'id': '3c534858-932f-4bf6-bccd-34154e51cc58', 'source': './rag_docs/vision_transformer.pdf', 'title': 'vision_transformer.pdf'}
Content Brief:


Focuses on a controlled scaling study of various models, including ResNets and Vision Transformers, evaluating their transfer performance from the JFT-300M dataset. It highlights the performance versus pre-training cost, revealing that Vision Transformers outperform ResNets in terms of efficiency and scalability, while also discussing the implications for future model scaling efforts.
Published as a conference paper at ICLR 2021
4.4
SCALING STUDY
We perform a controlled scaling study of different models by evaluating transfer performance from
JFT-300M. In this setting data size does not bottleneck the models’ performances, and we assess
performance versus pre-training cost of each model. The model set includes: 7 ResNets, R50x1,
R50x2 R101x1, R152x1, R152x2, pre-trained for 7 epochs, plus R152x2 and R200x3 pre-trained
for 14 epochs; 6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs, plus
L/16 and H/14 pre-trained for 14 epochs; and 5 hybrids, R50+ViT-B/32, B/1


Metadata: {'title': 'vision_transformer.pdf', 'id': '561364f3-23f5-428d-b43a-6171c2586694', 'page': 0, 'source': './rag_docs/vision_transformer.pdf'}
Content Brief:


Focuses on the introduction of the Vision Transformer (ViT) model, which applies a standard Transformer architecture directly to image classification tasks by treating image patches as tokens. It highlights the limitations of convolutional neural networks (CNNs) in computer vision and presents evidence that ViT can achieve competitive performance on various benchmarks when pre-trained on large datasets, while requiring fewer computational resources.
Published as a conference paper at ICLR 2021
AN IMAGE IS WORTH 16X16 WORDS:
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
Alexey Dosovitskiy∗,†, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk Weissenborn∗,
Xiaohua Zhai∗, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,
Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,†
∗equal technical contribution, †equal advising
Google Research, Brain Team
{adosovitskiy, neilhoulsby}@google.com
ABSTRACT
While the Transformer architecture has become the de-facto standard for natural
language 


Metadata: {'id': '103a7416-3ff5-4abb-91f2-5f3716cd500b', 'title': 'vision_transformer.pdf', 'page': 2, 'source': './rag_docs/vision_transformer.pdf'}
Content Brief:


Focuses on the architecture and methodology of the Vision Transformer (ViT), detailing how images are processed by splitting them into patches, embedding them, and utilizing a standard Transformer encoder for image classification tasks. It describes the model's design principles, including the use of position embeddings and the integration of a classification token, while referencing foundational work in Transformer architecture.
Published as a conference paper at ICLR 2021
Transformer Encoder
MLP 
Head
Vision Transformer (ViT)
*
Linear Projection of Flattened Patches
* Extra learnable
     [ cl ass]  embedding
1
2
3
4
5
6
7
8
9
0
Patch + Position 
Embedding
Class
Bird
Ball
Car
...
Embedded 
Patches
Multi-Head 
Attention
Norm
MLP
Norm
+
L x
+
Transformer Encoder
Figure 1: Model overview. We split an image into ﬁxed-size patches, linearly embed each of them,
add position embeddings, and feed the resulting sequence of vectors to a standard Transformer
encoder. In order to perform classiﬁ


Metadata: {'title': 'vision_transformer.pdf', 'id': 'c9afddf8-565b-474b-918d-e489eebb1096', 'source': './rag_docs/vision_transformer.pdf', 'page': 1}
Content Brief:


Focuses on the performance of the Vision Transformer (ViT) in comparison to convolutional neural networks (CNNs), highlighting the advantages of large-scale training on datasets ranging from 14M to 300M images. It emphasizes that ViT achieves state-of-the-art results on various image recognition benchmarks when pre-trained on extensive datasets like ImageNet-21k and JFT-300M, despite lacking some inductive biases inherent to CNNs.
Published as a conference paper at ICLR 2021
inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well
when trained on insufﬁcient amounts of data.
However, the picture changes if the models are trained on larger datasets (14M-300M images). We
ﬁnd that large scale training trumps inductive bias. Our Vision Transformer (ViT) attains excellent
results when pre-trained at sufﬁcient scale and transferred to tasks with fewer datapoints. When
pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset,


Metadata: {'source': './rag_docs/vision_transformer.pdf', 'id': '6c440742-c527-488f-8356-6bbe2b80cbe7', 'title': 'vision_transformer.pdf', 'page': 7}
Content Brief:


Focuses on the behavior of attention mechanisms in Vision Transformers, highlighting how attention distances vary across layers and the implications of localized attention in hybrid models that incorporate CNNs. It also discusses the relationship between attention distance and network depth, indicating that deeper layers attend to semantically relevant regions for classification.
have consistently small attention distances in the low layers. This highly localized attention is
less pronounced in hybrid models that apply a ResNet before the Transformer (Figure 7, right),
suggesting that it may serve a similar function as early convolutional layers in CNNs. Further, the
attention distance increases with network depth. Globally, we ﬁnd that the model attends to image
regions that are semantically relevant for classiﬁcation (Figure 6).
4.6
SELF-SUPERVISION
Transformers show impressive performance on NLP tasks. However, much of their success stems
not only from their excellent scalability bu




## Build the RAG Pipeline

In [None]:
from langchain_core.prompts import ChatPromptTemplate

rag_prompt = """You are an assistant who is an expert in question-answering tasks.
                Answer the following question using only the following pieces of retrieved context.
                If the answer is not in the context, do not make up answers, just say that you don't know.
                Keep the answer detailed and well formatted based on the information from the context.

                Question:
                {question}

                Context:
                {context}

                Answer:
            """

rag_prompt_template = ChatPromptTemplate.from_template(rag_prompt)

In [None]:
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

qa_rag_chain = (
    {
        "context": (similarity_retriever
                      |
                    format_docs),
        "question": RunnablePassthrough()
    }
      |
    rag_prompt_template
      |
    chatgpt
)

In [None]:
from IPython.display import display, Markdown

query = "What is machine learning?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

Machine learning is a subfield of computer science that provides computers with the ability to learn without being explicitly programmed. The concept was introduced by Arthur Samuel in 1959 and is rooted in artificial intelligence. Machine learning focuses on the study and construction of algorithms that can learn from data and make predictions or decisions based on that data. These algorithms follow programmed instructions but can also adapt and improve their performance by building models from sample inputs.

Machine learning is particularly useful in scenarios where designing and programming explicit algorithms is impractical. Common applications include spam filtering, detecting network intruders or malicious insiders, optical character recognition (OCR), search engines, and computer vision.

Within machine learning, there are different approaches, such as supervised learning, where a function is inferred from labeled training data. In this case, the system learns to produce correct results based on known outcomes, typically using vectors for training data and results to create a "classifier." Inductive reasoning is often employed to generalize from the training data.

Additionally, deep learning is a specialized area of machine learning that utilizes neural networks, particularly those with multiple layers (known as multi-layer neural networks). Deep learning is effective for complex tasks like speech recognition, image understanding, and handwriting recognition, which are challenging for computers but relatively easy for humans. These models are inspired by the information processing patterns of biological nervous systems, although they differ significantly from the structural and functional properties of human brains.

In summary, machine learning enables computers to learn from data and improve their performance over time, making it a crucial component of modern artificial intelligence applications.

In [None]:
query = "What is a CNN?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

A CNN, or Convolutional Neural Network, is a specialized type of Artificial Neural Network (ANN) that is particularly effective for image-driven pattern recognition tasks. CNNs are designed to process data with a grid-like topology, such as images, and they utilize a unique architecture that distinguishes them from traditional ANNs.

### Key Features of CNNs:

1. **Three-Dimensional Neuron Organization**: 
   - The neurons in CNNs are organized in three dimensions, which correspond to the spatial dimensions of the input (height and width) and the depth (which can represent color channels in images).

2. **Layer Types**:
   - CNNs are composed of three main types of layers:
     - **Convolutional Layers**: These layers apply convolution operations to the input, allowing the network to learn spatial hierarchies of features. Each neuron in a convolutional layer is connected to a small region of the input, which helps in detecting local patterns.
     - **Pooling Layers**: These layers reduce the spatial dimensions of the input, helping to decrease the computational load and control overfitting by summarizing the features.
     - **Fully-Connected Layers**: These layers connect every neuron in one layer to every neuron in the next layer, typically used at the end of the network to produce the final output.

3. **Functionality**:
   - The architecture of CNNs allows them to effectively learn and extract features from images. The convolutional layers detect features such as edges and textures, while pooling layers help in down-sampling the feature maps, making the network more efficient.

4. **Learning Paradigms**:
   - CNNs typically utilize supervised learning, where the model is trained on labeled data to minimize classification errors. This is crucial for tasks like image classification, where the goal is to assign a label to an input image.

5. **Applications**:
   - CNNs are widely used in various applications, including image recognition, object detection, and even in fields like medical image analysis and autonomous driving.

In summary, CNNs are a powerful architecture within the realm of machine learning, specifically tailored for tasks involving image data, leveraging their unique structure to achieve high performance in pattern recognition.

In [None]:
query = "How is a resnet better than a CNN?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

A ResNet (Residual Network) is considered better than a traditional CNN (Convolutional Neural Network) for several reasons, primarily due to its architectural innovations that address common challenges faced in training deep networks. Here are the key advantages of ResNets over standard CNNs:

1. **Residual Learning Framework**: 
   - ResNets introduce the concept of residual learning, where the network learns to predict the residual (the difference) between the desired output and the input. This is mathematically represented as \( F(x) + x \), where \( F(x) \) is the residual mapping. This approach simplifies the learning process, making it easier for the network to optimize deeper architectures.

2. **Shortcut Connections**:
   - The architecture of ResNets includes shortcut connections that skip one or more layers. These connections allow gradients to flow more easily during backpropagation, mitigating issues like vanishing gradients that often occur in very deep networks. This results in more effective training of deeper models.

3. **Degradation Problem**:
   - Traditional CNNs often suffer from the degradation problem, where adding more layers leads to higher training error. In contrast, ResNets can maintain or even improve performance as depth increases. For instance, a 34-layer ResNet outperforms an 18-layer ResNet, which is not the case for plain networks where deeper models can lead to worse performance.

4. **Higher Accuracy with Increased Depth**:
   - ResNets can achieve significantly better accuracy with increased depth without the degradation issues. For example, the 152-layer ResNet has shown to outperform shallower models and even other architectures like VGG, achieving lower top-5 validation errors in competitions.

5. **Computational Efficiency**:
   - Despite being deeper, ResNets can have lower computational complexity (measured in FLOPs) compared to other architectures like VGG. For instance, a 152-layer ResNet has lower complexity than VGG-16, allowing for more efficient training and inference.

6. **Generalization Performance**:
   - ResNets have demonstrated superior generalization performance across various tasks, including object detection and image classification, as evidenced by their success in competitions like ILSVRC and COCO.

In summary, the architectural innovations of ResNets, particularly the use of residual learning and shortcut connections, enable them to train deeper networks effectively, avoid degradation problems, and achieve higher accuracy compared to traditional CNNs.

In [None]:
query = "What is NLP and its relation to linguistics?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

Natural Language Processing (NLP) is a field within Artificial Intelligence that focuses on enabling computers to automatically understand and generate human languages. The term "Natural Language" specifically refers to human languages, distinguishing them from programming languages. The overarching goal of NLP is to facilitate seamless interaction between humans and machines through language.

NLP is closely related to linguistics, as it draws upon linguistic principles to enhance the understanding and processing of human language. Linguistics provides the foundational theories and frameworks that inform how language is structured, used, and understood, which are essential for developing effective NLP systems. By leveraging insights from linguistics, NLP aims to improve the accuracy and efficiency of language-related tasks, such as speech recognition, language translation, and text analysis.

In [None]:
query = "What is the difference between AI, ML and DL?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The difference between AI, ML, and DL can be summarized as follows:

### Artificial Intelligence (AI)
- **Definition**: AI refers to the ability of a computer program or machine to think and learn, mimicking human cognition. It encompasses a broad range of technologies and applications aimed at making machines "smart."
- **Functionality**: AI systems can interpret external data, learn from it, and use those learnings to achieve specific goals through flexible adaptation. The term has evolved, and tasks once considered AI, like optical character recognition, are now seen as routine technologies.
- **Origin**: The term "Artificial Intelligence" was coined by John McCarthy in 1955.

### Machine Learning (ML)
- **Definition**: ML is a subfield of AI that focuses on the study and construction of algorithms that allow computers to learn from and make predictions based on data without being explicitly programmed.
- **Functionality**: ML algorithms build models from sample inputs and can make predictions or decisions based on data. It is particularly useful in scenarios where designing explicit algorithms is impractical.
- **Examples**: Applications of ML include spam filtering, network intrusion detection, optical character recognition, search engines, and computer vision.

### Deep Learning (DL)
- **Definition**: DL is a specialized subset of ML that primarily uses neural networks with multiple layers (deep neural networks) to analyze various forms of data.
- **Functionality**: In DL, learning sessions can be unsupervised, semi-supervised, or supervised. The architecture typically includes at least one hidden layer between the input and output layers, allowing the model to process information in a more abstract manner as it passes through the layers.
- **Applications**: DL is particularly effective for complex tasks such as speech recognition, image classification, and understanding handwriting, which are challenging for traditional algorithms.

In summary, AI is the overarching field that includes both ML and DL, with ML being a method within AI that enables learning from data, and DL being a more advanced method of ML that utilizes deep neural networks for complex data processing.

In [None]:
query = "What is the difference between transformers and vision transformers?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The difference between transformers and vision transformers (ViTs) primarily lies in their application and the way they process input data.

### Transformers
- **Application**: Transformers were originally designed for natural language processing (NLP) tasks. They excel in handling sequential data, where the input is typically a sequence of tokens (words).
- **Input Processing**: In NLP, transformers take a sequence of word embeddings as input. Each token in the sequence is treated independently, and the model uses self-attention mechanisms to capture relationships between tokens across the entire sequence.
- **Architecture**: The standard transformer architecture consists of layers of multi-headed self-attention and feedforward neural networks, with positional encodings added to maintain the order of tokens.

### Vision Transformers (ViTs)
- **Application**: Vision transformers adapt the transformer architecture for image classification tasks. They treat image patches as tokens, allowing the model to process visual data similarly to how it processes text.
- **Input Processing**: In ViTs, an image is divided into fixed-size patches, which are then flattened and linearly embedded into a sequence of vectors. This sequence is fed into the transformer model, where each patch is treated like a token in NLP.
- **Architecture**: ViTs maintain the core transformer structure but include additional components like position embeddings to retain spatial information about the patches. The model uses self-attention to integrate information across the entire image, enabling it to capture global context effectively.

### Key Differences
1. **Data Type**: Transformers are designed for sequential data (text), while ViTs are tailored for 2D image data.
2. **Input Representation**: Transformers use word tokens, whereas ViTs use image patches as tokens.
3. **Inductive Bias**: Traditional CNNs, which are often used in computer vision, have strong inductive biases such as locality and translation equivariance. In contrast, ViTs have less image-specific inductive bias, relying more on the data to learn spatial relationships.
4. **Performance**: ViTs can achieve competitive performance on image classification tasks, especially when pre-trained on large datasets, often requiring fewer computational resources compared to traditional CNNs.

In summary, while both transformers and vision transformers share a foundational architecture, they differ significantly in their input processing, application domains, and the types of data they are optimized to handle.

In [None]:
query = "How is self-attention important in transformers?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

Self-attention is a crucial mechanism in transformers that allows the model to capture dependencies across different positions in a sequence, regardless of their distance. This capability is particularly important for several reasons:

1. **Global Dependency Modeling**: Self-attention enables the transformer to model relationships between all parts of the input sequence simultaneously. Unlike recurrent neural networks (RNNs), which process sequences in a linear fashion and can struggle with long-range dependencies, self-attention allows the transformer to attend to any part of the sequence directly. This is essential for tasks that require understanding context across long distances, such as language translation and comprehension.

2. **Parallelization**: The architecture of transformers, which relies entirely on self-attention rather than recurrence, allows for significant parallelization during training. This leads to faster training times and the ability to handle longer sequences more efficiently. The transformer can process all input positions at once, making it more scalable compared to traditional RNNs.

3. **Attention Mechanism**: Self-attention computes a representation of a sequence by relating different positions within that same sequence. This mechanism is effective in various tasks, including reading comprehension and summarization, as it allows the model to focus on relevant parts of the input when generating outputs.

4. **Layer-wise Attention Behavior**: In the context of Vision Transformers, self-attention shows varying attention distances across layers. Lower layers tend to have localized attention, focusing on small regions of the input, while deeper layers attend to more semantically relevant regions for classification. This hierarchical attention behavior enhances the model's ability to integrate information effectively.

5. **Multi-Head Attention**: The use of multi-head attention in transformers allows the model to capture different types of relationships and dependencies simultaneously. Each attention head can focus on different aspects of the input, enriching the representation learned by the model.

In summary, self-attention is vital in transformers as it facilitates the modeling of complex dependencies, enhances computational efficiency through parallelization, and allows for a flexible and powerful representation of input sequences.

In [None]:
query = "How does a resnet work?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

A ResNet, or Residual Network, is a type of deep neural network architecture designed to address the degradation problem that occurs when training very deep networks. The key innovation of ResNets is the introduction of residual learning through the use of shortcut connections.

### How ResNets Work:

1. **Residual Mapping**:
   - Instead of learning the desired underlying mapping \( H(x) \) directly, ResNets learn a residual mapping \( F(x) = H(x) - x \). This means that the network is trained to predict the difference between the desired output and the input, rather than the output itself.
   - The original mapping can then be expressed as \( H(x) = F(x) + x \). This formulation allows the network to focus on learning the residuals, which is often easier than learning the full mapping directly.

2. **Shortcut Connections**:
   - ResNets utilize shortcut connections that skip one or more layers. These connections perform identity mapping, meaning they add the input \( x \) directly to the output of the stacked layers. This addition does not introduce any extra parameters or computational complexity.
   - When the dimensions of the input and output differ, the shortcut can either use zero-padding (option A) or a projection shortcut using 1x1 convolutions (option B) to match dimensions.

3. **Optimization Benefits**:
   - The introduction of residual connections helps mitigate issues such as vanishing gradients, which can hinder the training of deep networks. By allowing gradients to flow more easily through the network, ResNets can be trained effectively even with hundreds or thousands of layers.
   - Empirical results show that deeper ResNets achieve better performance and lower training errors compared to traditional plain networks of similar depth. For instance, a 34-layer ResNet outperforms an 18-layer ResNet, demonstrating that deeper architectures can be beneficial when using residual learning.

4. **Training Procedure**:
   - ResNets are typically trained using stochastic gradient descent (SGD) with techniques such as batch normalization and data augmentation to improve convergence and generalization.
   - The learning rate is often adjusted during training to optimize convergence, starting from a higher value and decreasing it as training progresses.

### Conclusion:
Overall, ResNets represent a significant advancement in deep learning, allowing for the construction of very deep networks that are easier to optimize and achieve superior performance on tasks such as image classification. The residual learning framework has proven to be effective across various datasets, including CIFAR-10 and ImageNet, leading to state-of-the-art results in numerous competitions.

In [None]:
query = "What is LangGraph?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

I don't know.

In [None]:
query = "What is an Agentic AI System?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The context provided does not contain specific information about an "Agentic AI System." Therefore, I don't know what an Agentic AI System is.

In [None]:
query = "What is LangChain?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

I don't know.

# Build a RAG System with Sources

In [None]:
from langchain_core.prompts import ChatPromptTemplate

rag_prompt = """You are an assistant who is an expert in question-answering tasks.
                Answer the following question using only the following pieces of retrieved context.
                If the answer is not in the context, do not make up answers, just say that you don't know.
                Keep the answer detailed and well formatted based on the information from the context.

                Question:
                {question}

                Context:
                {context}

                Answer:
            """

rag_prompt_template = ChatPromptTemplate.from_template(rag_prompt)

In [None]:
# Import necessary components from LangChain
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda
from operator import itemgetter # A standard Python function to get an item from a dictionary

# --- Initialization ---
# Initialize the language model instance. We'll use this for generation.
# "gpt-4o-mini" is a fast and capable model. temperature=0 makes its output more deterministic.
chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

# NOTE: The following variables are assumed to be defined elsewhere in your code:
# - similarity_retriever: A LangChain retriever object that fetches documents based on a query.
# - rag_prompt_template: A ChatPromptTemplate that takes "context" and "question" as input.


# --- Helper Function ---
def format_docs(docs):
    """A helper function to concatenate the page_content of multiple documents into a single string."""
    return "\n\n".join(doc.page_content for doc in docs)


# --- Chain 1: The Core RAG Generation Chain ---
# This chain is responsible for generating the final answer once it has the context and question.
src_rag_response_chain = (
    # This dictionary defines the inputs for the prompt template.
    {
        # The "context" key's value is processed in two steps:
        # 1. itemgetter('context'): Extracts the list of documents from the input dictionary.
        # 2. RunnableLambda(format_docs): Applies our helper function to format the documents into a single string.
        "context": (itemgetter('context') | RunnableLambda(format_docs)),

        # The "question" key's value is simply extracted from the input dictionary.
        "question": itemgetter("question")
    }
    |
    # The dictionary with formatted context and question is passed to the prompt template.
    rag_prompt_template
    |
    # The formatted prompt is passed to the language model.
    chatgpt
    |
    # The output from the model is parsed into a clean string.
    StrOutputParser()
)


# --- Chain 2: The Full RAG Pipeline that includes Sources ---
# This is the main chain that orchestrates the entire process, from question to final output.
rag_chain_w_sources = (
    # This initial step performs retrieval. The input to this whole chain is the user's question.
    {
        # The retriever is called with the user's question to fetch relevant documents.
        # The result is assigned to the "context" key.
        "context": similarity_retriever,

        # RunnablePassthrough() simply passes the original user question through to the "question" key.
        # The output of this dictionary is: {"context": [docs], "question": "user's question"}
        "question": RunnablePassthrough()
    }
    |
    # RunnablePassthrough.assign() is a key component. It takes the dictionary from the previous step
    # and adds a new key to it. Here, we are adding a key named "response".
    # The value of "response" is generated by running the `src_rag_response_chain` with the
    # context and question from the previous step.
    RunnablePassthrough.assign(response=src_rag_response_chain)
)

# Final Output Structure when invoking `rag_chain_w_sources`:
# {
#   'context': [Document(page_content='...'), ...],  <- The source documents
#   'question': 'What is the capital of France?',     <- The original question
#   'response': 'The capital of France is Paris.'     <- The LLM's answer
# }

In [None]:
query = "What is machine learning?"
result = rag_chain_w_sources.invoke(query)
result

{'context': [Document(id='23437141-e83c-4a58-b26d-df3ef3e2d68e', metadata={'source': 'Wikipedia', 'id': '564928', 'title': 'Machine learning', 'page': 1}, page_content='Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.'),
  Document(id='e819102a-7ef6-411a-a919-210a86db4cb8', metadata={'source': 'Wikipedia

In [None]:
from IPython.display import display, Markdown

def display_results(result_obj):
    print('Query:')
    display(Markdown(result_obj['question']))
    print()
    print('Response:')
    display(Markdown(result_obj['response']))
    print('='*50)
    print('Sources:')
    for source in result_obj['context']:
        print('Metadata:', source.metadata)
        print('Content Brief:')
        display(Markdown(source.page_content))
        print()


In [None]:
query = "What is machine learning?"
result = rag_chain_w_sources.invoke(query)
display_results(result)

Query:


What is machine learning?


Response:


Machine learning is a subfield of computer science that provides computers with the ability to learn without being explicitly programmed. The concept was introduced by Arthur Samuel in 1959 and is rooted in artificial intelligence. Machine learning focuses on the study and construction of algorithms that can learn from data and make predictions or decisions based on that data. These algorithms follow programmed instructions but can also adapt and improve their performance by building models from sample inputs.

Machine learning is particularly useful in scenarios where designing and programming explicit algorithms is impractical. Common applications include spam filtering, detecting network intruders or malicious insiders, optical character recognition (OCR), search engines, and computer vision.

Within machine learning, there are different approaches, such as supervised learning, where a function is inferred from labeled training data. In this case, the system learns to produce correct results based on known outcomes, typically using vectors for training data and results to create a "classifier." Inductive reasoning is often employed to generalize from the training data.

Additionally, deep learning is a specialized area of machine learning that utilizes neural networks, particularly those with multiple layers (known as multi-layer neural networks). These networks process information in increasingly abstract ways as more layers are added, making them effective for complex tasks like speech and image recognition. Deep learning models are inspired by the information processing patterns of biological nervous systems, although they differ significantly from the structural and functional properties of human brains.

In summary, machine learning encompasses a range of techniques and applications that enable computers to learn from data, adapt, and make informed predictions or decisions without explicit programming.

Sources:
Metadata: {'id': '564928', 'source': 'Wikipedia', 'title': 'Machine learning', 'page': 1}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'id': '359370', 'title': 'Supervised learning', 'page': 1, 'source': 'Wikipedia'}
Content Brief:


In machine learning, supervised learning is the task of inferring a function from labelled training data. The results of the training are known beforehand, the system simply learns how to get to these results correctly. Usually, such systems work with vectors. They get the training data and the result of the training as two vectors and produce a "classifier". Usually, the system uses inductive reasoning to generalize the training data.


Metadata: {'source': 'Wikipedia', 'title': 'Deep learning', 'id': '663523', 'page': 1}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, which make them incompatible with neuroscience evidences.


Metadata: {'title': 'Artificial intelligence', 'page': 1, 'id': '6360', 'source': 'Wikipedia'}
Content Brief:


Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study which tries to make computers "smart". They work on their own without being encoded with commands. John McCarthy came up with the name "Artificial Intelligence" in 1955. In general use, the term "artificial intelligence" means a programme which mimics human cognition. At least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though not in the same way as we do. Andreas Kaplan and Michael Haenlein define AI as a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation. An ideal (perfect) intelligent machine is a flexible agent which perceives its environment and takes actions to maximize its chance of success at some goal or objective. As machines become increasingly capable, mental faculties once thought to require intelligence are removed from the definition. For example, optical character recognition is no longer perceived as an example of "artificial intelligence": it is just a routine technology.


Metadata: {'title': 'Artificial neural network', 'id': '44742', 'page': 1, 'source': 'Wikipedia'}
Content Brief:


A neural network (also called an ANN or an artificial neural network) is a sort of computer software, inspired by biological neurons. Biological brains are capable of solving difficult problems, but each neuron is only responsible for solving a very small part of the problem. Similarly, a neural network is made up of cells that work together to produce a desired result, although each individual cell is only responsible for solving a small part of the problem. This is one method for creating artificially intelligent programs. Neural networks are an example of machine learning, where a program can change as it learns to solve a problem. A neural network can be trained and improved with each example, but the larger the neural network, the more examples it needs to perform well—often needing millions or billions of examples in the case of deep learning. There are two ways to think of a neural network. First is like a human brain. Second is like a mathematical equation.




In [None]:
query = "What is the difference between AI, ML and DL?"
result = rag_chain_w_sources.invoke(query)
display_results(result)

Query:


What is the difference between AI, ML and DL?


Response:


The difference between AI, ML, and DL can be summarized as follows:

### Artificial Intelligence (AI)
- **Definition**: AI refers to the ability of a computer program or machine to think and learn, mimicking human cognition. It encompasses a broad range of technologies and applications that enable machines to perform tasks that typically require human intelligence.
- **Scope**: AI is a field of study aimed at creating systems that can interpret external data, learn from it, and adapt to achieve specific goals. It includes various subfields, including machine learning and deep learning.
- **Examples**: AI applications can range from simple rule-based systems to complex algorithms that can learn and adapt over time.

### Machine Learning (ML)
- **Definition**: ML is a subfield of AI that focuses on the development of algorithms that allow computers to learn from and make predictions based on data without being explicitly programmed.
- **Functionality**: ML algorithms build models from sample inputs and can make decisions or predictions based on new data. It is particularly useful in scenarios where traditional programming is impractical.
- **Examples**: Applications of ML include spam filtering, network intrusion detection, optical character recognition (OCR), and computer vision.

### Deep Learning (DL)
- **Definition**: DL is a specialized subset of machine learning that utilizes neural networks with multiple layers (known as deep neural networks) to analyze various forms of data.
- **Architecture**: Deep learning models often have at least one hidden layer between the input and output layers, allowing them to process information in a more abstract manner. This architecture is inspired by the biological nervous system.
- **Applications**: DL is particularly effective for complex tasks such as speech recognition, image classification, and natural language processing, where traditional ML methods may struggle.

In summary, AI is the overarching field that includes both ML and DL, with ML being a method of achieving AI through data-driven learning, and DL being a more advanced technique within ML that uses deep neural networks for complex data analysis.

Sources:
Metadata: {'source': 'Wikipedia', 'title': 'Deep learning', 'page': 1, 'id': '663523'}
Content Brief:


Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, which make them incompatible with neuroscience evidences.


Metadata: {'title': 'Machine learning', 'id': '564928', 'page': 1, 'source': 'Wikipedia'}
Content Brief:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Metadata: {'id': '6360', 'source': 'Wikipedia', 'page': 1, 'title': 'Artificial intelligence'}
Content Brief:


Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study which tries to make computers "smart". They work on their own without being encoded with commands. John McCarthy came up with the name "Artificial Intelligence" in 1955. In general use, the term "artificial intelligence" means a programme which mimics human cognition. At least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though not in the same way as we do. Andreas Kaplan and Michael Haenlein define AI as a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation. An ideal (perfect) intelligent machine is a flexible agent which perceives its environment and takes actions to maximize its chance of success at some goal or objective. As machines become increasingly capable, mental faculties once thought to require intelligence are removed from the definition. For example, optical character recognition is no longer perceived as an example of "artificial intelligence": it is just a routine technology.


Metadata: {'source': './rag_docs/cnn_paper.pdf', 'title': 'cnn_paper.pdf', 'page': 3, 'id': '895d04c8-663a-445f-a764-9026142ba3ad'}
Content Brief:


Focuses on the architectural differences of Convolutional Neural Networks (CNNs) compared to traditional Artificial Neural Networks (ANNs), emphasizing the three-dimensional organization of neurons and the specific types of layers that comprise CNNs, including convolutional, pooling, and fully-connected layers. It also outlines the functionality of these layers in processing image data for classification tasks.
4
Keiron O’Shea et al.
One of the key differences is that the neurons that the layers within the CNN
are comprised of neurons organised into three dimensions, the spatial dimen-
sionality of the input (height and the width) and the depth. The depth does not
refer to the total number of layers within the ANN, but the third dimension of a
activation volume. Unlike standard ANNS, the neurons within any given layer
will only connect to a small region of the layer preceding it.
In practice this would mean that for the example given earlier, the input ’vol-
ume’ will have a dimensionality of 64 × 64 × 3 (height, width and depth), lead-
ing to a ﬁnal output layer comprised of a dimensionality of 1 × 1 × n (where
n represents the possible number of classes) as we would have condensed the
full input dimensionality into a smaller volume of class scores ﬁled across the
depth dimension.
2.1
Overall architecture
CNNs are comprised of three types of layers. These are convolutional layers,
pooling layers and fully-connected layers. When these layers are stacked, a
CNN architecture has been formed. A simpliﬁed CNN architecture for MNIST
classiﬁcation is illustrated in Figure 2.
input
0
9
convolution
 w/ReLu
pooling
output
fully-connected
w/ ReLu
fully-connected
...
Fig. 2: An simple CNN architecture, comprised of just ﬁve layers
The basic functionality of the example CNN above can be broken down into
four key areas.
1. As found in other forms of ANN, the input layer will hold the pixel values
of the image.
2. The convolutional layer will determine the output of neurons of which are
connected to local regions of the input through the calculation of the scalar
product between their weights and the region connected to the input vol-
ume. The rectiﬁed linear unit (commonly shortened to ReLu) aims to apply


Metadata: {'source': 'Wikipedia', 'id': '669662', 'page': 1, 'title': 'Loop AI Labs'}
Content Brief:


Loop AI Labs is an AI and cognitive computing company that focuses on language understanding technology. The company was founded in San Francisco in 2012 by Italian entrepreneur Gianmauro Calafiore, who sold his company Gsmbox to in 2004 and then relocated from Italy to San Francisco. Wanting to start an artificial intelligence company, he recruited two veterans of the project, the largest government-funded AI project in history, who had worked on the project at and Stanford University's . The original company name, "Soshoma", was changed to Loop AI Labs in 2015 after the company decided to change its focus from consumer-oriented to enterprise. Loop AI Labs is headquartered in San Francisco, California, with offices in New York, Milan, and Singapore. The company is privately funded. On May 4, 2017, Loop AI Labs entered into a deal with , a leading European provider of mobile messaging and solutions, to bring their cognitive computing technology to LINK's business clients, which cover 234 million people across Europe.




In [None]:
query = "What is the difference between transformers and vision transformers?"
result = rag_chain_w_sources.invoke(query)
display_results(result)

Query:


What is the difference between transformers and vision transformers?


Response:


The difference between transformers and vision transformers (ViTs) primarily lies in their application and the way they process input data.

1. **Input Data Representation**:
   - **Transformers**: Originally designed for natural language processing (NLP), transformers operate on sequences of tokens, where each token typically represents a word or sub-word in a text. The input is a 1D sequence of these token embeddings.
   - **Vision Transformers (ViTs)**: ViTs adapt the transformer architecture for image classification tasks by treating image patches as tokens. An image is divided into fixed-size patches, which are then flattened and linearly embedded into a sequence of vectors. This sequence is fed into the transformer, similar to how words are processed in NLP.

2. **Architecture**:
   - **Transformers**: The standard transformer architecture consists of layers of multi-headed self-attention and feedforward neural networks, designed to capture relationships and dependencies in sequential data.
   - **Vision Transformers**: ViTs maintain the core transformer architecture but modify the input to accommodate 2D image data. They utilize position embeddings to retain spatial information about the patches, allowing the model to learn the relationships between different parts of the image.

3. **Inductive Bias**:
   - **Transformers**: In NLP, transformers inherently leverage the sequential nature of text, which provides a strong inductive bias for language tasks.
   - **Vision Transformers**: ViTs have much less image-specific inductive bias compared to convolutional neural networks (CNNs). While CNNs incorporate locality and translation equivariance into their architecture, ViTs rely on the self-attention mechanism to integrate information across the entire image, which can be advantageous when trained on large datasets.

4. **Performance and Efficiency**:
   - **Transformers**: In NLP, transformers have become the standard due to their scalability and performance on large text corpora.
   - **Vision Transformers**: ViTs have shown competitive performance in image classification tasks, especially when pre-trained on large datasets. They can achieve state-of-the-art results while requiring fewer computational resources compared to traditional CNNs, particularly when scaled appropriately.

In summary, while both transformers and vision transformers share a common architectural foundation, they differ significantly in their input data handling, inductive biases, and application domains, with ViTs specifically tailored for image processing tasks.

Sources:
Metadata: {'source': './rag_docs/vision_transformer.pdf', 'id': '561364f3-23f5-428d-b43a-6171c2586694', 'title': 'vision_transformer.pdf', 'page': 0}
Content Brief:


Focuses on the introduction of the Vision Transformer (ViT) model, which applies a standard Transformer architecture directly to image classification tasks by treating image patches as tokens. It highlights the limitations of convolutional neural networks (CNNs) in computer vision and presents evidence that ViT can achieve competitive performance on various benchmarks when pre-trained on large datasets, while requiring fewer computational resources.
Published as a conference paper at ICLR 2021
AN IMAGE IS WORTH 16X16 WORDS:
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
Alexey Dosovitskiy∗,†, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk Weissenborn∗,
Xiaohua Zhai∗, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,
Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,†
∗equal technical contribution, †equal advising
Google Research, Brain Team
{adosovitskiy, neilhoulsby}@google.com
ABSTRACT
While the Transformer architecture has become the de-facto standard for natural
language processing tasks, its applications to computer vision remain limited. In
vision, attention is either applied in conjunction with convolutional networks, or
used to replace certain components of convolutional networks while keeping their
overall structure in place. We show that this reliance on CNNs is not necessary
and a pure transformer applied directly to sequences of image patches can perform
very well on image classiﬁcation tasks. When pre-trained on large amounts of
data and transferred to multiple mid-sized or small image recognition benchmarks
(ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent
results compared to state-of-the-art convolutional networks while requiring sub-
stantially fewer computational resources to train.1
1
INTRODUCTION
Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017), have become
the model of choice in natural language processing (NLP). The dominant approach is to pre-train on
a large text corpus and then ﬁne-tune on a smaller task-speciﬁc dataset (Devlin et al., 2019). Thanks
to Transformers’ computational efﬁciency and scalability, it has become possible to train models of
unprecedented size, with over 100B parameters (Brown et al., 2020; Lepikhin et al., 2020). With the
models and datasets growing, there is still no sign of saturating performance.
In computer vision, however, convolutional architectures remain dominant (LeCun et al., 1989;
Krizhevsky et al., 2012; He et al., 2016). Inspired by NLP successes, multiple works try combining
CNN-like architectures with self-attention (Wang et al., 2018; Carion et al., 2020), some replacing
the convolutions entirely (Ramachandran et al., 2019; Wang et al., 2020a). The latter models, while
theoretically efﬁcient, have not yet been scaled effectively on modern hardware accelerators due to
the use of specialized attention patterns. Therefore, in large-scale image recognition, classic ResNet-
like architectures are still state of the art (Mahajan et al., 2018; Xie et al., 2020; Kolesnikov et al.,
2020).
Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard
Transformer directly to images, with the fewest possible modiﬁcations. To do so, we split an image
into patches and provide the sequence of linear embeddings of these patches as an input to a Trans-
former. Image patches are treated the same way as tokens (words) in an NLP application. We train
the model on image classiﬁcation in supervised fashion.
When trained on mid-sized datasets such as ImageNet without strong regularization, these mod-
els yield modest accuracies of a few percentage points below ResNets of comparable size. This
seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases
1Fine-tuning
code
and
pre-trained
models
are
available
at
https://github.com/
google-research/vision_transformer
1
arXiv:2010.11929v2  [cs.CV]  3 Jun 2021


Metadata: {'title': 'vision_transformer.pdf', 'page': 7, 'id': '3c534858-932f-4bf6-bccd-34154e51cc58', 'source': './rag_docs/vision_transformer.pdf'}
Content Brief:


Focuses on a controlled scaling study of various models, including ResNets and Vision Transformers, evaluating their transfer performance from the JFT-300M dataset. It highlights the performance versus pre-training cost, revealing that Vision Transformers outperform ResNets in terms of efficiency and scalability, while also discussing the implications for future model scaling efforts.
Published as a conference paper at ICLR 2021
4.4
SCALING STUDY
We perform a controlled scaling study of different models by evaluating transfer performance from
JFT-300M. In this setting data size does not bottleneck the models’ performances, and we assess
performance versus pre-training cost of each model. The model set includes: 7 ResNets, R50x1,
R50x2 R101x1, R152x1, R152x2, pre-trained for 7 epochs, plus R152x2 and R200x3 pre-trained
for 14 epochs; 6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs, plus
L/16 and H/14 pre-trained for 14 epochs; and 5 hybrids, R50+ViT-B/32, B/16, L/32, L/16 pre-
trained for 7 epochs, plus R50+ViT-L/16 pre-trained for 14 epochs (for hybrids, the number at the
end of the model name stands not for the patch size, but for the total dowsampling ratio in the ResNet
backbone).
Figure 5 contains the transfer performance versus total pre-training compute (see Appendix D.5
for details on computational costs). Detailed results per model are provided in Table 6 in the Ap-
pendix. A few patterns can be observed. First, Vision Transformers dominate ResNets on the
performance/compute trade-off. ViT uses approximately 2 −4× less compute to attain the same
performance (average over 5 datasets). Second, hybrids slightly outperform ViT at small compu-
tational budgets, but the difference vanishes for larger models. This result is somewhat surprising,
since one might expect convolutional local feature processing to assist ViT at any size. Third, Vision
Transformers appear not to saturate within the range tried, motivating future scaling efforts.
4.5
INSPECTING VISION TRANSFORMER
Input
Attention
Figure 6: Representative ex-
amples of attention from the
output token to the input
space. See Appendix D.7 for
details.
To begin to understand how the Vision Transformer processes im-
age data, we analyze its internal representations. The ﬁrst layer of
the Vision Transformer linearly projects the ﬂattened patches into a
lower-dimensional space (Eq. 1). Figure 7 (left) shows the top prin-
cipal components of the the learned embedding ﬁlters. The com-
ponents resemble plausible basis functions for a low-dimensional
representation of the ﬁne structure within each patch.
After the projection, a learned position embedding is added to the
patch representations. Figure 7 (center) shows that the model learns
to encode distance within the image in the similarity of position em-
beddings, i.e. closer patches tend to have more similar position em-
beddings. Further, the row-column structure appears; patches in the
same row/column have similar embeddings. Finally, a sinusoidal
structure is sometimes apparent for larger grids (Appendix D). That
the position embeddings learn to represent 2D image topology ex-
plains why hand-crafted 2D-aware embedding variants do not yield
improvements (Appendix D.4).
Self-attention allows ViT to integrate information across the entire
image even in the lowest layers. We investigate to what degree
the network makes use of this capability. Speciﬁcally, we compute
the average distance in image space across which information is
integrated, based on the attention weights (Figure 7, right). This
“attention distance” is analogous to receptive ﬁeld size in CNNs.
We ﬁnd that some heads attend to most of the image already in the lowest layers, showing that
the ability to integrate information globally is indeed used by the model. Other attention heads


Metadata: {'source': './rag_docs/vision_transformer.pdf', 'title': 'vision_transformer.pdf', 'page': 2, 'id': '103a7416-3ff5-4abb-91f2-5f3716cd500b'}
Content Brief:


Focuses on the architecture and methodology of the Vision Transformer (ViT), detailing how images are processed by splitting them into patches, embedding them, and utilizing a standard Transformer encoder for image classification tasks. It describes the model's design principles, including the use of position embeddings and the integration of a classification token, while referencing foundational work in Transformer architecture.
Published as a conference paper at ICLR 2021
Transformer Encoder
MLP 
Head
Vision Transformer (ViT)
*
Linear Projection of Flattened Patches
* Extra learnable
     [ cl ass]  embedding
1
2
3
4
5
6
7
8
9
0
Patch + Position 
Embedding
Class
Bird
Ball
Car
...
Embedded 
Patches
Multi-Head 
Attention
Norm
MLP
Norm
+
L x
+
Transformer Encoder
Figure 1: Model overview. We split an image into ﬁxed-size patches, linearly embed each of them,
add position embeddings, and feed the resulting sequence of vectors to a standard Transformer
encoder. In order to perform classiﬁcation, we use the standard approach of adding an extra learnable
“classiﬁcation token” to the sequence. The illustration of the Transformer encoder was inspired by
Vaswani et al. (2017).
3
METHOD
In model design we follow the original Transformer (Vaswani et al., 2017) as closely as possible.
An advantage of this intentionally simple setup is that scalable NLP Transformer architectures – and
their efﬁcient implementations – can be used almost out of the box.
3.1
VISION TRANSFORMER (VIT)
An overview of the model is depicted in Figure 1. The standard Transformer receives as input a 1D
sequence of token embeddings. To handle 2D images, we reshape the image x ∈RH×W ×C into a
sequence of ﬂattened 2D patches xp ∈RN×(P 2·C), where (H, W) is the resolution of the original
image, C is the number of channels, (P, P) is the resolution of each image patch, and N = HW/P 2
is the resulting number of patches, which also serves as the effective input sequence length for the
Transformer. The Transformer uses constant latent vector size D through all of its layers, so we
ﬂatten the patches and map to D dimensions with a trainable linear projection (Eq. 1). We refer to
the output of this projection as the patch embeddings.
Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embed-
ded patches (z0
0 = xclass), whose state at the output of the Transformer encoder (z0
L) serves as the
image representation y (Eq. 4). Both during pre-training and ﬁne-tuning, a classiﬁcation head is at-
tached to z0
L. The classiﬁcation head is implemented by a MLP with one hidden layer at pre-training
time and by a single linear layer at ﬁne-tuning time.
Position embeddings are added to the patch embeddings to retain positional information. We use
standard learnable 1D position embeddings, since we have not observed signiﬁcant performance
gains from using more advanced 2D-aware position embeddings (Appendix D.4). The resulting
sequence of embedding vectors serves as input to the encoder.
The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self-
attention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before
every block, and residual connections after every block (Wang et al., 2019; Baevski & Auli, 2019).
3


Metadata: {'page': 1, 'title': 'vision_transformer.pdf', 'id': 'c9afddf8-565b-474b-918d-e489eebb1096', 'source': './rag_docs/vision_transformer.pdf'}
Content Brief:


Focuses on the performance of the Vision Transformer (ViT) in comparison to convolutional neural networks (CNNs), highlighting the advantages of large-scale training on datasets ranging from 14M to 300M images. It emphasizes that ViT achieves state-of-the-art results on various image recognition benchmarks when pre-trained on extensive datasets like ImageNet-21k and JFT-300M, despite lacking some inductive biases inherent to CNNs.
Published as a conference paper at ICLR 2021
inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well
when trained on insufﬁcient amounts of data.
However, the picture changes if the models are trained on larger datasets (14M-300M images). We
ﬁnd that large scale training trumps inductive bias. Our Vision Transformer (ViT) attains excellent
results when pre-trained at sufﬁcient scale and transferred to tasks with fewer datapoints. When
pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches
or beats state of the art on multiple image recognition benchmarks. In particular, the best model
reaches the accuracy of 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100,
and 77.63% on the VTAB suite of 19 tasks.
2
RELATED WORK
Transformers were proposed by Vaswani et al. (2017) for machine translation, and have since be-
come the state of the art method in many NLP tasks. Large Transformer-based models are often
pre-trained on large corpora and then ﬁne-tuned for the task at hand: BERT (Devlin et al., 2019)
uses a denoising self-supervised pre-training task, while the GPT line of work uses language mod-
eling as its pre-training task (Radford et al., 2018; 2019; Brown et al., 2020).
Naive application of self-attention to images would require that each pixel attends to every other
pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes. Thus,
to apply Transformers in the context of image processing, several approximations have been tried in
the past. Parmar et al. (2018) applied the self-attention only in local neighborhoods for each query
pixel instead of globally. Such local multi-head dot-product self attention blocks can completely
replace convolutions (Hu et al., 2019; Ramachandran et al., 2019; Zhao et al., 2020). In a different
line of work, Sparse Transformers (Child et al., 2019) employ scalable approximations to global self-
attention in order to be applicable to images. An alternative way to scale attention is to apply it in
blocks of varying sizes (Weissenborn et al., 2019), in the extreme case only along individual axes (Ho
et al., 2019; Wang et al., 2020a). Many of these specialized attention architectures demonstrate
promising results on computer vision tasks, but require complex engineering to be implemented
efﬁciently on hardware accelerators.
Most related to ours is the model of Cordonnier et al. (2020), which extracts patches of size 2 × 2
from the input image and applies full self-attention on top. This model is very similar to ViT,
but our work goes further to demonstrate that large scale pre-training makes vanilla transformers
competitive with (or even better than) state-of-the-art CNNs. Moreover, Cordonnier et al. (2020)
use a small patch size of 2 × 2 pixels, which makes the model applicable only to small-resolution
images, while we handle medium-resolution images as well.
There has also been a lot of interest in combining convolutional neural networks (CNNs) with forms
of self-attention, e.g. by augmenting feature maps for image classiﬁcation (Bello et al., 2019) or by
further processing the output of a CNN using self-attention, e.g. for object detection (Hu et al., 2018;
Carion et al., 2020), video processing (Wang et al., 2018; Sun et al., 2019), image classiﬁcation (Wu
et al., 2020), unsupervised object discovery (Locatello et al., 2020), or uniﬁed text-vision tasks (Chen


Metadata: {'id': 'e45099b3-f31d-4258-bbaf-3e6a4a113a4c', 'page': 3, 'source': './rag_docs/vision_transformer.pdf', 'title': 'vision_transformer.pdf'}
Content Brief:


Focuses on the architecture and operational details of the Vision Transformer (ViT), including the structure of the multi-layer perceptron (MLP) and the handling of inductive biases compared to convolutional neural networks (CNNs). It also discusses the hybrid architecture that combines CNN feature maps with ViT and outlines the fine-tuning process for downstream tasks, emphasizing the importance of resolution adjustments and the implications for model performance.
Published as a conference paper at ICLR 2021
The MLP contains two layers with a GELU non-linearity.
z0 = [xclass; x1
pE; x2
pE; · · · ; xN
p E] + Epos,
E ∈R(P 2·C)×D, Epos ∈R(N+1)×D
(1)
z′
ℓ= MSA(LN(zℓ−1)) + zℓ−1,
ℓ= 1 . . . L
(2)
zℓ= MLP(LN(z′
ℓ)) + z′
ℓ,
ℓ= 1 . . . L
(3)
y = LN(z0
L)
(4)
Inductive bias.
We note that Vision Transformer has much less image-speciﬁc inductive bias than
CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are
baked into each layer throughout the whole model. In ViT, only MLP layers are local and transla-
tionally equivariant, while the self-attention layers are global. The two-dimensional neighborhood
structure is used very sparingly: in the beginning of the model by cutting the image into patches and
at ﬁne-tuning time for adjusting the position embeddings for images of different resolution (as de-
scribed below). Other than that, the position embeddings at initialization time carry no information
about the 2D positions of the patches and all spatial relations between the patches have to be learned
from scratch.
Hybrid Architecture.
As an alternative to raw image patches, the input sequence can be formed
from feature maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding
projection E (Eq. 1) is applied to patches extracted from a CNN feature map. As a special case,
the patches can have spatial size 1x1, which means that the input sequence is obtained by simply
ﬂattening the spatial dimensions of the feature map and projecting to the Transformer dimension.
The classiﬁcation input embedding and position embeddings are added as described above.
3.2
FINE-TUNING AND HIGHER RESOLUTION
Typically, we pre-train ViT on large datasets, and ﬁne-tune to (smaller) downstream tasks. For
this, we remove the pre-trained prediction head and attach a zero-initialized D × K feedforward
layer, where K is the number of downstream classes. It is often beneﬁcial to ﬁne-tune at higher
resolution than pre-training (Touvron et al., 2019; Kolesnikov et al., 2020). When feeding images
of higher resolution, we keep the patch size the same, which results in a larger effective sequence
length. The Vision Transformer can handle arbitrary sequence lengths (up to memory constraints),
however, the pre-trained position embeddings may no longer be meaningful. We therefore perform
2D interpolation of the pre-trained position embeddings, according to their location in the original
image. Note that this resolution adjustment and patch extraction are the only points at which an
inductive bias about the 2D structure of the images is manually injected into the Vision Transformer.
4
EXPERIMENTS
We evaluate the representation learning capabilities of ResNet, Vision Transformer (ViT), and the
hybrid. To understand the data requirements of each model, we pre-train on datasets of varying size
and evaluate many benchmark tasks. When considering the computational cost of pre-training the
model, ViT performs very favourably, attaining state of the art on most recognition benchmarks at
a lower pre-training cost. Lastly, we perform a small experiment using self-supervision, and show
that self-supervised ViT holds promise for the future.
4.1
SETUP
Datasets. To explore model scalability, we use the ILSVRC-2012 ImageNet dataset with 1k classes
and 1.3M images (we refer to it as ImageNet in what follows), its superset ImageNet-21k with




In [None]:
query = "What is an Agentic AI System?"
result = rag_chain_w_sources.invoke(query)
display_results(result)

Query:


What is an Agentic AI System?


Response:


The context provided does not contain specific information about an "Agentic AI System." Therefore, I don't know what an Agentic AI System is.

Sources:
Metadata: {'id': '6360', 'source': 'Wikipedia', 'title': 'Artificial intelligence', 'page': 1}
Content Brief:


Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study which tries to make computers "smart". They work on their own without being encoded with commands. John McCarthy came up with the name "Artificial Intelligence" in 1955. In general use, the term "artificial intelligence" means a programme which mimics human cognition. At least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though not in the same way as we do. Andreas Kaplan and Michael Haenlein define AI as a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation. An ideal (perfect) intelligent machine is a flexible agent which perceives its environment and takes actions to maximize its chance of success at some goal or objective. As machines become increasingly capable, mental faculties once thought to require intelligence are removed from the definition. For example, optical character recognition is no longer perceived as an example of "artificial intelligence": it is just a routine technology.


Metadata: {'source': 'Wikipedia', 'page': 1, 'id': '674015', 'title': 'A.I. Artificial Intelligence'}
Content Brief:


A.I. Artificial Intelligence, or A.I., is a 2001 American science fiction drama movie directed by Steven Spielberg. The screenplay was by Spielberg based on the 1969 short story "Supertoys Last All Summer Long" by Brian Aldiss. The movie was produced by Kathleen Kennedy, Spielberg and Bonnie Curtis. It stars Haley Joel Osment, Jude Law, Frances O'Connor, Brendan Gleeson and William Hurt. It is set in a futuristic post-climate change society. "A.I." tells the story of David (Osment), a childlike android uniquely programmed with the ability to love.


Metadata: {'title': 'Swarm intelligence', 'source': 'Wikipedia', 'id': '112634', 'page': 1}
Content Brief:


Swarm Intelligence is a field of Computer science. It is a form of Artificial intelligence. Some animals, mostly insects like ants, or bees form large colonies. These colonies are made of many animals that communicate with each other. Each animal is relatively simple, but by working together with other animals it is able to solve complex tasks. Swarm intelligence wants to obtain similar behaviour than that observed with these animals. Instead of the animals, so called "agents" are used.


Metadata: {'title': 'Shakey the robot', 'page': 1, 'source': 'Wikipedia', 'id': '692745'}
Content Brief:


Shakey the Robot was the first general purpose mobile AI robot. The project combined research in robotics, computer vision, and natural language processing. Because of this, it was the first project that melded logical reasoning and physical action. Shakey was developed at the Artificial Intelligence Center of Stanford Research Institute (now called SRI International) by Nils John Nilsson from 1966 to 1972.


Metadata: {'title': 'Robot lawyer', 'id': '564218', 'source': 'Wikipedia', 'page': 1}
Content Brief:


A robot lawyer is an artificial intelligence (AI) computer program. It is designed to ask the same questions as a real lawyer about certain legal issues. Robot lawyers are being used in many countries around the world including the United States, the United Kingdom, and Holland.




# Build a RAG System with Source Citations Agentic Pipeline

In [None]:
from langchain_core.prompts import ChatPromptTemplate

rag_prompt = """You are an assistant who is an expert in question-answering tasks.
                Answer the following question using only the following pieces of retrieved context.
                If the answer is not in the context, do not make up answers, just say that you don't know.
                Keep the answer detailed and well formatted based on the information from the context.

                Question:
                {question}

                Context:
                {context}

                Answer:
            """

rag_prompt_template = ChatPromptTemplate.from_template(rag_prompt)
rag_prompt_template.pretty_print()


You are an assistant who is an expert in question-answering tasks.
                Answer the following question using only the following pieces of retrieved context.
                If the answer is not in the context, do not make up answers, just say that you don't know.
                Keep the answer detailed and well formatted based on the information from the context.

                Question:
                [33;1m[1;3m{question}[0m

                Context:
                [33;1m[1;3m{context}[0m

                Answer:
            


In [None]:
citations_prompt = """You are an assistant who is an expert in analyzing answers to questions
                      and finding out referenced citations from context articles.

                      Given the following question, context and generated answer,
                      analyze the generated answer and quote citations from context articles
                      that can be used to justify the generated answer.

                      Question:
                      {question}

                      Context Articles:
                      {context}

                      Answer:
                      {answer}
                  """

cite_prompt_template = ChatPromptTemplate.from_template(citations_prompt)
cite_prompt_template.pretty_print()


You are an assistant who is an expert in analyzing answers to questions
                      and finding out referenced citations from context articles.

                      Given the following question, context and generated answer,
                      analyze the generated answer and quote citations from context articles
                      that can be used to justify the generated answer.

                      Question:
                      [33;1m[1;3m{question}[0m

                      Context Articles:
                      [33;1m[1;3m{context}[0m

                      Answer:
                      [33;1m[1;3m{answer}[0m
                  


In [None]:
from pydantic import BaseModel, Field
from typing import List

class Citation(BaseModel):
    id: str = Field(description="""The string ID of a SPECIFIC context article
                                   which justifies the answer.""")
    source: str = Field(description="""The source of the SPECIFIC context article
                                       which justifies the answer.""")
    title: str = Field(description="""The title of the SPECIFIC context article
                                      which justifies the answer.""")
    page: int = Field(description="""The page number of the SPECIFIC context article
                                     which justifies the answer.""")
    quotes: str = Field(description="""The VERBATIM sentences from the SPECIFIC context article
                                      that are used to generate the answer.
                                      Should be exact sentences from context article without missing words.""")


class QuotedCitations(BaseModel):
    """Quote citations from given context articles
       that can be used to justify the generated answer. Can be multiple articles."""
    citations: List[Citation] = Field(description="""Citations (can be multiple) from the given
                                                     context articles that justify the answer.""")

In [None]:
from langchain_core.documents import Document
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda
from operator import itemgetter


# --- Model Initialization ---

# Standard language model for generating plain text answers.
chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

# A specialized version of the model that is forced to output JSON matching the 'QuotedCitations' schema.
# This is the key to getting reliable, structured citation data.
structured_chatgpt = chatgpt.with_structured_output(QuotedCitations)

# NOTE: These variables are assumed to be defined elsewhere:
# - similarity_retriever: A retriever that fetches relevant documents.
# - rag_prompt_template: A prompt for generating the initial answer.
# - cite_prompt_template: A prompt specifically for generating citations based on an answer.


# --- Helper Function ---

def format_docs_with_metadata(docs: List[Document]) -> str:
    """
    Formats documents to include their metadata explicitly in the context string.
    This helps the LLM to easily access information needed for citations.
    """
    formatted_docs = [
        f"""Context Article ID: {doc.metadata['id']}
            Context Article Source: {doc.metadata['source']}
            Context Article Title: {doc.metadata['title']}
            Context Article Page: {doc.metadata['page']}
            Context Article Details: {doc.page_content}
         """
        for i, doc in enumerate(docs)
    ]
    # Join all formatted doc strings into a single block of text.
    return "\n\n" + "\n\n".join(formatted_docs)


# --- Chain 1: Generates the initial text answer ---

rag_response_chain = (
    {
        # Format the retrieved documents using our detailed helper function.
        "context": (itemgetter('context') | RunnableLambda(format_docs_with_metadata)),
        # Pass the original question through.
        "question": itemgetter("question")
    }
    |
    # Use the standard RAG prompt.
    rag_prompt_template
    |
    # Use the standard chat model to generate a text answer.
    chatgpt
    |
    # Parse the output into a string.
    StrOutputParser()
)


# --- Chain 2: Generates structured citations for the answer ---

cite_response_chain = (
    {
        # Pass the original, unformatted context through.
        "context": itemgetter('context'),
        # Pass the original question through.
        "question": itemgetter("question"),
        # IMPORTANT: This chain requires the 'answer' generated by the previous chain.
        "answer": itemgetter("answer")
    }
    |
    # Use the specialized citation prompt.
    cite_prompt_template
    |
    # Use the structured output model to get a JSON object of citations.
    structured_chatgpt
)


# --- Chain 3: The final orchestrator chain ---

rag_chain_w_citations = (
    # Step 1: Retrieval. Fetch context documents based on the question.
    # Output of this step: {"context": [docs], "question": "user_question"}
    {
        "context": similarity_retriever,
        "question": RunnablePassthrough()
    }
    |
    # Step 2: Generate Answer. Run the first chain to get the text answer and
    # add it to the dictionary under the key "answer".
    # Output of this step: {"context": [docs], "question": "user_question", "answer": "text_answer"}
    RunnablePassthrough.assign(answer=rag_response_chain)
    |
    # Step 3: Generate Citations. Run the second chain using the output from the previous step.
    # The result (structured citations) is added to the dictionary under the key "citations".
    # Output of this step: {"context": ..., "question": ..., "answer": ..., "citations": ...}
    RunnablePassthrough.assign(citations=cite_response_chain)
)

In [None]:
query = "What is machine learning"
result = rag_chain_w_citations.invoke(query)
result

{'context': [Document(id='23437141-e83c-4a58-b26d-df3ef3e2d68e', metadata={'id': '564928', 'source': 'Wikipedia', 'page': 1, 'title': 'Machine learning'}, page_content='Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.'),
  Document(id='e819102a-7ef6-411a-a919-210a86db4cb8', metadata={'page': 1, 'id': '35

In [None]:
result['citations'].dict()['citations']

/tmp/ipython-input-3486563740.py:1: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  result['citations'].dict()['citations']


[{'id': '564928',
  'source': 'Wikipedia',
  'title': 'Machine learning',
  'page': 1,
  'quotes': 'Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.'},
 {'id': '6360',
  'source': 'Wikipedia',
  'title': 'Artificial intelligence',
  'page': 1,
  'quotes': 'Artificial intelligence (AI) is the ability of a

In [None]:
import re
# used mostly for nice display formatting, ignore if not needed
def get_cited_context(result_obj):
    # Dictionary to hold separate citation information for each unique source and title combination
    source_with_citations = {}

    def highlight_text(context, quote):
        # Normalize whitespace and remove unnecessary punctuation
        quote = re.sub(r'\s+', ' ', quote).strip()
        context = re.sub(r'\s+', ' ', context).strip()

        # Split quote into phrases, being careful with punctuation
        phrases = [phrase.strip() for phrase in re.split(r'[.!?]', quote) if phrase.strip()]

        highlighted_context = context

        for phrase in phrases: # for each quoted phrase

            # Create regex pattern to match cited phrases
            # Escape special regex characters, but preserve word boundaries
            escaped_phrase = re.escape(phrase)
            # Create regex pattern that allows for slight variations
            pattern = re.compile(r'\b' + escaped_phrase + r'\b', re.IGNORECASE)

            # Replace all matched phrases with bolded version
            highlighted_context = pattern.sub(lambda m: f"**{m.group(0)}**", highlighted_context)

        return highlighted_context

    # Process the citation data
    for cite in result_obj['citations'].dict()['citations']:
        cite_id = cite['id']
        title = cite['title']
        source = cite['source']
        page = cite['page']
        quote = cite['quotes']

        # Check if the (source, title) key exists, and initialize if it doesn't
        if (source, title) not in source_with_citations:
            source_with_citations[(source, title)] = {
                'title': title,
                'source': source,
                'citations': []
            }

        # Find or create the citation entry for this unique (id, page) combination
        citation_entry = next(
            (c for c in source_with_citations[(source, title)]['citations'] if c['id'] == cite_id and c['page'] == page),
            None
        )
        if citation_entry is None:
            citation_entry = {'id': cite_id, 'page': page, 'quote': [quote], 'context': None}
            source_with_citations[(source, title)]['citations'].append(citation_entry)
        else:
            citation_entry['quote'].append(quote)

    # Process context data
    for context in result_obj['context']:
        context_id = context.metadata['id']
        context_page = context.metadata['page']
        source = context.metadata['source']
        title = context.metadata['title']
        page_content = context.page_content

        # Match the context to the correct citation entry by source, title, id, and page
        if (source, title) in source_with_citations:
            for citation in source_with_citations[(source, title)]['citations']:
                if citation['id'] == context_id and citation['page'] == context_page:
                    # Apply highlighting for each quote in the citation's quote list
                    highlighted_content = page_content
                    for quote in citation['quote']:
                        highlighted_content = highlight_text(highlighted_content, quote)
                    citation['context'] = highlighted_content

    # Convert the dictionary to a list of dictionaries for separate entries
    final_result_list = [
        {
            'title': details['title'],
            'source': details['source'],
            'citations': details['citations']
        }
        for details in source_with_citations.values()
    ]

    return final_result_list


In [None]:
get_cited_context(result)

/tmp/ipython-input-3913888139.py:31: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  for cite in result_obj['citations'].dict()['citations']:


[{'title': 'Machine learning',
  'source': 'Wikipedia',
  'citations': [{'id': '564928',
    'page': 1,
    'quote': ['Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.'],
    'context': 'Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). **It 

In [None]:
from IPython.display import display, Markdown

def display_results(result_obj):
    print('Query:')
    display(Markdown(result_obj['question']))
    print()
    print('Response:')
    display(Markdown(result_obj['answer']))
    print('='*50)
    print('Sources:')
    cited_context = get_cited_context(result_obj)
    for source in cited_context:
        print('Title:', source['title'], ' ', 'Source:', source['source'])
        print('Citations:')
        for citation in source['citations']:
            print('ID:', citation['id'], ' ', 'Page:', citation['page'])
            print('Cited Quotes:')
            display(Markdown('*'+' '.join(citation['quote'])+'*'))
            print('Cited Context:')
            display(Markdown(citation['context']))
            print()


In [None]:
display_results(result)

Query:


What is machine learning


Response:


Machine learning is a subfield of computer science that provides computers with the ability to learn without being explicitly programmed. The concept was introduced by Arthur Samuel in 1959 and is rooted in the broader field of artificial intelligence (AI). 

Machine learning focuses on the study and construction of algorithms that can learn from and make predictions based on data. These algorithms operate by following programmed instructions but also have the capability to make predictions or decisions based on the data they process. They build models from sample inputs, which allows them to function in scenarios where traditional programming methods are insufficient. 

Some common applications of machine learning include:
- Spam filtering
- Detection of network intruders or malicious insiders
- Optical character recognition (OCR)
- Search engines
- Computer vision

Overall, machine learning enables systems to improve their performance on tasks over time as they are exposed to more data, making it a powerful tool in various technological applications.

Sources:
Title: Machine learning   Source: Wikipedia
Citations:
ID: 564928   Page: 1
Cited Quotes:


/tmp/ipython-input-3913888139.py:31: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  for cite in result_obj['citations'].dict()['citations']:


*Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.*

Cited Context:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). **It is a subfield of computer science**. **The idea came from work in artificial intelligence**. **Machine learning explores the study and construction of algorithms which can learn and make predictions on data**. **Such algorithms follow programmed instructions, but can also make predictions or decisions based on data**. **They build a model from sample inputs**. **Machine learning is done where designing and programming explicit algorithms cannot be done**. **Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision**.




In [None]:
query = "What is AI, ML and DL?"
result = rag_chain_w_citations.invoke(query)
display_results(result)

Query:


What is AI, ML and DL?


Response:


**Artificial Intelligence (AI)**: AI is defined as the ability of a computer program or machine to think and learn. It is also a field of study aimed at making computers "smart," allowing them to operate independently without being explicitly programmed with commands. The term was coined by John McCarthy in 1955. AI encompasses systems that can interpret external data, learn from it, and adapt to achieve specific goals. As technology advances, tasks once considered to require intelligence, such as optical character recognition, are no longer classified as AI but rather as routine technologies.

**Machine Learning (ML)**: ML is a subfield of computer science that provides computers the ability to learn from data without being explicitly programmed. This concept emerged from the broader field of artificial intelligence. Machine learning involves the study and construction of algorithms that can learn from data and make predictions or decisions based on that data. It is particularly useful in scenarios where traditional programming is impractical. Examples of machine learning applications include spam filtering, network intrusion detection, optical character recognition, search engines, and computer vision.

**Deep Learning (DL)**: DL is a specialized form of machine learning that primarily utilizes neural networks, particularly those with multiple layers (known as multi-layer neural networks). It can involve unsupervised, semi-supervised, or supervised learning sessions. Deep learning is particularly effective for complex tasks such as speech recognition, image understanding, and handwriting recognition, which are challenging for computers. The architecture of deep learning models is inspired by the information processing patterns of biological nervous systems, although they differ significantly from the structural and functional properties of human brains.

Sources:
Title: Artificial intelligence   Source: Wikipedia
Citations:
ID: 6360   Page: 1
Cited Quotes:


/tmp/ipython-input-3913888139.py:31: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  for cite in result_obj['citations'].dict()['citations']:


*Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn. It is also a field of study which tries to make computers "smart". They work on their own without being encoded with commands. John McCarthy came up with the name "Artificial Intelligence" in 1955. In general use, the term "artificial intelligence" means a programme which mimics human cognition. At least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though not in the same way as we do. Andreas Kaplan and Michael Haenlein define AI as a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation.*

Cited Context:


**Artificial intelligence (AI) is the ability of a computer program or a machine to think and learn**. It is also a field of study which tries to make computers "smart". **They work on their own without being encoded with commands**. **John McCarthy came up with the name "Artificial Intelligence" in 1955**. **In general use, the term "artificial intelligence" means a programme which mimics human cognition**. **At least some of the things we associate with other minds, such as learning and problem solving can be done by computers, though not in the same way as we do**. **Andreas Kaplan and Michael Haenlein define AI as a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation**. An ideal (perfect) intelligent machine is a flexible agent which perceives its environment and takes actions to maximize its chance of success at some goal or objective. As machines become increasingly capable, mental faculties once thought to require intelligence are removed from the definition. For example, optical character recognition is no longer perceived as an example of "artificial intelligence": it is just a routine technology.


Title: Machine learning   Source: Wikipedia
Citations:
ID: 564928   Page: 1
Cited Quotes:


*Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data.*

Cited Context:


Machine learning gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959). **It is a subfield of computer science**. **The idea came from work in artificial intelligence**. **Machine learning explores the study and construction of algorithms which can learn and make predictions on data**. **Such algorithms follow programmed instructions, but can also make predictions or decisions based on data**. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Title: Deep learning   Source: Wikipedia
Citations:
ID: 663523   Page: 1
Cited Quotes:


*Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks. As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised. In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer.*

Cited Context:


**Deep learning (also called deep structured learning or hierarchical learning) is a kind of machine learning, which is mostly used with certain kinds of neural networks**. **As with other kinds of machine-learning, learning sessions can be unsupervised, semi-supervised, or supervised**. **In many cases, structures are organised so that there is at least one intermediate layer (or hidden layer), between the input layer and the output layer**. Certain tasks, such as as recognizing and understanding speech, images or handwriting, is easy to do for humans. However, for a computer, these tasks are very difficult to do. In a multi-layer neural network (having more than two layers), the information processed will become more abstract with each added layer. Deep learning models are inspired by information processing and communication patterns in biological nervous systems; they are different from the structural and functional properties of biological brains (especially the human brain) in many ways, which make them incompatible with neuroscience evidences.




In [None]:
query = "How is Machine learning related to supervised learning and clustering?"
result = rag_chain_w_citations.invoke(query)
display_results(result)

Query:


How is Machine learning related to supervised learning and clustering?


Response:


Machine learning is a broad field that encompasses various techniques and methodologies for enabling computers to learn from data. Two important concepts within machine learning are supervised learning and clustering.

### Supervised Learning
- **Definition**: Supervised learning is a specific type of machine learning where the model is trained on labeled data. This means that the training dataset includes both the input data and the corresponding correct outputs (labels).
- **Process**: The system infers a function from this labeled training data, learning how to map inputs to the correct outputs. The results of the training are known beforehand, allowing the system to learn to produce a "classifier" that can make predictions on new, unseen data.
- **Inductive Reasoning**: Supervised learning typically employs inductive reasoning to generalize from the training data to make predictions about new data.

### Clustering
- **Definition**: Clustering, or cluster analysis, is a type of data analysis that involves grouping a set of objects in such a way that objects in the same group (or cluster) are more similar to each other than to those in other groups.
- **Application**: Clustering is commonly used in data mining and is a form of unsupervised learning, where the model learns patterns from data without labeled outputs. Unlike supervised learning, clustering does not rely on predefined labels to guide the learning process.

### Relationship Between Machine Learning, Supervised Learning, and Clustering
- **Machine Learning as an Umbrella**: Machine learning serves as the overarching field that includes various learning paradigms, including both supervised learning and clustering.
- **Different Approaches**: While supervised learning focuses on learning from labeled data to make predictions, clustering is concerned with discovering inherent groupings in data without prior labels. Both approaches are essential for different types of tasks within machine learning, showcasing the versatility of the field.

In summary, machine learning encompasses both supervised learning, which relies on labeled data for training, and clustering, which identifies patterns in unlabeled data. Each serves distinct purposes and utilizes different methodologies within the broader context of machine learning.

Sources:
Title: Supervised learning   Source: Wikipedia
Citations:
ID: 359370   Page: 1
Cited Quotes:


/tmp/ipython-input-3913888139.py:31: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  for cite in result_obj['citations'].dict()['citations']:


*In machine learning, supervised learning is the task of inferring a function from labelled training data. The results of the training are known beforehand, the system simply learns how to get to these results correctly.*

Cited Context:


**In machine learning, supervised learning is the task of inferring a function from labelled training data**. **The results of the training are known beforehand, the system simply learns how to get to these results correctly**. Usually, such systems work with vectors. They get the training data and the result of the training as two vectors and produce a "classifier". Usually, the system uses inductive reasoning to generalize the training data.


Title: Machine learning   Source: Wikipedia
Citations:
ID: 564928   Page: 1
Cited Quotes:


*Machine learning gives computers the ability to learn without being explicitly programmed. It is a subfield of computer science. The idea came from work in artificial intelligence. Machine learning explores the study and construction of algorithms which can learn and make predictions on data.*

Cited Context:


**Machine learning gives computers the ability to learn without being explicitly programmed** (Arthur Samuel, 1959). **It is a subfield of computer science**. **The idea came from work in artificial intelligence**. **Machine learning explores the study and construction of algorithms which can learn and make predictions on data**. Such algorithms follow programmed instructions, but can also make predictions or decisions based on data. They build a model from sample inputs. Machine learning is done where designing and programming explicit algorithms cannot be done. Examples include spam filtering, detection of network intruders or malicious insiders working towards a data breach, optical character recognition (OCR), search engines and computer vision.


Title: Cluster analysis   Source: Wikipedia
Citations:
ID: 593732   Page: 1
Cited Quotes:


*Clustering or cluster analysis is a type of data analysis. The analyst groups objects so that objects in the same group (called a cluster) are more similar to each other than to objects in other groups (clusters) in some way.*

Cited Context:


**Clustering or cluster analysis is a type of data analysis**. **The analyst groups objects so that objects in the same group (called a cluster) are more similar to each other than to objects in other groups (clusters) in some way**. This is a common task in data mining.




In [None]:
query = "What is the difference between transformers and vision transformers?"
result = rag_chain_w_citations.invoke(query)
display_results(result)

Query:


What is the difference between transformers and vision transformers?


Response:


The difference between transformers and vision transformers (ViTs) primarily lies in their application and the way they process input data.

### Transformers
- **Application**: Transformers were originally designed for natural language processing (NLP) tasks. They excel in handling sequential data, where the input is typically a sequence of tokens (words).
- **Input Processing**: In a standard transformer, the input is a 1D sequence of token embeddings. Each token is processed through layers of self-attention and feedforward networks, allowing the model to capture relationships between tokens regardless of their position in the sequence.

### Vision Transformers (ViTs)
- **Application**: Vision transformers adapt the transformer architecture for image classification tasks. They treat images as sequences of patches rather than as a whole.
- **Input Processing**: In ViTs, an image is divided into fixed-size patches, which are then flattened and linearly embedded into a sequence of vectors. This sequence is fed into a standard transformer encoder. The patches are treated similarly to tokens in NLP, allowing the model to leverage the self-attention mechanism to integrate information across the entire image.
- **Inductive Bias**: Unlike convolutional neural networks (CNNs), which have built-in inductive biases such as locality and translation equivariance, ViTs have much less image-specific inductive bias. They rely on the model to learn spatial relationships from the data itself, which can be advantageous when trained on large datasets.

### Summary
In summary, while both transformers and vision transformers utilize the same underlying architecture, their differences lie in their input formats and the specific tasks they are designed to handle. Transformers are optimized for sequential data in NLP, whereas vision transformers adapt this approach for image data by treating image patches as sequences, allowing them to perform well in image classification tasks when trained on large datasets.

Sources:
Title: vision_transformer.pdf   Source: ./rag_docs/vision_transformer.pdf
Citations:
ID: 561364f3-23f5-428d-b43a-6171c2586694   Page: 0
Cited Quotes:


/tmp/ipython-input-3913888139.py:31: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  for cite in result_obj['citations'].dict()['citations']:


*It highlights the limitations of convolutional neural networks (CNNs) in computer vision and presents evidence that ViT can achieve competitive performance on various benchmarks when pre-trained on large datasets, while requiring fewer computational resources.*

Cited Context:


Focuses on the introduction of the Vision Transformer (ViT) model, which applies a standard Transformer architecture directly to image classification tasks by treating image patches as tokens. **It highlights the limitations of convolutional neural networks (CNNs) in computer vision and presents evidence that ViT can achieve competitive performance on various benchmarks when pre-trained on large datasets, while requiring fewer computational resources**. Published as a conference paper at ICLR 2021 AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Alexey Dosovitskiy∗,†, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk Weissenborn∗, Xiaohua Zhai∗, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,† ∗equal technical contribution, †equal advising Google Research, Brain Team {adosovitskiy, neilhoulsby}@google.com ABSTRACT While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classiﬁcation tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring sub- stantially fewer computational resources to train.1 1 INTRODUCTION Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017), have become the model of choice in natural language processing (NLP). The dominant approach is to pre-train on a large text corpus and then ﬁne-tune on a smaller task-speciﬁc dataset (Devlin et al., 2019). Thanks to Transformers’ computational efﬁciency and scalability, it has become possible to train models of unprecedented size, with over 100B parameters (Brown et al., 2020; Lepikhin et al., 2020). With the models and datasets growing, there is still no sign of saturating performance. In computer vision, however, convolutional architectures remain dominant (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016). Inspired by NLP successes, multiple works try combining CNN-like architectures with self-attention (Wang et al., 2018; Carion et al., 2020), some replacing the convolutions entirely (Ramachandran et al., 2019; Wang et al., 2020a). The latter models, while theoretically efﬁcient, have not yet been scaled effectively on modern hardware accelerators due to the use of specialized attention patterns. Therefore, in large-scale image recognition, classic ResNet- like architectures are still state of the art (Mahajan et al., 2018; Xie et al., 2020; Kolesnikov et al., 2020). Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modiﬁcations. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Trans- former. Image patches are treated the same way as tokens (words) in an NLP application. We train the model on image classiﬁcation in supervised fashion. When trained on mid-sized datasets such as ImageNet without strong regularization, these mod- els yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases 1Fine-tuning code and pre-trained models are available at https://github.com/ google-research/vision_transformer 1 arXiv:2010.11929v2 [cs.CV] 3 Jun 2021


ID: 39fe003d-429a-4e73-9262-f375cc845fa1   Page: 1
Cited Quotes:


*Vision Transformer (ViT) attains excellent results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints.*

Cited Context:


<IPython.core.display.Markdown object>


ID: 161ff54f-640b-40ee-b1dc-7724bce5522b   Page: 2
Cited Quotes:


*To handle 2D images, we reshape the image x ∈RH×W ×C into a sequence of flattened 2D patches xp ∈RN×(P 2·C), where (H, W) is the resolution of the original image, C is the number of channels, (P, P) is the resolution of each image patch, and N = HW/P 2 is the resulting number of patches, which also serves as the effective input sequence length for the Transformer.*

Cited Context:


<IPython.core.display.Markdown object>


ID: 5091e5e8-6492-4e8e-8de3-40c4b90b5142   Page: 3
Cited Quotes:


*In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global.*

Cited Context:


<IPython.core.display.Markdown object>


