## Semi-Structured RAG For Private Data

In this homework assignment, you will be delving into the realm of **Retrieval Augmented Generation (RAG)**.

Your objective is to construct a system that leverages retrieval from a **private database** consisting of PDFs. These PDFs encapsulate a rich variety of content, including textual information, images, and tables.

The challenge lies in preserving all these components while efficiently extracting relevant data based on a user's input question.

- As a first step, you will need to develop mechanisms for extracting text from the PDFs. Also, extract textual embeddings for following comparison with the user's input.
- Subsequently, you should implement a process to identify and retrieve the most pertinent information matching a user's query.
- Because some input texts are too long, we have to summarize them, and then use the summary of the most similar text to LLM as input.
- Then, you will integrate this retrieved information with a Large Language Model (LLM) to generate comprehensive and contextually relevant responses to user queries.
- Finally, you will apply this mechanism in a Multimodal approach, where you convert PDF images to clip embeddings and use the input's textual CLIP embeddings to compare with the ground truth's image embeddings and find the most similar image to the input text.
- As we are using Unimodal LLMs, we can not give those images to the LLM. Hence, we use image captions to be used in LLM's input.

This holistic approach ensures that no valuable information is lost, and the system provides nuanced answers by combining both the knowledge embedded in the PDFs and the capabilities of the LLM.

Instruction:

<font color='77CC99'>Follow the Green texts and fill out the notebook.</font>


<img src='https://drive.google.com/uc?id=1kODk16WWrn9DqvaWoEAekHRXup1djGjl' width="75%">

## Packages

In [1]:
# restart kernel after first instllation
%%capture
!apt-get install -y poppler-utils
!apt-get install tesseract-ocr
!pip install pytesseract
# for image extraction from pdf
! pip install PyMuPDF
! pip install Pillow
# text embedding
! pip install -U sentence-transformers
! pip install transformers accelerate bitsandbytes>=0.39.0 -q

# 0 - Loading Data

### 0.1 - Downoading the PDF

In [2]:
from pathlib import Path
import urllib.request

# Define the name of the PDF file and then download them
file_name = "Dall_E_paper"

url = "https://arxiv.org/pdf/2204.06125.pdf"
file_path = f"{file_name}.pdf"
urllib.request.urlretrieve(url, file_path)

('Dall_E_paper.pdf', <http.client.HTTPMessage at 0x7f10cc8aaef0>)

## 0.2 - Extract Images and Texts

Implement mechanisms to extract images and texts from the downloaded PDFs.

In [3]:
!which pdftotext

/usr/bin/pdftotext


In [4]:
import pytesseract
print(pytesseract.get_tesseract_version())

4.1.1


In [5]:
# Import required dependencies
import fitz
import os
from PIL import Image

#### Step 0.2.1: Extract and Store Images

In [6]:
# Open PDF file
pdf_file = fitz.open(file_path)

# Calculate number of pages in PDF file
page_nums = len(pdf_file)

# Create empty list to store images information
images_list = []

# Extract all images information from each page
for page_num in range(page_nums):
    page_content = pdf_file[page_num]
    images_list.extend(page_content.get_images())

In [7]:
images_path = "./images/"
Path(images_path).mkdir(parents=True, exist_ok=True)

#Save all the extracted images
for i, image in enumerate(images_list, start=1):
    #Extract the image object number
    xref = image[0]
    #Extract image
    base_image = pdf_file.extract_image(xref)
    #Store image bytes
    image_bytes = base_image['image']
    #Store image extension
    image_ext = base_image['ext']
    #Generate image file name
    image_name = file_name + '_' +str(i) + '.' + image_ext
    #Save image
    with open(os.path.join(images_path, image_name) , 'wb') as image_file:
        image_file.write(image_bytes)
        image_file.close()



### Step 0.2.2: Extract and Store Texts From PDF Content

In [None]:
! pip install unstructured[all-docs]==0.11.2

In [9]:
from lxml import html
from pydantic import BaseModel
from typing import Any, Optional
from unstructured.partition.pdf import partition_pdf

path='./'

# Specify the path to the poppler installation
poppler_path = '/usr/bin/'  # Replace with the path obtained from the previous step

# Specify the path to the Tesseract OCR installation
tesseract_path = '/usr/bin/tesseract'  # Replace with the path obtained from the previous step


# Get elements
raw_pdf_elements = partition_pdf(
    filename= "./"+"Dall_E_paper.pdf",
    # Using pdf format to find embedded image blocks
    extract_images_in_pdf=True,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 1900 chars
    # Attempt to keep chunks > 1000 chars
    # Hard max on chunks
    max_characters=2000,
    new_after_n_chars=1900,
    combine_text_under_n_chars=1000,
    image_output_dir_path=poppler_path,
    tesseract_path=tesseract_path,
)

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [10]:
# Text
text_elements = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.CompositeElement" in str(type(element)):
      text_elements.append(str(element))

print(len(text_elements))

39


In [11]:
text_elements[10]

'3.2 Interpolations\n\nIt is also possible to blend two images x1 and x2 for variations (Figure 4), traversing all of the concepts in CLIP’s embedding space that occur between them. To do this, we rotate between their CLIP embeddings zi1 and zi2 using spherical interpolation, yielding intermediate CLIP representations ziθ = slerp(zi1 , zi2, θ) as θ is varied from 0 to 1. There are two options for producing the intermediate DDIM latents along the trajectory. The ﬁrst option involves interpolating between their DDIM inverted latents xT1 and xT2 (by setting xTθ = slerp(xT1, xT2, θ)), which yields a single trajectory whose endpoints reconstruct x1 and x2. The second option involves ﬁxing the DDIM latent to a randomly-sampled value for all interpolates in the trajectory. This results in an inﬁnite number of trajectories between x1 and x2, though the endpoints of these trajectories will generally no longer coincide with the original images. We use this approach in Figure 4.\n\n3.3 Text Diffs

Because some texts are too long, we have to summarize them at first

In [12]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=device)

summarized_text_elements = summarizer(text_elements , max_length=100, do_sample=False)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Your max_length is set to 100, but your input_length is only 37. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)
Your max_length is set to 100, but your input_length is only 50. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=25)


# 1 - Unimodal RAG

## 1.1 - Loading True Text Data as Embeded Vectors

In this section, we should convert the text data into embedding vectors and store them. Hence, in the following step. having an input, by comparing we can find out the most similar fact with the input.

We use this model for [Text-Embedding](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2)

In [13]:
from sentence_transformers import SentenceTransformer, util

text_emb_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [14]:
text_embeddings = text_emb_model.encode(text_elements, convert_to_tensor=True)

Now, we have all our crucial embeddings. Thus, if we have a new input, we know that we should compare the input's embeddings with the text_embeddings element and find the closest one.

## 1.2 - Unimodal Semi-Structured RAG

### Step 1.2.1: Most Similar Ground Truth Text Ectraction

At the fist step, for any given input, we have to have evaluation functions to find the closest embedding vector to the input vectors. We use the Cosine similarity for this operation.

<font color='77CC99'>Write a function "text_embedding_similarity" to convert input texts to embedded vector and then returns the similarity between the input text and any of the ground truth texts.</font>


In [15]:
import numpy as np

def get_similarity(embeddings_1, embeddings_2):

  embeddings_1 = embeddings_1 / embeddings_1.norm(dim=-1, keepdim=True)
  embeddings_2 = embeddings_2 / embeddings_2.norm(dim=-1, keepdim=True)

  return embeddings_1.cpu().detach().numpy() @ embeddings_2.cpu().detach().numpy().T

In [16]:
def text_embedding_similarity(input_text, text_embeddings, text_emb_model):

  ### To Do ###

    input_text_emb = text_emb_model.encode([input_text], convert_to_tensor=True)[0]

  ### End ###

    return get_similarity(text_embeddings, input_text_emb)

In [17]:
input_text = "is DALL-E2 uses a clip model inside?"
text_embedding_similarity(input_text, text_embeddings, text_emb_model)

array([0.18537423, 0.20288187, 0.21293396, 0.20692343, 0.2686563 ,
       0.31146652, 0.01873749, 0.21311942, 0.20436674, 0.25198764,
       0.26972088, 0.24603592, 0.2582291 , 0.2539547 , 0.23482765,
       0.23657551, 0.17059211, 0.15917057, 0.15459305, 0.14453974,
       0.11232636, 0.18208514, 0.12710598, 0.37942076, 0.05238129,
       0.3243861 , 0.30019337, 0.0696483 , 0.10325827, 0.13214463,
       0.02927516, 0.15812725, 0.10174509, 0.26262775, 0.15325922,
       0.00922206, 0.31082875, 0.04998646, 0.12929475], dtype=float32)

<font color='77CC99'> Now, write a function that finds "Summaries" of the k most similar ground truth texts to the user's input. function "text_retrival"</font>

In [18]:
import heapq

def text_retrival(k,input_text,text_embeddings,text_elements,summarized_text_elements,text_emb_model):

  ### To Do ###

    # Compute the similarity scores between the input text and each element in the text embeddings
    text_embedding_similarities = text_embedding_similarity(input_text, text_embeddings, text_emb_model)

    # To find the indices of the top k most similar elements
    top_k_similar = heapq.nlargest(k, enumerate(text_embedding_similarities), key=lambda x: x[1])

    # For each of the top k indices, select the corresponding text element
    selected_text_elements = [summarized_text_elements[ind]['summary_text'] for ind, val in top_k_similar]

  ### End ###

    return {"selected_text_elements":selected_text_elements}

### Step 1.2.2: Load the core LLM and Combine them all

We use a Question-answering model as the core of our system. In fact, having the input text and finding the closest ground truth fact to the input text, we can give them both to an LLM to answer the question.

Here we load the core LLM for our Unimodal  Semi-Structured RAG. [Model in HF](https://huggingface.co/samwit/koala-7b)

In [19]:
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline
import torch
import textwrap

model = LlamaForCausalLM.from_pretrained(
    "samwit/koala-7b",
    load_in_8bit=True,
    device_map='auto',
)

tokenizer = LlamaTokenizer.from_pretrained("samwit/koala-7b")

config.json:   0%|          | 0.00/550 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/14 [00:00<?, ?it/s]

pytorch_model-00001-of-00014.bin:   0%|          | 0.00/1.96G [00:00<?, ?B/s]

pytorch_model-00002-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00003-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00004-of-00014.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00005-of-00014.bin:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

pytorch_model-00006-of-00014.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00007-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00008-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00009-of-00014.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00010-of-00014.bin:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

pytorch_model-00011-of-00014.bin:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

pytorch_model-00012-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00013-of-00014.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

pytorch_model-00014-of-00014.bin:   0%|          | 0.00/1.69G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/14 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/881 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [20]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15)

tokenizer.pad_token_id = tokenizer.eos_token_id

Now, what follows is our Prompt, based on that, do the task bellow.


<font color='77CC99'> Based on the prompt and what we have done before, write a function that answers the user's question by finding the most related ground truth text(fact) by giving the prompt to LLM. Function "Unimodal_Question_Answering" </font>

In [21]:
# Prompt
prompt_text = """ANSWER the QUESTION in conformity to on FACTS. \n
FACTS: \n {text_facts}. \n
QUESTION: {user_question} \n
ANSWER:  """

In [22]:
def Unimodal_Question_Answering(input_text,k=1):

  ### To Do ###

    retrieved_texts = text_retrival(k, input_text, text_embeddings, text_elements, summarized_text_elements, text_emb_model)
    retrieved_texts = ' '.join(retrieved_texts["selected_text_elements"])

    # Generate the prompt for the language model
    prompt = prompt_text.format(text_facts=retrieved_texts, user_question=input_text)

    # Generate the answer using the language model
    response = pipe(prompt)


  ### End ###

    return response

In [23]:
input_text = "is DALL-E2 uses a clip model inside?"

response = Unimodal_Question_Answering(input_text,k=1)



In [24]:
print(response[0]['generated_text'])

ANSWER the QUESTION in conformity to on FACTS. 

FACTS: 
 Since its release, CLIP has been used extensively to steer generative image models towards text prompts. Nichol et al. [35] showed classiﬁer-free guidance works more favorably than CLIP guidance for text conditional image generation. Zhou and Crowson [9] trained diffusion models conditioned on CLIP text embeddings, allowing for direct text-conditional imagegeneration.. 

QUESTION: is DALL-E2 uses a clip model inside? 

ANSWER:  
    
    No, DALL-E2 does not use a CLIP model inside. It uses a combination of techniques such as data augmentation, transfer learning, and attention mechanisms to improve the performance of the model. The specific details of how these techniques are implemented can be found in the paper "DALL-E2: Improving Image Generation with Transfer Learning" by Chen et al. \[10\].
    
    

In summary, while CLIP has been shown to be effective for text-conditioned image generation, it is not used directly within 

# 2 - Multimodal RAG

In this section, we want to add another modality to our unimodal RAG. What happens if we can consider images as ground truth facts?

We have stored all ground truth images. Thus, in this step, we should extract image embeddings for comparison with textual input embeddings

## 2.1 - Loading CLIP Model for Extracting Embeddings

<font color='77CC99'> Load CLIP model for extracting textual and visial embeddings, then convert all input images to their corresponding vectors.

[Huggingface Link](https://huggingface.co/docs/transformers/model_doc/clip) </font>


In [25]:
from PIL import Image
import requests
from transformers import AutoTokenizer, CLIPTextModelWithProjection
from transformers import AutoProcessor, CLIPVisionModelWithProjection

### To Do ###

textual_clip_model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
textual_clip_tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
visual_clip_model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
visual_clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

### End ###

config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

In [26]:
import glob
images_path = glob.glob('./images/*')

### To Do ###

images_embeddings = []

for i in range(len(images_path)):

    image = [Image.open(images_path[i])]

    # Preprocess the image
    image_input = visual_clip_processor(images=image, return_tensors="pt", padding=True)

    # Get the image embedding
    with torch.no_grad():
        image_embedding = visual_clip_model(**image_input)

    # Append the image embedding to the list
    images_embeddings.append(image_embedding[0])

images_embeddings = torch.stack(images_embeddings)

### End ###

As we are using unimodsl LLM, we need to make image's information understandable for LLM. Hence, we extract textual information of imaged as "Caption" store them in "captions" list.

<font color='77CC99'>Write the corresponding code.</font>


[Image Captioning HF Model](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning)

In [27]:
from transformers import pipeline
image_to_text = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")

### To Do ###

caption_list = [image_to_text(img_path) for img_path in images_path]

### End ###

config.json:   0%|          | 0.00/4.61k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/982M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/241 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/120 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Ok, now that we have every thing ready, we can code the Multimodal Semi-Structured RAG

## 2.2 - Multimodal Semi-Structured RAG

### Step 2.2.1: Most Similar Ground Truth Image Ectraction

At the fist step, for any given input, we have to have evaluation functions to find the closest visual embedding vector to the input's textual vectors. We use the Cosine similarity for this operation.

<font color='77CC99'>Write a function "visual_embedding_similarity" to convert input texts to clip embedding vector and then returns the similarity between the input text and any of the ground truth images.</font>

Note: You can use "get_similarity" function that you have definced before.

In [28]:
def get_embedding_similarity(input_text, images_embeddings, textual_clip_tokenizer ,textual_clip_model):

  ### To Do ###

    # Tokenize the input text using the textual_clip_tokenizer
    text_input = textual_clip_tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

    # Obtain the text embeddings from the textual_clip_model
    text_embeds = textual_clip_model(text_input.input_ids)

  ### End ###

    return get_similarity(images_embeddings, text_embeds[0])

<font color='77CC99'>Now, write a function that finds k most similar Text/Image to user's input.</font>

In [29]:
import heapq

def multimodal_retrival(k,input_text,text_embeddings,text_elements,summarized_text_elements,
                        text_emb_model,images_embeddings,caption_list,textual_clip_tokenizer ,textual_clip_model):

  ### To Do ###

    # TEXT
    text_embedding_similarities = text_embedding_similarity(input_text, text_embeddings, text_emb_model)
    top_k_similar = heapq.nlargest(k, enumerate(text_embedding_similarities), key=lambda x: x[1])
    selected_text_elements = [summarized_text_elements[ind]['summary_text'] for ind, val in top_k_similar]

    # IMAGE
    image_embedding_similarities = get_embedding_similarity(input_text, images_embeddings, textual_clip_tokenizer, textual_clip_model)
    top_k_similar = heapq.nlargest(k, enumerate(image_embedding_similarities), key=lambda x: x[1])
    selected_image_elements = [caption_list[ind][0]['generated_text'] for ind, val in top_k_similar]


    return {"selected_image_elements": selected_image_elements,
            "selected_text_elements": selected_text_elements}

  ### End ###

### Step 2.2.2: Use the core LLM and Combine them all

In this section, based on what we have done before(Loading LLM), we want to use what we have done in this section to write the Multimodal RAG. Do it as follows.

<font color='77CC99'> Based on the new prompt which contains both textual ground truth facts and the caption of visual ground truth images, to write the "Multimodal_Question_Answering" function. This function should takes the user's textual question as input, then finds the most correlated textual and visual ground truth. Then gives them all to LLM via prompt.</font>

In [30]:
# Prompt
prompt_text = """ANSWER the QUESTION in conformity to on FACTS. \n
FACTS: \n {text_facts} \n {image_facts}. \n
QUESTION: {user_question} \n
ANSWER:  """

In [31]:
def Multimodal_Question_Answering(input_text,k=1):

  ### To Do ###

    retrieval_results = multimodal_retrival(k, input_text, text_embeddings, text_elements, summarized_text_elements,
                                             text_emb_model, images_embeddings, caption_list, textual_clip_tokenizer, textual_clip_model)

    text_facts = ' '.join(retrieval_results["selected_text_elements"])
    image_facts = ' '.join(retrieval_results["selected_image_elements"])

    prompt = prompt_text.format(text_facts=text_facts, image_facts=image_facts, user_question=input_text)

    response = pipe(prompt)

  ### End ###

    return response

In [32]:
input_text = "is DALL-E2 uses a clip model inside?"

response = Multimodal_Question_Answering(input_text,k=1)

In [33]:
print(response[0]['generated_text'])

ANSWER the QUESTION in conformity to on FACTS. 

FACTS: 
 Since its release, CLIP has been used extensively to steer generative image models towards text prompts. Nichol et al. [35] showed classiﬁer-free guidance works more favorably than CLIP guidance for text conditional image generation. Zhou and Crowson [9] trained diffusion models conditioned on CLIP text embeddings, allowing for direct text-conditional imagegeneration. 
 a green and white train on a track . 

QUESTION: is DALL-E2 uses a clip model inside? 

ANSWER:  
CLIP is not used directly in DALL-E2. Instead, DALL-E2 uses a combination of various techniques such as attention mechanisms, data augmentation, and transfer learning to improve the performance of the model. The specific details of how these techniques are implemented can be found in the paper "DALL-E2: Improving Image Generation with Transfer Learning" by Chen et al. \[10\].


<font color='77CC99'>The Answer to the input question is "Yes" or "No". What are your Semi-structured models' answers? (Both Unimodal and Multimodal). Are they right or not?</font>

<font color='CC7799'>Your Answer:</font>


The answers provided by both the Unimodal and Multimodal models are correct. They both correctly state that DALL-E2 does not use a CLIP model inside. But the Multimodal model didn't answer the question directly with "Yes" or "No".