# DSA4265 Assignment 2: RAG Generation

With the large availability of news available today from different agencies, it is increasingly difficult for investors to spend time to look through all news articles in order to obtain the answer that they are looking for. 

Therefore, the goal of this assignment is to create a search engine that summarises key information about the recent stock data in order to have a better context of the stock such that investors can make a more informed decision about the performance of the stock.

## Part 1: Data Extraction

The following section describes the data extraction process and generation of the labelled dataframe. The tickers used for analysis are that of Apple Inc. (AAPL) and Tesla stocks (TSLA). The data obtained was sourced from Refinitiv Workspace, and the code to extract the dataframes were all copied and pasted from its in-built CodeBook. News headlines were limited to top deals for digital finance, corporate finance, and overall news about the stock itself.

In [2]:
import refinitiv.data as rd
from refinitiv.data.content import news
from IPython.display import HTML
import pandas as pd
import numpy as np
from datetime import datetime,timedelta
import time
import warnings
import refinitiv.data.eikon as ek
from IPython.display import HTML
warnings.filterwarnings("ignore")



In [2]:
rd.open_session()

<refinitiv.data.session.Definition object at 0x2969ae560e0 {name='workspace'}>

In [None]:
def fetch_full_story(story_id):
    try:
        story = rd.news.get_story(story_id, format=rd.news.Format.TEXT)
        return story if story else "Story not available"
    except Exception as e:
        print(f"Error fetching {story_id}: {e}")
        return "Error retrieving story"

dNow = datetime.now().date()
maxenddate = dNow - timedelta(days=90) #upto months=15
compNews = pd.DataFrame()
riclist = ['TSLA.O','AAPL.O'] # can also use Peers, Customers, Suppliers, Monitor, Portfolio to build universe

for ric in riclist:
    try:
        cHeadlines = rd.news.get_headlines("R:" + ric + " AND Language:LEN AND Source:RTRS AND (Topic:TOP/DEALS OR Topic:TOP/DIGFIN)", 
                                           start= str(dNow), 
                                           end = str(maxenddate), count = 100)
        cHeadlines['ric'] = ric
        # Corporate Finance: TOP/DEALS, Broker Research / Recommendation: RCH
        if len(compNews):
            compNews = pd.concat([compNews,cHeadlines])
        else:
            compNews = cHeadlines
    except Exception:
        pass

# Apply to all rows
compNews["full_story"] = compNews["storyId"].apply(fetch_full_story)

compNews2 = pd.DataFrame()
riclist = ['TSLA.O','AAPL.O'] # can also use Peers, Customers, Suppliers, Monitor, Portfolio to build universe

for ric in riclist:
    try:
        cHeadlines = rd.news.get_headlines("R:" + ric + " AND Language:LEN AND Source:RTRS AND (Topic:TOPALL)", 
                                           start= str(dNow), 
                                           end = str(maxenddate), count = 100)
        cHeadlines['ric'] = ric
        # Corporate Finance: TOP/DEALS, Broker Research / Recommendation: RCH
        if len(compNews):
            compNews2 = pd.concat([compNews2,cHeadlines])
        else:
            compNews2 = cHeadlines
    except Exception:
        pass

compNews2["full_story"] = compNews2["storyId"].apply(fetch_full_story)
combined_df = pd.concat([compNews, compNews2], axis = 0)
combined_df.to_csv('combined_news_updated.csv')
combined_df

Unnamed: 0_level_0,headline,storyId,sourceCode,ric,full_story
versionCreated,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2025-02-27 13:00:00,RPT-BREAKINGVIEWS-GM illuminates good times be...,urn:newsml:reuters.com:20250227:nL3N3PH1P3:5,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
2025-02-25 12:52:47,Tesla to acquire parts of insolvent German par...,urn:newsml:reuters.com:20250225:nL5N3PG0Y9:7,NS:RTRS,TSLA.O,"* \n Acquisition includes 300 staff, excl..."
2025-02-25 01:10:32,RPT-BREAKINGVIEWS-Nissan offers suitors daunti...,urn:newsml:reuters.com:20250225:nL3N3PG036:3,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
2025-02-24 12:00:00,RPT-BREAKINGVIEWS-Nissan offers suitors daunti...,urn:newsml:reuters.com:20250224:nL3N3PF0DU:4,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
2025-02-21 13:55:05,UPDATE 6-Japan seeks Tesla investment in Nissa...,urn:newsml:reuters.com:20250221:nL3N3PC0GR:2,NS:RTRS,TSLA.O,* \n Japanese group draws up plans for Te...
...,...,...,...,...,...
2025-02-14 15:03:57,US STOCKS-Wall St subdued as markets await tar...,urn:newsml:reuters.com:20250214:nL4N3P515Y:5,NS:RTRS,AAPL.O,"(For a Reuters live blog on U.S., UK and Europ..."
2025-02-14 14:04:36,US STOCKS-Wall St set for subdued open as mark...,urn:newsml:reuters.com:20250214:nL4N3P512O:5,NS:RTRS,AAPL.O,"(For a Reuters live blog on U.S., UK and Europ..."
2025-02-14 12:31:59,US STOCKS-Futures slip as markets await tariff...,urn:newsml:reuters.com:20250214:nL4N3P50XC:5,NS:RTRS,AAPL.O,"(For a Reuters live blog on U.S., UK and Europ..."
2025-02-14 11:50:37,REFILE-FACTBOX-China's AI firms take spotlight...,urn:newsml:reuters.com:20250214:nL1N3P50BI:1,NS:RTRS,AAPL.O,(Refiles to correct transposed letters in Doub...


## Part 2: Building of RAG Model

The RAG model was built with the help of 3 factors, all of which will be dealt with in greater depth below:

### Feature 1: Chunking of Documents

To facilitate the separation of documents into distinct chunks, RecursiveTextSplitter function was utilised, with an overlap of 100 characters so as to ensure the preservation of context between chunks. Therefore, this enables better understanding of each chunk.

### Feature 2: Sentence Embeddings

To assign texts to numerical vectors for the machines to process the text, a sentence transformer model was selected to help in semantic similarity search.

### Feature 3: Vector Store

The use of Facebook Artificial Intelligence Similarity Search (FAISS) was used to store the embeddings for quick retrieval.

### Feature 4: Language Model

In [5]:
import pandas as pd
import faiss
import torch
from transformers import AutoTokenizer, AutoModel
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DataFrameLoader
from langchain.schema import Document
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import re

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
combined_news_df = pd.read_csv("combined_news_updated.csv")
tsla_news = combined_news_df[combined_news_df['ric'] == 'TSLA.O']
aapl_news = combined_news_df[combined_news_df['ric'] == 'AAPL.O']

NameError: name 'pd' is not defined

In [97]:
tsla_news

Unnamed: 0,versionCreated,headline,storyId,sourceCode,ric,full_story
0,2025-02-27 13:00:00,RPT-BREAKINGVIEWS-GM illuminates good times be...,urn:newsml:reuters.com:20250227:nL3N3PH1P3:5,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
1,2025-02-25 12:52:47,Tesla to acquire parts of insolvent German par...,urn:newsml:reuters.com:20250225:nL5N3PG0Y9:7,NS:RTRS,TSLA.O,"* \n Acquisition includes 300 staff, excl..."
2,2025-02-25 01:10:32,RPT-BREAKINGVIEWS-Nissan offers suitors daunti...,urn:newsml:reuters.com:20250225:nL3N3PG036:3,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
3,2025-02-24 12:00:00,RPT-BREAKINGVIEWS-Nissan offers suitors daunti...,urn:newsml:reuters.com:20250224:nL3N3PF0DU:4,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
4,2025-02-21 13:55:05,UPDATE 6-Japan seeks Tesla investment in Nissa...,urn:newsml:reuters.com:20250221:nL3N3PC0GR:2,NS:RTRS,TSLA.O,* \n Japanese group draws up plans for Te...
...,...,...,...,...,...,...
157,2025-02-18 10:56:15,Netherlands to build 1.4 GW battery storage fa...,urn:newsml:reuters.com:20250218:nL6N3P90C3:2,NS:RTRS,TSLA.O,"AMSTERDAM, Feb 18 - Dutch energy storage firm ..."
158,2025-02-18 06:29:17,Tesla steps up India hiring after Musk-Modi me...,urn:newsml:reuters.com:20250218:nL3N3P90AQ:1,NS:RTRS,TSLA.O,Feb 18 (Reuters) - Elon Musk's Tesla <TSLA.O> ...
159,2025-02-18 04:44:51,Tesla begins mass production of revamped Model...,urn:newsml:reuters.com:20250218:nP8N3O50DP:2,NS:RTRS,TSLA.O,"BEIJING, Feb 18 (Reuters) - U.S. automaker Tes..."
160,2025-02-18 03:27:45,Complaints targeting BYD flood Chinese consume...,urn:newsml:reuters.com:20250218:nL3N3P80MF:5,NS:RTRS,TSLA.O,"BEIJING, Feb 18 (Reuters) - Complaints about B..."


In [99]:
# Convert dataframe to documents
documents = [
    Document(page_content=row['full_story'],
             date=row['versionCreated']) 
    for _, row in tsla_news.iterrows()
]

# Chunk documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
split_docs = text_splitter.split_documents(documents)

# Initialize embedding model for documents and queries
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model)

# Create FAISS vector database
vector_db = FAISS.from_documents(split_docs, embeddings)

# Use a generative model for text generation
llm_model = "EleutherAI/gpt-neo-2.7B"  # You can try other models as needed
# llm_model = "bigscience/bloom-560m"
llm_pipeline = pipeline("text-generation", model=llm_model, device=0 if torch.cuda.is_available() else -1, 
                        max_new_tokens=100)
llm = HuggingFacePipeline(pipeline=llm_pipeline)

# Create the RetrievalQA chain, passing in the LLM and vector database retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=llm, retriever=vector_db.as_retriever()
)

In [93]:
def ask_question(query):
    """Function to ask a question using RAG model with extracted context chunks."""
    # Retrieve relevant documents
    retrieved_docs = vector_db.similarity_search(query, k=3)
    
    # Extract the context chunks from the retrieved documents
    prompt_context = [doc.page_content for doc in retrieved_docs]
    # return prompt_context
    # Prepare the prompt
    context_str = "\n\n".join(prompt_context)
    prompt = f"""
        You are an AI system. Below are relevant news articles with potential relevance:
        {context_str}

        Based on these excerpts, if the information is insufficient, say "I do not have enough information." Otherwise, answer the following:
        
        Question: {query}

        Answer:
    """.strip()
    def remove_consecutive_duplicates(text: str):
        # Split the text into sentences using regex
        sentences = re.split(r'(?<=[.!?])\s+', text)
        
        # Create a list to store non-duplicate sentences
        unique_sentences = []
        
        # Iterate over the sentences and add them to unique_sentences if not a duplicate
        for i in range(len(sentences)):
            current_sentence = sentences[i].strip()
            
            # If it's the first sentence or not a duplicate of the previous one, keep it
            if i == 0 or current_sentence != sentences[i - 1].strip():
                unique_sentences.append(current_sentence)
        
        # Join the unique sentences back into a single text
        return ' '.join(unique_sentences)
    
    # Generate the model's response
    response = llm_pipeline(prompt)[0]['generated_text']
    # Find where the 'Question:' part starts
    first_qn_pos = response.lower().find('question:')
    
    answer_start = response.lower().find('answer:')
    
    # If 'Answer:' is found, slice the text from there
    if answer_start != -1:
        answer = response[answer_start + len('answer:'):].strip()  # Extract everything after "Answer:"
        new_answer = remove_consecutive_duplicates(answer)
    else:
        answer = "No answer found."
    
    # answer_end = response.lower().find('a:', qn_start)
    # answer_end = response.lower().find('question:', first_qn_pos + len('question:'))
    
    # # If 'Question:' and 'A:' are found, return the text between them
    # if first_qn_pos != -1 and answer_end != -1:
    #     answer = response[first_qn_pos:answer_end].strip()
    # elif first_qn_pos != -1:
    #     # If only 'Question:' is found, return everything from 'Question:' onward
    #     answer = response[first_qn_pos:].strip()
    # else:
    #     # If no 'Answer:' part is found, return the whole generated text
    #     answer = response.strip()
    final_response = f"Question:{query}\nAnswer:{new_answer}"
    return final_response

In [94]:
# Example usage
query = "What are some risks associated with Tesla lately?"
response = ask_question(query)
print(response)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


KeyboardInterrupt: 

In [90]:
def remove_consecutive_duplicates(text: str):
    # Split the text into sentences using regex
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())  # Split on punctuation followed by space
    
    # Create a list to store non-duplicate sentences
    unique_sentences = []
    
    # Iterate over the sentences and add them to unique_sentences if not a duplicate
    for i in range(len(sentences)):
        current_sentence = sentences[i].strip()
        
        # If it's the first sentence or not a duplicate of the previous one, keep it
        if i == 0 or current_sentence.lower() != sentences[i - 1].strip().lower():
            unique_sentences.append(current_sentence)
    
    # Join the unique sentences back into a single text
    return ' '.join(unique_sentences)

text = "The company has a market share of about 50% in the U.S. and Europe. The company has a market share of about 50% in the U.S. and Europe. Claire."
remove_consecutive_duplicates(text)

'The company has a market share of about 50% in the U.S. and Europe. The company has a market share of about 50% in the U.S. and Europe. Claire.'

In [None]:
# from transformers import pipeline
# import torch
# # from transformers import HuggingFacePipeline

# # Set up the text generation pipeline
# llm_model = "bigscience/bloom-560m"
# llm_pipeline = pipeline("text-generation", model=llm_model, device=0 if torch.cuda.is_available() else -1, 
#                         max_new_tokens=100)
# llm = HuggingFacePipeline(pipeline=llm_pipeline)

# # Function to ask a question given context
# def ask_question_with_context(context: str, question: str):
#     # Format the prompt with context and question
#     prompt = f"You are an AI bot who is given the following:\nContext: {context}\n\nQuestion: {question}\nAnswer:"
#     response = llm(prompt)
    
#     # Find where the 'Question:' part starts
#     qn_start = response.lower().find('question:')
    
#     # Find where the 'A:' part starts (indicating the end of the answer)
#     answer_end = response.lower().find('a:', qn_start)
    
#     # If 'Question:' and 'A:' are found, return the text between them
#     if qn_start != -1 and answer_end != -1:
#         answer = response[qn_start:answer_end].strip()
#     elif qn_start != -1:
#         # If only 'Question:' is found, return everything from 'Question:' onward
#         answer = response[qn_start:].strip()
#     else:
#         # If no 'Answer:' part is found, return the whole generated text
#         answer = response.strip()
    
#     return answer

# # Example context and question
# context = "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It was named after the engineer Gustave Eiffel, whose company designed and built the tower."
# question = "Where is Eiffel Tower?"

# # Ask the question based on the context
# answer = ask_question_with_context(context, question)

In [None]:
# print(answer)

Question: Where is Eiffel Tower?
Answer: The Eiffel Tower is located in the city of Paris, France. The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It was named after the engineer Gustave Eiffel, whose company designed and built the tower.


In [None]:
# def find_question_q(text: str):
#     # Find the position of 'Q' in 'Question:'
#     q_position = text.lower().find('question:')
    
#     # If 'Question:' is found, return the index of the first 'Q'
#     if q_position != -1:
#         return q_position  # Returns the index of 'Q' in 'Question:'
#     else:
#         return None  # Return None if 'Question:' is not found

# # Example string
# example_text = "You are an AI bot who is given the following:\nContext: Some context here\n\nQuestion: Where is Eiffel Tower?"

# # Find the position of the 'Q' in 'Question:'
# q_index = find_question_q(example_text)

# # Print the result
# print(f"The position of 'Q' in 'Question:' is: {q_index}")

# example_text[q_index:]

The position of 'Q' in 'Question:' is: 74


'Question: Where is Eiffel Tower?'

In [6]:
# !pip install "python-doctr[torch,viz,html,contrib]"  
# !pip install onnx==1.16.1

In [55]:
import base64
import os
import re
import uuid

from IPython.display import Image, Markdown, display
from langchain.prompts import PromptTemplate
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_core.documents import Document
from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_google_vertexai import (
    ChatVertexAI,
    VectorSearchVectorStore,
    VertexAI,
    VertexAIEmbeddings,
)
from langchain_text_splitters import CharacterTextSplitter
from google.cloud import aiplatform
import fitz  # pymupdf
from unstructured.partition.pdf import partition_pdf

In [56]:
PROJECT_ID = "handy-bonbon-453100-e8"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

# For Vector Search Staging
GCS_BUCKET = "gen_ai_bucket_129395"  # @param {type:"string"}
GCS_BUCKET_URI = f"gs://{GCS_BUCKET}"

In [3]:
aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=GCS_BUCKET_URI)

In [58]:
MODEL_NAME = "gemini-1.5-flash"
GEMINI_OUTPUT_TOKEN_LIMIT = 8192

EMBEDDING_MODEL_NAME = "text-embedding-004"
EMBEDDING_TOKEN_LIMIT = 2048

TOKEN_LIMIT = min(GEMINI_OUTPUT_TOKEN_LIMIT, EMBEDDING_TOKEN_LIMIT)

# model = VertexAI(
#     temperature=0, 
#     model_name=MODEL_NAME, 
#     max_output_tokens=TOKEN_LIMIT
# )

In [53]:
# from pdf2image import convert_from_path
# images = convert_from_path("tesla-stock-report.pdf", poppler_path=r"C:\Users\wjlwi\Downloads\poppler-24.08.0\Library\bin")
# for i, img in enumerate(images):
#     img.save(f'{i}.jpg')

In [54]:
# pdf_file_name = "google-10k-sample-14pages.pdf"
# pdf_folder_path = "data/"
# # Extract images, tables, and chunk text from a PDF file.
# raw_pdf_elements = partition_pdf(
#     filename=pdf_file_name,
#     extract_images_in_pdf=False,
#     infer_table_structure=True,
#     chunking_strategy="by_title",
#     max_characters=4000,
#     new_after_n_chars=3800,
#     combine_text_under_n_chars=2000,
#     image_output_dir_path=pdf_folder_path,
# )

# # Categorize extracted elements from a PDF into tables and texts.
# tables = []
# texts = []
# for element in raw_pdf_elements:
#     if "unstructured.documents.elements.Table" in str(type(element)):
#         tables.append(str(element))
#     elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
#         texts.append(str(element))

# # Optional: Enforce a specific token size for texts
# text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
#     chunk_size=10000, chunk_overlap=0
# )
# joined_texts = " ".join(texts)
# texts_4k_token = text_splitter.split_text(joined_texts)

In [51]:
doc = fitz.open("tesla-stock-report.pdf")
text = "\n".join([page.get_text() for page in doc])
# print(text)
# Extract text from all pages
texts = [page.get_text("text") for page in doc]

# Combine extracted text
full_text = "\n\n".join(texts)

# Print or use the extracted text
print(full_text)

# Initialize the text splitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=2000, chunk_overlap=200
)

# Split text into chunks
texts_4k_token = text_splitter.split_text(full_text)

# Print first few chunks to verify
for i, chunk in enumerate(texts_4k_token[:3]):  # Display first 3 chunks
    print(f"Chunk {i+1}:\n{chunk}\n{'-'*50}")

Last Close
230.24 (GBX)
2025 February 28
LONDON Exchange
Avg Daily Vol
472,517
52-Week High
387.96
Trailing PE
1.4
Annual Div
--
ROE
10.5%
LTG Forecast
18.1%
1-Mo Return
-30.2%
Market Cap (Consol)
748.4B
52-Week Low
110.27
Forward PE
1.0
Dividend Yield
--
Annual Rev
78.0B
Inst Own
49.3%
3-Mo Return
-14.5%
AVERAGE SCORE
NEUTRAL OUTLOOK: 0R0X's current score is
relatively in-line with the market.
Score Averages
Automobiles & Auto Parts Group:
5.1
Large Market Cap: 7.1
Automobiles & Auto Parts Sector:
5.1
FTSE 100 Index: 6.7
Positive
Neutral
Negative
Average Score Trend (4-Week Moving Avg)
2022-03
2023-03
2024-03
2025-03
Peers
-6M
-3M
-1M
-1W
Current
1Y Trend
0R0E
10
10
9
9
9
TYT
8
9
8
9
8
0P4F
7
7
6
6
6
0R0X
7
8
6
6
4
AML
4
3
4
3
2
HIGHLIGHTS
I/B/E/S MEAN
-
The score for Tesla Inc dropped to its 3-year low of 4 this week.
-
The recent change in the Average Score was primarily due to a
decline in the Price Momentum component score.
Hold
Mean recommendation from all analysts covering
the c

In [None]:
# import fitz
# import os

# # Open the PDF
# doc = fitz.open("tesla-stock-report.pdf")

# # Define the image output directory
# image_output_dir_path = "extracted_images_1"
# os.makedirs(image_output_dir_path, exist_ok=True)

# # Extract text from all pages (same as before)
# texts = [page.get_text("text") for page in doc]
# full_text = "\n\n".join(texts)
# print(full_text)

# # Initialize the text splitter
# text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=10000, chunk_overlap=0)

# # Split text into chunks
# texts_4k_token = text_splitter.split_text(full_text)

# # Extract images from each page
# for page_num in range(len(doc)):
#     page = doc.load_page(page_num)
#     image_list = page.get_images(full=True)
    
#     for img_index, img in enumerate(image_list):
#         xref = img[0]  # The image reference
#         base_image = doc.extract_image(xref)
#         image_bytes = base_image["image"]

#         # Save image to file
#         image_filename = f"{image_output_dir_path}/page_{page_num+1}_img_{img_index+1}.png"
#         with open(image_filename, "wb") as img_file:
#             img_file.write(image_bytes)
        
#         print(f"Image saved as {image_filename}")

# # Optionally, print first few text chunks to verify
# for i, chunk in enumerate(texts_4k_token[:3]):  # Display first 3 chunks
#     print(f"Chunk {i+1}:\n{chunk}\n{'-'*50}")

In [52]:
def generate_text_summaries(
    texts: list[str], summarize_texts: bool = False
) -> tuple[list, list]:
    """
    Summarize text elements
    texts: List of str
    summarize_texts: Bool to summarize texts
    """

    # Prompt
    prompt_text = """You are an assistant tasked with summarizing tables and text for retrieval. \
    These summaries will be embedded and used to retrieve the raw text or table elements. \
    Summarise the issues stemming for Tesla. Table or text: {element} """
    prompt = PromptTemplate.from_template(prompt_text)
    empty_response = RunnableLambda(
        lambda x: AIMessage(content="Error processing document")
    )
    # Text summary chain
    model = VertexAI(
        temperature=0, model_name=MODEL_NAME, max_output_tokens=TOKEN_LIMIT
    ).with_fallbacks([empty_response])
    summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

    # Initialize empty summaries
    text_summaries = []

    # Apply to text if texts are provided and summarization is requested
    if texts:
        if summarize_texts:
            text_summaries = summarize_chain.batch(texts, {"max_concurrency": 1})
        else:
            text_summaries = texts

    return text_summaries


# Get text, table summaries
text_summaries = generate_text_summaries(
    texts_4k_token, summarize_texts=True
)

Retrying langchain_google_vertexai.chat_models._completion_with_retry.<locals>._completion_with_retry_inner in 4.0 seconds as it raised ResourceExhausted: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: gemini-1.5-flash. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai..
Retrying langchain_google_vertexai.chat_models._completion_with_retry.<locals>._completion_with_retry_inner in 4.0 seconds as it raised ResourceExhausted: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: gemini-1.5-flash. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai..
Retrying langchain_google_vertexai.chat_models._completion_with_retry.<locals>._completion_with_retry_inner in 4.0 seconds as it raised ResourceExhausted: 429 Quota exceeded fo

KeyboardInterrupt: 

In [None]:
print(text_summaries[0])

## Tesla (0R0X-LN) Issues Summary:

**Overall:** Tesla's Average Score is currently **Neutral**, indicating performance in line with the market. However, the score has dropped to a 3-year low of 4 this week, primarily due to a decline in Price Momentum.

**Earnings:** Tesla's Earnings Rating is **Negative**, significantly lower than the industry average. The company has a history of missing consensus estimates, with recent analyst downgrades. While the consensus price target has increased notably, the company has reported more negative than positive earnings surprises in the past.

**Fundamentals:** Tesla's Fundamental Rating is **Neutral**, with fundamentals relatively in line with the market. The company's net margin and interest coverage have been consistently higher than the industry average, but its accruals ratio is the highest within its group. Notably, Tesla does not currently pay a dividend.

**Relative Valuation:** Tesla's Relative Valuation Rating is **Neutral**, with multip

In [20]:
# !pip install chromadb

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-win_amd64.whl.metadata (262 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.11-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.19.0-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.30.0-py3-none-any.whl.metadata (1.6 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.30.0-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-instrumentation-fastapi>=0.41b0 (from chromadb)
  Downloading opentelemetry_instrumentation_fastapi-0.51b0-py3-non


[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: C:\Users\wjlwi\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [None]:
from langchain_community.vectorstores import Chroma
# The vectorstore to use to index the summaries
# vectorstore = VectorSearchVectorStore.from_components(
#     project_id=PROJECT_ID,
#     region=LOCATION,
#     gcs_bucket_name=GCS_BUCKET,
#     index_id=index.name,
#     endpoint_id=index_endpoint.name,
#     embedding=VertexAIEmbeddings(model_name=EMBEDDING_MODEL_NAME),
#     stream_update=True,
# )

vectorstore = Chroma(
    collection_name="mm_rag_test",
    embedding_function=VertexAIEmbeddings(model_name=EMBEDDING_MODEL_NAME),
)

# Creation of Multi-Vector storage
docstore = InMemoryStore()

id_key = "doc_id"
# Create the multi-vector retriever
retriever_multi_vector_img = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    id_key=id_key,
)

# Raw Document Contents
doc_contents = texts

doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries + table_summaries)
]

retriever_multi_vector_img.docstore.mset(list(zip(doc_ids, doc_contents)))

# If using Vertex AI Vector Search, this will take a while to complete.
# You can cancel this cell and continue later.
retriever_multi_vector_img.vectorstore.add_documents(summary_docs)

# Creating chain with Retriever and Gemini LLM
# Create RAG chain
chain_multimodal_rag = (
    {
        "context": retriever_multi_vector_img | RunnableLambda(split_image_text_types),
        "question": RunnablePassthrough(),
    }
    | RunnableLambda(img_prompt_func)
    | ChatVertexAI(
        temperature=0,
        model_name=MODEL_NAME,
        max_output_tokens=TOKEN_LIMIT,
    )  # Multi-modal LLM
    | StrOutputParser()
)

In [None]:

query = """
 - What are the critical difference between various graphs for Class A Share?
 - Which index best matches Class A share performance closely where Google is not already a part? Explain the reasoning.
 - Identify key chart patterns for Google Class A shares.
 - What is cost of revenues, operating expenses and net income for 2020. Do mention the percentage change
 - What was the effect of Covid in the 2020 financial year?
 - What are the total revenues for APAC and USA for 2021?
 - What is deferred income taxes?
 - How do you compute net income per share?
 - What drove percentage change in the consolidated revenue and cost of revenue for the year 2021 and was there any effect of Covid?
 - What is the cause of 41% increase in revenue from 2020 to 2021 and how much is dollar change?
"""

# List of source documents
docs = retriever_multi_vector_img.get_relevant_documents(query, limit=10)

source_docs = split_image_text_types(docs)

print(source_docs["texts"])

for i in source_docs["images"]:
    display(Image(base64.b64decode(i)))
    
result = chain_multimodal_rag.invoke(query)

## Part 3: Evaluation of RAG Model

Th

### Evaluation of Strengths and Weaknesses

#### Strengths

#### Weaknesses

### Potential Future Work

### Conclusion