# DSA4265 Assignment 2: RAG Generation

With the large availability of news available today from different agencies, it is increasingly difficult for investors to spend time to look through all news articles in order to obtain the answer that they are looking for. 

Therefore, the goal of this assignment is to create a search engine that summarises key information about the recent stock data in order to have a better context of the stock such that investors can make a more informed decision about the performance of the stock.

## Part 1: Data Extraction

The following section describes the data extraction process and generation of the labelled dataframe. The tickers used for analysis are that of Apple Inc. (AAPL) and Tesla stocks (TSLA). The data obtained was sourced from Refinitiv Workspace, and the code to extract the dataframes were all copied and pasted from its in-built CodeBook. News headlines were limited to top deals for digital finance, corporate finance, and overall news about the stock itself.

In [2]:
import refinitiv.data as rd
from refinitiv.data.content import news
from IPython.display import HTML
import pandas as pd
import numpy as np
from datetime import datetime,timedelta
import time
import warnings
import refinitiv.data.eikon as ek
from IPython.display import HTML
warnings.filterwarnings("ignore")



In [2]:
rd.open_session()

<refinitiv.data.session.Definition object at 0x2969ae560e0 {name='workspace'}>

In [None]:
def fetch_full_story(story_id):
    try:
        story = rd.news.get_story(story_id, format=rd.news.Format.TEXT)
        return story if story else "Story not available"
    except Exception as e:
        print(f"Error fetching {story_id}: {e}")
        return "Error retrieving story"

dNow = datetime.now().date()
maxenddate = dNow - timedelta(days=90) #upto months=15
compNews = pd.DataFrame()
riclist = ['TSLA.O','AAPL.O'] # can also use Peers, Customers, Suppliers, Monitor, Portfolio to build universe

for ric in riclist:
    try:
        cHeadlines = rd.news.get_headlines("R:" + ric + " AND Language:LEN AND Source:RTRS AND (Topic:TOP/DEALS OR Topic:TOP/DIGFIN)", 
                                           start= str(dNow), 
                                           end = str(maxenddate), count = 100)
        cHeadlines['ric'] = ric
        # Corporate Finance: TOP/DEALS, Broker Research / Recommendation: RCH
        if len(compNews):
            compNews = pd.concat([compNews,cHeadlines])
        else:
            compNews = cHeadlines
    except Exception:
        pass

# Apply to all rows
compNews["full_story"] = compNews["storyId"].apply(fetch_full_story)

compNews2 = pd.DataFrame()
riclist = ['TSLA.O','AAPL.O'] # can also use Peers, Customers, Suppliers, Monitor, Portfolio to build universe

for ric in riclist:
    try:
        cHeadlines = rd.news.get_headlines("R:" + ric + " AND Language:LEN AND Source:RTRS AND (Topic:TOPALL)", 
                                           start= str(dNow), 
                                           end = str(maxenddate), count = 100)
        cHeadlines['ric'] = ric
        # Corporate Finance: TOP/DEALS, Broker Research / Recommendation: RCH
        if len(compNews):
            compNews2 = pd.concat([compNews2,cHeadlines])
        else:
            compNews2 = cHeadlines
    except Exception:
        pass

compNews2["full_story"] = compNews2["storyId"].apply(fetch_full_story)
combined_df = pd.concat([compNews, compNews2], axis = 0)
combined_df.to_csv('combined_news_updated.csv')
combined_df

Unnamed: 0_level_0,headline,storyId,sourceCode,ric,full_story
versionCreated,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2025-02-27 13:00:00,RPT-BREAKINGVIEWS-GM illuminates good times be...,urn:newsml:reuters.com:20250227:nL3N3PH1P3:5,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
2025-02-25 12:52:47,Tesla to acquire parts of insolvent German par...,urn:newsml:reuters.com:20250225:nL5N3PG0Y9:7,NS:RTRS,TSLA.O,"* \n Acquisition includes 300 staff, excl..."
2025-02-25 01:10:32,RPT-BREAKINGVIEWS-Nissan offers suitors daunti...,urn:newsml:reuters.com:20250225:nL3N3PG036:3,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
2025-02-24 12:00:00,RPT-BREAKINGVIEWS-Nissan offers suitors daunti...,urn:newsml:reuters.com:20250224:nL3N3PF0DU:4,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
2025-02-21 13:55:05,UPDATE 6-Japan seeks Tesla investment in Nissa...,urn:newsml:reuters.com:20250221:nL3N3PC0GR:2,NS:RTRS,TSLA.O,* \n Japanese group draws up plans for Te...
...,...,...,...,...,...
2025-02-14 15:03:57,US STOCKS-Wall St subdued as markets await tar...,urn:newsml:reuters.com:20250214:nL4N3P515Y:5,NS:RTRS,AAPL.O,"(For a Reuters live blog on U.S., UK and Europ..."
2025-02-14 14:04:36,US STOCKS-Wall St set for subdued open as mark...,urn:newsml:reuters.com:20250214:nL4N3P512O:5,NS:RTRS,AAPL.O,"(For a Reuters live blog on U.S., UK and Europ..."
2025-02-14 12:31:59,US STOCKS-Futures slip as markets await tariff...,urn:newsml:reuters.com:20250214:nL4N3P50XC:5,NS:RTRS,AAPL.O,"(For a Reuters live blog on U.S., UK and Europ..."
2025-02-14 11:50:37,REFILE-FACTBOX-China's AI firms take spotlight...,urn:newsml:reuters.com:20250214:nL1N3P50BI:1,NS:RTRS,AAPL.O,(Refiles to correct transposed letters in Doub...


## Part 2: Building of RAG Model

The RAG model was built with the help of 3 factors, all of which will be dealt with in greater depth below:

### Feature 1: Chunking of Documents

To facilitate the separation of documents into distinct chunks, RecursiveTextSplitter function was utilised, with an overlap of 100 characters so as to ensure the preservation of context between chunks. Therefore, this enables better understanding of each chunk.

### Feature 2: Sentence Embeddings

To assign texts to numerical vectors for the machines to process the text, a sentence transformer model was selected to help in semantic similarity search.

### Feature 3: Vector Store

The use of Facebook Artificial Intelligence Similarity Search (FAISS) was used to store the embeddings for quick retrieval.

### Feature 4: Language Model

In [5]:
import pandas as pd
import faiss
import torch
from transformers import AutoTokenizer, AutoModel
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DataFrameLoader
from langchain.schema import Document
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import re

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
combined_news_df = pd.read_csv("combined_news_updated.csv")
tsla_news = combined_news_df[combined_news_df['ric'] == 'TSLA.O']
aapl_news = combined_news_df[combined_news_df['ric'] == 'AAPL.O']

NameError: name 'pd' is not defined

In [97]:
tsla_news

Unnamed: 0,versionCreated,headline,storyId,sourceCode,ric,full_story
0,2025-02-27 13:00:00,RPT-BREAKINGVIEWS-GM illuminates good times be...,urn:newsml:reuters.com:20250227:nL3N3PH1P3:5,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
1,2025-02-25 12:52:47,Tesla to acquire parts of insolvent German par...,urn:newsml:reuters.com:20250225:nL5N3PG0Y9:7,NS:RTRS,TSLA.O,"* \n Acquisition includes 300 staff, excl..."
2,2025-02-25 01:10:32,RPT-BREAKINGVIEWS-Nissan offers suitors daunti...,urn:newsml:reuters.com:20250225:nL3N3PG036:3,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
3,2025-02-24 12:00:00,RPT-BREAKINGVIEWS-Nissan offers suitors daunti...,urn:newsml:reuters.com:20250224:nL3N3PF0DU:4,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
4,2025-02-21 13:55:05,UPDATE 6-Japan seeks Tesla investment in Nissa...,urn:newsml:reuters.com:20250221:nL3N3PC0GR:2,NS:RTRS,TSLA.O,* \n Japanese group draws up plans for Te...
...,...,...,...,...,...,...
157,2025-02-18 10:56:15,Netherlands to build 1.4 GW battery storage fa...,urn:newsml:reuters.com:20250218:nL6N3P90C3:2,NS:RTRS,TSLA.O,"AMSTERDAM, Feb 18 - Dutch energy storage firm ..."
158,2025-02-18 06:29:17,Tesla steps up India hiring after Musk-Modi me...,urn:newsml:reuters.com:20250218:nL3N3P90AQ:1,NS:RTRS,TSLA.O,Feb 18 (Reuters) - Elon Musk's Tesla <TSLA.O> ...
159,2025-02-18 04:44:51,Tesla begins mass production of revamped Model...,urn:newsml:reuters.com:20250218:nP8N3O50DP:2,NS:RTRS,TSLA.O,"BEIJING, Feb 18 (Reuters) - U.S. automaker Tes..."
160,2025-02-18 03:27:45,Complaints targeting BYD flood Chinese consume...,urn:newsml:reuters.com:20250218:nL3N3P80MF:5,NS:RTRS,TSLA.O,"BEIJING, Feb 18 (Reuters) - Complaints about B..."


In [99]:
# Convert dataframe to documents
documents = [
    Document(page_content=row['full_story'],
             date=row['versionCreated']) 
    for _, row in tsla_news.iterrows()
]

# Chunk documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
split_docs = text_splitter.split_documents(documents)

# Initialize embedding model for documents and queries
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model)

# Create FAISS vector database
vector_db = FAISS.from_documents(split_docs, embeddings)

# Use a generative model for text generation
llm_model = "EleutherAI/gpt-neo-2.7B"  # You can try other models as needed
# llm_model = "bigscience/bloom-560m"
llm_pipeline = pipeline("text-generation", model=llm_model, device=0 if torch.cuda.is_available() else -1, 
                        max_new_tokens=100)
llm = HuggingFacePipeline(pipeline=llm_pipeline)

# Create the RetrievalQA chain, passing in the LLM and vector database retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=llm, retriever=vector_db.as_retriever()
)

In [93]:
def ask_question(query):
    """Function to ask a question using RAG model with extracted context chunks."""
    # Retrieve relevant documents
    retrieved_docs = vector_db.similarity_search(query, k=3)
    
    # Extract the context chunks from the retrieved documents
    prompt_context = [doc.page_content for doc in retrieved_docs]
    # return prompt_context
    # Prepare the prompt
    context_str = "\n\n".join(prompt_context)
    prompt = f"""
        You are an AI system. Below are relevant news articles with potential relevance:
        {context_str}

        Based on these excerpts, if the information is insufficient, say "I do not have enough information." Otherwise, answer the following:
        
        Question: {query}

        Answer:
    """.strip()
    def remove_consecutive_duplicates(text: str):
        # Split the text into sentences using regex
        sentences = re.split(r'(?<=[.!?])\s+', text)
        
        # Create a list to store non-duplicate sentences
        unique_sentences = []
        
        # Iterate over the sentences and add them to unique_sentences if not a duplicate
        for i in range(len(sentences)):
            current_sentence = sentences[i].strip()
            
            # If it's the first sentence or not a duplicate of the previous one, keep it
            if i == 0 or current_sentence != sentences[i - 1].strip():
                unique_sentences.append(current_sentence)
        
        # Join the unique sentences back into a single text
        return ' '.join(unique_sentences)
    
    # Generate the model's response
    response = llm_pipeline(prompt)[0]['generated_text']
    # Find where the 'Question:' part starts
    first_qn_pos = response.lower().find('question:')
    
    answer_start = response.lower().find('answer:')
    
    # If 'Answer:' is found, slice the text from there
    if answer_start != -1:
        answer = response[answer_start + len('answer:'):].strip()  # Extract everything after "Answer:"
        new_answer = remove_consecutive_duplicates(answer)
    else:
        answer = "No answer found."
    
    # answer_end = response.lower().find('a:', qn_start)
    # answer_end = response.lower().find('question:', first_qn_pos + len('question:'))
    
    # # If 'Question:' and 'A:' are found, return the text between them
    # if first_qn_pos != -1 and answer_end != -1:
    #     answer = response[first_qn_pos:answer_end].strip()
    # elif first_qn_pos != -1:
    #     # If only 'Question:' is found, return everything from 'Question:' onward
    #     answer = response[first_qn_pos:].strip()
    # else:
    #     # If no 'Answer:' part is found, return the whole generated text
    #     answer = response.strip()
    final_response = f"Question:{query}\nAnswer:{new_answer}"
    return final_response

In [94]:
# Example usage
query = "What are some risks associated with Tesla lately?"
response = ask_question(query)
print(response)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


KeyboardInterrupt: 

In [90]:
def remove_consecutive_duplicates(text: str):
    # Split the text into sentences using regex
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())  # Split on punctuation followed by space
    
    # Create a list to store non-duplicate sentences
    unique_sentences = []
    
    # Iterate over the sentences and add them to unique_sentences if not a duplicate
    for i in range(len(sentences)):
        current_sentence = sentences[i].strip()
        
        # If it's the first sentence or not a duplicate of the previous one, keep it
        if i == 0 or current_sentence.lower() != sentences[i - 1].strip().lower():
            unique_sentences.append(current_sentence)
    
    # Join the unique sentences back into a single text
    return ' '.join(unique_sentences)

text = "The company has a market share of about 50% in the U.S. and Europe. The company has a market share of about 50% in the U.S. and Europe. Claire."
remove_consecutive_duplicates(text)

'The company has a market share of about 50% in the U.S. and Europe. The company has a market share of about 50% in the U.S. and Europe. Claire.'

In [None]:
# from transformers import pipeline
# import torch
# # from transformers import HuggingFacePipeline

# # Set up the text generation pipeline
# llm_model = "bigscience/bloom-560m"
# llm_pipeline = pipeline("text-generation", model=llm_model, device=0 if torch.cuda.is_available() else -1, 
#                         max_new_tokens=100)
# llm = HuggingFacePipeline(pipeline=llm_pipeline)

# # Function to ask a question given context
# def ask_question_with_context(context: str, question: str):
#     # Format the prompt with context and question
#     prompt = f"You are an AI bot who is given the following:\nContext: {context}\n\nQuestion: {question}\nAnswer:"
#     response = llm(prompt)
    
#     # Find where the 'Question:' part starts
#     qn_start = response.lower().find('question:')
    
#     # Find where the 'A:' part starts (indicating the end of the answer)
#     answer_end = response.lower().find('a:', qn_start)
    
#     # If 'Question:' and 'A:' are found, return the text between them
#     if qn_start != -1 and answer_end != -1:
#         answer = response[qn_start:answer_end].strip()
#     elif qn_start != -1:
#         # If only 'Question:' is found, return everything from 'Question:' onward
#         answer = response[qn_start:].strip()
#     else:
#         # If no 'Answer:' part is found, return the whole generated text
#         answer = response.strip()
    
#     return answer

# # Example context and question
# context = "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It was named after the engineer Gustave Eiffel, whose company designed and built the tower."
# question = "Where is Eiffel Tower?"

# # Ask the question based on the context
# answer = ask_question_with_context(context, question)

In [None]:
# print(answer)

Question: Where is Eiffel Tower?
Answer: The Eiffel Tower is located in the city of Paris, France. The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It was named after the engineer Gustave Eiffel, whose company designed and built the tower.


In [None]:
# def find_question_q(text: str):
#     # Find the position of 'Q' in 'Question:'
#     q_position = text.lower().find('question:')
    
#     # If 'Question:' is found, return the index of the first 'Q'
#     if q_position != -1:
#         return q_position  # Returns the index of 'Q' in 'Question:'
#     else:
#         return None  # Return None if 'Question:' is not found

# # Example string
# example_text = "You are an AI bot who is given the following:\nContext: Some context here\n\nQuestion: Where is Eiffel Tower?"

# # Find the position of the 'Q' in 'Question:'
# q_index = find_question_q(example_text)

# # Print the result
# print(f"The position of 'Q' in 'Question:' is: {q_index}")

# example_text[q_index:]

The position of 'Q' in 'Question:' is: 74


'Question: Where is Eiffel Tower?'

In [6]:
# !pip install "python-doctr[torch,viz,html,contrib]"  
# !pip install onnx==1.16.1

In [1]:
import base64
import os
import re
import uuid

from IPython.display import Image, Markdown, display
from langchain.prompts import PromptTemplate
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_core.documents import Document
from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_google_vertexai import (
    ChatVertexAI,
    VectorSearchVectorStore,
    VertexAI,
    VertexAIEmbeddings,
)
from langchain_text_splitters import CharacterTextSplitter
from google.cloud import aiplatform
import fitz  # pymupdf
from langchain_community.vectorstores import Chroma

In [2]:
PROJECT_ID = "handy-bonbon-453100-e8"  # @param {type:"string"}
LOCATION = "us-east4"  # @param {type:"string"}

# For Vector Search Staging
GCS_BUCKET = "gen_ai_bucket_129395"  # @param {type:"string"}
GCS_BUCKET_URI = f"gs://{GCS_BUCKET}"
aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=GCS_BUCKET_URI)

In [3]:
aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=GCS_BUCKET_URI)

In [31]:
MODEL_NAME = "gemini-1.5-flash"
GEMINI_OUTPUT_TOKEN_LIMIT = 8192

EMBEDDING_MODEL_NAME = "text-embedding-004"
EMBEDDING_TOKEN_LIMIT = 2048

TOKEN_LIMIT = min(GEMINI_OUTPUT_TOKEN_LIMIT, EMBEDDING_TOKEN_LIMIT)

# model = VertexAI(
#     temperature=0, 
#     model_name=MODEL_NAME, 
#     max_output_tokens=TOKEN_LIMIT
# )

In [5]:
stocks_used = ["aapl", "amzn", "ba", "brka", "googl", "gs", "jnj", "jpm", "ko", "mcd", 
               "meta", "ms", "msft", "nee", "nvda", "pfe", "pg", "tsla", "v", "xom"]

In [53]:
# from pdf2image import convert_from_path
# images = convert_from_path("tesla-stock-report.pdf", poppler_path=r"C:\Users\wjlwi\Downloads\poppler-24.08.0\Library\bin")
# for i, img in enumerate(images):
#     img.save(f'{i}.jpg')

In [54]:
# pdf_file_name = "google-10k-sample-14pages.pdf"
# pdf_folder_path = "data/"
# # Extract images, tables, and chunk text from a PDF file.
# raw_pdf_elements = partition_pdf(
#     filename=pdf_file_name,
#     extract_images_in_pdf=False,
#     infer_table_structure=True,
#     chunking_strategy="by_title",
#     max_characters=4000,
#     new_after_n_chars=3800,
#     combine_text_under_n_chars=2000,
#     image_output_dir_path=pdf_folder_path,
# )

# # Categorize extracted elements from a PDF into tables and texts.
# tables = []
# texts = []
# for element in raw_pdf_elements:
#     if "unstructured.documents.elements.Table" in str(type(element)):
#         tables.append(str(element))
#     elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
#         texts.append(str(element))

# # Optional: Enforce a specific token size for texts
# text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
#     chunk_size=10000, chunk_overlap=0
# )
# joined_texts = " ".join(texts)
# texts_4k_token = text_splitter.split_text(joined_texts)

In [6]:
import time
def generate_text_summaries(
    texts: list[str], summarize_texts: bool = False
) -> tuple[list, list]:
    """
    Summarize text elements
    texts: List of str
    summarize_texts: Bool to summarize texts
    """

    # Prompt
    prompt_text = """You are an assistant tasked with summarizing tables and text for retrieval. \
    These summaries will be embedded and used to retrieve the raw text or table elements. \
    Summarise the issues stemming for the report provided. The report is as shown: {element} """
    prompt = PromptTemplate.from_template(prompt_text)
    empty_response = RunnableLambda(
        lambda x: AIMessage(content="Error processing document")
    )
    # Text summary chain
    model = VertexAI(
        temperature=0, model_name=MODEL_NAME, max_output_tokens=TOKEN_LIMIT
    ).with_fallbacks([empty_response])
    summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

    # Initialize empty summaries
    text_summaries = []

    # Apply to text if texts are provided and summarization is requested
    # if texts:
    #     if summarize_texts:
    #         text_summaries = summarize_chain.batch(texts, {"max_concurrency": 1})
    #     else:
    #         text_summaries = texts
    if texts:
        for i in range(len(texts)):
            text = texts[i]
            if summarize_texts:
                # Summarize the current text chunk
                summary = summarize_chain.invoke({"element": text})
                text_summaries.append(summary)
            else:
                text_summaries.append(text)
            print(f"Chunk {i} summarised, {len(texts)-i} remaining for this stock")
            # Wait for 1 minute after every 3 chunks
            if (i + 1) % 4 == 0 and i != len(texts) - 1:
                print("Waiting for 1 minute before processing the next 4 chunks...")
                time.sleep(60)  # Delay for 1 minute after every 3 chunks
    print("Summarised!")
    return text_summaries

In [None]:
# PROJECT_ID = "handy-bonbon-453100-e8"  # @param {type:"string"}
# LOCATION = "us-central2"  # @param {type:"string"}

# 'asia-northeast1', 'us-west3', 'northamerica-northeast1', 'europe-west9', 'asia-northeast3', 'europe-west8', 'europe-west12', 
# 'africa-south1', 'us-east5', 'asia-south1', 'asia-southeast1', 'asia-east1', 'europe-west4', 'europe-west3', 'northamerica-northeast2', 
# 'us-west4', 'me-central2', 'us-central1', 'australia-southeast1', 'europe-central2', 'europe-north1', 'me-central1', 'europe-west1', 'us-west1', 
# 'us-west2', 'asia-east2', 'us-east1', 'me-west1', 'asia-northeast2', 'southamerica-west1', 'australia-southeast2', 'europe-west2', 'us-south1', 'global', 
# 'asia-southeast2', 'southamerica-east1', 'us-east4', 'europe-west6', 'europe-southwest1'

# # For Vector Search Staging
# GCS_BUCKET = "gen_ai_bucket_129395"  # @param {type:"string"}
# GCS_BUCKET_URI = f"gs://{GCS_BUCKET}"
# aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=GCS_BUCKET_URI)

In [7]:
# stocks_used_1 = ["aapl", "amzn", "ba", "brka"]
# stocks_used_1 = ["aapl"]
stocks_used = ["aapl", "amzn", "ba", "brka", "googl", "gs", "jnj", "jpm", "ko", "mcd", 
               "meta", "ms", "msft", "nee", "nvda", "pfe", "pg", "tsla", "v", "xom"]

stocks_used_dict = dict()

for stock in stocks_used:
    doc = fitz.open(f"{stock}_report.pdf")
    text = "\n".join([page.get_text() for page in doc])
    
    # Extract text from all pages
    texts = [page.get_text("text") for page in doc]

    # Combine extracted text
    full_text = "\n\n".join(texts)

    # Print or use the extracted text
    # print(full_text)

    # Initialize the text splitter, and chunk the reports into more concise summaries
    text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
        chunk_size=1000, chunk_overlap=200
    )

    # Split text into chunks
    texts_4k_token = text_splitter.split_text(full_text)

    # Get text, table summaries
    text_summaries = generate_text_summaries(
        texts_4k_token, summarize_texts=True
    )
    stocks_used_dict[stock] = text_summaries
    print(f"{stock} Done")

Created a chunk of size 1044, which is longer than the specified 1000
Created a chunk of size 1192, which is longer than the specified 1000
Created a chunk of size 1049, which is longer than the specified 1000
Created a chunk of size 1941, which is longer than the specified 1000


Chunk 0 summarised, 11 remaining for this stock
Chunk 1 summarised, 10 remaining for this stock
Chunk 2 summarised, 9 remaining for this stock
Chunk 3 summarised, 8 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 7 remaining for this stock
Chunk 5 summarised, 6 remaining for this stock
Chunk 6 summarised, 5 remaining for this stock
Chunk 7 summarised, 4 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 3 remaining for this stock
Chunk 9 summarised, 2 remaining for this stock
Chunk 10 summarised, 1 remaining for this stock
Summarised!
aapl Done


Created a chunk of size 1176, which is longer than the specified 1000
Created a chunk of size 1188, which is longer than the specified 1000
Created a chunk of size 1026, which is longer than the specified 1000
Created a chunk of size 1892, which is longer than the specified 1000


Chunk 0 summarised, 12 remaining for this stock
Chunk 1 summarised, 11 remaining for this stock
Chunk 2 summarised, 10 remaining for this stock
Chunk 3 summarised, 9 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 8 remaining for this stock
Chunk 5 summarised, 7 remaining for this stock
Chunk 6 summarised, 6 remaining for this stock
Chunk 7 summarised, 5 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 4 remaining for this stock
Chunk 9 summarised, 3 remaining for this stock
Chunk 10 summarised, 2 remaining for this stock
Chunk 11 summarised, 1 remaining for this stock
Summarised!
amzn Done


Created a chunk of size 1182, which is longer than the specified 1000
Created a chunk of size 1240, which is longer than the specified 1000
Created a chunk of size 1030, which is longer than the specified 1000
Created a chunk of size 1888, which is longer than the specified 1000


Chunk 0 summarised, 12 remaining for this stock
Chunk 1 summarised, 11 remaining for this stock
Chunk 2 summarised, 10 remaining for this stock
Chunk 3 summarised, 9 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 8 remaining for this stock
Chunk 5 summarised, 7 remaining for this stock
Chunk 6 summarised, 6 remaining for this stock
Chunk 7 summarised, 5 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 4 remaining for this stock
Chunk 9 summarised, 3 remaining for this stock
Chunk 10 summarised, 2 remaining for this stock
Chunk 11 summarised, 1 remaining for this stock
Summarised!
ba Done


Created a chunk of size 1214, which is longer than the specified 1000
Created a chunk of size 1060, which is longer than the specified 1000
Created a chunk of size 1905, which is longer than the specified 1000


Chunk 0 summarised, 12 remaining for this stock
Chunk 1 summarised, 11 remaining for this stock
Chunk 2 summarised, 10 remaining for this stock
Chunk 3 summarised, 9 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 8 remaining for this stock
Chunk 5 summarised, 7 remaining for this stock
Chunk 6 summarised, 6 remaining for this stock
Chunk 7 summarised, 5 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 4 remaining for this stock
Chunk 9 summarised, 3 remaining for this stock
Chunk 10 summarised, 2 remaining for this stock
Chunk 11 summarised, 1 remaining for this stock
Summarised!
brka Done


Created a chunk of size 1210, which is longer than the specified 1000
Created a chunk of size 1201, which is longer than the specified 1000
Created a chunk of size 1032, which is longer than the specified 1000
Created a chunk of size 1016, which is longer than the specified 1000
Created a chunk of size 1892, which is longer than the specified 1000


Chunk 0 summarised, 12 remaining for this stock
Chunk 1 summarised, 11 remaining for this stock
Chunk 2 summarised, 10 remaining for this stock
Chunk 3 summarised, 9 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 8 remaining for this stock
Chunk 5 summarised, 7 remaining for this stock
Chunk 6 summarised, 6 remaining for this stock
Chunk 7 summarised, 5 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 4 remaining for this stock
Chunk 9 summarised, 3 remaining for this stock
Chunk 10 summarised, 2 remaining for this stock
Chunk 11 summarised, 1 remaining for this stock
Summarised!
googl Done


Created a chunk of size 1069, which is longer than the specified 1000
Created a chunk of size 1201, which is longer than the specified 1000
Created a chunk of size 1075, which is longer than the specified 1000
Created a chunk of size 1952, which is longer than the specified 1000


Chunk 0 summarised, 11 remaining for this stock
Chunk 1 summarised, 10 remaining for this stock
Chunk 2 summarised, 9 remaining for this stock
Chunk 3 summarised, 8 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 7 remaining for this stock
Chunk 5 summarised, 6 remaining for this stock
Chunk 6 summarised, 5 remaining for this stock
Chunk 7 summarised, 4 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 3 remaining for this stock
Chunk 9 summarised, 2 remaining for this stock
Chunk 10 summarised, 1 remaining for this stock
Summarised!
gs Done


Created a chunk of size 1085, which is longer than the specified 1000
Created a chunk of size 1187, which is longer than the specified 1000
Created a chunk of size 1078, which is longer than the specified 1000
Created a chunk of size 1939, which is longer than the specified 1000


Chunk 0 summarised, 11 remaining for this stock
Chunk 1 summarised, 10 remaining for this stock
Chunk 2 summarised, 9 remaining for this stock
Chunk 3 summarised, 8 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 7 remaining for this stock
Chunk 5 summarised, 6 remaining for this stock
Chunk 6 summarised, 5 remaining for this stock
Chunk 7 summarised, 4 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 3 remaining for this stock
Chunk 9 summarised, 2 remaining for this stock
Chunk 10 summarised, 1 remaining for this stock
Summarised!
jnj Done


Created a chunk of size 1079, which is longer than the specified 1000
Created a chunk of size 1183, which is longer than the specified 1000
Created a chunk of size 1049, which is longer than the specified 1000
Created a chunk of size 1939, which is longer than the specified 1000


Chunk 0 summarised, 11 remaining for this stock
Chunk 1 summarised, 10 remaining for this stock
Chunk 2 summarised, 9 remaining for this stock
Chunk 3 summarised, 8 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 7 remaining for this stock
Chunk 5 summarised, 6 remaining for this stock
Chunk 6 summarised, 5 remaining for this stock
Chunk 7 summarised, 4 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 3 remaining for this stock
Chunk 9 summarised, 2 remaining for this stock
Chunk 10 summarised, 1 remaining for this stock
Summarised!
jpm Done


Created a chunk of size 1072, which is longer than the specified 1000
Created a chunk of size 1203, which is longer than the specified 1000
Created a chunk of size 1059, which is longer than the specified 1000
Created a chunk of size 1941, which is longer than the specified 1000


Chunk 0 summarised, 11 remaining for this stock
Chunk 1 summarised, 10 remaining for this stock
Chunk 2 summarised, 9 remaining for this stock
Chunk 3 summarised, 8 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 7 remaining for this stock
Chunk 5 summarised, 6 remaining for this stock
Chunk 6 summarised, 5 remaining for this stock
Chunk 7 summarised, 4 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 3 remaining for this stock
Chunk 9 summarised, 2 remaining for this stock
Chunk 10 summarised, 1 remaining for this stock
Summarised!
ko Done


Created a chunk of size 1088, which is longer than the specified 1000
Created a chunk of size 1196, which is longer than the specified 1000
Created a chunk of size 1047, which is longer than the specified 1000
Created a chunk of size 1944, which is longer than the specified 1000


Chunk 0 summarised, 11 remaining for this stock
Chunk 1 summarised, 10 remaining for this stock
Chunk 2 summarised, 9 remaining for this stock
Chunk 3 summarised, 8 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 7 remaining for this stock
Chunk 5 summarised, 6 remaining for this stock
Chunk 6 summarised, 5 remaining for this stock
Chunk 7 summarised, 4 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 3 remaining for this stock
Chunk 9 summarised, 2 remaining for this stock
Chunk 10 summarised, 1 remaining for this stock
Summarised!
mcd Done


Created a chunk of size 1060, which is longer than the specified 1000
Created a chunk of size 1188, which is longer than the specified 1000
Created a chunk of size 1051, which is longer than the specified 1000
Created a chunk of size 1937, which is longer than the specified 1000


Chunk 0 summarised, 11 remaining for this stock
Chunk 1 summarised, 10 remaining for this stock
Chunk 2 summarised, 9 remaining for this stock
Chunk 3 summarised, 8 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 7 remaining for this stock
Chunk 5 summarised, 6 remaining for this stock
Chunk 6 summarised, 5 remaining for this stock
Chunk 7 summarised, 4 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 3 remaining for this stock
Chunk 9 summarised, 2 remaining for this stock
Chunk 10 summarised, 1 remaining for this stock
Summarised!
meta Done


Created a chunk of size 1229, which is longer than the specified 1000
Created a chunk of size 1198, which is longer than the specified 1000
Created a chunk of size 1037, which is longer than the specified 1000
Created a chunk of size 1904, which is longer than the specified 1000


Chunk 0 summarised, 12 remaining for this stock
Chunk 1 summarised, 11 remaining for this stock
Chunk 2 summarised, 10 remaining for this stock
Chunk 3 summarised, 9 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 8 remaining for this stock
Chunk 5 summarised, 7 remaining for this stock
Chunk 6 summarised, 6 remaining for this stock
Chunk 7 summarised, 5 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 4 remaining for this stock
Chunk 9 summarised, 3 remaining for this stock
Chunk 10 summarised, 2 remaining for this stock
Chunk 11 summarised, 1 remaining for this stock
Summarised!
ms Done


Created a chunk of size 1033, which is longer than the specified 1000
Created a chunk of size 1181, which is longer than the specified 1000
Created a chunk of size 1047, which is longer than the specified 1000
Created a chunk of size 1936, which is longer than the specified 1000


Chunk 0 summarised, 11 remaining for this stock
Chunk 1 summarised, 10 remaining for this stock
Chunk 2 summarised, 9 remaining for this stock
Chunk 3 summarised, 8 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 7 remaining for this stock
Chunk 5 summarised, 6 remaining for this stock
Chunk 6 summarised, 5 remaining for this stock
Chunk 7 summarised, 4 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 3 remaining for this stock
Chunk 9 summarised, 2 remaining for this stock
Chunk 10 summarised, 1 remaining for this stock
Summarised!
msft Done


Created a chunk of size 1213, which is longer than the specified 1000
Created a chunk of size 1197, which is longer than the specified 1000
Created a chunk of size 1014, which is longer than the specified 1000
Created a chunk of size 1893, which is longer than the specified 1000


Chunk 0 summarised, 12 remaining for this stock
Chunk 1 summarised, 11 remaining for this stock
Chunk 2 summarised, 10 remaining for this stock
Chunk 3 summarised, 9 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 8 remaining for this stock
Chunk 5 summarised, 7 remaining for this stock
Chunk 6 summarised, 6 remaining for this stock
Chunk 7 summarised, 5 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 4 remaining for this stock
Chunk 9 summarised, 3 remaining for this stock
Chunk 10 summarised, 2 remaining for this stock
Chunk 11 summarised, 1 remaining for this stock
Summarised!
nee Done


Created a chunk of size 1216, which is longer than the specified 1000
Created a chunk of size 1205, which is longer than the specified 1000
Created a chunk of size 1022, which is longer than the specified 1000
Created a chunk of size 1891, which is longer than the specified 1000


Chunk 0 summarised, 12 remaining for this stock
Chunk 1 summarised, 11 remaining for this stock
Chunk 2 summarised, 10 remaining for this stock
Chunk 3 summarised, 9 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 8 remaining for this stock
Chunk 5 summarised, 7 remaining for this stock
Chunk 6 summarised, 6 remaining for this stock
Chunk 7 summarised, 5 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 4 remaining for this stock
Chunk 9 summarised, 3 remaining for this stock
Chunk 10 summarised, 2 remaining for this stock
Chunk 11 summarised, 1 remaining for this stock
Summarised!
nvda Done


Created a chunk of size 1237, which is longer than the specified 1000
Created a chunk of size 1194, which is longer than the specified 1000
Created a chunk of size 1033, which is longer than the specified 1000
Created a chunk of size 1891, which is longer than the specified 1000


Chunk 0 summarised, 12 remaining for this stock
Chunk 1 summarised, 11 remaining for this stock
Chunk 2 summarised, 10 remaining for this stock
Chunk 3 summarised, 9 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 8 remaining for this stock
Chunk 5 summarised, 7 remaining for this stock
Chunk 6 summarised, 6 remaining for this stock
Chunk 7 summarised, 5 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 4 remaining for this stock
Chunk 9 summarised, 3 remaining for this stock
Chunk 10 summarised, 2 remaining for this stock
Chunk 11 summarised, 1 remaining for this stock
Summarised!
pfe Done


Created a chunk of size 1193, which is longer than the specified 1000
Created a chunk of size 1045, which is longer than the specified 1000
Created a chunk of size 1937, which is longer than the specified 1000


Chunk 0 summarised, 11 remaining for this stock
Chunk 1 summarised, 10 remaining for this stock
Chunk 2 summarised, 9 remaining for this stock
Chunk 3 summarised, 8 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 7 remaining for this stock
Chunk 5 summarised, 6 remaining for this stock
Chunk 6 summarised, 5 remaining for this stock
Chunk 7 summarised, 4 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 3 remaining for this stock
Chunk 9 summarised, 2 remaining for this stock
Chunk 10 summarised, 1 remaining for this stock
Summarised!
pg Done


Created a chunk of size 1062, which is longer than the specified 1000
Created a chunk of size 1195, which is longer than the specified 1000
Created a chunk of size 1079, which is longer than the specified 1000
Created a chunk of size 1940, which is longer than the specified 1000


Chunk 0 summarised, 11 remaining for this stock
Chunk 1 summarised, 10 remaining for this stock
Chunk 2 summarised, 9 remaining for this stock
Chunk 3 summarised, 8 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 7 remaining for this stock
Chunk 5 summarised, 6 remaining for this stock
Chunk 6 summarised, 5 remaining for this stock
Chunk 7 summarised, 4 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 3 remaining for this stock
Chunk 9 summarised, 2 remaining for this stock
Chunk 10 summarised, 1 remaining for this stock
Summarised!
tsla Done


Created a chunk of size 1199, which is longer than the specified 1000
Created a chunk of size 1181, which is longer than the specified 1000
Created a chunk of size 1013, which is longer than the specified 1000
Created a chunk of size 1888, which is longer than the specified 1000


Chunk 0 summarised, 12 remaining for this stock
Chunk 1 summarised, 11 remaining for this stock
Chunk 2 summarised, 10 remaining for this stock
Chunk 3 summarised, 9 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 8 remaining for this stock
Chunk 5 summarised, 7 remaining for this stock
Chunk 6 summarised, 6 remaining for this stock
Chunk 7 summarised, 5 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 4 remaining for this stock
Chunk 9 summarised, 3 remaining for this stock
Chunk 10 summarised, 2 remaining for this stock
Chunk 11 summarised, 1 remaining for this stock
Summarised!
v Done


Created a chunk of size 1060, which is longer than the specified 1000
Created a chunk of size 1205, which is longer than the specified 1000
Created a chunk of size 1060, which is longer than the specified 1000
Created a chunk of size 1946, which is longer than the specified 1000


Chunk 0 summarised, 11 remaining for this stock
Chunk 1 summarised, 10 remaining for this stock
Chunk 2 summarised, 9 remaining for this stock
Chunk 3 summarised, 8 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 4 summarised, 7 remaining for this stock
Chunk 5 summarised, 6 remaining for this stock
Chunk 6 summarised, 5 remaining for this stock
Chunk 7 summarised, 4 remaining for this stock
Waiting for 1 minute before processing the next 4 chunks...
Chunk 8 summarised, 3 remaining for this stock
Chunk 9 summarised, 2 remaining for this stock
Chunk 10 summarised, 1 remaining for this stock
Summarised!
xom Done


In [13]:
print(stocks_used_dict['tsla'][0])

## Summary of Issues for Tesla Inc (0R0X-LN)

The report highlights several concerning issues for Tesla Inc:

* **Declining Average Score:** Tesla's average score has dropped to a 3-year low of 4, primarily due to a decline in Price Momentum. This suggests a negative outlook on the company's performance.
* **Negative 1-Month and 3-Month Returns:** Tesla has experienced significant negative returns in the past month (-26.5%) and three months (-28.9%). This indicates a recent downward trend in the stock price.
* **Neutral Outlook:** Despite the recent decline, Tesla's current score is still considered "relatively in-line with the market," suggesting a neutral outlook. However, the declining trend raises concerns about future performance.
* **Analyst Recommendations:** While the mean recommendation from analysts is "Hold," the distribution of recommendations shows a significant number of "Sell" ratings. This suggests a lack of confidence in the company's future prospects.
* **Trailing and

In [None]:
import pandas as pd

# Convert stocks_used_dict to a pandas DataFrame
# Each stock symbol becomes a row, and its associated summaries become a column

stocks_df = pd.DataFrame(list(stocks_used_dict.items()), columns=['Stock', 'Summaries'])

# Optionally, if you want to save the DataFrame to a CSV file:
stocks_df.to_csv('stocks_used_summaries.csv', index=False)

# Check the resulting DataFrame
print(stocks_df.head())


   Stock                                          Summaries
0   aapl  [## Summary of Issues from Apple Inc. (0R2V-LN...
1   amzn  [## Summary of Issues for AMZN:\n\nThe report ...
2     ba  [## Summary of Issues for Boeing Co (BA)\n\nTh...
3   brka  [## Summary of Issues from the Berkshire Hatha...
4  googl  [## Summary of Issues from Alphabet Inc. (GOOG...


In [35]:
# https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings
DIMENSIONS = 768  # Dimensions output from textembedding-gecko

index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name="rag_index",
    dimensions=DIMENSIONS,
    approximate_neighbors_count=150,
    leaf_node_embedding_count=500,
    leaf_nodes_to_search_percent=7,
    description="RAG LangChain Index",
    index_update_method="STREAM_UPDATE",
)

Creating MatchingEngineIndex
Create MatchingEngineIndex backing LRO: projects/954241416931/locations/us-east4/indexes/3653589178169425920/operations/1053255723351277568
MatchingEngineIndex created. Resource name: projects/954241416931/locations/us-east4/indexes/3653589178169425920
To use this MatchingEngineIndex in another session:
index = aiplatform.MatchingEngineIndex('projects/954241416931/locations/us-east4/indexes/3653589178169425920')


In [36]:
DEPLOYED_INDEX_ID = "rag_index_endpoint"

index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=DEPLOYED_INDEX_ID,
    description="RAG Index Endpoint",
    public_endpoint_enabled=True,
)

Creating MatchingEngineIndexEndpoint
Create MatchingEngineIndexEndpoint backing LRO: projects/954241416931/locations/us-east4/indexEndpoints/904634186868981760/operations/6484596873960095744
MatchingEngineIndexEndpoint created. Resource name: projects/954241416931/locations/us-east4/indexEndpoints/904634186868981760
To use this MatchingEngineIndexEndpoint in another session:
index_endpoint = aiplatform.MatchingEngineIndexEndpoint('projects/954241416931/locations/us-east4/indexEndpoints/904634186868981760')


In [None]:
index_endpoint = index_endpoint.deploy_index(
    index=index, deployed_index_id="rag_deployed_index"
)
index_endpoint.deployed_indexes

Deploying index MatchingEngineIndexEndpoint index_endpoint: projects/954241416931/locations/us-east4/indexEndpoints/904634186868981760
Deploy index MatchingEngineIndexEndpoint index_endpoint backing LRO: projects/954241416931/locations/us-east4/indexEndpoints/904634186868981760/operations/5984697315321970688


In [None]:
from chromadb import Client
from chromadb.config import Settings
from langchain.vectorstores import Chroma
from langchain.embeddings import VertexAIEmbeddings
from langchain.retrievers import MultiVectorRetriever
import uuid
from langchain.schema import Document

# Assuming VertexAIEmbeddings and Chroma setup
# vectorstore = Chroma(
#     collection_name="rag_collection",
#     embedding_function=VertexAIEmbeddings(model_name=EMBEDDING_MODEL_NAME),
# )
vectorstore = VectorSearchVectorStore.from_components(
    project_id=PROJECT_ID,
    region=LOCATION,
    gcs_bucket_name=GCS_BUCKET,
    index_id=index.name,
    endpoint_id=index_endpoint.name,
    embedding=VertexAIEmbeddings(model_name=EMBEDDING_MODEL_NAME),
    stream_update=True,
)
# Create the in-memory docstore to store metadata (e.g., stock symbol)
docstore = InMemoryStore()

# Define the key for document IDs (it could be stock symbols or unique IDs)
id_key = "doc_id"

# Create the multi-vector retriever
retriever_multi_vector_img = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    id_key=id_key,
)

# Process the stock summaries and add to the vectorstore
for stock, summaries in stocks_used_dict.items():
    # Generate unique document IDs (or use stock symbols as IDs)
    doc_ids = [str(uuid.uuid4()) for _ in summaries]
    
    # Create Document objects (with summaries and metadata)
    summary_docs = [
        Document(page_content=s, metadata={id_key: doc_ids[i]})
        for i, s in enumerate(summaries)
    ]
    
    # Add documents (summaries) to Chroma vectorstore
    vectorstore.add_documents(summary_docs)

PydanticUserError: `VertexAIEmbeddings` is not fully defined; you should define `_LanguageModel`, then call `VertexAIEmbeddings.model_rebuild()`.

For further information visit https://errors.pydantic.dev/2.10/u/class-not-fully-defined

In [None]:
def split_text_types(docs):
    """
    Split only text documents
    """
    texts = []
    for doc in docs:
        # Check if the document is of type Document and extract page_content if so
        if isinstance(doc, Document):
            doc = doc.page_content
        texts.append(doc)
    return {"texts": texts}


def text_prompt_func(data_dict):
    """
    Join the context into a single string
    """
    formatted_texts = "\n".join(data_dict["context"]["texts"])
    messages = [
        {
            "type": "text",
            "text": (
                "You are a financial analyst tasked with providing investment advice.\n"
                "You will be given text-based data, including tables and reports.\n"
                "Use this information to provide investment advice related to the user's question.\n"
                f"User-provided question: {data_dict['question']}\n\n"
                "Text and / or tables:\n"
                f"{formatted_texts}"
            ),
        }
    ]
    return [HumanMessage(content=messages)]


# Create RAG chain with text-only logic
chain_multimodal_rag = (
    {
        "context": retriever_multi_vector_img | RunnableLambda(split_text_types),
        "question": RunnablePassthrough(),
    }
    | RunnableLambda(text_prompt_func)
    | ChatVertexAI(
        temperature=0,
        model_name=MODEL_NAME,
        max_output_tokens=TOKEN_LIMIT,
    )  # Multi-modal LLM (text-only in this case)
    | StrOutputParser()
)

Created a chunk of size 1044, which is longer than the specified 1000
Created a chunk of size 1192, which is longer than the specified 1000
Created a chunk of size 1049, which is longer than the specified 1000
Created a chunk of size 1941, which is longer than the specified 1000
Retrying langchain_google_vertexai.chat_models._completion_with_retry.<locals>._completion_with_retry_inner in 4.0 seconds as it raised ResourceExhausted: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: gemini-1.5-flash. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai..
Retrying langchain_google_vertexai.chat_models._completion_with_retry.<locals>._completion_with_retry_inner in 4.0 seconds as it raised ResourceExhausted: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: gemini-1.5-flash. Pleas

KeyboardInterrupt: 

In [14]:
# doc = fitz.open(f"tesla-stock-report.pdf")
# text = "\n".join([page.get_text() for page in doc])
# # print(text)
# # Extract text from all pages
# texts = [page.get_text("text") for page in doc]

# # Combine extracted text
# full_text = "\n\n".join(texts)

# # Print or use the extracted text
# # print(full_text)

# # Initialize the text splitter, and chunk the reports into more concise summaries
# text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
#     chunk_size=2000, chunk_overlap=0
# )

# # Split text into chunks
# texts_4k_token = text_splitter.split_text(full_text)

# # Get text, table summaries
# text_summaries = generate_text_summaries(
#     texts_4k_token, summarize_texts=True
# )

## Part 3: Evaluation of RAG Model

Th

### Evaluation of Strengths and Weaknesses

#### Strengths

#### Weaknesses

### Potential Future Work

### Conclusion