<a href="https://colab.research.google.com/github/sidokhade/AI-Powered-Investment-Research-Assistant-Gemini-RAG-/blob/main/InvestmentThesis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install Libraries

In [None]:
#python -m venv thesis_env
#source thesis_env/bin/activate  # (Windows: thesis_env\Scripts\activate)
!pip install google-generativeai langchain chromadb sentence-transformers pypdf2 streamlit langchain_text_splitters langchain-community

Collecting chromadb
  Downloading chromadb-1.4.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting pypdf2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting streamlit
  Downloading streamlit-1.53.0-py3-none-any.whl.metadata (10 kB)
Collecting langchain_text_splitters
  Downloading langchain_text_splitters-1.1.0-py3-none-any.whl.metadata (2.7 kB)
Collecting langchain-community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.4.0-py3-none-any.whl.metadata (5.8 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.3-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.23.2-cp312-c

Import libraries

In [None]:
import os
import google.generativeai as genai

## Load API KEY

YOUR_GEMINI_API_KEY = "AIzaSyAUw0sjpM_tzVNduf1iyY-rG9yCO1ZoksI" # Make sure your actual API key is within quotes
os.environ["GOOGLE_API_KEY"] = YOUR_GEMINI_API_KEY # Use the variable, not the placeholder string
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])


All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  loader.exec_module(module)


Document Loader Setup

In [None]:

import os
from PyPDF2 import PdfReader

def load_documents(folder_path):
    docs = []
    processed_count = 0
    # Ensure the path exists and is a directory
    if not os.path.exists(folder_path) or not os.path.isdir(folder_path):
        print(f"Warning: Folder path '{folder_path}' does not exist or is not a directory. Skipping.")
        return []

    for file_name in os.listdir(folder_path):
        if file_name.endswith(".pdf"):
            file_path = os.path.join(folder_path, file_name)
            try:
                reader = PdfReader(file_path)
                text = ""
                for page in reader.pages:
                    text += page.extract_text() or ""
                docs.append({"file_name": file_name, "text": text})
                processed_count += 1
            except Exception as e:
                print(f"Error processing {file_name}: {e}")
    print(f"Successfully processed {processed_count} PDF documents.")
    return docs

Chunking Setup

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

def semantic_chunking(docs, chunk_size=800, overlap=150):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ".", " "]
    )
    chunks = []
    for doc in docs:
        for chunk in splitter.split_text(doc["text"]):
            chunks.append({"text": chunk, "source": doc["file_name"]})
    return chunks



Vector SetUp

In [None]:
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma

def build_vector_store(chunks, persist_directory="vector_db"):
    embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
    vector_db = Chroma.from_texts(
        texts=[chunk["text"] for chunk in chunks],
        embedding=embedding_model,
        metadatas=[{"source": chunk["source"]} for chunk in chunks],
        persist_directory=persist_directory
    )
    vector_db.persist()
    return vector_db

Retrieval SetUp

In [None]:
def retrieve_context(vector_db, query, top_k=5):
    results = vector_db.similarity_search(query, k=top_k)
    context = "\n\n".join([r.page_content for r in results])
    return context

Prompt Template

In [None]:
def build_investment_prompt(company_name, context):
    prompt = f"""
YouYou are a senior investment analyst preparing a professional investment thesis.

Company: {company_name}

Context from research reports, filings, and market data:
{context}

Generate a structured investment thesis with the following sections:
1. **Company Overview**
2. **Market Opportunity**
3. **Competitive Positioning**
4. **Financial Performance**
5. **Risks & Mitigations**
6. **Investment Recommendation**

Guidelines:
- Use concise, factual, and analytical tone.
- Base insights strictly on the provided context.
- Avoid speculative statements.
- Format clearly with section headings.

RULES FOR EXTRACTION:

1. Only use financial values found in:

   - Consolidated Statement of Profit and Loss

   - Consolidated Balance Sheet

   - Consolidated Cash Flow Statement

3. If multiple values appear, use these tie-break rules:

   a. Prefer audited tables.

   b. Prefer the latest FY.

   c. Prefer numeric tables over narrative.

4. Return structured JSON only.

5. Include for each item:

   - value

   - unit

   - source_snippet (copy exact text)

   - page_number (if found)
 6. Refer to Analyst report for Investment Recommendation & generate the summary
"""
    return prompt

Generator SetUp

In [None]:
import google.generativeai as genai
#from prompt_template import build_investment_prompt

def generate_investment_thesis(company_name, context):
    prompt = build_investment_prompt(company_name, context)
    model = genai.GenerativeModel("gemini-2.5-flash")
    response = model.generate_content(prompt)
    return response.text

In [None]:
#test code 1:
#test the above function - two PDFs are in the folder. Check if Text o/p has contets of bboth the pdfs
import os

pdf_folder_path = "/content/sample_data"

# Create the directory if it doesn't exist
if not os.path.exists(pdf_folder_path):
    os.makedirs(pdf_folder_path)
    print(f"Created directory: {pdf_folder_path}")
else:
    print(f"Directory already exists: {pdf_folder_path}")

# Now try loading documents
text = load_documents(pdf_folder_path)

# Display last 100 characters if text is not empty, otherwise indicate it's empty
if text:
    print(text[-100:])
else:
    print("No documents loaded (likely no PDFs in the directory or issue with folder path).")

Directory already exists: /content/sample_data
Successfully processed 2 PDF documents.
[{'file_name': 'Apple.pdf', 'text': 'DBS Group Research  \nDisclaimer: The information contained in this document is intended only for use by the person to whom it has been delivered and should not be  disseminated \nor distributed to third parties without our prior written consent. DBS accepts no liability whatsoever with respect to the use of this document or its contents. \nPlease refer to Disclaimer found at the end of this document  \n \n \n  US EQUITY  RESEARCH   \n \n18 July 2025   \nApple Inc  \nPhenomenal new products and technological barrier to support growth  \n \nCompany Overview  \nApple Inc. (Apple) designs, manufactures, and markets smartphones, personal computers, \ntablets, wearables, and accessories and sells a range of related services. The company’s \nproducts include iPhone (51% of FY24 revenue), Mac (8%), iPad (7%), wearables, h ome, and \naccessories (9%) like AirPods, Apple T

In [None]:
#test code 2: "text" is the o/p of text_code_1 is now go for chunking
chunks = semantic_chunking(text)
chunks[-5:]

[{'text': '(d)Disclosed in this report any change in the Registrant’s internal control over financial reporting that occurred \nduring the Registrant’s most recent fiscal quarter (the Registrant’s fourth fiscal quarter in the case of an annual \nreport) that has materially affected, or is reasonably likely to materially affect, the Registrant’s internal control \nover financial reporting; and\n5.The Registrant’s other certifying officer(s) and I have disclosed, based on our most recent evaluation of internal control over \nfinancial reporting, to the Registrant’s auditors and the audit committee of the Registrant’s board of directors (or persons \nperforming the equivalent functions):\n(a)All significant deficiencies and material weaknesses in the design or operation of internal control over financial',
  'source': '10K-Q4-2025-as-filed.pdf'},
 {'text': '(a)All significant deficiencies and material weaknesses in the design or operation of internal control over financial \nreporting whi

In [None]:
#test code 3: o/p of test_code_2 is now iput to db
#import importlib
#import vector_store
#importlib.reload(vector_store)
#from vector_store import build_vector_store
vector_db = build_vector_store(chunks)
#vector_db

  embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  vector_db.persist()


In [None]:
#test code 4: lets search the DB now
#Here is problem! We are rteivinng chunks for a specific connext ad passing that to LLM. While LLM has
#wider prompts, but a specific/narrow context! Hence,ontext! Hence,ontext! Hence,ontext! Hence proper thesis generation fails.
#Solution - don't use RAG instead pass whole "text" (o/p of test code 1) to LLM. This "text" has everythig
context = retrieve_context(vector_db, "Consolidated financial statement of apple")
print(context) #o/p is coming correctly, it means DB is set up#

See accompanying Notes to Consolidated Financial Statements.
Apple Inc. | 2025  Form 10-K | 33Apple Inc.
Notes to Consolidated Financial Statements
Note 1 – Summary of Significant Accounting Policies
Basis of Presentation and Preparation
The consolidated financial statements include the accounts of Apple Inc. and its wholly owned subsidiaries. The preparation of 
these consolidated financial statements and accompanying notes in conformity with GAAP requires the use of management 
estimates. Certain prior period amounts in the notes to consolidated financial statements have been reclassified to conform to 
the current period’s presentation .
The Company’s fiscal year is the 52- or 53-week period that ends on the last Saturday of September. An additional week is

Consolidated Statements of Shareholders’ Equity for the years ended September 27, 2025, September 28, 2024 
and September 30, 2023 32
Consolidated Statements of Cash Flows for the years ended September 27, 2025, September 28, 20

In [None]:
#test code 5: lets search the DB now
#If you put context = text then the RAG pipeline will be skipped and the thesis will be detailed
thesis = generate_investment_thesis("Apple", context)

In [None]:
print(thesis)

```json
{
  "investment_thesis": {
    "company_overview": {
      "summary": "Apple Inc. is a technology company, with its consolidated financial statements encompassing Apple Inc. and its wholly owned subsidiaries. The company's fiscal year concludes on the last Saturday of September, spanning either 52 or 53 weeks. Financial statements are prepared in conformity with U.S. GAAP, involving management estimates.",
      "details": [
        {
          "item": "Entity Structure",
          "value": "Apple Inc. and its wholly owned subsidiaries",
          "unit": "entity",
          "source_snippet": "The consolidated financial statements include the accounts of Apple Inc. and its wholly owned subsidiaries.",
          "page_number": 34
        },
        {
          "item": "Fiscal Year End",
          "value": "Last Saturday of September",
          "unit": "date",
          "source_snippet": "The Company’s fiscal year is the 52- or 53-week period that ends on the last Saturday of Se

In [None]:
print(thesis)