Persona Semantic Search + RAG Prototype
This notebook implements a **persona detection** pipeline using **semantic search** over a **mock corpus** of investor questions. It returns a `/semantic_search`-style JSON with the **top persona** and a
**persona_definition** that you can inject into LLM.

What you get:
- Mock persona dataset (Novice / Intermediate / Expert) saved to **Excel** and **CSV**
- Embeddings + cosine similarity via `sentence-transformers`
- `/semantic_search` function returning: `[{"persona": str, "persona_definition": str}]`
- **RAG** generator that tailors the answer using the selected persona (offline template or OpenAI if available)

In [1]:
#!pip install sentence-transformers pandas numpy openai

In [4]:
!{sys.executable} -m pip install --upgrade openai

from openai import OpenAI

Collecting openai
  Downloading openai-1.99.9-py3-none-any.whl.metadata (29 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.10.0-cp313-cp313-win_amd64.whl.metadata (5.3 kB)
Collecting pydantic<3,>=1.9.0 (from openai)
  Downloading pydantic-2.11.7-py3-none-any.whl.metadata (67 kB)
Collecting annotated-types>=0.6.0 (from pydantic<3,>=1.9.0->openai)
  Downloading annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)
Collecting pydantic-core==2.33.2 (from pydantic<3,>=1.9.0->openai)
  Downloading pydantic_core-2.33.2-cp313-cp313-win_amd64.whl.metadata (6.9 kB)
Collecting typing-inspection>=0.4.0 (from pydantic<3,>=1.9.0->openai)
  Downloading typing_inspection-0.4.1-py3-none-any.whl.metadata (2.6 kB)
Downloading openai-1.99.9-py3-none-any.whl (786 kB)
   ---------------------------------------- 0.0/786.8 kB ? eta -:--:--
   --------------------------------------- 786


[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: C:\Users\mail2\AppData\Local\Programs\Python\Python313\python.exe -m pip install --upgrade pip


In [3]:
import sys
print(sys.executable)

C:\Users\mail2\AppData\Local\Programs\Python\Python313\python.exe


In [25]:
import os
import json
from pathlib import Path
from typing import List, Dict

import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer, util

OPENAI_API_KEY = "ADD OPENAI KEY"
client = None
if OPENAI_API_KEY:
    try:
        from openai import OpenAI
        client = OpenAI(api_key=OPENAI_API_KEY)
        print("OpenAI client initialized.")
    except Exception as e:
        client = None
        print("OpenAI not available; falling back to offline templating.", e)
else:
    print("OPENAI_API_KEY not set; using offline templating.")



OpenAI not available; falling back to offline templating. No module named 'openai'


In [20]:
# Create Mock Persona Corpus (Novice / Intermediate / Expert)
corpus = [
    {
        "persona": "Novice Investor",
        "persona_definition": (
            "A beginner in investing who needs simple explanations, minimal jargon, "
            "and prefers low-risk, diversified funds with long-term horizons."
        ),
        "example_questions": [
            "What is a stock and how is it different from a bond?",
            "How do I start investing with a small amount of money?",
            "Is an ETF safer than a single stock?",
            "What is the safest way to grow my savings?",
            "Should I invest monthly or all at once?"
        ]
    },
    {
        "persona": "Intermediate DIY Investor",
        "persona_definition": (
            "Understands basic asset classes and portfolio concepts, comfortable with "
            "moderate risk, seeks balanced explanations with some metrics."
        ),
        "example_questions": [
            "How should I balance ETFs and bonds in a 10-year plan?",
            "What asset allocation fits a moderate risk tolerance?",
            "Compare VTI and SPY for diversification and fees.",
            "Is dollar-cost averaging better than lump-sum in a volatile market?",
            "How do I rebalance a 60/40 portfolio?"
        ]
    },
    {
        "persona": "Expert Investor",
        "persona_definition": (
            "Comfortable with financial jargon and advanced strategies; expects concise, "
            "data-rich insights using risk/return statistics and factor exposures."
        ),
        "example_questions": [
            "Contrast factor tilts between QQQ and RSP; impact on drawdown and tracking error.",
            "Implications of an inverted yield curve for duration exposure in IG vs HY credit ETFs?",
            "Compare 1Y and 3Y Sharpe for SCHD vs VIG with dividend stability considerations.",
            "How do you hedge currency risk for international equity exposure?",
            "Suggest a momentum tilt using rolling 6M/12M signals across sector ETFs."
        ]
    }
]

rows = []
for d in corpus:
    for q in d["example_questions"]:
        rows.append(
            {
                "persona": d["persona"],
                "persona_definition": d["persona_definition"],
                "example_question": q,
            }
        )

df = pd.DataFrame(rows)
df.head()

Unnamed: 0,persona,persona_definition,example_question
0,Novice Investor,A beginner in investing who needs simple expla...,What is a stock and how is it different from a...
1,Novice Investor,A beginner in investing who needs simple expla...,How do I start investing with a small amount o...
2,Novice Investor,A beginner in investing who needs simple expla...,Is an ETF safer than a single stock?
3,Novice Investor,A beginner in investing who needs simple expla...,What is the safest way to grow my savings?
4,Novice Investor,A beginner in investing who needs simple expla...,Should I invest monthly or all at once?


In [6]:
out_dir = Path("persona_demo_data")
out_dir.mkdir(exist_ok=True)
excel_path = out_dir / "persona_corpus.xlsx"
csv_path = out_dir / "persona_corpus.csv"

with pd.ExcelWriter(excel_path) as writer:
    df.to_excel(writer, index=False, sheet_name="persona_corpus")

df.to_csv(csv_path, index=False)

print("Saved:", excel_path.resolve())
print("Saved:", csv_path.resolve())

Saved: C:\Users\mail2\Documents\Northwestern MSDS\Capstone DS498\persona_demo_data\persona_corpus.xlsx
Saved: C:\Users\mail2\Documents\Northwestern MSDS\Capstone DS498\persona_demo_data\persona_corpus.csv


In [7]:
#Build Embeddings Index
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(df["example_question"].tolist(), convert_to_tensor=True)
print("Embeddings shape:", embeddings.shape)
df.shape

Embeddings shape: torch.Size([15, 384])


(15, 3)

In [8]:
# `/semantic_search` — Return Top Persona + Definition

def semantic_search(user_query: str, top_k: int = 1) -> List[Dict[str, str]]:
    """
    Given a user query, return the top persona and persona_definition.
    Response schema matches your API design:
    [ { 'persona': str, 'persona_definition': str } ]
    """
    query_emb = model.encode(user_query, convert_to_tensor=True)
    scores = util.cos_sim(query_emb, embeddings).cpu().numpy().flatten()
    top_idx = scores.argsort()[-top_k:][::-1]
    results = []
    for idx in top_idx:
        row = df.iloc[idx]
        results.append(
            {
                "persona": row["persona"],
                "persona_definition": row["persona_definition"],
            }
        )
    return results


semantic_search("How do I start investing with a small amount of money?")

[{'persona': 'Novice Investor',
  'persona_definition': 'A beginner in investing who needs simple explanations, minimal jargon, and prefers low-risk, diversified funds with long-term horizons.'}]

In [14]:
# RAG-Style Response Generation
try:
    from openai import OpenAI
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) if os.getenv("OPENAI_API_KEY") else None
except ImportError:
    client = None

def generate_advice_with_persona(user_query: str, budget: float = 0.0) -> Dict[str, str]:
    """
    Semantic search -> persona -> tailored answer.
    Uses OpenAI if available, otherwise falls back to offline template.
    Returns a dict: {'persona', 'persona_definition', 'answer'}
    """
    top = semantic_search(user_query, top_k=1)[0]
    persona = top["persona"]
    persona_def = top["persona_definition"]

    use_openai = client is not None
    answer = ""

    if use_openai: 
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": f"You are {persona}, defined as: {persona_def}. Provide ETF/stock advice."},
                    {"role": "user", "content": f"User question: {user_query}. Budget: ${budget:,.2f}"}
                ],
                max_tokens=300
            )
            answer = response.choices[0].message.content.strip()
        except Exception as e:
            answer = f"[LLM Error: {e}] Falling back to offline template."
            use_openai = False 

    if not use_openai:
        if persona == "Novice Investor":
            answer = (
                f"As a beginner, focus on simple, diversified choices. "
                f"Consider a low-cost broad ETF and add a bond ETF to reduce ups and downs. "
                f"Invest regularly (monthly) and avoid timing the market. "
                f"Your question: '{user_query}'."
            )
        elif persona == "Intermediate DIY Investor":
            answer = (
                f"Given your intermediate profile, balance growth and stability. "
                f"A core broad equity ETF plus a bond sleeve can work; rebalance (e.g., 60/40). "
                f"Watch expense ratios and tax efficiency. "
                f"Question: '{user_query}'."
            )
        else:  # Expert Investor
            answer = (
                f"For an expert: align allocation with your risk budget and macro view. "
                f"Consider factor tilts and monitor risk-adjusted returns. "
                f"If tactical, define entry/exit rules. "
                f"Question: '{user_query}'."
            )

    return {"persona": persona, "persona_definition": persona_def, "answer": answer}

In [15]:
generate_advice_with_persona("What is the safest way to grow my savings?", budget=2000)

{'persona': 'Novice Investor',
 'persona_definition': 'A beginner in investing who needs simple explanations, minimal jargon, and prefers low-risk, diversified funds with long-term horizons.',
 'answer': "As a beginner, focus on simple, diversified choices. Consider a low-cost broad ETF and add a bond ETF to reduce ups and downs. Invest regularly (monthly) and avoid timing the market. Your question: 'What is the safest way to grow my savings?'."}

In [16]:
# Multiple Queries

queries = [
    "Should I buy bonds or ETFs for retirement?",
    "How do you calculate beta exposure on a leveraged ETF?",
    "Compare VTI and SPY for diversification and fees.",
    "What is a stock?"
]

for q in queries:
    result = generate_advice_with_persona(q, budget=5000)
    print("Q:", q)
    print(json.dumps(result, indent=2), "\n")



Q: Should I buy bonds or ETFs for retirement?
{
  "persona": "Intermediate DIY Investor",
  "persona_definition": "Understands basic asset classes and portfolio concepts, comfortable with moderate risk, seeks balanced explanations with some metrics.",
  "answer": "Given your intermediate profile, balance growth and stability. A core broad equity ETF plus a bond sleeve can work; rebalance (e.g., 60/40). Watch expense ratios and tax efficiency. Question: 'Should I buy bonds or ETFs for retirement?'."
} 

Q: How do you calculate beta exposure on a leveraged ETF?
{
  "persona": "Expert Investor",
  "persona_definition": "Comfortable with financial jargon and advanced strategies; expects concise, data-rich insights using risk/return statistics and factor exposures.",
  "answer": "For an expert: align allocation with your risk budget and macro view. Consider factor tilts and monitor risk-adjusted returns. If tactical, define entry/exit rules. Question: 'How do you calculate beta exposure on 

In [17]:
# Mimic an API handler: `/semantic_search`
def api_semantic_search(payload: Dict[str, str]) -> List[Dict[str, str]]:
    """
    Mimic of a FastAPI/Flask handler.
    Input payload example: { "query": "user question here" }
    Output: [ {persona, persona_definition} ]
    """
    query = payload.get("query", "")
    return semantic_search(query, top_k=1)

In [18]:
api_semantic_search({"query": "How do I start investing with little money?"})

[{'persona': 'Novice Investor',
  'persona_definition': 'A beginner in investing who needs simple explanations, minimal jargon, and prefers low-risk, diversified funds with long-term horizons.'}]