# Email Search AI – Generative Search System

This notebook implements a robust generative search system for organizational email data.
The system follows a three-layer architecture:
1. Embedding Layer
2. Search Layer
3. Generation Layer

The goal is to retrieve and summarize decisions, strategies, and timelines from large email corporates.


In [1]:
# Importing standard Python libraries
import os
import re
import hashlib
from typing import Dict

import pandas as pd
import chromadb
from chromadb.config import Settings

from sentence_transformers import SentenceTransformer, CrossEncoder
from openai import OpenAI




In [3]:
# Loading the email threads file
emails_df = pd.read_csv("email_thread_details.csv")
emails_df = emails_df.dropna(subset=["body"])

print("Emails loaded:", len(emails_df))
emails_df.head()

Emails loaded: 21684


Unnamed: 0,thread_id,subject,timestamp,from,to,body
0,1,FW: Master Termination Log,2002-01-29 11:23:42,"Gossett, Jeffrey C. JGOSSET","['Giron', 'Darron C. Dgiron', 'Love', 'Phillip...",\n\n -----Original Message-----\nFrom: =09Ther...
1,1,FW: Master Termination Log,2002-01-31 12:50:00,"Theriot, Kim S. KTHERIO","['Murphy', 'Melissa Mmurphy', 'Gossett', 'Jeff...",\n\n -----Original Message-----\nFrom: =09Panu...
2,1,FW: Master Termination Log,2002-02-05 15:03:35,"Theriot, Kim S. KTHERIO","['Murphy', 'Melissa Mmurphy', 'Anderson', 'Dia...",Note to Stephanie Panus....\n\nStephanie...ple...
3,1,FW: Master Termination Log,2002-02-05 15:06:25,"Theriot, Kim S. KTHERIO","['Hall', 'D. Todd Thall', 'Sweeney', 'Kevin Ks...",\n\n -----Original Message-----\nFrom: =09Panu...
4,1,FW: Master Termination Log,2002-05-28 07:20:35,"Kelly, Katherine L. KKELLY","['Germany', 'Chris Cgerman']",\n\n -----Original Message-----\nFrom: =09McMi...


In [5]:
# Loading email summary file
summary_df = pd.read_csv("email_thread_summaries.csv")
print("Thread summaries loaded:", len(summary_df))
summary_df.head()

Thread summaries loaded: 4167


Unnamed: 0,thread_id,summary
0,1,The email thread discusses the Master Terminat...
1,2,A lunch meeting has been scheduled for May 5th...
2,3,Ben is updating a friend on his progress with ...
3,4,The recipient of the email thread initially ex...
4,5,The email thread discusses the long form confi...


In [7]:
# Cleans raw email text so it is suitable for embedding and semantic search
def clean_email(text):
    text = re.sub(r"On .* wrote:", "", text)
    text = re.sub(r"(From|Sent|To|Subject):.*", "", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

emails_df["clean_body"] = emails_df["body"].apply(clean_email)

In [9]:
# Splits a long email into smaller overlapping text chunks
def chunk_email(text, chunk_size=300, overlap=50):
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunks.append(" ".join(words[start:end]))
        start += chunk_size - overlap
    return chunks

In [11]:
# Lists to store text chunks and their metadata
documents, metadatas = [], []

# Iterate through each email row
for _, row in emails_df.iterrows():

    # Split the cleaned email body into overlapping chunks
    for i, chunk in enumerate(chunk_email(row["clean_body"])):

        # Store the text chunk for embedding
        documents.append(chunk)

        # Store metadata associated with this chunk
        metadatas.append({
            "thread_id": row["thread_id"],   # Conversation identifier
            "subject": row["subject"],       # Email subject
            "timestamp": row["timestamp"],   # Sent time
            "from": row["from"],             # Sender
            "to": row["to"],                 # Recipient(s)
            "doc_type": "email"              # Used for filtering later
        })


In [13]:
# Load a pre-trained sentence embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Convert each document chunk into a vector embedding
email_embeddings = embedding_model.encode(
    documents,
    show_progress_bar=True  # Visual feedback during embedding
)


Batches:   0%|          | 0/1018 [00:00<?, ?it/s]

In [15]:
# Extract summary text and prepare metadata
summary_docs = summary_df["summary"].tolist()
summary_metas = [
    {"thread_id": t, "doc_type": "thread_summary"}
    for t in summary_df["thread_id"]
]

# Generate embeddings for thread-level summaries
summary_embeddings = embedding_model.encode(
    summary_docs,
    show_progress_bar=True
)


Batches:   0%|          | 0/131 [00:00<?, ?it/s]

In [19]:
# Function to add documents in smaller batches
def batch_add(
    collection,
    documents,
    embeddings,
    metadatas,
    ids,
    batch_size=5000,  # safe batch size < Chroma max (~5461)
):
    """
    Add documents to ChromaDB in smaller batches to prevent InternalError.
    """
    for i in range(0, len(ids), batch_size):
        collection.add(
            documents=documents[i:i+batch_size],
            embeddings=embeddings[i:i+batch_size],
            metadatas=metadatas[i:i+batch_size],
            ids=ids[i:i+batch_size],
        )

# Combine emails and summaries into single lists
all_documents = documents + summary_docs
all_embeddings = list(email_embeddings) + list(summary_embeddings)
all_metadatas = metadatas + summary_metas
all_ids = [f"doc_{i}" for i in range(len(all_documents))]

# Add to Chroma in batches
batch_add(
    collection,
    all_documents,
    all_embeddings,
    all_metadatas,
    all_ids,
)

In [21]:
# Simple in-memory cache to avoid repeated vector searches
cache: Dict[str, tuple] = {}

def cache_key(q):
    # Create a stable hash for each query string
    return hashlib.md5(q.encode()).hexdigest()

def search(query, top_k=10):
    # Check cache first to avoid recomputing embeddings
    key = cache_key(query)
    if key in cache:
        return cache[key]

    # Generate embedding for the query
    emb = embedding_model.encode([query])[0]

    # Perform vector similarity search in ChromaDB
    results = collection.query(
        query_embeddings=[emb],
        n_results=top_k
    )

    # Cache and return documents with their metadata
    cache[key] = (
        results["documents"][0],
        results["metadatas"][0]
    )
    return cache[key]


In [23]:
# Load a cross-encoder model for query-document relevance scoring
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Rerank retrieved documents based on their relevance to the query.
def rerank(query, docs, metas, top_n=3):
    # Predict relevance scores for each (query, document) pair
    scores = reranker.predict([(query, d) for d in docs])

    # Combine docs, metadata, and scores, then sort by score descending
    ranked = sorted(
        zip(docs, metas, scores),
        key=lambda x: x[2],
        reverse=True
    )

    # Return only the top_n results
    return ranked[:top_n]

In [33]:
# Initialize the OpenAI client using the API key from environment variables
from openai import OpenAI
client = OpenAI(api_key='sk-something')  # key has been removed after running the code and before downl

def build_prompt(query, ranked):
    """
    Construct a structured prompt for the LLM using:
    - Thread summaries (high-level context)
    - Email chunks (evidence)
    """
    summaries, emails = [], []

    # Separate summaries from raw email chunks
    for doc, meta, _ in ranked:
        if meta["doc_type"] == "thread_summary":
            summaries.append(doc)  # Thread-level summary
        else:
            # Format email evidence with timestamp, sender, recipient, and content
            emails.append(f"{meta['timestamp']} | {meta['from']} → {meta['to']}: {doc}")

    # Combine everything into a single prompt
    return f"""
You are an enterprise email analysis assistant.

THREAD SUMMARIES:
{chr(10).join(summaries)}

EMAIL EVIDENCE:
{chr(10).join(emails)}

QUESTION:
{query}

ANSWER (fact-based, concise):
"""

def generate_answer(query, ranked):
    """
    Generate a fact-based answer from the LLM given:
    - A user query
    - Ranked documents (summaries + email chunks)
    """
    # Build the prompt with relevant context
    prompt = build_prompt(query, ranked)

    # Call the OpenAI chat endpoint
    response = client.chat.completions.create(
        model="gpt-4o-mini",               # Model choice
        messages=[{"role": "user", "content": prompt}],  
        temperature=0                      # Deterministic / factual answers
    )

    # Return the assistant’s answer text
    return response.choices[0].message.content


In [35]:
queries = [
    "What decisions were made about Q3 marketing strategy?",
    "Was budget approval discussed for Project Atlas?",
    "What timelines were agreed upon for product launch?"
]

In [37]:
# Iterate over each query in the list
for q in queries:
    docs, metas = search(q)
    ranked = rerank(q, docs, metas)
    print("\nQUERY:", q)
    print("\nTop 3 Search Results:")
    for i, (doc, meta, score) in enumerate(ranked, 1):
        print(f"[{i}] ({meta['doc_type']}) Score: {score:.2f}")
    print("\nFinal Answer:")
    print(generate_answer(q, ranked))



QUERY: What decisions were made about Q3 marketing strategy?

Top 3 Search Results:
[1] (email) Score: -4.05
[2] (email) Score: -5.25
[3] (email) Score: -7.10

Final Answer:
The emails indicate that there was a discussion regarding the timing of income recognition, with Ben Jacoby seeking input on whether to close a deal as scheduled, which could impact Q2 and Q3. Additionally, Louise Kitchen expressed a preference for a focus on Q4 rather than Q3. However, no definitive decisions about the Q3 marketing strategy were made in the provided emails.

QUERY: Was budget approval discussed for Project Atlas?

Top 3 Search Results:
[1] (thread_summary) Score: -4.20
[2] (email) Score: -5.77
[3] (email) Score: -6.03

Final Answer:
No, budget approval for Project Atlas was not specifically discussed in the email thread. The focus was on the proposal of using a statement of work template for approving major projects and the related budget items for IT projects.

QUERY: What timelines were agreed 