# AI Powered Advisory System
This notebook implements an AI-powered career advisory system designed to provide students with grounded career insights from professional podcast interviews.

## RAG Implementation: Code for Interfacing with Vector DB
The code below sets up and interfaces with a vector database to retrieve relevant "What-To-Be" podcast excerpts for the RAG pipeline.

# Setup background libraries

**You should just run the below cell once.**

Running the below code will install all the necessary packages to query the vector database and perform RAG.


In [72]:
#Do not edit this cell - just run it using the button on the left. It will take some time to run
!pip install openai
!pip install pinecone
!pip install langchain
!pip install langchain_community
!pip install langchain_openai
!pip install langchain_pinecone
!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
!pip install --upgrade --quiet langchain-google-community[drive]
!pip install google-auth google-auth-oauthlib google-api-python-client
!pip install PyPDF2
!pip install vertexai
!pip install unstructured



# Setting up Your Database

The code below is how we extract the interview data and put it into our Pinecone Vector Database. Run the code below to setup your database.

In [73]:
#Do not edit this cell - just run it using the button on the left

from google.colab import auth
from langchain.document_loaders import DirectoryLoader
from PyPDF2 import PdfReader

Access this [Google Drive](https://drive.google.com/drive/folders/1suacg1-9-yA1q-VB2tiatJt7Gcp6s_N4?usp=drive_link) that contains the sample interview data. These transcripts will be needed for RAG or you can utilize your own files.

You HAVE TO make a shortcut to this folder(or a folder of your choice that contains the your transcripts) in your own Drive. On the Google Drive Folder:

-   Navigate to the Google Drive folder linked above.
-   Click the **down arrow** next to the folder name (e.g., "Sample Transcripts").
-   Select "Organize" from the menu.
-   Choose "Add Shortcut."
-   From "All locations," select "My Drive" and confirm.

In [74]:
#Do not edit this cell - just run it using the button on the left
#This cell will give the file access to read in the data you just saved.

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [75]:
#Do not edit this cell - just run it using the button on the left
# Run this cell and You should see a list of the transcripts that are in the google drive
!ls "/content/drive/My Drive/Sample Transcripts"

 001_KellyOlmstead_LicensedMidwife.txt
'002_RodCaborn_EventProducerFundraiser (1).txt'
 003_SukhSingh_CEOofCodeNaturally.txt
 007_KeishaFrost_CEOofTheUnitedWaySantaCruz.txt
 008_IrvinLemus_EthicalHACKERComputerInformationSystemsInstructor.txt
'011_ Brook Ewoldsen_Fashion Institute Of Design (FIDM) .txt'


In [76]:
#Do not edit this cell - just run it using the button on the left. This cell will take sometime to load and then give you a preview of a transcript
folder_path = "/content/drive/My Drive/Sample Transcripts"

loader = DirectoryLoader(folder_path, glob="**/*.txt")
transcripts = loader.load()
transcripts[0]



In [77]:
#Do not edit this cell - just run it using the button on the left

VALID_INDUSTRY_SECTORS = {
    "Not categorized yet",
    "Architecture and Engineering",
    "Agriculture and Natural Resources",
    "Marketing, Sales, and Service",
    "Building, Trades, and Construction",
    "Energy, Environment, Utilities",
    "Fashion and Interior Design",
    "Manufacturing and Product Development",
    "Education, Child Development, Family Services",
    "Public and Government Services",
    "Finance and Business",
    "Arts, Media, and Entertainment",
    "Information and Computer Technologies",
    "Hospitality, Tourism, Recreation",
    "Health Services, Sciences, Medical Technology"
}

Now that the transcripts are uploaded, we need to chunk them into **manageable** portions. This is crucial for two reasons: to fit within the input limits of embedding models and to improve the granularity of retrieval from the vector database. The code below splits the transcripts by character count and extracts useful metadata (like interviewee name, industry, and source) for later use in Pinecone.

In [78]:
#Do not edit this cell - just run it using the button on the left

from langchain.text_splitter import RecursiveCharacterTextSplitter
import re

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Adjust chunk size as needed
    chunk_overlap=200  # Adjust overlap as needed
)

# Split each transcript into chunks
chunks = []
for transcript in transcripts:
    file_path = transcript.metadata['source']
    file_name = re.search(r'([^/]+)\.txt', file_path).group(1)
    content = transcript.page_content

    # Extract Interviewee Name
    interviewee_match = re.search(r'Interviewee:\s*([A-Za-z]+\s+[A-Za-z]+)', content)
    interviewee_name = interviewee_match.group(1).strip() if interviewee_match else "Unknown"

    # Extract Industry Sector
    sector_matches = re.search(r'Industry Sectors\s*:\s*([^\n]+?)(?=\s*Takeaways:|#|$)', content)
    industry_sector = []
    if sector_matches:
      extracted_text = sector_matches.group(1)
      for sector in VALID_INDUSTRY_SECTORS:
          if sector in extracted_text:
              industry_sector.append(sector)

    source_match = re.search(r'Source\s*:\s*([^\n]+)', content)
    source = source_match.group(1).strip() if source_match else "Unknown"

    # Split texts
    splits = text_splitter.split_text(content)
    for i, split in enumerate(splits):
        chunks.append({
            "file_name": file_name,
            "chunk_id": i,
            "Interviewee": interviewee_name,
            "Industry Sectors": industry_sector,
            "Source": source,
            "content": split,
        })

In [79]:
#Do not edit this cell - just run it using the button on the left

from langchain.schema import Document

# Convert chunks to LangChain Document format
documents = []
for chunk in chunks:
    doc = Document(
        page_content=chunk["content"],
        metadata={
            "file_name": chunk["file_name"],
            "chunk_id": chunk["chunk_id"],
            "Interviewee": chunk["Interviewee"],
            "Industry Sectors": chunk["Industry Sectors"],
            "Source": chunk["Source"],
        }
    )
    documents.append(doc)

We now want to embed each of our chunks. This is what we'll upload to the database and use to retrieve relevant excerpts. (This may take a few minutes)

In [80]:
#Do not edit this cell - just run it using the button on the left. This cell will take sometime to load
from sentence_transformers import SentenceTransformer
# Use SentenceTransformer model
embedding_model = SentenceTransformer('avsolatorio/GIST-large-Embedding-v0')

# Generate embeddings for each document
embeddings = [embedding_model.encode(doc.page_content) for doc in documents]

In [81]:
#Do not edit this cell - just run it using the button on the left.
embedding_amount = len(embeddings)
embedding_dimension = len(embeddings[0])
print("Amount of Embeddings:", embedding_amount)
print("Embedding Dimensions:", embedding_dimension)

Amount of Embeddings: 264
Embedding Dimensions: 1024


It's time to create our vector database in Pinecone. Ensure your Pinecone API key is securely defined (e.g., in Colab Secrets). Also, choose a unique name for your index to avoid conflicts with existing ones.

Once you run the cell below, a new Pinecone index will be created, and it will begin uploading each of the transcript excerpts. While the upload is in progress, you can monitor the data ingestion by visiting [https://app.pinecone.io/](https://app.pinecone.io/).

In [82]:
from pinecone import Pinecone, ServerlessSpec
import numpy as np

# Access secret Pinecone key
from google.colab import userdata
PINECONE_API_KEY = userdata.get('PINECONE_API_KEY')
# Initialize Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)
# Define the index name
index_name = "test-dummy" # Replace with desired name

# Check if the index already exists
if index_name not in pc.list_indexes().names():
    # Create a new index
    pc.create_index(
        name=index_name,
        dimension=embedding_dimension,
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )

# Connect to the index
index = pc.Index(index_name)

# Function to upsert vectors in batches
def upsert_in_batches(vectors, batch_size=25):
    """Splits upsert operations into batches to avoid exceeding Pinecone's 4MB limit."""
    for i in range(0, len(vectors), batch_size):
        batch = vectors[i:i + batch_size]
        try:
            index.upsert(vectors=batch)
        except Exception as e:
            print(f"Error upserting batch {i // batch_size + 1}: {e}")

# Prepare vectors
vectors = []
for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
    metadata = {
        "file_name": doc.metadata.get("file_name", f"file_{i}"),
        "chunk_id": doc.metadata.get("chunk_id", i),
        "Interviewee": doc.metadata.get("Interviewee", "Unknown"),
        "Industry Sectors": doc.metadata.get("Industry Sectors", "wrong"),
        "Source": doc.metadata.get("Source", "Unknown"),
        "content": doc.page_content,
    }

    vectors.append({
        "id": f"chunk_{i}",  # Unique ID for each chunk
        "values": embedding.tolist(),  # Convert numpy array to list
        "metadata": metadata
    })

# Upsert the vectors in batches
upsert_in_batches(vectors, batch_size=50)

print("Upsert completed successfully!")

Upsert completed successfully!


In [83]:
documents[0]

Document(metadata={'file_name': '001_KellyOlmstead_LicensedMidwife', 'chunk_id': 0, 'Interviewee': 'Kelly Olmstead', 'Industry Sectors': [], 'Source': 'https://soundcloud.com/what'}, page_content='Source: https://soundcloud.com/what\n\nto\n\nbe/midwife\n\nkelly\n\nolmstead?utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing\n\nInterviewee: Kelly Olmstead Industry Sector: Health Services, Sciences, Medical Technology Takeaways: Early self-discovery and interest exploration: Understanding different career paths, Early self-discovery and interest exploration: Exploring personal interests passions and values, Career decision-making and commitments: Making informed career decisions, Career decision-making and commitments: Taking the leap seizing opportunities and epiphany, Career pivots reconsideration and development: Integrating different passions or interests\n\n# INTERVIEW INTRODUCTION')

## Setup functions

**You should run the below cells in this section once.**

These cells define helper functions crucial for the RAG pipeline: an `API_call` function for interacting with the Gemini LLM, a `CustomEmbeddings` class for embedding queries, and a `parse_query_string` function to process query objects. You only need to run these setup cells once.

In [84]:
#You only need to run this cell without making any changes

import google.generativeai as genai

# Access secret Gemini API key
from google.colab import userdata
GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')

# Defining an API_call function, here we are using gemini
def API_call(model, key, prompt):
    """
    Makes a call to the specified Generative AI model.
    Args:
        model (str): The name of the model to use (e.g., "gemini-2.0-flash").
        key (str): The API key for authentication.
        prompt (str): The prompt to send to the model.
    Returns:
        str: The text response from the API.
    Raises:
        ValueError: If the specified model is not supported.
    """
    if model[0:6] == "gemini":
        genai.configure(api_key=key)
        model = genai.GenerativeModel(model)
        response = model.generate_content(prompt)
        return response.text

    else: raise (f"Error, model {model} not found")


In [85]:
#Do not edit this cell - just run it using the button on the left
from sentence_transformers import SentenceTransformer
# Use SentenceTransformer model
embedding_model = SentenceTransformer('avsolatorio/GIST-large-Embedding-v0')

# Custom class to wrap SentenceTransformer for compatibility with LangChain
class CustomEmbeddings:
    def __init__(self, model):
        self.model = model

    def embed_query(self, text):
        return self.model.encode(text).tolist()

    def embed_documents(self, texts):
        return [self.model.encode(text).tolist() for text in texts]

embedding = CustomEmbeddings(embedding_model)

In [86]:
#Do not edit this cell - just run it using the button on the left
import os
from google.colab import userdata

os.environ['PINECONE_API_KEY'] = PINECONE_API_KEY

import ast

def parse_query_string(query_string):
    # Convert the string into a dictionary using ast.literal_eval
    input_dict = ast.literal_eval(query_string)

    # Ensure the fields are of the correct type
    result_dict = {
        "content_string_query": input_dict.get("content_string_query", ""),
        "industry_filter": input_dict.get("industry_filter", []),
    }

    # Convert industry_filter to lists if they are not already
    if not isinstance(result_dict["industry_filter"], list):
        result_dict["industry_filter"] = [result_dict["industry_filter"]]

    return result_dict

# RAG Pipeline

This section outlines the core RAG pipeline. By utilizing the defined tasks and the `RAG_Pipeline` function, users can submit a query and receive a grounded response that leverages the relevant documents uploaded to Pinecone as contextual information.

# Task 1: Creating Query Objects

Turning natural language query string to a query object. We do this to optimize the retrieval process for the vector database. To accomplish this, I have created a prompt such that given a natural language query string it is converted to a query object.

## Example of a query object:
query_string = '''
{
  "content_string_query": "“You said you jumped straight from a 10-week EMT course into a year-and-a-half of paramedic school without first working on an ambulance. Looking back, what made that intense route worth it, and what would you advise teens who are eager to ‘sprint’ into the fire service today?”?",
  "industry_filter": ["Career"],
}
'''


In [87]:
def cleanjson(response_str):
        cleaned = response_str.strip()
        if cleaned.startswith("```json"):
            cleaned = cleaned[7:]
        elif cleaned.startswith("```"):
            cleaned = cleaned[3:]
        if cleaned.endswith("```"):
            cleaned = cleaned[:-3]
        return cleaned.strip()

def format_query(query, valid_industries):
  # Convert the set of valid industries to a comma-separated string for the prompt
  industries_str = ", ".join(sorted(list(valid_industries))) # Sort for consistent order

  query_prompt = f'''
    SYSTEM ROLE: You are an **Expert RAG-Query Architect**.
    Your task is to translate one natural-language career question (user_query) into a JSON query object that a vector database can consume.

    INPUTS (always provided in this order):
      1. user_query: A single question written by a student.
      2. industries (optional): An array of predefined industry names.

    OUTPUT:
      Return **only** a JSON object with double quotes, no markdown formatting, and no backticks.

    SCHEMA:
      {{
        "content_string_query": string,    // condensed, information-rich reformulation of user_query for embedding
        "industry_filter": string[]       // OPTIONAL: array of 1–3 industry names from the input list, included only if the question clearly targets them
      }}

    GUIDELINES:
      • Use standard JSON syntax with double quotes.
      • Do NOT wrap the output in markdown backticks or use ```json.
      • Preserve all essential meaning; do not paraphrase away domain-specific language.
      • Write in a precise, professional tone (avoid conversational filler).
      • Only include industry_filter if the query clearly targets specific industries.
    Here is the query:
      {query}
    Predefined industries available for filtering: {industries_str}
    '''
  format_query = API_call("gemini-2.0-flash", GEMINI_API_KEY, query_prompt)
  return cleanjson(format_query)

# Task 2: Querying the Vector Database (Retrieval Step)

This section defines the function responsible for querying the Pinecone vector database. When executed, this function will parse the generated query object and retrieve the top 10 most relevant documents from the database. These retrieved documents will then serve as the contextual information for the final LLM response generation.

In [94]:
#Do not edit this cell - just run it using the button on the left
import re
import pprint
from langchain_pinecone import PineconeVectorStore
from langchain_community.chat_models import ChatOpenAI
from langchain.vectorstores import Pinecone
from langchain_openai import OpenAIEmbeddings

# Define function for querying response
def query_response(parsed_dict, top_k = 10):
    # Extract the filters from the parsed dictionary
    content_string_query = parsed_dict.get("content_string_query", None)
    industry_filter = parsed_dict.get("industry_filter", [])

    # Embed user query
    vector=embedding.embed_query(content_string_query)
    # Check if either filter is non-empty
    if len(industry_filter) > 0:
        return index.query(
            vector=vector,
            top_k=top_k,
            include_values=True,
            include_metadata=True,
            filter={
                'Industry Sectors': { "$in": industry_filter },
            }
        )
    elif len(industry_filter) > 0:
        return index.query(
            vector=vector,
            top_k=top_k,
            include_values=True,
            include_metadata=True,
            filter={
                'Industry Sectors': { "$in": industry_filter },
            }
        )
    else:
        return index.query(
            vector=vector,
            top_k=top_k,
            include_values=True,
            include_metadata=True,
        )

def format_documents(documents):
    formatted_documents = []

    for doc in documents['matches']:
        formatted_doc = {
            "Passage": doc['metadata']['content'],
            "Interviewee": doc['metadata']['Interviewee'],
            "Industry Sectors": doc['metadata']['Industry Sectors'],
            "Source": doc['metadata']['Source'],
        }
        formatted_documents.append(formatted_doc)

    return formatted_documents

# Task 3: Generating the user response

This final step of the RAG pipeline involves designing a prompt to synthesize the original user query with the retrieved contextual information, ultimately generating a helpful and grounded response for the student.



In [89]:
def querying_db(query, retrieved_context):
    rag_prompt = f'''

    SYSTEM ROLE:
    You are a career guidance AI designed to support high school students by offering authentic, experience-based advice. You ground your responses in real insights from professionals featured on the "What To Be" podcast.

    TASK:
    Using the student’s career question and retrieved context from podcast transcripts, craft a thoughtful, personalized response rooted in the lived experiences and reflections of the professionals interviewed.

    INPUTS:
    1. User Query: {query}
    2. Retrieved Context: {retrieved_context}

    OUTPUT INSTRUCTIONS:
    • Begin with a warm, empathetic acknowledgment of the student's question and concerns.
    • Share relevant experiences, stories, or quotes from professionals in the retrieved context.
    • Use a supportive, conversational tone—like a trusted mentor offering guidance.
    • Reference specific professionals by name and highlight their real experiences.
    • Prioritize concrete details over generic advice—anchor your response in the lived realities from the transcripts.
    • Provide actionable takeaways the student can use or reflect on.
    • End with links to the podcast episodes so students can explore further. Format them as:
    "To hear more from [Name], check out their interview at [Source URL]"

    REQUIREMENTS:
    • Use **only** the retrieved context. Do not fabricate details or speculate beyond it.
    • Ensure your response is rich in content and insight, grounded in what professionals actually said.
    • Maintain a tone that is honest, encouraging, and realistic—acknowledge that career paths can be non-linear and challenging.
    • Avoid vague or generalized advice—focus on meaningful takeaways drawn directly from the context.
    '''
    db_response = API_call("gemini-2.0-flash", GEMINI_API_KEY, rag_prompt)

    return db_response

# Final RAG Pipeline function
This function integrates all the preceding steps into a cohesive pipeline. It takes a user query, formats it, retrieves relevant contextual information from the vector database, and then prompts the LLM to generate a final, grounded response. This response will contain career insights directly extracted from the podcast documents parsed from Google Drive.


In [92]:
def RAG_Pipeline(query):
    query_string = format_query(query, VALID_INDUSTRY_SECTORS)

    # Parsing query string
    parsed_dict = parse_query_string(query_string)

    # Retrieve top-k relevant docs - default is 10
    response = query_response(parsed_dict)

    # Format
    retrieved_context = format_documents(response)

    #Query the DB with user query and retrieved context as arguments
    db_response = querying_db(query_string, retrieved_context)
    print("-"*47)
    print("FINAL USER RESPONSE:")
    print(db_response)
    print("-"*47)

    return db_response

# Testing the RAG pipeline
To utilize the RAG pipeline function, simply insert a query below that is relevant to the data stored in the documents. Lastly, run the cell to see the printed output!

In [93]:
#Test your pipeline here for your user queries
query = "How can young adults explore different career avenues with limited resources?"
response = RAG_Pipeline(query)

-----------------------------------------------
FINAL USER RESPONSE:
Okay, I understand you're looking for ways to explore different career options, especially when resources are limited. It's a great question and something many young people think about!

From the "What To Be" podcast, I've heard some really insightful stories that might help.

**Leveraging Local Resources & Opportunities:**

*   **Keisha Frost, CEO:** Keisha's story is a powerful example of how community involvement can shape a career. Growing up, she was exposed to various activities and organizations like Girls Scouts, track and field, and tutoring through the United Way's initiatives in her school. This exposure not only kept her busy and connected but also paved the way for her future career path, eventually leading her back to the United Way as a CEO. Keisha's experience highlights how local community programs can offer valuable experiences and connections, even with limited resources.

**Exploring Interests & Va