<a href="https://colab.research.google.com/github/violetxs16/TIM-175/blob/main/Copy_of_TIM175_RAG_Implementation_and_Querying.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TIM175 RAG Implementation: Code for Interfacing with Vector DB
The below code can be used to setup and interface with the vector DB to retrieve relevant documents (What-To-Be podcast excerpts) for the RAG pipeline you are implementing in your assignment.


Reminder: Change the runtime to GPU T4 if you have not done so before

Happy RAGing!

=============

# Setup background libraries

**You should just run the below cell once. You should NOT modify any code.**

Running the below code will install all the necessary packages to query the vector database and perform RAG.


In [None]:
#Do not edit this cell - just run it using the button on the left. It will take some time to run
%%capture
!pip install openai
!pip install pinecone
!pip install langchain
!pip install langchain_community
!pip install langchain_openai
!pip install langchain_pinecone
!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
!pip install --upgrade --quiet langchain-google-community[drive]
!pip install google-auth google-auth-oauthlib google-api-python-client
!pip install PyPDF2
!pip install vertexai
!pip install unstructured

# Task 1 Setting up Your Database

The code below is how we extracted the interview data and put it into our Pinecone Vector Database. For this assignment, you will be doing same. Run the code below to setup your database.

In [None]:
#Do not edit this cell - just run it using the button on the left

from google.colab import auth
from langchain.document_loaders import DirectoryLoader
from PyPDF2 import PdfReader

Access this [Google Drive](https://drive.google.com/drive/folders/1SOX-9F7HMUvzo7VIIBuqk0CU5hgMAoJE?usp=drive_link) that contains the interview data. These transcripts that will be needed for RAG.

You HAVE TO make a shortcut to this folder in your own Drive. On the Google Drive Folder:

- Click the Dropbar (where the name of the folder is)
- Click Organize
- Click Add Shortcut
- Then from All locations choose My Drive

In [None]:
#Do not edit this cell - just run it using the button on the left
#This cell will give the file access to read in the data you just saved.

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Do not edit this cell - just run it using the button on the left
# Run this cell and You should see a list of the transcripts that are in the google drive
!ls "/content/drive/My Drive/Cleaned Transcripts Latest"

'0019_Jose Ibarra_Resident Services Manager.txt'
 001_KellyOlmstead_LicensedMidwife.txt
'0020_Lauren Donnelley Crocker_Development Director Seymour Marine Discovery Center (1).txt'
'002_RodCaborn_EventProducerFundraiser (1).txt'
 003_SukhSingh_CEOofCodeNaturally.txt
'004_ChrisCottinghamGiannaVani_OwnerAndAdminAssistant365Producer(MusicProductionCompany).txt'
 007_KeishaFrost_CEOofTheUnitedWaySantaCruz.txt
 008_IrvinLemus_EthicalHACKERComputerInformationSystemsInstructor.txt
'011_ Brook Ewoldsen_Fashion Institute Of Design (FIDM) .txt'
'012 _ Rob Gaukel_Chief Operating Officer of Verve Coffee (1) (1).txt'
 021_AmandaBird_WastewaterTreatmentOperator.txt
 025_AkiWilliams_ChiefOperationsOfficerDefibThisFlightNurse.txt
'039_RachelRassmussen_HRDirector&BaristaatCatandCloudCoffee.txt'
 040_MurrySchekman_RetiredK-12AdministratorFull-TImeLectureratSanJoseStateUniversity.txt
'041_TonyNunez_ManagingEditorAtThePajaronian 1 (1).txt'
 042_GavinClark_MasterOfBiophysicsMDStudentAtGeorgetownUniversity.

In [None]:
#Do not edit this cell - just run it using the button on the left. This cell will take sometime to load and then give you a preview of a transcript
folder_path = "/content/drive/My Drive/Cleaned Transcripts Latest"

loader = DirectoryLoader(folder_path, glob="**/*.txt")
transcripts = loader.load()
transcripts[0]

Document(metadata={'source': '/content/drive/My Drive/Cleaned Transcripts Latest/166_DanteSearcy_HRGeneralist.txt'}, page_content="Interviewee: Dante Searcy Industry Sectors: Public and Government Services Takeaways: Early self-discovery and interest exploration: Understanding different career paths, Early self-discovery and interest exploration: Recognizing personal strengths and weaknesses, Skill development education and deep exploration: Real-life exposure and skill development, Career pivots reconsideration and development: Integrating different passions or interests, Career pivots reconsideration and development: Finding enjoyment fulfillment and purpose in work Source: https://soundcloud.com/what-to-be/dante-searcy-joby-aviation-human-reasources-generalist?utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing # INTRODUCTION Interviewer  0:16 Hello and welcome everyone to KSQD Santa Cruz at 90.7 FM. I'm Patrick Hart and you're listening to our show What To Be, where we

In [None]:
#Do not edit this cell - just run it using the button on the left

VALID_INDUSTRY_SECTORS = {
    "Not categorized yet",
    "Architecture and Engineering",
    "Agriculture and Natural Resources",
    "Marketing, Sales, and Service",
    "Building, Trades, and Construction",
    "Energy, Environment, Utilities",
    "Fashion and Interior Design",
    "Manufacturing and Product Development",
    "Education, Child Development, Family Services",
    "Public and Government Services",
    "Finance and Business",
    "Arts, Media, and Entertainment",
    "Information and Computer Technologies",
    "Hospitality, Tourism, Recreation",
    "Health Services, Sciences, Medical Technology"
}

Now that we have the transcripts uploaded, we want to chunk them into managable portions and extracts some useful metadata. The code below simply splits them up by a set number of characters, but think about other ways you could approach this.

In [None]:
#Do not edit this cell - just run it using the button on the left

from langchain.text_splitter import RecursiveCharacterTextSplitter
import re

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Adjust chunk size as needed
    chunk_overlap=200  # Adjust overlap as needed
)

# Split each transcript into chunks
chunks = []
for transcript in transcripts:
    file_path = transcript.metadata['source']
    file_name = re.search(r'([^/]+)\.txt', file_path).group(1)
    content = transcript.page_content

    # Extract Interviewee Name
    interviewee_match = re.search(r'Interviewee:\s*([A-Za-z]+\s+[A-Za-z]+)', content)
    interviewee_name = interviewee_match.group(1).strip() if interviewee_match else "Unknown"

    # Extract Industry Sector
    sector_matches = re.search(r'Industry Sectors\s*:\s*([^\n]+?)(?=\s*Takeaways:|#|$)', content)
    industry_sector = []
    if sector_matches:
      extracted_text = sector_matches.group(1)
      for sector in VALID_INDUSTRY_SECTORS:
          if sector in extracted_text:
              industry_sector.append(sector)

    source_match = re.search(r'Source\s*:\s*([^\n]+)', content)
    source = source_match.group(1).strip() if source_match else "Unknown"

    # Split texts
    splits = text_splitter.split_text(content)
    for i, split in enumerate(splits):
        chunks.append({
            "file_name": file_name,
            "chunk_id": i,
            "Interviewee": interviewee_name,
            "Industry Sectors": industry_sector,
            "Source": source,
            "content": split,
        })

# Print the first chunk as an example
print(chunks[1])

{'file_name': '166_DanteSearcy_HRGeneralist', 'chunk_id': 1, 'Interviewee': 'Dante Searcy', 'Industry Sectors': ['Public and Government Services'], 'Source': "https://soundcloud.com/what-to-be/dante-searcy-joby-aviation-human-reasources-generalist?utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing # INTRODUCTION Interviewer  0:16 Hello and welcome everyone to KSQD Santa Cruz at 90.7 FM. I'm Patrick Hart and you're listening to our show What To Be, where we interview inspiring people and highlight their careers. What To Be is a program provided by Your Future Is Our Business, a Santa Cruz County nonprofit that helps students explore careers through programs such as college and career expos, panels and other work based learning activities.", 'content': "I'm Patrick Hart and you're listening to our show What To Be, where we interview inspiring people and highlight their careers. What To Be is a program provided by Your Future Is Our Business, a Santa Cruz County nonprofit th

In [None]:
#Do not edit this cell - just run it using the button on the left

from langchain.schema import Document

# Convert chunks to LangChain Document format
documents = []
for chunk in chunks:
    doc = Document(
        page_content=chunk["content"],
        metadata={
            "file_name": chunk["file_name"],
            "chunk_id": chunk["chunk_id"],
            "Interviewee": chunk["Interviewee"],
            "Industry Sectors": chunk["Industry Sectors"],
            "Source": chunk["Source"],
        }
    )
    documents.append(doc)

documents[0]

Document(metadata={'file_name': '166_DanteSearcy_HRGeneralist', 'chunk_id': 0, 'Interviewee': 'Dante Searcy', 'Industry Sectors': ['Public and Government Services'], 'Source': "https://soundcloud.com/what-to-be/dante-searcy-joby-aviation-human-reasources-generalist?utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing # INTRODUCTION Interviewer  0:16 Hello and welcome everyone to KSQD Santa Cruz at 90.7 FM. I'm Patrick Hart and you're listening to our show What To Be, where we interview inspiring people and highlight their careers. What To Be is a program provided by Your Future Is Our Business, a Santa Cruz County nonprofit that helps students explore careers through programs such as college and career expos, panels and other work based learning activities."}, page_content="Interviewee: Dante Searcy Industry Sectors: Public and Government Services Takeaways: Early self-discovery and interest exploration: Understanding different career paths, Early self-discovery and interes

We now want to embed each of our chunks. This is what we'll upload to the database and use to retrieve relevant exerpts. (This may take a few minutes)

In [None]:
#Do not edit this cell - just run it using the button on the left. This cell will take sometime to load
from sentence_transformers import SentenceTransformer
# Use SentenceTransformer model
embedding_model = SentenceTransformer('avsolatorio/GIST-large-Embedding-v0')

# Generate embeddings for each document
embeddings = [embedding_model.encode(doc.page_content) for doc in documents]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/754 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [None]:
#Do not edit this cell - just run it using the button on the left.
embedding_amount = len(embeddings)
embedding_dimension = len(embeddings[0])
print("Amount of Embeddings:", embedding_amount)
print("Embedding Dimensions:", embedding_dimension)

Amount of Embeddings: 2390
Embedding Dimensions: 1024


It's time to create our database. Make sure you have defined your pinecone key below. You can use the same key as the pre-lab. Also be sure to set a name for your index. It can be anything you want, just make sure it is different from any of your current indexes.

Once you run the cell below, you will have created your own Pinecone index, and it will slowly upload each up the transcript exeprts. While this is running, go to https://app.pinecone.io/ and you should see as the transcripts come in!

In [None]:
# You must edit this cell by replacing "YOUR API KEY" with your Pinecone API key you made during PreLab 7.
#You also want to add a name for your database instead of "YOUR INDEX NAME"

from pinecone import Pinecone, ServerlessSpec
import numpy as np

# Initialize Pinecone
PINECONE_API_KEY = "pcsk_6y4pT6_6suL9H5X7C6TrP2YU1zKGadfHHWf8Fv7dDC1B648JzctzBHvmBZeRvcoj1mMa6T"
pc = Pinecone(api_key=PINECONE_API_KEY)
# Define the index name
index_name = "violeta-solorio"

# Check if the index already exists
if index_name not in pc.list_indexes().names():
    # Create a new index
    pc.create_index(
        name=index_name,
        dimension=embedding_dimension,
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )

# Connect to the index
index = pc.Index(index_name)

# Function to upsert vectors in batches
def upsert_in_batches(vectors, batch_size=25):
    """Splits upsert operations into batches to avoid exceeding Pinecone's 4MB limit."""
    for i in range(0, len(vectors), batch_size):
        batch = vectors[i:i + batch_size]
        try:
            index.upsert(vectors=batch)
        except Exception as e:
            print(f"Error upserting batch {i // batch_size + 1}: {e}")

# Prepare vectors
vectors = []
for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
    metadata = {
        "file_name": doc.metadata.get("file_name", f"file_{i}"),
        "chunk_id": doc.metadata.get("chunk_id", i),
        "Interviewee": doc.metadata.get("Interviewee", "Unknown"),
        "Industry Sectors": doc.metadata.get("Industry Sectors", "wrong"),
        "Source": doc.metadata.get("Source", "Unknown"),
        "content": doc.page_content,
    }

    vectors.append({
        "id": f"chunk_{i}",  # Unique ID for each chunk
        "values": embedding.tolist(),  # Convert numpy array to list
        "metadata": metadata
    })

# Upsert the vectors in batches
upsert_in_batches(vectors, batch_size=50)

print("Upsert completed successfully!")

Upsert completed successfully!


In [None]:
documents[0]

Document(metadata={'file_name': '166_DanteSearcy_HRGeneralist', 'chunk_id': 0, 'Interviewee': 'Dante Searcy', 'Industry Sectors': ['Public and Government Services'], 'Source': "https://soundcloud.com/what-to-be/dante-searcy-joby-aviation-human-reasources-generalist?utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing # INTRODUCTION Interviewer  0:16 Hello and welcome everyone to KSQD Santa Cruz at 90.7 FM. I'm Patrick Hart and you're listening to our show What To Be, where we interview inspiring people and highlight their careers. What To Be is a program provided by Your Future Is Our Business, a Santa Cruz County nonprofit that helps students explore careers through programs such as college and career expos, panels and other work based learning activities."}, page_content="Interviewee: Dante Searcy Industry Sectors: Public and Government Services Takeaways: Early self-discovery and interest exploration: Understanding different career paths, Early self-discovery and interes

## Setup functions

**You should just run the below cell once. You should NOT modify any code.**

Running the below code will create helper functions that will be used to clean your query object string. You only need to run this code once (you don't need to run it for every query you test).

In [None]:
#You only need to run this cell without making any changes

import google.generativeai as genai

# Defining an API_call function, here we are using gemini as it is free!
def API_call(model, GOOGLE_API_KEY, prompt):
    """
    Creates a call to the API
    Takes in a model to choose which API
    Returns the updated tokens (int, int) and API response (str)
    """
    if model[0:6] == "gemini":
        genai.configure(api_key=GOOGLE_API_KEY)
        model = genai.GenerativeModel(model)
        response = model.generate_content(prompt)
        return response.text

    else: raise (f"Error, model {model} not found")


In [None]:
#You only need to run this cell without making any changes. If you did not set this up during the PreLab, please see how to add your token into the secrets in the prelab.
from google.colab import userdata
GOOGLE_API_KEY = userdata.get('GEMINI_TOKEN')

In [None]:
#Do not edit this cell - just run it using the button on the left
from sentence_transformers import SentenceTransformer
# Use SentenceTransformer model
embedding_model = SentenceTransformer('avsolatorio/GIST-large-Embedding-v0')

# Custom class to wrap SentenceTransformer for compatibility with LangChain
class CustomEmbeddings:
    def __init__(self, model):
        self.model = model

    def embed_query(self, text):
        return self.model.encode(text).tolist()

    def embed_documents(self, texts):
        return [self.model.encode(text).tolist() for text in texts]

embedding = CustomEmbeddings(embedding_model)

In [None]:
#Do not edit this cell - just run it using the button on the left
import os
from google.colab import userdata

os.environ['PINECONE_API_KEY'] = PINECONE_API_KEY

import ast

def parse_query_string(query_string):
    # Convert the string into a dictionary using ast.literal_eval
    input_dict = ast.literal_eval(query_string)

    # Ensure the fields are of the correct type
    result_dict = {
        "content_string_query": input_dict.get("content_string_query", ""),
        "industry_filter": input_dict.get("industry_filter", []),
    }

    # Convert industry_filter to lists if they are not already
    if not isinstance(result_dict["industry_filter"], list):
        result_dict["industry_filter"] = [result_dict["industry_filter"]]

    return result_dict

# Task 2 - Creating User Queries



Create 5 User Queries as **explained in the Google Document**. Once you have them written, place one in the cell below. We will use this as a test query as we create our RAG pipeline. Once we have finished the pipeline you can test the rest of your queries at the end.

In [None]:
#You need to edit this cell by replacing "ADD YOUR QUERY HERE" with your first query.
query = "“You said you jumped straight from a 10-week EMT course into a year-and-a-half of paramedic school without first working on an ambulance. Looking back, what made that intense route worth it, and what would you advise teens who are eager to ‘sprint’ into the fire service today?”"

#Task 3 - Creating Query Objects

You need to write and run a prompt that will take your query and make a query object based on the instructions in the google document. The goal is to turn the natural language query into a formatted query_string as seen below:

In [None]:
## THIS IS AN EXAMPLE OF WHAT A QUERY OBJECT LOOKS LIKE ##
query_string = '''
{
  "content_string_query": "“You said you jumped straight from a 10-week EMT course into a year-and-a-half of paramedic school without first working on an ambulance. Looking back, what made that intense route worth it, and what would you advise teens who are eager to ‘sprint’ into the fire service today?”?",
  "industry_filter": ["Career"],
}
'''
##########################################################

Create your own prompt for creating a query object. It's important that the formatting is perscise here as we will use them to query the database

In [None]:
prompt = f"""
SYSTEM ROLE: You are an Expert RAG-Query Architect.

TASK:
Your job is to convert a student’s natural-language career question (user_query) into a structured JSON query object for use in a vector database.

INPUTS (always provided in this order):

user_query: A single question written by a student.

industries (optional): An array of predefined industry names.

OUTPUT:
Return only a JSON object using valid JSON syntax (double quotes, no markdown, no backticks).
The object must follow this schema:

json
Copy code
{{
  "content_string_query": string,      // An information-dense, professional reformulation of the original question for embedding
  "industry_filter": string[]          // OPTIONAL: Include 1–3 industries from the input list if the question clearly targets them
}}
GUIDELINES:
• Use precise, information-rich language.
• Maintain the original meaning and terminology, especially domain-specific terms.
• Use a professional tone—no filler, conversational phrasing, or stylistic noise.
• Only include industry_filter if the question clearly references specific industries.
• Do not wrap the output in backticks, code blocks, or markdown.

BEGIN PROCESSING.
Here is the query:
{query}
"""


NameError: name 'query' is not defined

In [None]:
def cleanjson(response_str):
    cleaned = response_str.strip()
    if cleaned.startswith("```json"):
        cleaned = cleaned[7:]
    elif cleaned.startswith("```"):
        cleaned = cleaned[3:]
    if cleaned.endswith("```"):
        cleaned = cleaned[:-3]
    return cleaned.strip()

# Example usage:
response_text = API_call("gemini-2.0-flash", GOOGLE_API_KEY, prompt)
print(response_text)
query_string = cleanjson(response_text)
print("RESPONSE:\n", query_string)


```json
{
  "content_string_query": "Advice for teens considering accelerated EMT/paramedic training programs and direct entry into fire service, reflecting on the value of skipping ambulance experience.",
  "industry_filter": [
    "Fire Service"
  ]
}
```
RESPONSE:
 {
  "content_string_query": "Advice for teens considering accelerated EMT/paramedic training programs and direct entry into fire service, reflecting on the value of skipping ambulance experience.",
  "industry_filter": [
    "Fire Service"
  ]
}


# Task 4: Run code for querying from the vector DB

**This step does NOT require modifying the notebook. It only requires Running a few cells.**

Now that we can turn natural language text into a query_string, we are ready to actually query the database you made in task 1! Make sure your query_string looks correct before running the below cells.

First, we just define some useful functions to make our lives easier:

In [None]:
#Do not edit this cell - just run it using the button on the left
import re
import pprint
from langchain_pinecone import PineconeVectorStore
from langchain_community.chat_models import ChatOpenAI
from langchain.vectorstores import Pinecone
from langchain_openai import OpenAIEmbeddings

# Define function for querying response
def query_response(parsed_dict, top_k = 10):
    # Extract the filters from the parsed dictionary
    content_string_query = parsed_dict.get("content_string_query", None)
    industry_filter = parsed_dict.get("industry_filter", [])

    # Embed user query
    vector=embedding.embed_query(content_string_query)
    # Check if either filter is non-empty
    if len(industry_filter) > 0:
        return index.query(
            vector=vector,
            top_k=top_k,
            include_values=True,
            include_metadata=True,
            filter={
                'Industry Sectors': { "$in": industry_filter },
            }
        )
    elif len(industry_filter) > 0:
        return index.query(
            vector=vector,
            top_k=top_k,
            include_values=True,
            include_metadata=True,
            filter={
                'Industry Sectors': { "$in": industry_filter },
            }
        )
    else:
        return index.query(
            vector=vector,
            top_k=top_k,
            include_values=True,
            include_metadata=True,
        )

def format_documents(documents):
    formatted_documents = []

    for doc in documents['matches']:
        formatted_doc = {
            "Passage": doc['metadata']['content'],
            "Interviewee": doc['metadata']['Interviewee'],
            "Industry Sectors": doc['metadata']['Industry Sectors'],
            "Source": doc['metadata']['Source'],
        }
        formatted_documents.append(formatted_doc)

    return formatted_documents

This cell will now parse your query_string and retrieve the top 4 documents from the database! Read through them to see how well it did.

In [None]:
#Do not edit this cell - just run it using the button on the left
# Convert query string to dictionary
parsed_dict = parse_query_string(query_string)

# Retrieve top-k relevant docs (k=4)
response = query_response(parsed_dict, top_k=4)

# Print out the response header
print("-"*47)
pprint.pp("Top 4 Most Relevant Excerpts:")
print("-"*47)

# Format the documents
formatted_documents = format_documents(response)

# Print out the formatted documents
for doc in formatted_documents:
  pprint.pp(doc)
  pprint.pp("-"*47)

retrieved_context = formatted_documents

-----------------------------------------------
'Top 4 Most Relevant Excerpts:'
-----------------------------------------------


# Task 5: Design a prompt that responds to users

Now that we can go from a *user query -> query_string -> retrieved context*. It's time to finish off the RAG pipeline by using that context in the LLMs response.

Below we have outlined a simple function that wraps up all the steps into one. You should copy your prompt from task 3 here where prompted. Then you must write a new prompt that will take the original query as well as the retrieved context and provide a useful response for the user.

In [None]:
#You need to edit this cell by adding your prompts here for both Task 3 and Task 5
def RAG_Pipeline(query):
    print(f"PIPELINE OUTPUTS FOR: {query}")

    #Step 1. Convert query string to query object (From Task 3)
    query_prompt = f'''
SYSTEM ROLE: You are an **Expert RAG-Query Architect**.
Your task is to translate one natural-language career question (user_query) into a JSON query object that a vector database can consume.

INPUTS (always provided in this order):
1. user_query: A single question written by a student.
2. industries (optional): An array of predefined industry names.

OUTPUT:
Return **only** a JSON object with double quotes, no markdown formatting, and no backticks.

SCHEMA:
{{
  "content_string_query": string,    // condensed, information-rich reformulation of user_query for embedding
  "industry_filter": string[]       // OPTIONAL: array of 1–3 industry names from the input list, included only if the question clearly targets them
}}

GUIDELINES:
• Use standard JSON syntax with double quotes.
• Do NOT wrap the output in markdown backticks or use ```json.
• Preserve all essential meaning; do not paraphrase away domain-specific language.
• Write in a precise, professional tone (avoid conversational filler).
• Only include industry_filter if the query clearly targets specific industries.
Here is the query:        {query}
        '''

    query_string1 = API_call("gemini-2.0-flash", GOOGLE_API_KEY, query_prompt)

    print("-"*47)
    print("QUERY STRING GENERATED:")
    print(query_string1)
    print("-"*47)
    def cleanjson(response_str):
        cleaned = response_str.strip()
        if cleaned.startswith("```json"):
            cleaned = cleaned[7:]
        elif cleaned.startswith("```"):
            cleaned = cleaned[3:]
        if cleaned.endswith("```"):
            cleaned = cleaned[:-3]
        return cleaned.strip()
    query_string = cleanjson(query_string1)

    # Parsing query string
    parsed_dict = parse_query_string(query_string)
    print("this is pre query_respinse")

    # Step 2. Retrieve top-k relevant docs (no modifications needed) (From Task 4)
    response = query_response(parsed_dict)
    print("this is post query_respinse")

    formatted_documents = format_documents(response)
    print("this is pre query_respinse")

    retrieved_context = formatted_documents
    print("this is post query_respinse")

    print("-"*47)
    print("CONTEXT RETRIEVED:")
    print(retrieved_context)
    print("-"*47)

    #Step 3. Add the prompt for taking the user query and the retrieved context and outputting a good user response to the their query (Task 5)
    rag_prompt = f'''

    SYSTEM ROLE:
You are a career guidance AI designed to support high school students by offering authentic, experience-based advice. You ground your responses in real insights from professionals featured on the "What To Be" podcast.

TASK:
Using the student’s career question and retrieved context from podcast transcripts, craft a thoughtful, personalized response rooted in the lived experiences and reflections of the professionals interviewed.

INPUTS:
1. User Query: {{query}}
2. Retrieved Context: {{retrieved_context}}

OUTPUT INSTRUCTIONS:
• Begin with a warm, empathetic acknowledgment of the student's question and concerns.
• Share relevant experiences, stories, or quotes from professionals in the retrieved context.
• Use a supportive, conversational tone—like a trusted mentor offering guidance.
• Reference specific professionals by name and highlight their real experiences.
• Prioritize concrete details over generic advice—anchor your response in the lived realities from the transcripts.
• Provide actionable takeaways the student can use or reflect on.
• End with links to the podcast episodes so students can explore further. Format them as:
  "To hear more from [Name], check out their interview at [Source URL]"

REQUIREMENTS:
• Use **only** the retrieved context. Do not fabricate details or speculate beyond it.
• Ensure your response is rich in content and insight, grounded in what professionals actually said.
• Maintain a tone that is honest, encouraging, and realistic—acknowledge that career paths can be non-linear and challenging.
• Avoid vague or generalized advice—focus on meaningful takeaways drawn directly from the context.
 '''
    user_response = API_call("gemini-2.0-flash", GOOGLE_API_KEY, rag_prompt)

    print("-"*47)
    print("FINAL USER RESPONSE:")
    print(user_response)
    print("-"*47)

    return user_response

Finally you can test your full pipeline! You should test all of your queries created in task 2 here. You only need to replace the query below with your next query and **run only the cell below.** We have created the pipeline function so it will print out each step (output from each Task from before) that you need for your spreadsheet. (This cell will spit out a lot of information so you may have to scroll to see it all)

In [None]:
#Test your pipeline here for your 5 user queries
query = "“You said you jumped straight from a 10-week EMT course into a year-and-a-half of paramedic school without first working on an ambulance. Looking back, what made that intense route worth it, and what would you advise teens who are eager to ‘sprint’ into the fire service today?”"
response = RAG_Pipeline(query)

PIPELINE OUTPUTS FOR: “You said you jumped straight from a 10-week EMT course into a year-and-a-half of paramedic school without first working on an ambulance. Looking back, what made that intense route worth it, and what would you advise teens who are eager to ‘sprint’ into the fire service today?”
-----------------------------------------------
QUERY STRING GENERATED:
```json
{
  "content_string_query": "Advice on the accelerated path from EMT training to paramedic school without ambulance experience, and recommendations for teenagers interested in a fast-track into the fire service.",
  "industry_filter": [
    "Fire Service"
  ]
}
```
-----------------------------------------------
this is pre query_respinse
this is post query_respinse
this is pre query_respinse
this is post query_respinse
-----------------------------------------------
CONTEXT RETRIEVED:
[]
-----------------------------------------------
-----------------------------------------------
FINAL USER RESPONSE:
Okay, th