## About the Notebook

### Capstone Project ‚Äì Kaggle 5-Day Intensive Course on Generative AI

In celebration of World Health Day, this capstone project focuses on Mental Health and Wellbeing, leveraging a dedicated mental health dataset. The project combines foundational concepts from Kaggle‚Äôs Generative AI Intensive (Q1 2025) with advanced techniques such as Retrieval-Augmented Generation (RAG), Few-Shot Prompting, Function Calling, and custom prompt engineering.

This notebook classifies user text responses to detect potential mental health concerns, generates contextual insights, and assigns severity scores using large language models. It also demonstrates practical integrations of generative AI into health tech applications, pushing the boundaries of responsible and interactive AI design.

**Key Components**:

**Text Classification via Prompt-Driven LLMs**: Utilizes prompt-based generative AI for classifying textual inputs into mental health-related categories.

**Severity Scoring (1‚Äì10 Scale)**: Assigns interpretable severity levels using prompt engineering combined with few-shot examples and function calling for precision.

**Sentiment Categorization**: Introduces a new feature column labeling each entry as Positive, Negative, or Neutral based on the severity score and sentiment cues.

**Dataset-Wide Distribution Analysis**: Computes and visualizes the percentage breakdown of mental health statuses, helping surface key trends and risk segments.

**Actionable Interventions**: Leverages Retrieval-Augmented Generation (RAG) to suggest personalized and context-aware interventions aimed at promoting mental well-being.

---

## Setup

Install the SDK

In [None]:
!pip uninstall -qy jupyterlab  # Remove unused packages from Kaggle's base image that conflict
!pip install -U -q "google-genai==1.7.0"

In [None]:
from google import genai
from google.genai import types

from IPython.display import HTML, Markdown, display

Set up a retry helper. This allows you to "Run all" without worrying about per-minute quota.

In [None]:
from google.api_core import retry


is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

genai.models.Models.generate_content = retry.Retry(
    predicate=is_retriable)(genai.models.Models.generate_content)

Importing the API key

In [None]:
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

Test the API key using simple prompt

In [None]:
client = genai.Client(api_key=GOOGLE_API_KEY)

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="When is World Health Day and what was the objective of it? Why is it important to pay more attention to Mental Health in Modern day")

Markdown(response.text)

---

## About the dataset: Mental Health 

**Description**: This dataset contains textual data labeled according to indications of mental health concerns, specifically identifying content associated with mental health issues or distress. To classify if the person is struggling with mental health issues or not from their messages

This Dataet is a collection of texts related to people with anxiety, depression, and other mental health issues. The corpus consists of two columns: one containing the comments, and the other containing labels indicating whether the comments are considered poisonous or not. The corpus can be used for a variety of purposes, such as sentiment analysis, toxic language detection, and mental health language analysis. The data in the corpus may be useful for researchers, mental health professionals, and others interested in understanding the language and sentiment surrounding mental health issues.1 means considered as a comment which is poisonous with mental health issues, and 0 means not a person who is struggling with mental health issues

Understanding and detecting mental health issues through text messages can be a critical step in providing timely support and intervention for those in need. Research has shown that linguistic patterns and word choices in written communication can be indicative of various mental health conditions, including depression, anxiety, and stress. Analyzing the content of messages, along with the intensity of emotions conveyed, can offer valuable insights into a person's emotional well-being.

This cutting-edge field of study combines natural language processing (NLP) techniques with psychology and psychiatry, aiming to build automated systems capable of identifying signs of mental distress accurately. By recognizing these signals, friends, family, and mental health professionals can be better equipped to offer timely assistance and support.

**Use Cases**: Sentiment analysis, text classification, NLP-based mental health detection, exploratory data analysis, and machine learning classification tasks.

**Data Dictionary**:

  - **text**: The textual content collected from various sources, potentially containing expressions or indications of mental health conditions or distress. (Data Type: object)
  - **label**: Binary indicator specifying whether the text indicates a mental health issue (1) or not (0). (Data Type: int64)

File Reference: [mental_health.csv](https://github.com/vmahawar/data-science-datasets-collection/raw/main/mental_health.csv)

#### read the dataset

In [None]:
import pandas as pd
url = 'https://github.com/vmahawar/data-science-datasets-collection/raw/main/mental_health.csv'
df = pd.read_csv(url)
df.head()

In [None]:
df.shape

In [None]:
df = df.head(100).copy()

In [None]:
df['text'][0]

#### Generate word cloud for with label = 0 (who are mentally doing well)

In [None]:
# Install wordcloud if not already installed
!pip install -q wordcloud

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Load your dataset (assuming you already have it)
# For example:
# df = pd.read_csv("mental_health.csv")  # already loaded earlier in your notebook

# Filter for records with score > 5
filtered_df = df[df['label'] == 0]

# Combine all the text entries into a single string
text_data = " ".join(filtered_df['text'].dropna().astype(str))

# Generate the word cloud
wordcloud = WordCloud(
    width=800, height=400,
    background_color='white',
    max_words=100,
    colormap='viridis'
).generate(text_data)

# Display the word cloud
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Word Cloud for people with label as 0 (who are mentally well)", fontsize=16)
plt.show()


Choose a model
The Gemini API provides access to a number of models from the Gemini model family. Read about the available models and their capabilities on the [model overview page](https://ai.google.dev/gemini-api/docs/models/gemini).

In this step you'll use the API to list all of the available models.

In [None]:
for model in client.models.list():
  print(model.name)

---

## Prompt Engineering - Evaluation and Structured Output

**Prompt Engineering**: Carefully constructed prompts guided the AI to evaluate texts based on signs of anxiety, stress, depression, and overall mental wellbeing.

**Severity Scoring**: A numeric score indicating severity was provided by the AI, allowing categorization and prioritization of responses for interventions.

**Insight Generation**: The analysis highlighted significant trends, enabling recommendations for mental health professionals to intervene proactively.

In [None]:
import enum

# Define the evaluation prompt
SUMMARY_PROMPT = """\
# Instruction
You are an expert classifier. Your task is to classify the text responses in the dataset.
I want you to convert this on the scale of 1 to 10 and apply the rubric.
Here is the brief description about the dataset:
To classify if the person is struggling with mental health issues or not from their messages.

I will provide you with the user text and a label. label with 0 means mentally well and score will usually be less than 5, label with 1 means not mentally well which means score would be > 5. 
This Dataet is a collection of texts related to people with anxiety, depression, and other mental health issues. The corpus consists of two columns: one containing the comments, and the other containing labels indicating whether the comments are considered poisonous or not. The corpus can be used for a variety of purposes, such as sentiment analysis, toxic language detection, and mental health language analysis. The data in the corpus may be useful for researchers, mental health professionals, and others interested in understanding the language and sentiment surrounding mental health issues.1 means considered as a comment which is poisonous with mental health issues, and 0 means not a person who is struggling with mental health issues
Understanding and detecting mental health issues through text messages can be a critical step in providing timely support and intervention for those in need. Research has shown that linguistic patterns and word choices in written communication can be indicative of various mental health conditions, including depression, anxiety, and stress. Analyzing the content of messages, along with the intensity of emotions conveyed, can offer valuable insights into a person's emotional well-being.
This cutting-edge field of study combines natural language processing (NLP) techniques with psychology and psychiatry, aiming to build automated systems capable of identifying signs of mental distress accurately. By recognizing these signals, friends, family, and mental health professionals can be better equipped to offer timely assistance and support.

You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

Below are a few examples:

---

User message: "I feel like there's no purpose in my life anymore."
Label: 1 (positive for mental health issue)
Evaluation: The user expresses signs of deep emotional distress and hopelessness, which may indicate depression.
Score: 9
Rating: VERY_VERY_BAD

---

User message: "Just had a long day. Looking forward to a good night's sleep."
Label: 0 (not indicative of mental health issue)
Evaluation: The message reflects tiredness but no sign of mental health issues.
Score: 2
Rating: VERY_GOOD

---

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a summarization task and the context to be summarized are provided in the user prompt. The response should be shorter than the text in the context. The response should not contain information that is not present in the context.

## Criteria
Mentall Wellbeing: The response demonstrates a Mental State of the person, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Anxiety: The response checks for Anxiety if any.
Stress: The response checks for stress.

## Rating Rubric
10: (WORST). The patient's mental state is extremely poor and requires urgent attention.
9: (VERY_VERY_BAD). The patient's mental state is very severely compromised; immediate intervention strongly advised.
8: (VERY_BAD). The patient's mental state is severely compromised; professional assistance needed promptly.
7: (BAD). The patient's mental state is significantly compromised; recommend seeking help soon.
6: (MODERATE_BAD). The patient shows moderate signs of mental health issues; monitoring or professional support advised.
5: (NEUTRAL). The patient's mental state is borderline, neither clearly healthy nor unhealthy; regular monitoring recommended.
4: (MODERATE_GOOD). The patient's mental state is fairly good; minimal issues present, maintain current support.
3: (GOOD). The patient's mental state is healthy with minor or occasional concerns; generally doing well.
2: (VERY_GOOD). The patient is mentally well with very minor or no issues; no immediate concerns.
1: (BEST). The patient's mental state is excellent; no signs of distress or mental health concerns.


## Evaluation Steps
STEP 1: Assess the text responses in aspects of instruction following whether person is having anxiety, depression or other mental health issues according to the criteria.
STEP 2: Score based on the rubric.

# User Input text responses and label
## User Inputs text response

### Text Response
{text}

## Label
{label}
"""

# Define a structured enum class to capture the result.
class SummaryRating(enum.Enum):
  WORST = '10'
  VERY_VERY_BAD = '9'
  VERY_BAD = '8'
  BAD = '7'
  MODERATE_BAD = '6'
  NEUTRAL = '5'
  MODERATE_GOOD = '4'
  GOOD = '3'
  VERY_GOOD = '2'
  BEST = '1'

#### create method eval_summary

In [None]:
import functools

@functools.cache
def eval_summary(text_response, label):
  """Evaluate the generated text responses against the label."""

  chat = client.chats.create(
    model="gemini-2.0-flash",
    config=types.GenerateContentConfig(
        system_instruction=SUMMARY_PROMPT  # This is your long-term context!
    ))

  # Generate the full text response.
  # response = chat.send_message(
  #     message=SUMMARY_PROMPT.format(text=text_response, label=label)
  # )

  response = chat.send_message(f"User message: {text_response}, Label: {label}")
    
  verbose_eval = response.text

  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=SummaryRating,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed

  # Extract the numeric rating from structured_eval
  numeric_rating = structured_eval.value

  # Map numeric rating back to the enum description clearly
  summary_rating = structured_eval.name

  return verbose_eval, numeric_rating, summary_rating

#### test method eval_summary

In [None]:
text_eval, score, struct_eval = eval_summary(text_response=df['text'][0], label=df['label'][0])
print((text_eval))
print(struct_eval)

#### create a wrapper method get_score_and_feedback
This creates new features - feedback, severity score and feedback

In [None]:
def get_score_and_feedback(row):
    rationale, score, rating = eval_summary(text_response=row['text'], label=row['label'])
    return pd.Series({'feedback': rationale, 'score': score, 'rating': rating})

# Apply this function only to the first 20 rows (for your current code)
df_sample = df.head(20).copy()
df_sample[['feedback', 'score','rating']] = df_sample.apply(get_score_and_feedback, axis=1)


---

## Feature Sample Dataset 

In [None]:
df_sample.head(20)

---

## **RAG** - Retrieval Augmeneted Generation

In [None]:
!pip install -qU "chromadb==0.4.24"

In [None]:
# Optional: filter if you want only label=1, etc.
rag_df = df_sample.dropna(subset=["text", "label" , "feedback", "score", "rating"])

# Combine columns to create rich docs
documents = [
    f"USER MESSAGE: {row['text']}"
    for _, row in rag_df.iterrows()
]

### Creating the embedding database with ChromaDB
Create a custom function to generate embeddings with the Gemini API. In this task, you are implementing a retrieval system, so the task_type for generating the document embeddings is retrieval_document. Later, you will use retrieval_query for the query embeddings. Check out the API reference for the full list of supported tasks.

Key words: Documents are the items that are in the database. They are inserted first, and later retrieved. Queries are the textual search terms and can be simple keywords or textual descriptions of the desired documents.

In [None]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry
from google.genai import types

class GeminiEmbeddingFunction(EmbeddingFunction):
    document_mode = True

    @retry.Retry(predicate=lambda e: isinstance(e, genai.errors.APIError) and e.code in {429, 503})
    def __call__(self, input: Documents) -> Embeddings:
        task_type = "retrieval_document" if self.document_mode else "retrieval_query"
        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(task_type=task_type),
        )
        return [e.values for e in response.embeddings]

Now create a Chroma database client that uses the GeminiEmbeddingFunction.

In [None]:
import chromadb

# Create DB
embed_fn = GeminiEmbeddingFunction()
embed_fn.document_mode = True

chroma_client = chromadb.Client()
# Delete old collection to avoid conflict
# chroma_client.delete_collection("mentalhealthdb")

# Recreate it
db = chroma_client.get_or_create_collection(name="mentalhealthdb", embedding_function=embed_fn)

Populate the database with the documents you defined above.

In [None]:
# Filter out rows with any missing required fields

documents = [
    f"USER MESSAGE: {row['text']}\nFEEDBACK: {row['feedback']}"
    for _, row in rag_df.iterrows()
]

metadatas = [
    {
        "score": int(row['score']),
        "rating": row['rating'],
        "feedback": row['feedback']
    }
    for _, row in rag_df.iterrows()
]

ids = [str(i) for i in range(len(rag_df))]


In [None]:
db.add(documents=documents, metadatas=metadatas, ids=ids)

Confirm that the data was inserted by looking at the database.

In [None]:
db.count()
# You can peek at the data too.
# db.peek(1)

#### Retrieval: Find relevant documents

To search the Chroma database, call the query method. Note that you also switch to the retrieval_query mode of embedding generation.

In [None]:
embed_fn.document_mode = False

query = "What are the common traits among people with Very Very Bad rating and share their feedback?"

result = db.query(query_texts=[query], n_results=5, where={"rating": "VERY_VERY_BAD"})

retrieved_docs = result["documents"][0]
retrieved_metadata = result["metadatas"][0]

for i in range(len(retrieved_docs)):
    print(f"DOCUMENT {i+1}:")
    print("USER MESSAGE:", retrieved_docs[i])
    
    print("-" * 60)

#### Augmented generation: Answer the question

Once the relevant passage(s) have been retrieved from the mental health dataset, a generation prompt can be assembled for the Gemini API to produce a final response.

In [None]:
def build_rag_prompt(query: str, docs: list[str]) -> str:
    prompt = f"""You are a helpful and empathetic assistant trained to detect signs of mental health issues from text messages.

Using the following examples as references, answer this question in a kind, clear way:

QUESTION: {query}
"""
    for i, doc in enumerate(docs):
        clean_doc = doc.replace('\n', ' ')
        prompt += f"PASSAGE {i+1}: {clean_doc}\n"
    return prompt

Now use the generate_content method to to generate an answer to the question.

In [None]:
rag_prompt = build_rag_prompt(query, retrieved_docs)

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=rag_prompt
)

Markdown(response.text)


---

## Function Calls with Gemini API

#### Prepare SQLite DB

In [None]:
import sqlite3

# Create a new SQLite DB (or connect if exists)
conn = sqlite3.connect("mental_health.db")
cursor = conn.cursor()

# Drop the table if it exists
cursor.execute("DROP TABLE IF EXISTS mental_health;")

# Create table
cursor.execute('''
CREATE TABLE IF NOT EXISTS mental_health (
    id INTEGER PRIMARY KEY,
    text TEXT,
    feedback TEXT,
    score INTEGER,
    rating TEXT
)
''')

# Insert top 5 rows from df_sample
for i, row in df_sample.head(5).iterrows():
    cursor.execute('''
    INSERT INTO mental_health (text, feedback, score, rating)
    VALUES (?, ?, ?, ?)
    ''', (row['text'], row['feedback'], int(row['score']), row['rating']))

conn.commit()


#### Define helper functions

##### Method - list_tables

In [None]:
def list_tables() -> list[str]:
    """Retrieve the names of all tables in the database."""
    print(" - DB CALL: list_tables()")
    cursor = conn.cursor()
    cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
    return [t[0] for t in cursor.fetchall()]


In [None]:
list_tables()

##### Method - describe_table

In [None]:
def describe_table(table_name: str) -> list[tuple[str, str]]:
    """Get table schema - column name and type."""
    print(f" - DB CALL: describe_table({table_name})")
    cursor = conn.cursor()
    cursor.execute(f"PRAGMA table_info({table_name});")
    schema = cursor.fetchall()
    return [(col[1], col[2]) for col in schema]

In [None]:
describe_table('mental_health')

##### Method - execute_query

In [None]:
def execute_query(sql: str) -> list[list[str]]:
    """Execute SQL query and return results."""
    print(f" - DB CALL: execute_query({sql})")
    cursor = conn.cursor()
    cursor.execute(sql)
    return cursor.fetchall()

In [None]:
execute_query('select * from mental_health LIMIT 1')

In [None]:
# These are the Python functions defined above.
db_tools = [list_tables, describe_table, execute_query]

In [None]:
instruction = """You are a helpful mental health data assistant. Use the available tools to query and analyze the data.
        The dataset contains mental health text samples along with feedback, score (1-10), and rating.
        Your goal is to help the user gain insights from the data through SQL queries."""

client = genai.Client(api_key=GOOGLE_API_KEY)

# Start a chat with automatic function calling enabled.
chat = client.chats.create(
    model="gemini-2.0-flash",
    config=types.GenerateContentConfig(
        system_instruction=instruction,
        tools=db_tools,
    ),
)

In [None]:
resp = chat.send_message("What is the lowest score?")
print(f"\n{resp.text}")

In [None]:
resp = chat.send_message("How many are having score > 5?")
print(f"\n{resp.text}")

---

## MedLM and MedPaLM enhancements

### Acknowledgments:

I extend my heartfelt gratitude üôè to the Kaggle community, all the faculties, experts, trainers‚Äîespecially [Paige Bailey](https://www.linkedin.com/in/dynamicwebpaige/), [Anant Nawalgaria](https://www.linkedin.com/in/anant-nawalgaria/), and Mark McDonald [Linkedin](https://www.linkedin.com/in/markmcdonald0/) | [Kaggle](https://www.kaggle.com/markishere) ‚Äî for their invaluable guidance and insightful training notebooks during the Kaggle 5-day GenAI Course. Their support has been instrumental in making this project possible.

For more about the course, please visit [Kaggle 5-day GenAI Course](https://www.kaggle.com/learn-guide/5-day-genai).

Blog Post: [Harnessing Generative AI to Understand Mental Health](https://whizdba.wordpress.com/2025/04/07/harnessing-generative-ai-to-understand-mental-health-a-capstone-journey/)