# Dataset 1: Mental Health Counseling Conversations

Source: Hugging Face
Dataset: Amod/mental_health_counseling_conversations
Dataset Link : https://huggingface.co/datasets/Amod/mental_health_counseling_conversations

In [12]:
# Import necessary libraries
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
import seaborn as sns

# Download stopwords and wordnet for NLP preprocessing
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/satwik/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/satwik/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [16]:
# Load the dataset (from Hugging Face)
df = pd.read_json("hf://datasets/Amod/mental_health_counseling_conversations/combined_dataset.json", lines=True)

# Display first Ten rows
df.head(10)

Unnamed: 0,Context,Response
0,I'm going through some things with my feelings...,"If everyone thinks you're worthless, then mayb..."
1,I'm going through some things with my feelings...,"Hello, and thank you for your question and see..."
2,I'm going through some things with my feelings...,First thing I'd suggest is getting the sleep y...
3,I'm going through some things with my feelings...,Therapy is essential for those that are feelin...
4,I'm going through some things with my feelings...,I first want to let you know that you are not ...
5,I'm going through some things with my feelings...,"Heck, sure thing, hun!Feelings of 'depression'..."
6,I'm going through some things with my feelings...,You are exhibiting some specific traits of a p...
7,I'm going through some things with my feelings...,That is intense. Depression is a liar. Sometim...
8,I'm going through some things with my feelings...,It sounds like you may be putting yourself las...
9,I'm going through some things with my feelings...,It must be really difficult to experience what...


In [29]:
print(f"Number of rows in the dataset: {df.shape[0]}")

Number of rows in the dataset: 3512


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3512 entries, 0 to 3511
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Context   3512 non-null   object
 1   Response  3512 non-null   object
dtypes: object(2)
memory usage: 55.0+ KB


## Handle Missing Data
If 'Context' (User Question) or 'Response' (Therapist Answer) is missing → Drop the row.

In [21]:
# Check for missing values
print("Missing values before handling:")
print(df.isnull().sum())

Missing values before handling:
Context     0
Response    0
dtype: int64


This indicates that there are no missing values in this dataset, so we don’t need to apply any Missing value handling techniques

## Remove Outliers (Based on Text Length)

Remove extremely short questions (<5 words) and long responses (>500 words).

In [31]:
# Calculate word count for each row
df['context_word_count'] = df['Context'].apply(lambda x: len(x.split()))
df['response_word_count'] = df['Response'].apply(lambda x: len(x.split()))

# Define reasonable thresholds for length
min_question_length = 5  # At least 5 words in question
max_response_length = 500  # Max 500 words in therapist's response

# Apply filters
df = df[(df['context_word_count'] >= min_question_length) & (df['response_word_count'] <= max_response_length)]
df = df.drop(columns=['context_word_count', 'response_word_count'])
print(f"Dataset size after outlier removal: {df.shape}")

Dataset size after outlier removal: (3430, 2)


## Text Cleaning (NLP Preprocessing)

- Convert to lowercase
- Remove special characters & punctuation
- Remove stopwords
- Lemmatization

In [38]:
# Initialize NLP tools
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Function to clean text
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters & punctuation
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])  # Apply lemmatization
    return text

# Apply text cleaning
df['Cleaned_Context'] = df['Context'].apply(clean_text)
df['Cleaned_Response'] = df['Response'].apply(clean_text)

In [40]:
# Show First 10 rows of cleaned data
df[['Cleaned_Context', 'Cleaned_Response']].head(10)

Unnamed: 0,Cleaned_Context,Cleaned_Response
0,im going thing feeling barely sleep nothing th...,everyone think youre worthless maybe need find...
1,im going thing feeling barely sleep nothing th...,hello thank question seeking advice feeling wo...
2,im going thing feeling barely sleep nothing th...,first thing id suggest getting sleep need impa...
3,im going thing feeling barely sleep nothing th...,therapy essential feeling depressed worthless ...
4,im going thing feeling barely sleep nothing th...,first want let know alone feeling always someo...
5,im going thing feeling barely sleep nothing th...,heck sure thing hunfeelings depression deeplyr...
6,im going thing feeling barely sleep nothing th...,exhibiting specific trait particular temperame...
7,im going thing feeling barely sleep nothing th...,intense depression liar sometimes depression p...
8,im going thing feeling barely sleep nothing th...,sound like may putting last wrote want fix iss...
9,im going thing feeling barely sleep nothing th...,must really difficult experience going right t...


In [42]:
# Save cleaned dataset
df.to_csv("cleaned_counseling_conversations.csv", index=False)
print("Preprocessing complete! Cleaned dataset saved as 'cleaned_counseling_conversations.csv'.")

Preprocessing complete! Cleaned dataset saved as 'cleaned_counseling_conversations.csv'.


# Dataset 2: Mental Health FAQ for Chatbot

Source: Kaggle
Dataset: Mental Health FAQ for Chatbot
Dataset Link : https://www.kaggle.com/datasets/narendrageek/mental-health-faq-for-chatbot

In [47]:
# Load the dataset
df2 = pd.read_csv("/Users/satwik/Downloads/Mental_Health_FAQ.csv")

# Display first 10 rows
df2.head(10)

Unnamed: 0,Question_ID,Questions,Answers
0,1590140,What does it mean to have a mental illness?,Mental illnesses are health conditions that di...
1,2110618,Who does mental illness affect?,It is estimated that mental illness affects 1 ...
2,6361820,What causes mental illness?,It is estimated that mental illness affects 1 ...
3,9434130,What are some of the warning signs of mental i...,Symptoms of mental health disorders vary depen...
4,7657263,Can people with mental illness recover?,"When healing from mental illness, early identi..."
5,1619387,What should I do if I know someone who appears...,Although this website cannot substitute for pr...
6,1030153,How can I find a mental health professional fo...,Feeling comfortable with the professional you ...
7,8022026,What treatment options are available?,Just as there are different types of medicatio...
8,1155199,"If I become involved in treatment, what do I n...",Since beginning treatment is a big step for in...
9,7760466,What is the difference between mental health p...,There are many types of mental health professi...


In [49]:
# Drop the 'Question_ID' column as it's not useful for training
df2 = df2.drop(columns=['Question_ID'])

# Verify the change
df2.head(10)

Unnamed: 0,Questions,Answers
0,What does it mean to have a mental illness?,Mental illnesses are health conditions that di...
1,Who does mental illness affect?,It is estimated that mental illness affects 1 ...
2,What causes mental illness?,It is estimated that mental illness affects 1 ...
3,What are some of the warning signs of mental i...,Symptoms of mental health disorders vary depen...
4,Can people with mental illness recover?,"When healing from mental illness, early identi..."
5,What should I do if I know someone who appears...,Although this website cannot substitute for pr...
6,How can I find a mental health professional fo...,Feeling comfortable with the professional you ...
7,What treatment options are available?,Just as there are different types of medicatio...
8,"If I become involved in treatment, what do I n...",Since beginning treatment is a big step for in...
9,What is the difference between mental health p...,There are many types of mental health professi...


In [61]:
print(f"Number of rows in the dataset: {df2.shape[0]}")

Number of rows in the dataset: 98


In [59]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Questions  98 non-null     object
 1   Answers    98 non-null     object
dtypes: object(2)
memory usage: 1.7+ KB


## Handle Missing Data

- Drop rows where either Question or Answer is missing

In [53]:
# Check for missing values
print("Missing values before handling:")
print(df2.isnull().sum())

Missing values before handling:
Questions    0
Answers      0
dtype: int64


This indicates that there are no missing values in this dataset, so we don’t need to apply any Missing value handling techniques

## Remove Duplicate Questions
- Ensure each question has a unique entry

In [66]:
# Remove duplicate questions
df2 = df2.drop_duplicates(subset=['Questions'])

# Display dataset size after removing duplicates
print(f"Dataset size after duplicate removal: {df2.shape}")

Dataset size after duplicate removal: (98, 2)


This indicates that "There are no Duplicate Questions in this Dataset"

## Text Cleaning (NLP Preprocessing)
- Convert text to lowercase
- Remove punctuation & special characters
- Remove stopwords
- Apply lemmatization

In [72]:
# Initialize NLP tools
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Function to clean text
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters & punctuation
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])  # Apply lemmatization
    return text

# Apply text cleaning
df2['Cleaned_Questions'] = df2['Questions'].apply(clean_text)
df2['Cleaned_Answers'] = df2['Answers'].apply(clean_text)

In [74]:
# Showing Sample cleaned data
df2[['Cleaned_Questions', 'Cleaned_Answers']].head(10)

Unnamed: 0,Cleaned_Questions,Cleaned_Answers
0,mean mental illness,mental illness health condition disrupt person...
1,mental illness affect,estimated mental illness affect adult america ...
2,cause mental illness,estimated mental illness affect adult america ...
3,warning sign mental illness,symptom mental health disorder vary depending ...
4,people mental illness recover,healing mental illness early identification tr...
5,know someone appears symptom mental disorder,although website cannot substitute professiona...
6,find mental health professional child,feeling comfortable professional child working...
7,treatment option available,different type medication physical illness dif...
8,become involved treatment need know,since beginning treatment big step individual ...
9,difference mental health professional,many type mental health professional variety p...


In [76]:
# Save cleaned dataset
df2.to_csv("cleaned_mental_health_faq.csv", index=False)
print("Preprocessing complete! Cleaned dataset saved as 'cleaned_mental_health_faq.csv'.")

Preprocessing complete! Cleaned dataset saved as 'cleaned_mental_health_faq.csv'.


# Dataset 3: Sentiment Analysis for Mental Health

Source: Kaggle
Dataset: Sentiment Analysis for Mental Health
Dataset Link : https://www.kaggle.com/datasets/suchintikasarkar/sentiment-analysis-for-mental-health

In [81]:
# Load the dataset
df3 = pd.read_csv("/Users/satwik/Downloads/Sentiment_Analysis.csv")

# Display first 10 rows
df3.head(10)

Unnamed: 0.1,Unnamed: 0,statement,status
0,0,oh my gosh,Anxiety
1,1,"trouble sleeping, confused mind, restless hear...",Anxiety
2,2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety
3,3,I've shifted my focus to something else but I'...,Anxiety
4,4,"I'm restless and restless, it's been a month n...",Anxiety
5,5,"every break, you must be nervous, like somethi...",Anxiety
6,6,"I feel scared, anxious, what can I do? And may...",Anxiety
7,7,Have you ever felt nervous but didn't know why?,Anxiety
8,8,"I haven't slept well for 2 days, it's like I'm...",Anxiety
9,9,"I'm really worried, I want to cry.",Anxiety


In [83]:
# Drop the unnecessary "Unnamed: 0" column
df3 = df3.drop(columns=['Unnamed: 0'], errors='ignore')

# Verify the change
df3.head(10)

Unnamed: 0,statement,status
0,oh my gosh,Anxiety
1,"trouble sleeping, confused mind, restless hear...",Anxiety
2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety
3,I've shifted my focus to something else but I'...,Anxiety
4,"I'm restless and restless, it's been a month n...",Anxiety
5,"every break, you must be nervous, like somethi...",Anxiety
6,"I feel scared, anxious, what can I do? And may...",Anxiety
7,Have you ever felt nervous but didn't know why?,Anxiety
8,"I haven't slept well for 2 days, it's like I'm...",Anxiety
9,"I'm really worried, I want to cry.",Anxiety


In [86]:
print(f"Number of rows in the dataset: {df3.shape[0]}")

Number of rows in the dataset: 53043


In [88]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53043 entries, 0 to 53042
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   statement  52681 non-null  object
 1   status     53043 non-null  object
dtypes: object(2)
memory usage: 828.9+ KB


# Handle Missing Data
- Drop rows where statement or status is missing

In [90]:
# Check for missing values
print("Missing values before handling:")
print(df3.isnull().sum())

Missing values before handling:
statement    362
status         0
dtype: int64


There are 362 Missing values in 'statement' column so we drop them.

In [93]:
# Drop rows where 'statement' or 'status' is missing
df3 = df3.dropna(subset=['statement', 'status'])

# Verify missing values are handled
print("\nMissing values after handling:")
print(df3.isnull().sum())


Missing values after handling:
statement    0
status       0
dtype: int64


In [97]:
df3.info()
df3.head(10)

<class 'pandas.core.frame.DataFrame'>
Index: 52681 entries, 0 to 53042
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   statement  52681 non-null  object
 1   status     52681 non-null  object
dtypes: object(2)
memory usage: 1.2+ MB


Unnamed: 0,statement,status
0,oh my gosh,Anxiety
1,"trouble sleeping, confused mind, restless hear...",Anxiety
2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety
3,I've shifted my focus to something else but I'...,Anxiety
4,"I'm restless and restless, it's been a month n...",Anxiety
5,"every break, you must be nervous, like somethi...",Anxiety
6,"I feel scared, anxious, what can I do? And may...",Anxiety
7,Have you ever felt nervous but didn't know why?,Anxiety
8,"I haven't slept well for 2 days, it's like I'm...",Anxiety
9,"I'm really worried, I want to cry.",Anxiety


In [99]:
print(f"Number of rows in the dataset after Removing Missing Values are: {df3.shape[0]}")

Number of rows in the dataset after Removing Missing Values are: 52681


## Normalize Text (NLP Preprocessing)
- Convert to lowercase
- Remove punctuation, special characters, and numbers
- Remove stopwords
- Apply lemmatization

In [105]:
# Initialize stopwords and lemmatizer
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Function to clean text
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters, numbers, punctuation
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])  # Apply lemmatization
    return text

# Apply text cleaning
df3['Cleaned_Statement'] = df3['statement'].apply(clean_text)

In [107]:
# Show sample cleaned data
df3[['Cleaned_Statement', 'status']].head(10)

Unnamed: 0,Cleaned_Statement,status
0,oh gosh,Anxiety
1,trouble sleeping confused mind restless heart ...,Anxiety
2,wrong back dear forward doubt stay restless re...,Anxiety
3,ive shifted focus something else im still worried,Anxiety
4,im restless restless month boy mean,Anxiety
5,every break must nervous like something wrong ...,Anxiety
6,feel scared anxious may family u protected,Anxiety
7,ever felt nervous didnt know,Anxiety
8,havent slept well day like im restless huh,Anxiety
9,im really worried want cry,Anxiety


## Encode 'status' Labels for Model Training
- Since status is categorical (text), we convert it to numerical labels for Model training.

In [110]:
# Define label encoding for mental health status
label_mapping = {
    'Normal': 0,
    'Depression': 1,
    'Suicidal': 2,
    'Anxiety': 3,
    'Stress': 4,
    'Bi-Polar': 5,
    'Personality Disorder': 6
}

# Apply encoding
df3['Encoded_Status'] = df3['status'].map(label_mapping)

In [112]:
# Verify whether encoding is applied to the dataset or not
df3[['Cleaned_Statement', 'status', 'Encoded_Status']].head(30)

Unnamed: 0,Cleaned_Statement,status,Encoded_Status
0,oh gosh,Anxiety,3.0
1,trouble sleeping confused mind restless heart ...,Anxiety,3.0
2,wrong back dear forward doubt stay restless re...,Anxiety,3.0
3,ive shifted focus something else im still worried,Anxiety,3.0
4,im restless restless month boy mean,Anxiety,3.0
5,every break must nervous like something wrong ...,Anxiety,3.0
6,feel scared anxious may family u protected,Anxiety,3.0
7,ever felt nervous didnt know,Anxiety,3.0
8,havent slept well day like im restless huh,Anxiety,3.0
9,im really worried want cry,Anxiety,3.0


In [114]:
# Save cleaned dataset
df3.to_csv("cleaned_sentiment_analysis.csv", index=False)

print("Preprocessing complete! Cleaned dataset saved as 'cleaned_sentiment_analysis.csv'.")›

Preprocessing complete! Cleaned dataset saved as 'cleaned_sentiment_analysis.csv'.
