# Mental Health Crisis Detection Data Preprocessing

This notebook handles the loading and preprocessing of Suicide and Depression Detection dataset from Kaggle.

## Data Cleaning Stages
1. **Sanitization**:
   - URL removal
   - Special character stripping
2. **Normalization**:
   - Lowercasing
   - Stopword removal (NLTK English)
3. **Quality Control**:
   - NaN removal
   - Column standardization

## Preprocessing Rationale
- URL/Special Char Removal: Reduces noise in embeddings
- Stopword Removal: Focuses on meaningful terms
- Lowercasing: Matches BERT's uncased architecture

In [1]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split

In [2]:
nltk.download('stopwords')
STOP_WORDS = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sem_w\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
def clean_data(df):
    """Sanitize raw dataframe
    
    Args:
        df: Raw dataframe with 'text' and 'class' columns
        
    Returns:
        Cleaned dataframe ready for preprocessing
    """
    df.drop(columns=df.columns[0], inplace=True)  # Drop the first column (index)
    df.dropna(inplace=True)  # Handle missing values
    df['text'] = df['text'].apply(remove_urls)  # Remove URLs
    df['text'] = df['text'].apply(clean_special_chars)  # Remove non-alphanumeric characters
    return df

In [4]:
def remove_urls(text):
    """Remove URLs using regex pattern matching"""
    return re.sub(r'http\S+', '', text)

In [5]:
def clean_special_chars(text):
    """Retain only alphanumeric characters and whitespace"""
    return re.sub(r'[^A-Za-z0-9\s]+', '', text)

In [6]:
def preprocess_text(df):
    """Text normalization pipeline"""
    df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x.lower().split() if word not in STOP_WORDS]))
    return df

In [7]:
def prepare_data(file_path):
    """End-to-end data preparation
    
    Returns:
        X_train, X_test, y_train, y_test: Stratified splits
    """
    df = pd.read_csv(file_path) # Load raw data
    df = clean_data(df) # Sanitization
    df = preprocess_text(df) # Normalization
    X_train, X_test, y_train, y_test = train_test_split(df['text'], df['class'], test_size=0.2, random_state=42) # Industry-standard split
    return X_train, X_test, y_train, y_test