Skip to content

Advanced NLP model for detecting suicidal ideation in online texts, employing machine learning for early mental health intervention

Notifications You must be signed in to change notification settings

ywan3223/SucideIntentAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Suicide Intent Analysis - Natural Language Process Project

This project develops machine learning model to analyze online textual data for suicide risk assessment. Utilizing datasets derived from Reddit forums, the study evaluates the efficacy of various algorithms, including K-Neighbors Classifier, Multinomial Naïve Bayes, and Logistic Regression, to discern suicidal ideation. The optimal model demonstrated a test accuracy of 69.18% with an AUC score of 0.77, highlighting the potential of natural language processing in mental health diagnostics and indicating avenues for future refinement through advanced learning methodologies.

Dataset

The dataset is from the Reddit website "Peer support for anyone struggling with suicidal thoughts" community and "depression, because nobody should be alone in a dark place" community. There are 1957 items in the dataset that have been manually classified with 6 variables. The training dataset has 7613 observations, whereas the test dataset contains the remaining observations. cmt

Methodology(Natural Language Processing)

  • Pre-processing
    Data preprocessing is a critical step, ensuring text data's consistent and digestibility for NLP model construction. Techniques include converting text to lowercase for consistency, tokenizing characters into discrete components, verb lemmatization for meaningful analysis, removing stopwords and punctuation to focus on relevant information. This phase transforms raw text into a structured format, enabling precise subsequent analysis​.

  • Text Analysis
    Text analysis employs vectorization methods such as CountVectorize, TF-IDF transfer, and HashingVectorize to translate filtered text into numerical vector data. CountVectorize quantifies word occurrences, TF-IDF adjusts word counts based on their document frequency to highlight important terms, and HashingVectorize maps words to numerical amount, optimizing memory usage. These methodologies convert text into fixed-length feature vectors suitable for machine learning algorithms​.

  • Models and Algorithms
    This study explores Multinomial Naive Bayes, K-Neighbors Classifier, and Logistic Regression to predict suicidal ideation from online texts. Multinomial Naive Bayesanalyzes word frequency distributions, K-Neighbors Classifier uses nearest data points for classification, and Logistic Regression applies a logistic function to estimate probabilities. The selection process involved evaluating each model's accuracy and AUC score, with adjustments for data normalization in optimization models like KNN and LR to enhance performance

Result

Results demonstrated the efficacy of various vectorization and optimization models in identifying suicidal ideation from online texts. MultiNB models with TfidVectorizer showed the highest performance and probability of distinguishing depression or suicidal ideation using this model reached 69.18% test accuracy with AUC score 0.77.

AUC score Train Accuracy Test Accuracy
cvec+ multi_nb 0.717 0.682 0.628
tvec + multi_nb 0.754 0.687 0.651
hvec + multi_nb 0.807 0.766 0.659

cmt

Application

The project aims to enhance mental health diagnosis by analyzing suicidal ideation in online texts using advanced NLP techniques. It offers a potential pathway for early intervention and support mechanisms by enabling the timely identification of individuals at risk, thereby contributing to preventative mental health care strategies.

About

Advanced NLP model for detecting suicidal ideation in online texts, employing machine learning for early mental health intervention

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages