Suicide Intent Analysis - Natural Language Process Project

This project develops machine learning model to analyze online textual data for suicide risk assessment. Utilizing datasets derived from Reddit forums, the study evaluates the efficacy of various algorithms, including K-Neighbors Classifier, Multinomial Naïve Bayes, and Logistic Regression, to discern suicidal ideation. The optimal model demonstrated a test accuracy of 69.18% with an AUC score of 0.77, highlighting the potential of natural language processing in mental health diagnostics and indicating avenues for future refinement through advanced learning methodologies.

Dataset

The dataset is from the Reddit website "Peer support for anyone struggling with suicidal thoughts" community and "depression, because nobody should be alone in a dark place" community. There are 1957 items in the dataset that have been manually classified with 6 variables. The training dataset has 7613 observations, whereas the test dataset contains the remaining observations.

Methodology(Natural Language Processing)

Pre-processing
Data preprocessing is a critical step, ensuring text data's consistent and digestibility for NLP model construction. Techniques include converting text to lowercase for consistency, tokenizing characters into discrete components, verb lemmatization for meaningful analysis, removing stopwords and punctuation to focus on relevant information. This phase transforms raw text into a structured format, enabling precise subsequent analysis.
Text Analysis
Text analysis employs vectorization methods such as CountVectorize, TF-IDF transfer, and HashingVectorize to translate filtered text into numerical vector data. CountVectorize quantifies word occurrences, TF-IDF adjusts word counts based on their document frequency to highlight important terms, and HashingVectorize maps words to numerical amount, optimizing memory usage. These methodologies convert text into fixed-length feature vectors suitable for machine learning algorithms.
Models and Algorithms
This study explores Multinomial Naive Bayes, K-Neighbors Classifier, and Logistic Regression to predict suicidal ideation from online texts. Multinomial Naive Bayesanalyzes word frequency distributions, K-Neighbors Classifier uses nearest data points for classification, and Logistic Regression applies a logistic function to estimate probabilities. The selection process involved evaluating each model's accuracy and AUC score, with adjustments for data normalization in optimization models like KNN and LR to enhance performance

Result

Results demonstrated the efficacy of various vectorization and optimization models in identifying suicidal ideation from online texts. MultiNB models with TfidVectorizer showed the highest performance and probability of distinguishing depression or suicidal ideation using this model reached 69.18% test accuracy with AUC score 0.77.

	AUC score	Train Accuracy	Test Accuracy
cvec+ multi_nb	0.717	0.682	0.628
tvec + multi_nb	0.754	0.687	0.651
hvec + multi_nb	0.807	0.766	0.659

Application

The project aims to enhance mental health diagnosis by analyzing suicidal ideation in online texts using advanced NLP techniques. It offers a potential pathway for early intervention and support mechanisms by enabling the timely identification of individuals at risk, thereby contributing to preventative mental health care strategies.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
image		image
pic		pic
test		test
.DS_Store		.DS_Store
DataCleaning.py		DataCleaning.py
DataCollection.py		DataCollection.py
README.md		README.md
main.py		main.py
modelling.py		modelling.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Suicide Intent Analysis - Natural Language Process Project

Dataset

Methodology(Natural Language Processing)

Result

Application

About

Releases

Packages

Languages

ywan3223/SucideIntentAnalysis

Folders and files

Latest commit

History

Repository files navigation

Suicide Intent Analysis - Natural Language Process Project

Dataset

Methodology(Natural Language Processing)

Result

Application

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages