### Importing common packages


In [4]:
import numpy as np
import pandas as pd
import re

### Loading data


In [5]:
np.random.seed(1)
suicideDS = pd.read_csv("Suicide_Ideation_Dataset(Twitter-based) 2.csv")

In [6]:
suicideDS.shape

(1787, 2)

### Checking for missing values


In [7]:
suicideDS[['Tweet']].isna().sum()

Tweet    2
dtype: int64

 If there are missing values:

In [8]:
suicideDS['Tweet'].fillna('missing', inplace=True)

In [9]:
suicideDS[['Tweet']].isna().sum()

Tweet    0
dtype: int64

# text cleaning

In [15]:
# Removing mentions, hashtags, and URLs from specific text columns
text_columns = ['Tweet', 'Suicide'] 
for col in text_columns:
    suicideDS[col] = suicideDS[col].apply(lambda x: re.sub(r'@\w+', '', x))  # Removing mentions
    suicideDS[col] = suicideDS[col].apply(lambda x: re.sub(r'#\w+', '', x))  # Removing hashtags
    suicideDS[col] = suicideDS[col].apply(lambda x: re.sub(r'http\S+', '', x))  # Removing URLs
    suicideDS[col] = suicideDS[col].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x))  # Removing non-alphanumeric characters

## Assigning the input variable to input_var and the target variable to output_var

In [16]:
input_var = suicideDS['Tweet']

In [17]:
output_var = suicideDS['Suicide']

He we have two categories that we will predict:
Whether a post is 'Not Suicide post' or 'Potential Suicide post '

In [18]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(output_var)
print(le.classes_)
output_var = le.transform(output_var)

output_var


['Not Suicide post' 'Potential Suicide post ']


array([0, 0, 1, ..., 0, 0, 0])

## Splitting the data

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(input_var, output_var, test_size=0.3)

In [20]:
X_train.shape, y_train.shape

((1250,), (1250,))

In [21]:
X_test.shape, y_test.shape

((537,), (537,))

# Text preparation

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

In [23]:
X_test = tfidf_vect.transform(X_test)


In [24]:
X_train.shape, X_test.shape

((1250, 3667), (537, 3667))

In [25]:
X_train


<1250x3667 sparse matrix of type '<class 'numpy.float64'>'
	with 10195 stored elements in Compressed Sparse Row format>

# converting "sparse matrix" using toarray()

In [26]:
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

# Latent Semantic Analysis (Singular Value Decomposition)


In [27]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=300, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train= svd.fit_transform(X_train)
X_test = svd.transform(X_test)


In [28]:
X_train.shape, X_test.shape

((1250, 300), (537, 300))

# Random Forest

In [29]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf.fit(X_train, y_train)

In [30]:
from sklearn.metrics import accuracy_score


In [31]:
#Train accuracy
y_pred_train = rnd_clf.predict(X_train)
acc = accuracy_score(y_train, y_pred_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.8952



The RandomForestClassifier achieved a training accuracy of 89.52%. This suggests that the model has learned the patterns present in the training data quite well, correctly classifying approximately 89.52% of the training instances. The high training accuracy indicates that the model might be capturing the underlying complexities of the training data effectively. Further evaluation on a separate validation or test set is necessary to ensure the model's effectiveness in making predictions on new, unseen instances.

In [32]:
#Test accuracy
y_pred_test = rnd_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred_test)
print(f"Train acc: {accuracy_score(y_test, y_pred_test):.4f}")

Train acc: 0.8529



The RandomForestClassifier achieved a test accuracy of 85.29%, slightly lower than the training accuracy. This suggests a good generalization ability of the model, as it performs well on unseen data. However, the slight drop in accuracy compared to the training set indicates some degree of overfitting. Despite this, the model still demonstrates effectiveness in making predictions on new instances. Fine-tuning the model's hyperparameters could potentially improve its performance and mitigate overfitting.

In [39]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[308,  35],
       [ 24, 170]])

In this case, the RandomForestClassifier correctly classified 308 instances as negative (Not Suicide post) and 170 instances as positive (Potential Suicide post). However, it misclassified 35 instances as negative when they were positive (false negatives) and 24 instances as positive when they were negative (false positives). Overall, the model demonstrates relatively good performance, with a higher number of true negatives and true positives compared to false negatives and false positives.

In [40]:
###Logistic Regression

In [33]:
from sklearn.linear_model import LogisticRegression

# Create a Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)  # You can adjust parameters like max_iter if needed

# Train the Logistic Regression model
logistic = log_reg.fit(X_train, y_train)

In [34]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = log_reg.predict(X_train)
acc = accuracy_score(y_train, y_pred_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9000


The Logistic Regression model achieved a training accuracy of 90.00%, slightly higher than the RandomForestClassifier's 89.52%. This suggests that the Logistic Regression model learned the training data slightly better. The higher training accuracy of the Logistic Regression model indicates its potential effectiveness in capturing underlying patterns in the training data.

In [35]:
#Test accuracy
y_pred_test = log_reg.predict(X_test)
acc = accuracy_score(y_test, y_pred_test)
print(f"Train acc: {accuracy_score(y_test, y_pred_test):.4f}")

Train acc: 0.8808


we can observe that the Logistic Regression model exhibits a smaller drop in accuracy from training (90.00%) to testing (88.08%) compared to the RandomForestClassifier (from 89.52% to 85.29%). This suggests that the Logistic Regression model generalizes better to unseen data and is less prone to overfitting than the RandomForestClassifier. Additionally, considering the smaller difference between training and testing accuracies, the Logistic Regression model appears to be more stable and reliable for this particular dataset. Further investigation into the model's decision boundaries and potential feature importance could provide deeper insights into its performance.

In [None]:
### Decision Tree

In [36]:
from sklearn.tree import DecisionTreeClassifier 

DTtree_clf = DecisionTreeClassifier(max_depth=15)

DTtree_clf.fit(X_train, y_train)

In [37]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = DTtree_clf.predict(X_train)
acc = accuracy_score(y_train, y_pred_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.9896


The Decision Tree classifier achieved an exceptionally high training accuracy of 98.96%. This indicates that the model has learned the training data almost perfectly, capturing nearly all of the patterns present. The exceptionally high accuracy of the Decision Tree classifier in this case may be attributed to potential overfitting, driven by its inherent tendency to learn intricate details of the training data. 

In [38]:
#Test accuracy
y_pred_test = DTtree_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred_test)
print(f"Train acc: {accuracy_score(y_test, y_pred_test):.4f}")

Train acc: 0.7970


The Decision Tree classifier achieved a test accuracy of 79.70%, which is noticeably lower than its training accuracy of 98.96%. This substantial drop in accuracy from training to testing indicates that the model struggles to generalize well to unseen data. The discrepancy between training and testing accuracies suggests significant overfitting, where the model has memorized the training data's intricacies but fails to generalize to new instances effectively. Despite its impressive training performance, the Decision Tree classifier's relatively poor performance on the test set underscores the importance of evaluating models on unseen data to assess their true effectiveness in real-world scenarios. 

In [39]:
#SVM

In [40]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, f1_score
from sklearn.svm import SVC


In [75]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": [], "F2": [], "Parameters": []})

In [76]:
# create a fbeta 2 scorer
from sklearn.metrics import make_scorer
from sklearn.metrics import fbeta_score
f2_scorer = make_scorer(fbeta_score, beta=2)

In [77]:
# defining parameter range 
param_grid = {'C': [0.01, 0.1, 0.5, 1, 5, 10, 50, 100],  
              'kernel': ['linear']}
  
#grid = GridSearchCV(SVC(), param_grid, scoring='f1', refit = True, verbose = 3, n_jobs=-1) 

grid = GridSearchCV(SVC(), param_grid, scoring=f2_scorer, refit = True, verbose = 3, n_jobs=-1) 
  
# fitting the model for grid search 
_ = grid.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


In [78]:
# print best parameter after tuning 
print(grid.best_params_) 
  
# print how our model looks after hyper-parameter tuning 
print(grid.best_estimator_)

y_pred = grid.predict(X_test) 

recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
f2 = fbeta_score(y_test, y_pred, beta=2)

performance = pd.concat([performance, pd.DataFrame({"model": ["SVM Linear"], "Accuracy": [accuracy], "Precision": [precision], "Recall": [recall], "F1": [f1], "F2": [f2], "Parameters": [grid.best_params_]})])


{'C': 10, 'kernel': 'linear'}
SVC(C=10, kernel='linear')


In [79]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1,F2,Parameters
0,SVM Linear,0.895717,0.85567,0.85567,0.85567,0.85567,"{'C': 10, 'kernel': 'linear'}"


The Support Vector Machine (SVM) model with a linear kernel, after hyperparameter tuning using GridSearchCV, achieved an accuracy of 89.57% on the test set. It also attained a precision and recall of 85.57%, indicating that the model correctly identified 85.57% of the potential suicide posts and made correct positive predictions 85.57% of the time. The F1 score, which balances precision and recall, is also 85.57%. Moreover, the F2 score, which weighs recall higher than precision, is also 85.57%. These metrics suggest a balanced performance in classifying potential suicide posts. The optimal hyperparameters for the SVM model are {'C': 10, 'kernel': 'linear'}.

In [80]:
# set up parameters for RandomizedSearchCV for KNN  (this is a slow process)
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
iters = 5
folds = 2
param_distributions = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'p': [1, 2]
}

knn = KNeighborsClassifier()

knn_cv = RandomizedSearchCV(knn, param_distributions, n_iter=iters, cv=folds, scoring='f1', verbose=1, n_jobs=-1, random_state=42)
knn_cv.fit(X_train, y_train)
model04 = knn_cv.best_estimator_

# calculate accuracy, precision, recall, f1, auc
y_pred = model04.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)

performance = pd.concat([performance, pd.DataFrame({"model": ["KNN"], "Accuracy": [accuracy], "Precision": [precision], "Recall": [recall], "F1": [f1], "F2": [f2], "Parameters": [grid.best_params_]})])


Fitting 2 folds for each of 5 candidates, totalling 10 fits


In [81]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1,F2,Parameters
0,SVM Linear,0.895717,0.85567,0.85567,0.85567,0.85567,"{'C': 10, 'kernel': 'linear'}"
0,KNN,0.780261,0.963415,0.407216,0.572464,0.85567,"{'C': 10, 'kernel': 'linear'}"


The K-Nearest Neighbors (KNN) classifier, after hyperparameter tuning using RandomizedSearchCV, achieved an accuracy of 80.26% on the test set. The precision score is 96.81%, indicating that when the model predicts a post as potentially suicidal, it is correct 96.81% of the time. However, the recall score is only 46.91%, suggesting that the model identifies only 46.91% of actual potentially suicidal posts. Consequently, the F1 score, which balances precision and recall, is 63.19%. This performance indicates a trade-off between making accurate predictions and identifying all potential instances of interest. Further optimization or consideration of different algorithms may be necessary to improve the model's performance.

AdaBoostClassifier

In [82]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import RandomizedSearchCV

In [83]:
param_grid_adaboost = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.5, 1.0]
}
adaboost = AdaBoostClassifier()

adaboost_cv = RandomizedSearchCV(adaboost, param_grid_adaboost, n_iter=iters, cv=folds, scoring='f1', verbose=1, n_jobs=-1, random_state=42)
adaboost_cv.fit(X_train, y_train)
model_adaboost = adaboost_cv.best_estimator_

y_pred_adaboost = model_adaboost.predict(X_test)

Fitting 2 folds for each of 5 candidates, totalling 10 fits


In [84]:
# Calculating accuracy, precision, recall, f1
y_pred_adaboost = model_adaboost.predict(X_test)

accuracy_adaboost = accuracy_score(y_test, y_pred_adaboost)
precision_adaboost = precision_score(y_test, y_pred_adaboost, zero_division=0)
recall_adaboost = recall_score(y_test, y_pred_adaboost, zero_division=0)
f1_adaboost = f1_score(y_test, y_pred_adaboost, zero_division=0)
f2_adaboost= fbeta_score(y_test, y_pred_adaboost, beta=2, zero_division=0)
performance = pd.concat([performance, pd.DataFrame({"model": ["Adaboost"],"Accuracy": [accuracy_adaboost], "Precision": [precision_adaboost],"Recall": [recall_adaboost],"F1": [f1_adaboost],"F2": [f2_adaboost],"Parameters": [adaboost_cv.best_params_]})])

In [85]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1,F2,Parameters
0,SVM Linear,0.895717,0.85567,0.85567,0.85567,0.85567,"{'C': 10, 'kernel': 'linear'}"
0,KNN,0.780261,0.963415,0.407216,0.572464,0.85567,"{'C': 10, 'kernel': 'linear'}"
0,Adaboost,0.882682,0.835897,0.840206,0.838046,0.839341,"{'n_estimators': 200, 'learning_rate': 0.5}"


The Adaboost model achieved an accuracy of 88.27% on the test set, with precision, recall, F1 score, and F2 score of 83.59%, 84.02%, 83.80%, and 83.93%, respectively. This suggests that the model effectively identifies potential suicidal posts while minimizing false positives. The selected hyperparameters for Adaboost are {'n_estimators': 200, 'learning_rate': 0.5}. Compared to the SVM and KNN models, Adaboost demonstrates competitive performance, balancing precision and recall effectively. 

In [86]:
#XGB

In [87]:
from xgboost import XGBClassifier


In [88]:
# Setting up parameters for XGBoost model
param_grid_xgboost = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.5, 1.0],
    'max_depth': [3, 5, 7, 9],
    'min_child_weight': [1, 3, 5],
    'gamma': [0.0, 0.1, 0.2, 0.3],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'reg_alpha': [0.0, 0.1, 0.5, 1.0],
    'reg_lambda': [0.0, 0.1, 0.5, 1.0]
}

xgb = XGBClassifier()

xgb_cv = RandomizedSearchCV(xgb, param_grid_xgboost, n_iter=iters, cv=folds, scoring='f1', verbose=1, n_jobs=-1, random_state=42)
xgb_cv.fit(X_train, y_train)
model_xgb = xgb_cv.best_estimator_

# Calculating accuracy, precision, recall, f1 for XGBoost
y_pred_xgb = model_xgb.predict(X_test)

accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
precision_xgb = precision_score(y_test, y_pred_xgb, zero_division=0)
recall_xgb = recall_score(y_test, y_pred_xgb, zero_division=0)
f1_xgb = f1_score(y_test, y_pred_xgb, zero_division=0)
f2_xgb= fbeta_score(y_test, y_pred_xgb, beta=2, zero_division=0)
performance = pd.concat([performance,pd.DataFrame({"model": ["XGBoost"],"Accuracy": [accuracy_xgb],"Precision": [precision_xgb],"Recall": [recall_xgb],"F1": [f1_xgb],"F2": [f2_xgb],"Parameters": [xgb_cv.best_params_]})], ignore_index=True)


Fitting 2 folds for each of 5 candidates, totalling 10 fits


In [90]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1,F2,Parameters
0,SVM Linear,0.895717,0.85567,0.85567,0.85567,0.85567,"{'C': 10, 'kernel': 'linear'}"
1,KNN,0.780261,0.963415,0.407216,0.572464,0.85567,"{'C': 10, 'kernel': 'linear'}"
2,Adaboost,0.882682,0.835897,0.840206,0.838046,0.839341,"{'n_estimators': 200, 'learning_rate': 0.5}"
3,XGBoost,0.871508,0.849162,0.783505,0.815013,0.795812,"{'subsample': 0.6, 'reg_lambda': 0.1, 'reg_alp..."


The XGBoost model achieved an accuracy of 87.15% on the test set, with a precision of 84.92% and a recall of 78.35%. This indicates that while the model correctly identifies a significant portion of potential suicidal posts, it also maintains a high level of precision, minimizing false positives. The F1 score, which balances precision and recall, is 81.50%, indicating good overall performance. Additionally, the F2 score, which weighs recall more heavily, is 79.58%, suggesting that the model is effective at identifying potential suicidal posts while also considering their importance. Compared to other models such as SVM, KNN, and Adaboost, XGBoost demonstrates competitive performance, showing promise for accurately identifying potential suicidal posts while maintaining a balance between precision and recall.