#Project: Mining Mental Health Patterns in User Tweets and Resource Allocation via Sentiment Analysis, Topic Modeling and Transfer Learning


**IST 736 : Text Mining under Prof. Bei Yu**

 **Collaborators: Aditi Pala, Yashaswini Kulkarni, Viha Mashruwala**




In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import classification_report, confusion_matrix
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from mlxtend.plotting import plot_confusion_matrix
import matplotlib.cm as cm
from matplotlib import rcParams
from collections import Counter
from nltk.tokenize import RegexpTokenizer
import re
import string
from tensorflow.keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")




In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


# Import Sentiment data

In [3]:
train160 = pd.read_csv('/content/drive/My Drive/Text Mining Project/training.1600000.processed.noemoticon.csv', encoding = "ISO-8859-1", engine="python")

In [4]:
train160.head()

Unnamed: 0.1,Unnamed: 0,post_id,post_created,post_text,user_id,followers,friends,favourites,statuses,retweets,label
0,0,637894677824413696,Sun Aug 30 07:48:37 +0000 2015,its just over years since i was diagnosed with...,1013187241,84,211,251,837,0,1
1,1,637890384576778240,Sun Aug 30 07:31:33 +0000 2015,its sunday i need a break so im planning to sp...,1013187241,84,211,251,837,1,1
2,2,637749345908051968,Sat Aug 29 22:11:07 +0000 2015,awake but tired i need to sleep but my brain h...,1013187241,84,211,251,837,0,1
3,3,637696421077123073,Sat Aug 29 18:40:49 +0000 2015,rt retro bears make perfect gifts and are grea...,1013187241,84,211,251,837,2,1
4,4,637696327485366272,Sat Aug 29 18:40:26 +0000 2015,its hard to say whether packing lists are maki...,1013187241,84,211,251,837,1,1


In [5]:
train160.shape

(20000, 11)

Checking for data imbalance:

In [6]:

query_counts = train160['label'].value_counts()
print(query_counts)

1    10000
0    10000
Name: label, dtype: int64


Conclusion : Data is balanced for both positive and negative labels

In [7]:
train160 = train160[['post_text','label']]

In [8]:
train160.columns=['text','label']

In [9]:
train160.head()

Unnamed: 0,text,label
0,its just over years since i was diagnosed with...,1
1,its sunday i need a break so im planning to sp...,1
2,awake but tired i need to sleep but my brain h...,1
3,rt retro bears make perfect gifts and are grea...,1
4,its hard to say whether packing lists are maki...,1


Assigning 1 to positive sentiment 4 for consistency

In [10]:

train160['label'][train160['label']==4]=1

#Seperating positive and negative tweets
train160_pos = train160[train160['label'] == 1]
train160_neg = train160[train160['label'] == 0]


# Data Preprocessing - Senti140 Dataset

In [11]:
#Making statement text in lower case
train160['text']= train160['text'].str.lower()

In [12]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

# Get the list of English stopwords
stopwords_list = stopwords.words('english')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [13]:
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [14]:
#Cleaning and removing the above stop words list from the tweet text
STOPWORDS = set(stopwords.words('english'))
def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])
train160['text'] = train160['text'].apply(lambda text: cleaning_stopwords(text))
train160['text'].head()

0    years since diagnosed anxiety depression today...
1    sunday need break im planning spend little tim...
2                   awake tired need sleep brain ideas
3    rt retro bears make perfect gifts great beginn...
4    hard say whether packing lists making life eas...
Name: text, dtype: object

In [15]:
#Cleaning and removing punctuations
english_punctuations = string.punctuation
punctuations_list = english_punctuations
def cleaning_punctuations(text):
    translator = str.maketrans('', '', punctuations_list)
    return text.translate(translator)

train160['text']= train160['text'].apply(lambda x: cleaning_punctuations(x))
train160['text'].head()

0    years since diagnosed anxiety depression today...
1    sunday need break im planning spend little tim...
2                   awake tired need sleep brain ideas
3    rt retro bears make perfect gifts great beginn...
4    hard say whether packing lists making life eas...
Name: text, dtype: object

In [16]:
#Cleaning and removing repeating characters
def cleaning_repeating_char(text):
    return re.sub(r'(.)\1+', r'\1', text)

train160['text'] = train160['text'].apply(lambda x: cleaning_repeating_char(x))
train160['text'].head()

0    years since diagnosed anxiety depresion today ...
1    sunday ned break im planing spend litle time p...
2                     awake tired ned slep brain ideas
3    rt retro bears make perfect gifts great begine...
4    hard say whether packing lists making life eas...
Name: text, dtype: object

In [17]:
#Cleaning and removing email
def cleaning_email(train160):
    return re.sub('@[^\s]+', ' ', train160)

train160['text']= train160['text'].apply(lambda x: cleaning_email(x))
train160['text'].tail()

19995                      day without sunshine like night
19996    borens laws charge ponder trouble delegate dou...
19997    flow chart thoroughly oversold piece program d...
19998                   ships safe harbor never meant stay
19999                        black holes god dividing zero
Name: text, dtype: object

In [18]:
#Cleaning and removing URL's
def cleaning_URLs(train160):
    return re.sub('((www\.[^\s]+)|(https?://[^\s]+))',' ',train160)

train160['text'] = train160['text'].apply(lambda x: cleaning_URLs(x))
train160['text'].tail()

19995                      day without sunshine like night
19996    borens laws charge ponder trouble delegate dou...
19997    flow chart thoroughly oversold piece program d...
19998                   ships safe harbor never meant stay
19999                        black holes god dividing zero
Name: text, dtype: object

In [19]:
#Cleaning and removing Numeric numbers
def cleaning_numbers(train160):
    return re.sub('[0-9]+', '', train160)
train160['text'] = train160['text'].apply(lambda x: cleaning_numbers(x))
train160['text'].tail()

19995                      day without sunshine like night
19996    borens laws charge ponder trouble delegate dou...
19997    flow chart thoroughly oversold piece program d...
19998                   ships safe harbor never meant stay
19999                        black holes god dividing zero
Name: text, dtype: object

In [20]:
#Getting tokenization of tweet text
tokenizer = RegexpTokenizer(r'\w+')
train160['tokens'] = train160['text'].apply(tokenizer.tokenize)
train160['tokens'].head()

0    [years, since, diagnosed, anxiety, depresion, ...
1    [sunday, ned, break, im, planing, spend, litle...
2              [awake, tired, ned, slep, brain, ideas]
3    [rt, retro, bears, make, perfect, gifts, great...
4    [hard, say, whether, packing, lists, making, l...
Name: tokens, dtype: object

In [21]:
#Applying Lemmatizer
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')


lm = nltk.WordNetLemmatizer()
def lemmatizer_on_text(train160):
    text = [lm.lemmatize(word) for word in train160]
    return train160

train160['tokens'] = train160['tokens'].apply(lambda x: lemmatizer_on_text(x))
train160['tokens'].head()

[nltk_data] Downloading package wordnet to /root/nltk_data...


0    [years, since, diagnosed, anxiety, depresion, ...
1    [sunday, ned, break, im, planing, spend, litle...
2              [awake, tired, ned, slep, brain, ideas]
3    [rt, retro, bears, make, perfect, gifts, great...
4    [hard, say, whether, packing, lists, making, l...
Name: tokens, dtype: object

In [22]:
train160.head()

Unnamed: 0,text,label,tokens
0,years since diagnosed anxiety depresion today ...,1,"[years, since, diagnosed, anxiety, depresion, ..."
1,sunday ned break im planing spend litle time p...,1,"[sunday, ned, break, im, planing, spend, litle..."
2,awake tired ned slep brain ideas,1,"[awake, tired, ned, slep, brain, ideas]"
3,rt retro bears make perfect gifts great begine...,1,"[rt, retro, bears, make, perfect, gifts, great..."
4,hard say whether packing lists making life eas...,1,"[hard, say, whether, packing, lists, making, l..."


# Import Mental Health Dataset

In [5]:
import pandas as pd
df = pd.read_csv('/content/drive/My Drive/Text Mining Project/Mental-Health-Twitter.csv', encoding = "ISO-8859-1", engine="python")

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,post_id,post_created,post_text,user_id,followers,friends,favourites,statuses,retweets,label
0,0,637894677824413696,Sun Aug 30 07:48:37 +0000 2015,It's just over 2 years since I was diagnosed w...,1013187241,84,211,251,837,0,1
1,1,637890384576778240,Sun Aug 30 07:31:33 +0000 2015,"It's Sunday, I need a break, so I'm planning t...",1013187241,84,211,251,837,1,1
2,2,637749345908051968,Sat Aug 29 22:11:07 +0000 2015,Awake but tired. I need to sleep but my brain ...,1013187241,84,211,251,837,0,1
3,3,637696421077123073,Sat Aug 29 18:40:49 +0000 2015,RT @SewHQ: #Retro bears make perfect gifts and...,1013187241,84,211,251,837,2,1
4,4,637696327485366272,Sat Aug 29 18:40:26 +0000 2015,Itâs hard to say whether packing lists are m...,1013187241,84,211,251,837,1,1


In [25]:
df.isnull().sum() # check for missing values

Unnamed: 0      0
post_id         0
post_created    0
post_text       0
user_id         0
followers       0
friends         0
favourites      0
statuses        0
retweets        0
label           0
dtype: int64

Checking for data imbalance:

In [26]:

query_counts_df = df['label'].value_counts()
print(query_counts_df)

1    10000
0    10000
Name: label, dtype: int64


Conclusion : Data is balanced for both positive and negative labels

Let's focus on the tweets by top 5 users - we do this to find mental health posts that are have high frequency of tweets

In [8]:
import pandas as pd

user_tweet_counts = df.groupby('user_id').size().reset_index(name='tweet_count')

# Get the top 5 users based on tweet count
top_users = user_tweet_counts.nlargest(5, 'tweet_count')

# Combine tweets of top 5 users
top_user_data = pd.concat([df[df['user_id'] == user_id][['user_id', 'post_text']] for user_id in top_users['user_id']], ignore_index=True)

# Display the subset of data for the user
display(top_user_data.head())


Unnamed: 0,user_id,post_text
0,490044008,"Mom: ""So I was checking your credit card state..."
1,490044008,(Pretends I didnt spend 2k a day buying Hearth...
2,490044008,No money no talk meh.
3,490044008,Well now. https://t.co/0r7vfr1sv4
4,490044008,RT @HDDoesGaming: PASS INTO THE IRIS https://t...


In [9]:
# Display the subset of data for the user
display(top_user_data)

Unnamed: 0,user_id,post_text
0,490044008,"Mom: ""So I was checking your credit card state..."
1,490044008,(Pretends I didnt spend 2k a day buying Hearth...
2,490044008,No money no talk meh.
3,490044008,Well now. https://t.co/0r7vfr1sv4
4,490044008,RT @HDDoesGaming: PASS INTO THE IRIS https://t...
...,...,...
7373,1616997456,Overcome Depressive disorders: ADHD plus drugs...
7374,1616997456,Overcome Depression: A scheduled appointment w...
7375,1616997456,Overcome Depression: A scheduled appointment w...
7376,1616997456,Overcome Depression: ARIEL SHARON (1928-2014) ...


There are about 7378 tweets included in top 5 user data

# Data Preprocessing - top 5 users from mental health dataset

In [10]:


import re
import string

# Function to remove links from text
def remove_links(text):
    return re.sub(r'http\S+', '', text)

# Function to remove user names from text
def remove_usernames(text):
    return re.sub(r'@\w+', '', text)

# Function to remove punctuation from text
def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

# Function to remove special characters
def remove_special_characters(text):
    return re.sub(r'[^a-zA-Z\s]', '', text)

# Apply the functions to your 'post_text' column
top_user_data["post_text"] = top_user_data["post_text"].apply(remove_links)
top_user_data["post_text"] = top_user_data["post_text"].apply(remove_usernames)
top_user_data["post_text"] = top_user_data["post_text"].apply(remove_punctuation)
top_user_data["post_text"] = top_user_data["post_text"].apply(remove_special_characters)

# Change all characters in tweets to lower case
top_user_data["post_text"] = top_user_data["post_text"].apply(lambda x: " ".join(x.lower() for x in x.split()))

# Extract 'post_text' as docs
docs = top_user_data['post_text']

In [11]:

# Apply the functions to your 'post_text' column
df["post_text"] = df["post_text"].apply(remove_links)
df["post_text"] = df["post_text"].apply(remove_usernames)
df["post_text"] = df["post_text"].apply(remove_punctuation)
df["post_text"] = df["post_text"].apply(remove_special_characters)

# Change all characters in tweets to lower case
df["post_text"] = df["post_text"].apply(lambda x: " ".join(x.lower() for x in x.split()))



In [7]:
df.shape

(20000, 11)

In [31]:
top_user_data.head()

Unnamed: 0,user_id,post_text
0,490044008,mom so i was checking your credit card stateme...
1,490044008,pretends i didnt spend k a day buying hearthst...
2,490044008,no money no talk meh
3,490044008,well now
4,490044008,rt pass into the iris


In [32]:
top_user_data

Unnamed: 0,user_id,post_text
0,490044008,mom so i was checking your credit card stateme...
1,490044008,pretends i didnt spend k a day buying hearthst...
2,490044008,no money no talk meh
3,490044008,well now
4,490044008,rt pass into the iris
...,...,...
7373,1616997456,overcome depressive disorders adhd plus drugs ...
7374,1616997456,overcome depression a scheduled appointment wi...
7375,1616997456,overcome depression a scheduled appointment wi...
7376,1616997456,overcome depression ariel sharon isracast


In [33]:
print(top_user_data.columns)
print(df.columns)

Index(['user_id', 'post_text'], dtype='object')
Index(['Unnamed: 0', 'post_id', 'post_created', 'post_text', 'user_id',
       'followers', 'friends', 'favourites', 'statuses', 'retweets', 'label'],
      dtype='object')


# Topic Modeling

using BERTopic

In [34]:
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hcanceled
[31mERROR: Operation cancelled by user[0m[31m
[0mTraceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/cli/base_command.py", line 169, in exc_logging_wrapper
    status = run_func(*args)
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/cli/req_command.py", line 242, in wrapper
    return func(self, options, args)
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/commands/install.py", line 377, in run
    requirement_set = resolver.resolve(
  File "/usr

In [35]:
!pip show joblib

Name: joblib
Version: 1.3.2
Summary: Lightweight pipelining with Python functions
Home-page: 
Author: 
Author-email: Gael Varoquaux <gael.varoquaux@normalesup.org>
License: BSD 3-Clause
Location: /usr/local/lib/python3.10/dist-packages
Requires: 
Required-by: hdbscan, imbalanced-learn, librosa, mlxtend, music21, nltk, pynndescent, scikit-learn


###**BERTopic**

Training model on Mental Health top 5 user data to research mental health related keywords

In [36]:

from bertopic import BERTopic

topic_model = BERTopic(embedding_model="all-MiniLM-L12-v2", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(docs)

2023-12-12 22:18:39,332 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/231 [00:00<?, ?it/s]

2023-12-12 22:21:07,605 - BERTopic - Embedding - Completed ✓
2023-12-12 22:21:07,608 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2023-12-12 22:21:57,598 - BERTopic - Dimensionality - Completed ✓
2023-12-12 22:21:57,605 - BERTopic - Cluster - Start clustering the reduced embeddings
2023-12-12 22:22:13,453 - BERTopic - Cluster - Completed ✓
2023-12-12 22:22:13,482 - BERTopic - Representation - Extracting topics from clusters using representation models.
2023-12-12 22:22:14,562 - BERTopic - Representation - Completed ✓


In [37]:
freq = topic_model.get_topic_info();
num_topics = len(freq) -1
print(num_topics)
freq.head(num_topics)

120


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2624,-1_the_to_and_it,"[the, to, and, it, this, in, that, of, you, rt]","[how can people even put up with me, its like ..."
1,0,604,0_user_account_gander_tabletsrental,"[user, account, gander, tabletsrental, allfxbr...","[user talk, user, user]"
2,1,330,1_following_twitter_hello_thank,"[following, twitter, hello, thank, say, anytim...",[is now following me on twitter thank you say ...
3,2,230,2_rt_this_crying_password,"[rt, this, crying, password, im, boom, so, my,...","[rt oh my god this is me and im so sorry, rt i..."
4,3,223,3_depression_treatments_overcome_treatment,"[depression, treatments, overcome, treatment, ...",[healthy living finding new treatments for dep...
...,...,...,...,...,...
115,114,11,114_dare_jerk_fuck_portion,"[dare, jerk, fuck, portion, hell, cares, how, ...","[ok how dare you, how dare you, how dare you]"
116,115,11,115_repair_phone_clickaway_verizon,"[repair, phone, clickaway, verizon, redwood, s...",[clickaway redwood city verizon store phone re...
117,116,11,116_vent_dryer_cleaning_dora,"[vent, dryer, cleaning, dora, mount, rosa, gra...","[mount dora dryer vent cleaning, mount dora dr..."
118,117,11,117_publish_place_waystoovercomeyourdepression...,"[publish, place, waystoovercomeyourdepression,...",[publish place wouldyoufeeldepressedreadthisin...


Clusters from perfoming BERTopic on mental health data

In [38]:
for i in range(num_topics):
  print("\n== Representative documents in cluster #", i)
  print(topic_model.get_representative_docs(i))


== Representative documents in cluster # 0
['user talk', 'user', 'user']

== Representative documents in cluster # 1
['is now following me on twitter thank you say hello any', 'is now following me on twitter thank you say hello any', 'is now following me on twitter thank you say hello anytime i am a real person']

== Representative documents in cluster # 2
['rt oh my god this is me and im so sorry', 'rt is that you', 'rt b and s']

== Representative documents in cluster # 3
['healthy living finding new treatments for depression depression treatments', 'depression treatments', 'depression treatments']

== Representative documents in cluster # 4
['trump quotes putin in tweet slamming clinton democrats has more positive things to say about putin than opponents no', 'putin ordered hacking to help trump intelligence report says via', 'does the word putin mean anything to you of trump voters like putin you know the']

== Representative documents in cluster # 5
['why is the propagandist for 

In [39]:
topic_model.get_topic(0)  # Select the most frequent topic

[('user', 0.03914452219317539),
 ('account', 0.02668260255799135),
 ('gander', 0.02443909698267449),
 ('tabletsrental', 0.02443909698267449),
 ('allfxbrokers', 0.019373239771537244),
 ('petraguardcoatings', 0.019373239771537244),
 ('userasappartsunlimited', 0.019373239771537244),
 ('userhdfworm', 0.013898007974381744),
 ('davistrimming', 0.013898007974381744),
 ('userboisewindowtinting', 0.013898007974381744)]

In [40]:
topic_model.visualize_topics()

From these clusters - we have used ChatGPT to filter mental health keywords from the cluster content

1. Initiated prompts to chatgpt to only take into consideration clusters that talk/relate to/similar to mental health topics

2. From this list of clusters, prompted chatgpt to extract individual keywords from aforemention cluster list that talk/relate to/similar to mental health topics

After filtering done by ChatGPT which was done many times to get higher freq of mental health related keywords from our data

bertopic_keywords = {
  
    2: ['depression', 'depression treatments'],
    7: ['broken people', 'unrepairable', 'fix', 'broken'],
    15: ['talk business adviceconcerningyourfightwithdepression', 'talk business adviceandadviceoncopingwithdepression', 'talk business wouldyoufeeldepressedreadthisinformativearticle'],
    30: ['leave depression behind now with these hints article teller', 'easy things you have to know about depression article teller', 'how to live a depression free life article teller'],
    50: ['circles end depression therapy who had a great winter mental health', 'circles end depression therapy feeling depressed animalassisted therapy could', 'circles end depression therapy'],
    69: ['overcome depressive disorders beating the wintertime blues seasonal affective disorder', 'overcome depression winter season blues ways to enhance mood plus energy', 'winter blues lack of sun often leads to depression my champlain valley fox amp abc overcome depression'],
    91: ['simple relaxation techniques to help reduce stress', 'relaxation techniques to help reduce stress', 'ways to stress less at work'],
    98: ['apps can help relieve stress anxiety sydney morning herald depression treatments', 'overcome depression sleep therapy to treat depression sydney morning herald', 'apps can help relieve stress anxiety sydney morning herald depression treatments'],
    103: ['sing loud changes on swift products for foods for depression', 'sing loud the hard battle howto fight depression', 'sing loud the best methods to overcome depression and be happy']
}

In [12]:
bertopic_keywords = {

2: ['depression', 'depression treatments'],
7: ['broken people', 'unrepairable', 'fix', 'broken'],
15: ['talk business adviceconcerningyourfightwithdepression', 'talk business adviceandadviceoncopingwithdepression', 'talk business wouldyoufeeldepressedreadthisinformativearticle'],
30: ['leave depression behind now with these hints article teller', 'easy things you have to know about depression article teller', 'how to live a depression free life article teller'],
50: ['circles end depression therapy who had a great winter mental health', 'circles end depression therapy feeling depressed animalassisted therapy could', 'circles end depression therapy'],
69: ['overcome depressive disorders beating the wintertime blues seasonal affective disorder', 'overcome depression winter season blues ways to enhance mood plus energy', 'winter blues lack of sun often leads to depression my champlain valley fox amp abc overcome depression'],
91: ['simple relaxation techniques to help reduce stress', 'relaxation techniques to help reduce stress', 'ways to stress less at work'],
98: ['apps can help relieve stress anxiety sydney morning herald depression treatments', 'overcome depression sleep therapy to treat depression sydney morning herald', 'apps can help relieve stress anxiety sydney morning herald depression treatments'],
103: ['sing loud changes on swift products for foods for depression', 'sing loud the hard battle howto fight depression', 'sing loud the best methods to overcome depression and be happy']
}

###**SBERT**

Training model on Mental Health top 5 user data to research mental health related keywords

In [42]:
!pip install -U sentence-transformers



In [43]:
pip install sentence-transformers scikit-learn




In [None]:
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer

# Create SBERT embeddings
embedder = SentenceTransformer('all-MiniLM-L6-v2')
sbert_embeddings = embedder.encode(docs)

# KMeans clustering on SBERT embeddings
K = 10
sbert_model = KMeans(n_clusters=K, random_state=0)
sbert_model.fit(sbert_embeddings)


In [None]:

# Print cluster sizes
def print_cluster_sizes(model):
    cnt_per_cluster = {}
    for c in model.labels_:
        cnt_per_cluster[c] = cnt_per_cluster.get(c, 0) + 1
    print(cnt_per_cluster)

print_cluster_sizes(sbert_model)

In [None]:
def print_docs_closest_to_centroids_sbert(model, vec, n):
    K = len(model.cluster_centers_)

    for j in range(K):
        d = model.transform(vec)[:, j]  # transform all docs to cluster-distance space
        idx = np.argsort(d)[:n]  # find n docs closest to centroid

        c_idx = [i for i, label in enumerate(model.labels_) if label == j]  # find the index of all docs in cluster j
        print('\n\n======cluster #', j, ', cluster size:', len(c_idx))

        for i in idx:
            if i < len(docs) and i in c_idx:
                print(docs.iloc[i])  # Assuming 'docs' is the original DataFrame column
            else:
                print('[Index Error] Doc index out of range or not in the cluster:', i)


Clusters from perfoming SBert on mental health data


In [None]:
# Use the modified function
print_docs_closest_to_centroids_sbert(sbert_model, sbert_embeddings, 5)

From these clusters - we have used ChatGPT to filter mental health keywords from the cluster content

1. Initiated prompts to chatgpt to only take into consideration clusters that talk/relate to/similar to mental health topics

2. From this list of clusters, prompted chatgpt to extract individual keywords that talk/relate to/similar to mental health topics

3. Through several rounds of filtering conducted by ChatGPT on our command, we worked diligently to increase the prevalence of mental health-related keywords within our data.




sbert_keywords = {

    1: ['trust', 'uglier', 'bothers', 'pretty', 'girls', 'nude', 'photoshoots'],
    3: ['lonely', 'hollow', 'purpose', 'grateful'],
    5: ['money', 'buy', 'design', 'save', 'cash', 'order', 'electronics', 'changing', 'colours', 'confused', 'emailed', 'reply', 'checking', 'designs'],
    6: ['lightsabers', 'lightsaber', 'look', 'designs', 'start', 'small', 'buy', 'cheaper', 'save', 'money'],
    9: ['obsessed', 'stalking', 'broke', 'house', 'bam', 'crush', 'serial', 'killer', 'thank', 'god', 'came', 'life'],
}


Taking keywords into consideration

In [13]:
sbert_keywords = {
    1: ['trust', 'uglier', 'bothers', 'pretty', 'girls', 'nude', 'photoshoots'],
    3: ['lonely', 'hollow', 'purpose', 'grateful'],
    5: ['money', 'buy', 'design', 'save', 'cash', 'order', 'electronics', 'changing', 'colours', 'confused', 'emailed', 'reply', 'checking', 'designs'],
    6: ['lightsabers', 'lightsaber', 'look', 'designs', 'start', 'small', 'buy', 'cheaper', 'save', 'money'],
    9: ['obsessed', 'stalking', 'broke', 'house', 'bam', 'crush', 'serial', 'killer', 'thank', 'god', 'came', 'life'],
}

###**LDA**

Training model on Mental Health top 5 user data to research mental health related keywords

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pandas as pd
import re
import string
import spacy
from collections import Counter


In [None]:

def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))


In [None]:
docs

In [None]:
# TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(docs)

In [None]:
no_topics = 20
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [None]:

# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics, max_iter=10, learning_method='online', learning_offset=50., random_state=0)
lda_z = lda.fit_transform(tfidf)

no_top_words =  10 # Specify the number of top words for each topic


In [None]:

feature_names = tfidf_vectorizer.get_feature_names_out()

display_topics(lda, feature_names, no_top_words)

# Extract keywords from the topics
feature_names = tfidf_vectorizer.get_feature_names_out()
keywords = []

for topic_idx, topic in enumerate(lda.components_):
    topic_keywords = [feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]
    keywords.extend(topic_keywords)


lda_keywords = {
  
    1: ['depression', 'symptoms'],
    3: ['cry', 'sad', 'fight', 'love'],
    4: ['depression', 'treatments'],
    5: ['sad', 'kids', 'fight', 'love', 'disease', 'like', 'campaign', 'little', 'poor', 'person', 'stories', 'artists', 'darkness', 'heartbreaking', 'upgrade', 'understanding', 'cold', 'life', 'sadly', 'insane', 'spelling', 'begin'],
    6: ['happy', 'wish', 'birthday', 'wonderful', 'intense'],
    7: ['family', 'hell', 'sweet', 'goodnight', 'hope', 'ill'],
    8: ['depression', 'helped', 'sucked', 'overcoming', 'overcome', 'stigma', 'thoughts', 'manageable', 'hair', 'dead', 'death', 'ready', 'mute', 'mean', 'color', 'tiny', 'moron'],
    9: ['suicide', 'blues', 'deplorable', 'cure', 'mad', 'rest', 'mess', 'methods', 'absolutely', 'game', 'thugs', 'benefit', 'toy', 'committed', 'swift', 'toys', 'darker', 'boys', 'wit', 'moron', 'ended', 'makeup', 'lover', 'help', 'mocks', 'people', 'heavy'],
}

In [14]:
lda_keywords = {
    1: ['depression', 'symptoms'],
    3: ['cry', 'sad', 'fight', 'love'],
    4: ['depression', 'treatments'],
    5: ['sad', 'kids', 'fight', 'love', 'disease', 'like', 'campaign', 'little', 'poor', 'person', 'stories', 'artists', 'darkness', 'heartbreaking', 'upgrade', 'understanding', 'cold', 'life', 'sadly', 'insane', 'spelling', 'begin'],
    6: ['happy', 'wish', 'birthday', 'wonderful', 'intense'],
    7: ['family', 'hell', 'sweet', 'goodnight', 'hope', 'ill'],
    8: ['depression', 'helped', 'sucked', 'overcoming', 'overcome', 'stigma', 'thoughts', 'manageable', 'hair', 'dead', 'death', 'ready', 'mute', 'mean', 'color', 'tiny', 'moron'],
    9: ['suicide', 'blues', 'deplorable', 'cure', 'mad', 'rest', 'mess', 'methods', 'absolutely', 'game', 'thugs', 'benefit', 'toy', 'committed', 'swift', 'toys', 'darker', 'boys', 'wit', 'moron', 'ended', 'makeup', 'lover', 'help', 'mocks', 'people', 'heavy'],
}

# Master Keywords List
Combining keywords from topic modeling from 3 models:

1. BERTopic
2. BERT
3. LDA

In [15]:
master_keywords = set()

# Add bertopic keywords to master_keywords
for keywords in bertopic_keywords.values():
    master_keywords.update(keywords)

# Add sbert keywords to master_keywords
for keywords_list in sbert_keywords.values():
    master_keywords.update(keywords_list)

# Add lda keywords to master_keywords
for keywords_list in lda_keywords.values():
    master_keywords.update(keywords_list)


# Convert set to list
master_keywords = list(master_keywords)

# Print the master_keywords list
print(master_keywords)

['simple relaxation techniques to help reduce stress', 'artists', 'spelling', 'birthday', 'helped', 'depression', 'talk business adviceandadviceoncopingwithdepression', 'overcome', 'deplorable', 'heartbreaking', 'heavy', 'changing', 'wonderful', 'mute', 'help', 'symptoms', 'lightsabers', 'purpose', 'electronics', 'designs', 'circles end depression therapy feeling depressed animalassisted therapy could', 'circles end depression therapy who had a great winter mental health', 'lonely', 'look', 'photoshoots', 'kids', 'love', 'talk business adviceconcerningyourfightwithdepression', 'lover', 'buy', 'start', 'makeup', 'killer', 'fight', 'house', 'design', 'obsessed', 'methods', 'sad', 'reply', 'cash', 'wish', 'toys', 'intense', 'relaxation techniques to help reduce stress', 'depression treatments', 'cure', 'how to live a depression free life article teller', 'understanding', 'broken people', 'sweet', 'mocks', 'nude', 'small', 'easy things you have to know about depression article teller', 'de

Manually rechecking master keywords list- word by word

master_keywords =

['cure', 'overcome', 'stigma', 'methods', 'heartbreaking', 'depression', 'love', 'suicide', 'artists', 'intense', 'disease', 'ended', 'changing', 'poor', 'lover', 'ready', 'benefit', 'goodnight', 'absolutely', 'toys', 'happy', 'trust', 'mess', 'purpose', 'understanding', 'girls', 'pretty', 'manageable', 'hair', 'overcoming', 'cold', 'sweet', 'broke', 'thoughts', 'start', 'family', 'mad', 'ill', 'confused', 'heavy', 'boys', 'kids', 'lonely', 'campaign', 'treatments', 'fight', 'little', 'upgrade', 'life', 'mute', 'wish', 'wonderful', 'sad', 'begin', 'helped', 'thugs', 'insane', 'symptoms', 'grateful', 'dead', 'darker', 'people', 'money', 'deplorable', 'bothers', 'obsessed', 'death', 'darkness', 'game', 'birthday', 'blues', 'help', 'rest', 'sadly', 'committed', 'lightsaber', 'moron', 'sucked', 'like', 'small', 'cry', 'toy', 'color', 'nude', 'serial', 'stalking']

In [16]:
master_keywords = ['cure', 'overcome', 'stigma', 'methods', 'heartbreaking', 'depression', 'love', 'suicide', 'artists', 'intense', 'disease', 'ended', 'changing', 'poor', 'lover', 'ready', 'benefit', 'goodnight', 'absolutely', 'toys', 'happy', 'trust', 'mess', 'purpose', 'understanding', 'girls', 'pretty', 'manageable', 'hair', 'overcoming', 'cold', 'sweet', 'broke', 'thoughts', 'start', 'family', 'mad', 'ill', 'confused', 'heavy', 'boys', 'kids', 'lonely', 'campaign', 'treatments', 'fight', 'little', 'upgrade', 'life', 'mute', 'wish', 'wonderful', 'sad', 'begin', 'helped', 'thugs', 'insane', 'symptoms', 'grateful', 'dead', 'darker', 'people', 'money', 'deplorable', 'bothers', 'obsessed', 'death', 'darkness', 'game', 'birthday', 'blues', 'help', 'rest', 'sadly', 'committed', 'lightsaber', 'moron', 'sucked', 'like', 'small', 'cry', 'toy', 'color', 'nude', 'serial', 'stalking']


Now we have a master keywords list

Let's try to filter mental health tweets based on master keywords from original mental health dataset

In [17]:
# Function to check if a tweet contains any keyword
def contains_keyword(tweet, keywords):
    return any(keyword in tweet for keyword in keywords)

In [18]:
# Filter tweets
filtered_tweets_all = df[df['post_text'].apply(lambda x: contains_keyword(x, master_keywords))]


The mental health dataset has 20000 rows

In [42]:
display(df)

Unnamed: 0.1,Unnamed: 0,post_id,post_created,post_text,user_id,followers,friends,favourites,statuses,retweets,label
0,0,637894677824413696,Sun Aug 30 07:48:37 +0000 2015,its just over years since i was diagnosed with...,1013187241,84,211,251,837,0,1
1,1,637890384576778240,Sun Aug 30 07:31:33 +0000 2015,its sunday i need a break so im planning to sp...,1013187241,84,211,251,837,1,1
2,2,637749345908051968,Sat Aug 29 22:11:07 +0000 2015,awake but tired i need to sleep but my brain h...,1013187241,84,211,251,837,0,1
3,3,637696421077123073,Sat Aug 29 18:40:49 +0000 2015,rt retro bears make perfect gifts and are grea...,1013187241,84,211,251,837,2,1
4,4,637696327485366272,Sat Aug 29 18:40:26 +0000 2015,its hard to say whether packing lists are maki...,1013187241,84,211,251,837,1,1
...,...,...,...,...,...,...,...,...,...,...,...
19995,19995,819336825231773698,Thu Jan 12 00:14:56 +0000 2017,a day without sunshine is like night,1169875706,442,230,7,1063601,0,0
19996,19996,819334654260080640,Thu Jan 12 00:06:18 +0000 2017,borens laws when in charge ponder when in trou...,1169875706,442,230,7,1063601,0,0
19997,19997,819334503042871297,Thu Jan 12 00:05:42 +0000 2017,the flow chart is a most thoroughly oversold p...,1169875706,442,230,7,1063601,0,0
19998,19998,819334419374899200,Thu Jan 12 00:05:22 +0000 2017,ships are safe in harbor but they were never m...,1169875706,442,230,7,1063601,0,0


After filtering the dataset to exclusively encompass tweets containing at least one word from the master keyword list, it is evident that the data has undergone a concentration process, resulting in a robust dataset comprising 7366 rows.

In [43]:
display(filtered_tweets_all)

Unnamed: 0.1,Unnamed: 0,post_id,post_created,post_text,user_id,followers,friends,favourites,statuses,retweets,label
0,0,637894677824413696,Sun Aug 30 07:48:37 +0000 2015,its just over years since i was diagnosed with...,1013187241,84,211,251,837,0,1
1,1,637890384576778240,Sun Aug 30 07:31:33 +0000 2015,its sunday i need a break so im planning to sp...,1013187241,84,211,251,837,1,1
3,3,637696421077123073,Sat Aug 29 18:40:49 +0000 2015,rt retro bears make perfect gifts and are grea...,1013187241,84,211,251,837,2,1
4,4,637696327485366272,Sat Aug 29 18:40:26 +0000 2015,its hard to say whether packing lists are maki...,1013187241,84,211,251,837,1,1
12,12,637555158440902656,Sat Aug 29 09:19:29 +0000 2015,moving stuff is bloomin knackering and theres ...,1013187241,84,211,251,837,0,1
...,...,...,...,...,...,...,...,...,...,...,...
19984,19984,819341164147048449,Thu Jan 12 00:32:10 +0000 2017,welcome to the working week i know it dont thr...,1169875706,442,230,7,1063601,0,0
19987,19987,819340814270615553,Thu Jan 12 00:30:47 +0000 2017,the internet is that thing still around homer ...,1169875706,442,230,7,1063601,0,0
19993,19993,819337577601912835,Thu Jan 12 00:17:55 +0000 2017,every why hath a wherefore william shakespeare...,1169875706,442,230,7,1063601,0,0
19994,19994,819336993331171329,Thu Jan 12 00:15:36 +0000 2017,you will have good luck and overcome many hard...,1169875706,442,230,7,1063601,0,0


In [44]:
import pandas as pd

# saving filtered dataset to pandas
#filtered_tweets_all.to_csv('/content/drive/My Drive/Text Mining Project/mental_health_subset.csv', index=False)

Renaming final mental health dataset with top mental health related keywords

In [45]:
final_mh = filtered_tweets_all.drop(columns=['Unnamed: 0'])

In [46]:
display(final_mh)

Unnamed: 0,post_id,post_created,post_text,user_id,followers,friends,favourites,statuses,retweets,label
0,637894677824413696,Sun Aug 30 07:48:37 +0000 2015,its just over years since i was diagnosed with...,1013187241,84,211,251,837,0,1
1,637890384576778240,Sun Aug 30 07:31:33 +0000 2015,its sunday i need a break so im planning to sp...,1013187241,84,211,251,837,1,1
3,637696421077123073,Sat Aug 29 18:40:49 +0000 2015,rt retro bears make perfect gifts and are grea...,1013187241,84,211,251,837,2,1
4,637696327485366272,Sat Aug 29 18:40:26 +0000 2015,its hard to say whether packing lists are maki...,1013187241,84,211,251,837,1,1
12,637555158440902656,Sat Aug 29 09:19:29 +0000 2015,moving stuff is bloomin knackering and theres ...,1013187241,84,211,251,837,0,1
...,...,...,...,...,...,...,...,...,...,...
19984,819341164147048449,Thu Jan 12 00:32:10 +0000 2017,welcome to the working week i know it dont thr...,1169875706,442,230,7,1063601,0,0
19987,819340814270615553,Thu Jan 12 00:30:47 +0000 2017,the internet is that thing still around homer ...,1169875706,442,230,7,1063601,0,0
19993,819337577601912835,Thu Jan 12 00:17:55 +0000 2017,every why hath a wherefore william shakespeare...,1169875706,442,230,7,1063601,0,0
19994,819336993331171329,Thu Jan 12 00:15:36 +0000 2017,you will have good luck and overcome many hard...,1169875706,442,230,7,1063601,0,0


#Models Creation and testing

In [47]:
final_mh

Unnamed: 0,post_id,post_created,post_text,user_id,followers,friends,favourites,statuses,retweets,label
0,637894677824413696,Sun Aug 30 07:48:37 +0000 2015,its just over years since i was diagnosed with...,1013187241,84,211,251,837,0,1
1,637890384576778240,Sun Aug 30 07:31:33 +0000 2015,its sunday i need a break so im planning to sp...,1013187241,84,211,251,837,1,1
3,637696421077123073,Sat Aug 29 18:40:49 +0000 2015,rt retro bears make perfect gifts and are grea...,1013187241,84,211,251,837,2,1
4,637696327485366272,Sat Aug 29 18:40:26 +0000 2015,its hard to say whether packing lists are maki...,1013187241,84,211,251,837,1,1
12,637555158440902656,Sat Aug 29 09:19:29 +0000 2015,moving stuff is bloomin knackering and theres ...,1013187241,84,211,251,837,0,1
...,...,...,...,...,...,...,...,...,...,...
19984,819341164147048449,Thu Jan 12 00:32:10 +0000 2017,welcome to the working week i know it dont thr...,1169875706,442,230,7,1063601,0,0
19987,819340814270615553,Thu Jan 12 00:30:47 +0000 2017,the internet is that thing still around homer ...,1169875706,442,230,7,1063601,0,0
19993,819337577601912835,Thu Jan 12 00:17:55 +0000 2017,every why hath a wherefore william shakespeare...,1169875706,442,230,7,1063601,0,0
19994,819336993331171329,Thu Jan 12 00:15:36 +0000 2017,you will have good luck and overcome many hard...,1169875706,442,230,7,1063601,0,0


#Sentiment Analysis - BERT

In [48]:

!git clone -b master https://github.com/charles9n/bert-sklearn
!cd bert-sklearn; pip install .

Cloning into 'bert-sklearn'...
remote: Enumerating objects: 259, done.[K
remote: Total 259 (delta 0), reused 0 (delta 0), pack-reused 259[K
Receiving objects: 100% (259/259), 516.15 KiB | 3.35 MiB/s, done.
Resolving deltas: 100% (131/131), done.
Processing /content/bert-sklearn
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting boto3 (from bert-sklearn==0.3.1)
  Downloading boto3-1.33.13-py3-none-any.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m966.8 kB/s[0m eta [36m0:00:00[0m
Collecting botocore<1.34.0,>=1.33.13 (from boto3->bert-sklearn==0.3.1)
  Downloading botocore-1.33.13-py3-none-any.whl (11.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.8/11.8 MB[0m [31m45.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jmespath<2.0.0,>=0.7.1 (from boto3->bert-sklearn==0.3.1)
  Downloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting s3transfer<0.9.0,>=0.8.2 (from boto3->bert-sklearn==0.3

In [49]:
# options
from bert_sklearn import BertClassifier
model = BertClassifier(bert_model='bert-base-uncased',
label_list=None,
num_mlp_hiddens=500,
num_mlp_layers=0,
epochs=1,
max_seq_length=128,
train_batch_size=32,
eval_batch_size=8,
learning_rate=2e-5,
warmup_proportion=0.1,
gradient_accumulation_steps=1,
fp16=False,
loss_scale=0,
local_rank=-1,
use_cuda=True,
random_state=42,
validation_fraction=0.1,
logfile='bert_sklearn.log' )

Building sklearn text classifier...


In [50]:
X_train_bert, y_train_bert = train160['text'].values, train160['label'].values
from bert_sklearn import BertClassifier
model_bert = BertClassifier(epochs=1)
print(model_bert)
model_bert.fit(X_train_bert, y_train_bert)

Building sklearn text classifier...
BertClassifier(epochs=1)


100%|██████████| 231508/231508 [00:00<00:00, 2881914.35B/s]


Loading bert-base-uncased model...


100%|██████████| 440473133/440473133 [00:16<00:00, 27401237.27B/s]
100%|██████████| 433/433 [00:00<00:00, 1182378.67B/s]


Defaulting to linear classifier/regressor
Loading Pytorch checkpoint

train data size: 18000, validation data size: 2000



Training  :   0%|          | 0/563 [00:00<?, ?it/s]

Validating:   0%|          | 0/250 [00:00<?, ?it/s]


Epoch 1, Train loss: 0.5227, Val loss: 0.4568, Val accy: 76.45%



In [51]:
model_bert.save('bert-sentiment.model')

Classification report - BERT trained on sentiment data, tested on mental health data

In [52]:
from sklearn.metrics import classification_report

# Make predictions on the health data
y_pred_bert = model_bert.predict(final_mh['post_text'].values)

# Get the true labels from the Eureka dataset
y_true =final_mh['label'].values

# Generate the classification report
classification_rep = classification_report(y_true, y_pred_bert, target_names=['Negative','Positive'])

# Print the classification report
print("Classification Report:\n", classification_rep)

Predicting:   0%|          | 0/921 [00:00<?, ?it/s]

Classification Report:
               precision    recall  f1-score   support

    Negative       0.83      0.61      0.70      3197
    Positive       0.75      0.91      0.82      4169

    accuracy                           0.78      7366
   macro avg       0.79      0.76      0.76      7366
weighted avg       0.79      0.78      0.77      7366



Error Analysis - BERT

In [68]:
from sklearn.metrics import confusion_matrix
import pandas as pd


conf_matrix_bert = confusion_matrix(y_true, y_pred_bert)
print("Confusion Matrix:\n", conf_matrix_bert)


misclassified_indices_bert = y_true != y_pred_bert
misclassified_data_bert = final_mh.loc[misclassified_indices_bert, ['post_text', 'label']]


print("Misclassified Instances:")
display(misclassified_data_bert)



Confusion Matrix:
 [[1944 1253]
 [ 395 3774]]
Misclassified Instances:


Unnamed: 0,post_text,label
25,i will always love you peter gabriel in your eyes,1
115,do you still use this account,1
155,gorilla has amazing reaction when reunited wit...,1
157,i scored points at kolor a game where you have...,1
176,its not your chair its all mine worldcatday,1
...,...,...
19978,exactitude in small matters is the essence of ...,0
19981,change your thoughts and you change your world,0
19984,welcome to the working week i know it dont thr...,0
19994,you will have good luck and overcome many hard...,0


#Sentiment Analysis - DistilBERT

In [53]:
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from tqdm.notebook import tqdm


splitting data for training & testing

In [54]:
#mh = pd.read_csv('mental_health_subset.csv')
x_train, y_train = train160['text'].values, train160['label'].values
x_test, y_test = final_mh['post_text'].values, final_mh['label'].values

In [55]:
# Split the training data into train and validation sets
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=42)

In [56]:
# Load DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [57]:
from torch.utils.data import Dataset

# Define the SentimentDataset class

class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = int(self.labels[idx])

        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }


In [58]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AdamW
from torch.utils.data import DataLoader, Dataset
from sklearn.metrics import accuracy_score
import torch
# Tokenize the data
from torch.utils.data import Dataset

class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = int(self.labels[idx])

        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }



In [59]:
import torch.nn as nn
# Create DataLoader for training and testing
train_dataset = SentimentDataset(x_train, y_train, tokenizer)
test_dataset = SentimentDataset(x_test, y_test, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

# Define optimizer and loss function
optimizer = AdamW(model.parameters(), lr=5e-5)
criterion = nn.CrossEntropyLoss()

In [64]:
 #Train the model
num_epochs = 2
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(num_epochs):
    model.train()
    for batch in tqdm(train_loader, desc="Training"):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.logits, labels)
        loss.backward()
        optimizer.step()

Training:   0%|          | 0/2000 [00:00<?, ?it/s]

Training:   0%|          | 0/2000 [00:00<?, ?it/s]

In [65]:
# Evaluate the model
model.eval()
all_predictions = []
all_true_labels = []

with torch.no_grad():
    for batch in tqdm(test_loader, desc="Evaluating"):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs.logits, dim=1)

        all_predictions.extend(predictions.cpu().numpy())
        all_true_labels.extend(labels.cpu().numpy())



Evaluating:   0%|          | 0/921 [00:00<?, ?it/s]

Classification report - DistilBERT trained on sentiment data, tested on mental health data


along with error analysis

In [67]:

print("\nClassification Report on Testing Data:")
print(classification_report(all_true_labels, all_predictions))
print("\nConfusion Matrix:")
print(confusion_matrix(all_true_labels, all_predictions))


Classification Report on Testing Data:
              precision    recall  f1-score   support

           0       0.94      0.69      0.79      3197
           1       0.80      0.97      0.88      4169

    accuracy                           0.85      7366
   macro avg       0.87      0.83      0.84      7366
weighted avg       0.86      0.85      0.84      7366


Confusion Matrix:
[[2192 1005]
 [ 133 4036]]


# Sentiment Analysis - MNB

In [82]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from sklearn.model_selection import train_test_split
import numpy as np

# Step 1: Train Naive Bayes model on the Senti140 dataset
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
X_train = tfidf_vectorizer.fit_transform(train160['text'])
y_train = train160['label']  # Assuming 'label' column represents sentiment labels

# Introduce noise to the training labels
y_train_noisy = np.random.choice([0, 1], size=len(y_train), p=[0.5, 0.5])

# Train Naive Bayes model with noisy labels
naive_bayes_model = MultinomialNB(alpha=1.0)
naive_bayes_model.fit(X_train, y_train_noisy)


In [83]:

# Step 3: Test Naive Bayes model on the subset of the mental health dataset
X_test = tfidf_vectorizer.transform(final_mh['post_text'])
y_true = final_mh['label']  # Assuming 'label' column represents the true labels

# Predict using the model with noisy labels
y_pred = naive_bayes_model.predict(X_test)

# Step 4: Evaluate the Naive Bayes model
accuracy = accuracy_score(y_true, y_pred)
auc_score = roc_auc_score(y_true, y_pred)
classification_rep = classification_report(y_true, y_pred)

print(f"Accuracy: {accuracy}")
print(f"AUC Score: {auc_score}")
print("Classification Report:\n", classification_rep)

Accuracy: 0.5016291067064893
AUC Score: 0.49875910591101197
Classification Report:
               precision    recall  f1-score   support

           0       0.43      0.48      0.45      3197
           1       0.56      0.52      0.54      4169

    accuracy                           0.50      7366
   macro avg       0.50      0.50      0.50      7366
weighted avg       0.51      0.50      0.50      7366



Error Analysis - MNB

In [84]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns


conf_matrix = confusion_matrix(y_true, y_pred)
print(conf_matrix)


misclassified_indices = y_true != y_pred
misclassified_data = final_mh.loc[misclassified_indices, ['post_text', 'label']]

# Display Misclassified Instances
print("Misclassified Instances:")
display(misclassified_data)

[[1525 1672]
 [1999 2170]]
Misclassified Instances:


Unnamed: 0,post_text,label
1,its sunday i need a break so im planning to sp...,1
3,rt retro bears make perfect gifts and are grea...,1
4,its hard to say whether packing lists are maki...,1
20,its do i get up or lie here a little longer wi...,1
21,theres nothing like cocktails and exhaustion t...,1
...,...,...
19970,jacquins postulate on democratic government no...,0
19973,fs fitzgerald to hemingway ernest the rich are...,0
19974,in a five year period we can get one superb pr...,0
19978,exactitude in small matters is the essence of ...,0


#Best Model - DistilBERT

Since distilbert has highest accuracy, we Filter instances where DistilBERT has correctly predicted label 0 for negative tweets

In [87]:
import pandas as pd

final_mh['Predicted Label'] = all_predictions


filtered_final_mh = final_mh[(final_mh['Predicted Label'] == 0) & (final_mh['label'] == 0)]


print(filtered_final_mh.info())



<class 'pandas.core.frame.DataFrame'>
Int64Index: 2192 entries, 10000 to 19993
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   post_id          2192 non-null   int64 
 1   post_created     2192 non-null   object
 2   post_text        2192 non-null   object
 3   user_id          2192 non-null   int64 
 4   followers        2192 non-null   int64 
 5   friends          2192 non-null   int64 
 6   favourites       2192 non-null   int64 
 7   statuses         2192 non-null   int64 
 8   retweets         2192 non-null   int64 
 9   label            2192 non-null   int64 
 10  Predicted Label  2192 non-null   int64 
dtypes: int64(9), object(2)
memory usage: 205.5+ KB
None
                  post_id                    post_created  \
10000  819457334271279105  Thu Jan 12 08:13:47 +0000 2017   
10017  819439102244356096  Thu Jan 12 07:01:21 +0000 2017   
10023  819316582862221312  Wed Jan 11 22:54:30 +0000 2017   
10

In [88]:
display(filtered_final_mh)

Unnamed: 0,post_id,post_created,post_text,user_id,followers,friends,favourites,statuses,retweets,label,Predicted Label
10000,819457334271279105,Thu Jan 12 08:13:47 +0000 2017,my enemys invisible i dont know how to fight,727820220291645442,123,145,1068,23801,0,0,0
10017,819439102244356096,Thu Jan 12 07:01:21 +0000 2017,i dont knwo the editing term for it but the sp...,3249600438,235,185,24407,22302,0,0,0
10023,819316582862221312,Wed Jan 11 22:54:30 +0000 2017,he will only let himself be saved if she also ...,3249600438,235,185,24407,22302,0,0,0
10026,819204798750793732,Wed Jan 11 15:30:18 +0000 2017,shows where i dont have to watch arrow even af...,3249600438,235,185,24407,22302,0,0,0
10029,819204113619709952,Wed Jan 11 15:27:35 +0000 2017,and how can you possibly be true to dc univers...,3249600438,235,185,24407,22302,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
19973,819344882804424707,Thu Jan 12 00:46:57 +0000 2017,fs fitzgerald to hemingway ernest the rich are...,1169875706,442,230,7,1063601,0,0,0
19976,819342226400702465,Thu Jan 12 00:36:24 +0000 2017,i think trash is the most important manifestat...,1169875706,442,230,7,1063601,0,0,0
19983,819341171411550208,Thu Jan 12 00:32:12 +0000 2017,while most peoples opinions change the convict...,1169875706,442,230,7,1063601,0,0,0
19987,819340814270615553,Thu Jan 12 00:30:47 +0000 2017,the internet is that thing still around homer ...,1169875706,442,230,7,1063601,0,0,0


Count the number of negative tweets for each user

In [89]:


neg_tweet_count_by_user = filtered_final_mh.groupby('user_id').size().reset_index(name='neg_tweet_count')

neg_tweet_count_by_user = neg_tweet_count_by_user.sort_values(by='neg_tweet_count', ascending=False)

print(neg_tweet_count_by_user)

               user_id  neg_tweet_count
5            490044008              507
2            145626605              430
13          3249600438              338
17  763182466098233344              161
0             18831261              154
10          1497350173              126
12          2780518314              117
9           1458225506              104
8           1169875706               94
16  762433972273950725               58
4            324294391               37
3            171999132               23
1             29053403               16
7            894149342               13
11          2369443141                8
6            548972753                4
14  706699293558710273                1
15  727820220291645442                1


Now we are able to identify users by anonymity through their user_id and send variety of resources to them