<a href="https://colab.research.google.com/github/rammzi/Twitter-Sentimental-Analysis-NLP-/blob/main/X(Twitter)_Sentimental_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction**
This project delves into analyzing a Twitter dataset focused on mental health through the lens of topic modeling. By uncovering abstract themes within the text data, the goal is to extract valuable insights and identify patterns that could provide a deeper understanding of the public's conversation around mental health. The methodologies used include transforming text data into numerical formats using TF-IDF and SentenceTransformer embeddings, followed by the application of topic modeling techniques like BERTopic and Latent Dirichlet Allocation (LDA). To present these findings in a comprehensible manner, visualizations are included using Bokeh and pyLDAvis, offering a clear view of the relationships and structure within the data.



# 1. Installing and Importing Libraries
SentenceTransformer and pyLDAvis were used. The former is used to generate vector representations of text data, while the latter provides an interactive interface for the results.




In [None]:
# installing libraries for NLP, visualization, and topic modeling
!pip install -Uqq sentence-transformers
!pip install -qq bokeh
!pip install -qq bertopic
!pip install pyLDAvis

# importing libraries for data processing, embeddings, and visualization
import numpy as np
import pandas as pd
import random
from sentence_transformers import SentenceTransformer
import sklearn.manifold
import bokeh.plotting as bpl
import bokeh.models as bmo
from bokeh.models import ColumnDataSource, HoverTool
from bertopic import BERTopic
import pyLDAvis
import pyLDAvis.lda_model
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# bokeh setup for notebook output
bpl.output_notebook()


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.7/268.7 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.7/143.7 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.9/56.9 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcy-2.0-py2.py3-none-any.whl (3

# 2. Load and Pre-process the Twitter Dataset


In [None]:
from google.colab import files

# csv file uploaded locally
uploaded = files.upload()

In [None]:
!pip install langdetect




In [None]:
from langdetect import detect
import langdetect

# loading the dataset
df = pd.read_csv('Mental-Health-Twitter.csv')

from langdetect import detect
import langdetect

def is_english(text):
    try:
        return detect(text) == 'en'
    except langdetect.lang_detect_exception.LangDetectException:
        return False

# filtering out non-English tweets
df = df[df['post_text'].apply(is_english)]
df = df.head(3000)
df.head()


  and should_run_async(code)


Unnamed: 0.1,Unnamed: 0,post_id,post_created,post_text,user_id,followers,friends,favourites,statuses,retweets,label
0,0,637894677824413696,Sun Aug 30 07:48:37 +0000 2015,It's just over 2 years since I was diagnosed w...,1013187241,84,211,251,837,0,1
1,1,637890384576778240,Sun Aug 30 07:31:33 +0000 2015,"It's Sunday, I need a break, so I'm planning t...",1013187241,84,211,251,837,1,1
2,2,637749345908051968,Sat Aug 29 22:11:07 +0000 2015,Awake but tired. I need to sleep but my brain ...,1013187241,84,211,251,837,0,1
3,3,637696421077123073,Sat Aug 29 18:40:49 +0000 2015,RT @SewHQ: #Retro bears make perfect gifts and...,1013187241,84,211,251,837,2,1
4,4,637696327485366272,Sat Aug 29 18:40:26 +0000 2015,It’s hard to say whether packing lists are mak...,1013187241,84,211,251,837,1,1


# 3. Generate Text Embeddings Using SentenceTransformer
Numerical representation of data by genenrating text embeddings for effective machine learning modeling.

In [None]:
# initializing the SentenceTransformer model for generating embeddings
model = SentenceTransformer('stsb-distilbert-base')

# generating embeddings
embeddings = model.encode(df['post_text'].tolist())

# dimensionality reduction for data visualization
out = sklearn.manifold.TSNE(n_components=2, random_state=42).fit_transform(embeddings)

  and should_run_async(code)


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/539 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/489 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# 4. Visualize Tweet Embeddings Using Bokeh
Bokeh is used to create an interactive 2D scatter plot of the text embeddings. The purpose of this visualization is to observe how tweets cluster based on their semantic similarity, which can indicate the presence of underlying topics. Clusters of points might represent recurring themes or ideas discussed in the tweets. By hovering over individual points, one can explore specific tweets and see how they relate to their neighbors.


In [None]:
import random
from bokeh.transform import factor_cmap
from bokeh.palettes import Category20_20
from sklearn.manifold import TSNE
from bokeh.models import ColumnDataSource, HoverTool
import bokeh.plotting as bpl

tsne_model = TSNE(n_components=2, random_state=42)
out = tsne_model.fit_transform(embeddings)

# extracting x and y coordinates
list_x = out[:, 0]
list_y = out[:, 1]

# extracting the tweet text to use as descriptions
desc = df['post_text'].tolist()

# number of clusters or topics
num_clusters = 10
clusters = [f"Cluster {i}" for i in range(num_clusters)]
cluster_labels = [random.choice(clusters) for _ in range(len(list_x))]

# preparing data for a 2D scatter plot using Bokeh
source = ColumnDataSource(data=dict(
    x=list_x,
    y=list_y,
    desc=desc,
    cluster=cluster_labels
))

# using a color mapper to map clusters to different colors
color_mapper = factor_cmap('cluster', palette=Category20_20, factors=clusters)

# setting up the Bokeh plot with hover functionality and colored clusters
hover = HoverTool(tooltips=[
    ("index", "$index"),
    ("(x,y)", "(@x, @y)"),
    ("desc", "@desc"),
    ("cluster", "@cluster")
])

# creating the Bokeh scatter plot with colored clusters
p = bpl.figure(width=800, height=600, tools=[hover], title="Tweet Embeddings Visualization with Colored Clusters")
p.circle('x', 'y', size=10, source=source, fill_color=color_mapper, fill_alpha=0.6, line_color=None)
bpl.show(p)


  and should_run_async(code)


# 5. Perform Topic Modeling Using BERTopic

In [None]:
# initializing the BERTopic model for topic extraction
topic_model = BERTopic(language="english")

# fitting
topics, probs = topic_model.fit_transform(df['post_text'].tolist())

topic_info = topic_model.get_topic_info()
topic_info.head()


  and should_run_async(code)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,953,-1_depression_treatments_overcome_rt,"[depression, treatments, overcome, rt, the, an...",[@_ChiefKia Can you direct message me? We are ...
1,0,96,0_addiction_heroin_addicted_problem,"[addiction, heroin, addicted, problem, drug, c...",[Do you know some who has been arrested becaus...
2,1,92,1_day_morning_home_today,"[day, morning, home, today, its, and, motivati...",[Good morning!:) I hope its gonna be a nice da...
3,2,85,2_treatments_depression_treatment_therapy,"[treatments, depression, treatment, therapy, d...",[Newly Funded Hepatitis C Treatment Will Help ...
4,3,81,3_mental_mentalhealth_health_illness,"[mental, mentalhealth, health, illness, mental...",[Who do you think has a high risk of mental he...


# 6. Visualize Topics with BERTopic
BERTopic provides built-in functions to display topic distributions and key terms. Gain insights in social media data by characterising data by topics.

In [None]:
# visualizing top 10 topics
topic_model.visualize_barchart(top_n_topics=10)


  and should_run_async(code)


# 7. Perform LDA Topic Modeling and Visualize Using pyLDAvis
This modeling works by assigning probabilities to words and documents, revealing hidden topics. Using pyLDAvis for visualization. Shows how topics are distributed in relation to each other and highlights the most salient terms within each topic.

In [None]:
!pip install --upgrade pyLDAvis

from __future__ import print_function
import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

docs_raw = df['post_text'].tolist()

# creating CountVectorizer and TfidfVectorizer
tf_vectorizer = CountVectorizer(strip_accents='unicode',
                                stop_words='english',
                                lowercase=True,
                                token_pattern=r'\b[a-zA-Z]{3,}\b',
                                max_df=0.5,
                                min_df=10)
dtm_tf = tf_vectorizer.fit_transform(docs_raw)  # document-term matrix for LDA

tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)  # TF-IDF matrix for LDA

# fitting the LDA model
lda_tf = LatentDirichletAllocation(n_components=20, random_state=0)
lda_tf.fit(dtm_tf)

pyLDAvis_display = pyLDAvis.lda_model.prepare(lda_tf, dtm_tf, tf_vectorizer)
pyLDAvis_display




`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



