---------------------------------------------
- Wiley Winters
- MSDS 640 - Assignment 6 Social Media Analysis for the Common Good
- 2025-AUG-16

------------------------------------------------------------------------
# Assignment Specification
For this week's assignment, you are tasked with write report APA-formatted paper (3-4 pages long). Assume the role of a data science researcher employed at a non-profit organization, approaching the topic from a data science's perspective.
- Your main objective is to utilize social media data to contribute to a common good issue. Choose a topic such as **mental health**, **income inequality**, **human rights**, **workers' rights**, a particular **healthcare concern**, or **socioeconomic injustices**. Select a social media platform for data collection, and options include Reddit, X (formerly Twitter), Facebook, or others
- At a minimum create the wordcloud and include it in your paper. To go above and beyond, apply other NLP and text analytics techniques, such as [topic modeling](https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0) and sentiment analysis. Note that many people consider wordclouds uninformative and [bad practice](https://getthematic.com/insights/word-clouds-harm-insights/), so you should strive to create a bar chart of top wordcloud or other visualizations instead, which can be done using tools and examples provided in resources like "MSDS640_Week6_FTE.ipynb"
- Your paper should also feature a mindmap. This mindmap should center around the common good issue you have selected, with social media platforms branching out from the center. Further layers can delve into ethics and privacy concerns related to the project, culminating in examples of these issues
- In your work, include an overall ***introduction***, a ***description of your dataset***, the ***purpose** behind your research, highlighting the problem you seek to address, and a discussion on ethics and privacy challenges in the context of your chosen common good issue. Additionally, provide a summary of your findings. For further insights and inspiration, refer to the weekly reading list, which includes videos and mind-mapping resources

-------------------------------------------------------------------------------
## Import Required Packages and Libraries

In [1]:
# Standard Imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import sys

# Read sqlite3 database file
import sqlite3

# Text processing and preparation
import string
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords, words

# Text Visualization
from wordcloud import WordCloud

# Sentiment Analysis
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Make plots pretty
plt.style.use('ggplot')

## Define Functions

Process text function to perform basic preprocessing on text features.  I changed this function to use lemmatization instead of stemming.  While stemming can be faster to perform, lemmatization actually reduces the word-forms to linguistically valid lemmas.

In [2]:
def process_text(text):
    if isinstance(text, str):
        text = text.lower() # Convert all to lower case
        # Remove punctuation
        text = ''.join([char for char in text if char not in string.punctuation])
        text = ''.join([char for char in text if not char.isdigit()])  # Remove numbers
        # In some instances, I've run into issues with extra spaces.
        text = text.strip()
        # Remove stop words and apply lemmatizer
        stop = stopwords.words('english')
        wnl = WordNetLemmatizer()
        text = ' '.join([wnl.lemmatize(word) for word in text.split() if word not in \
                         stop])

        return text
    else:
        return ' '

I chose to download **Reddit** submissions and their associated comments.  Since the relationship between submissions and comments are one to many, I decided to store the data in a *sqlite3* datafile.  This will allow me to query the data using standard SQL statements that allows for one to many SQL joins.

The code below creates a connection to the datafile and then queries it based on criteria I have chosen.  A pandas dataframe is created from the query.

Since I believe the purpose of the lab is to perform a sentiment analysis, I will only load text features.

In [3]:
conn = sqlite3.connect('data/poverty.sqlite')  # Create the database connection object

reddit_df = pd.read_sql_query('SELECT id, created_utc, title, author, n_comments, ' \
                              'score, ratio, text, comment_id, comment_utc, ' \
                              'comment_author, body, comment_score ' \
                              'FROM posts, comments WHERE posts.id = comments.link_id ' \
                              'AND n_comments > 0', conn)
# Take a quick peek at the data
reddit_df.sample(5)

Unnamed: 0,id,created_utc,title,author,n_comments,score,ratio,text,comment_id,comment_utc,comment_author,body,comment_score
5875,1m36veg,1752856000.0,How can I make $75 quick?,PsychologicalWait189,41,17,0.87,,n3y2unm,1752902000.0,MoodyMagicOwl,"Have you done this? If so, do these guys care ...",2
2233,1kzyuez,1748703000.0,Prominent conservative attacks Social Security...,Socialfilterdvit,205,356,0.96,,mvrfk1o,1748955000.0,EffectiveSalamander,He's admitting that privatization would destro...,1
8049,ooiue4,1626842000.0,Question for r/poverty,ozzy622,15,3,1.0,"If any of you are living in poverty, have live...",h60m28i,1626883000.0,excaligirltoo,And this isn’t even just a poverty situation. ...,2
5181,1mlxqvc,1754767000.0,Feeling like a failure,Secret-Requirement22,31,25,1.0,I feel like a failure and don’t know how to ge...,n7uphey,1754781000.0,Substantial-Use-1758,"Community college, girl. You’ve still got 30+ ...",7
4397,1iofed5,1739439000.0,Favorite Poverty Meals?,Intrepid-Opening5877,59,56,0.99,"And I’m talking like DIRT cheap, as low as you...",mgqnog5,1741466000.0,NurseCrystal81,Do you follow Dollar Tree Meals on TikTok? Sh...,1


The dates are in Unix Epoch.  I will convert them into something a little more human readable.

In [4]:
reddit_df['created_utc'] = pd.to_datetime(reddit_df['created_utc'], unit='s')
reddit_df['comment_utc'] = pd.to_datetime(reddit_df['comment_utc'], unit='s')
reddit_df.sample(5)

Unnamed: 0,id,created_utc,title,author,n_comments,score,ratio,text,comment_id,comment_utc,comment_author,body,comment_score
92,1m91bdh,2025-07-25 14:54:23,Poor people are taught shame for the things th...,CarpenterUpset3251,180,3364,0.99,This is just a thought. But I feel like it's s...,n56yorv,2025-07-26 01:31:45,Elegant_Break9371,Ah. That explains it lol,2
5879,1m36veg,2025-07-18 16:26:32,How can I make $75 quick?,PsychologicalWait189,41,17,0.87,,n43g55g,2025-07-20 01:45:51,team_undog,I was just kidding. I would care if I was into...,2
3198,1lh6cfo,2025-06-21 20:43:21,To people who make fun of us poor people: I'm ...,Different_Lychee9708,71,146,0.92,"I am officially done. From now on, ANYONE who ...",mz31ljt,2025-06-22 01:49:00,Diane1967,It’s really hurtful when people use our posts ...,5
1673,1m3jg9t,2025-07-19 01:11:14,The struggle is real.,Apprehensive_Snow45,273,876,0.98,I'm fresh out of prison and I'm feeling overwh...,n4s9eet,2025-07-23 21:03:23,Few-Reason7527,"Find a church , they will help guide you. Have...",1
1939,1k3zhs8,2025-04-20 23:29:10,What the rich eat !,Sweet-Leadership-290,158,468,0.95,I am disgusted at the waste of resources. I ha...,moozjog,2025-04-23 23:00:40,greenerbeansheen,KIRKLAND!!!!,1


From previous experiences with this dataset, I know when a comment is removed or deleted, the `body` text is replaced with `[deleted]` or `[removed]`.  I will take a look to see if the count of those comments is significant

In [5]:
print(reddit_df[reddit_df['body'] == '[deleted]'].count())
print(reddit_df[reddit_df['body'] == '[removed]'].count())

id                74
created_utc       74
title             74
author            74
n_comments        74
score             74
ratio             74
text              74
comment_id        74
comment_utc       74
comment_author    74
body              74
comment_score     74
dtype: int64
id                75
created_utc       75
title             75
author            75
n_comments        75
score             75
ratio             75
text              75
comment_id        75
comment_utc       75
comment_author    75
body              75
comment_score     75
dtype: int64


The number is not significant so I will drop these rows.  Its less than 1% of the dataset.

In [6]:
indexBody = reddit_df[(reddit_df['body'] == '[deleted]') | \
                      (reddit_df['body'] =='[removed]')].index
reddit_df.drop(indexBody, inplace=True)
print(reddit_df[reddit_df['body'] == '[deleted]'].count())
print(reddit_df[reddit_df['body'] == '[removed]'].count())

id                0
created_utc       0
title             0
author            0
n_comments        0
score             0
ratio             0
text              0
comment_id        0
comment_utc       0
comment_author    0
body              0
comment_score     0
dtype: int64
id                0
created_utc       0
title             0
author            0
n_comments        0
score             0
ratio             0
text              0
comment_id        0
comment_utc       0
comment_author    0
body              0
comment_score     0
dtype: int64


It appears from the quick look at the dataset that the `text` feature may be null or blank values. I will check

In [7]:
reddit_df['text'].map(len).value_counts()

text
0       701
2258    355
36      296
2004    294
554     270
       ... 
686       1
710       1
4134      1
342       1
51        1
Name: count, Length: 434, dtype: int64

Appears the `text` feature does not contain text.  I will drop it.

In [None]:
reddit_df.drop('text', axis=1, inplace=True)
reddit_df.info()

With 336,775 rows the dataset is larger than required for the lab.  In addition, clustering with K-Means can be computational and time intensive.  I will truncate the dataset to a number that can be processed in a reasonable amount of time.

In [None]:
reddit_df = reddit_df.head(10000)
reddit_df.shape

-------------------------------------------------------------------------------
### Preprocess Text

In order to conduct an meaningful EDA of the text data, I will apply some basic NLTK preprocessing to it.  This includes removing punctuation, converting all to lower case, removing numbers, extra spaces, stop words, and breaking words down to their *lemmas*.  I have defined a function to carry out these tasks.  I am concentrating on the `title` and `body` features, but will also process `author` and `comment_author`.

In [None]:
cols = ['title', 'body', 'author', 'comment_author']
for col in cols:
    reddit_df[col] = reddit_df[col].apply(process_text)
reddit_df.sample(5)

For this study, I am mostly concern with the text data in the `title` and `body` features.  In order to make processing easier, I will merged the two.  I will also add in `author` and `comment_author` to add more words to be clustered.

In [None]:
reddit_df['content'] = reddit_df['title'] + reddit_df['body'] + \
                       reddit_df['author'] + reddit_df['comment_author']
reddit_df['content'].sample(5)

-------------------------------------------------------------------------------
## Perform some Basic EDA

In [None]:
print(reddit_df.info())
print('\nDataset shape: ', reddit_df.shape)

The dataset contains 10,000 rows and 13 columns.  There are no NaN values and the datatypes include *datetime64[ns]*, *int64*, *float64*, and *object*.  For this analysis, I am concentrating on *object* text data.

**List the top 10 Words**



In [None]:
freq = pd.Series(' '.join(reddit_df['content']).split()).value_counts(ascending=False).to_dict()
print('Top 10 Words:')
list(freq.items())[:10] # using list to make the output more readable.

**Create a WordCloud from Text Data**

In [None]:
wc = WordCloud(width=1000, height=600, max_words=500).generate_from_frequencies(freq)
plt.figure(figsize=(10, 10))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

The word cloud along with the top 10 word list can give a person a really good idea of what SubReddit was downloaded.  The most used words such as *moon*, *space*, *earth*, and *astronaut* are really good indicators of the subject under discussion.

**Vectorize Text Data**

Machine Learning algorithms do not understand text information and in order to feed text into an ML algorithm, it must be converted to numeric values.  There are a few options available to do this task.  In a past lab we used a *CountVectorizer*, which simply counts the number of times a word appears in the document.  The *TfidfVectorizer*, on the other hand, counts the words and takes into account how important that word is to the whole corpus.  I've also added parameters suggested by this article [Sparse Features](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py), since the dataset I am working with is sparse in nature.

In [None]:
tfv = TfidfVectorizer(sublinear_tf=True, max_df=0.5, min_df=5, stop_words='english')
X = tfv.fit_transform(reddit_df['content'])
X.shape

-------------------------------------------------------------------------------
### Cluster Text with K-Means

K-Means clustering is one of the most popular unsupervised machine learning algorithms. K-Means clustering is used to find intrinsic groups within an unlabelled dataset and draw inferences from them. In this study, I will be clustering the most used words in the Reddit submissions.

K-means optimizes a non-convex objective function and its clustering is not guaranteed to be optimal for a given random init.  Furthermore, sparse high-dimensional data such as text vectorized using the *Bag of Words* approach (which I am doing in this lab), can cause k-means to initialize centroids on extremely isolated data points.  One way to avoid this problem is to increase the number of runs with independent random initiators.  Increasing the `n_init` parameter's value will do this [Sparse Features](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py).  The default for `n_init` changes depending on the value of `init`, since the default for `init` is *k-means++* the default for `n_init` is **1**

In [None]:
# Declare a list to store results in
sum_sq= []

# fit the model for a range of 1 to 11 clusters and add to sum_sq[]
for n in range (1, 11):
    print('Calculating for ',n,' clusters')
    
    # random_state makes the results more reproducible 
    km_model = KMeans(n_clusters=n, max_iter=2000, n_init=2, random_state=42)
    km_model.fit(X)
    sum_sq.append(-km_model.score(X))

One method to determine the optimal number of clusters is to plot the results of the K-Means clustering algorithm and looking for a bend in the plot.  The bend should indicate the optimal number of clusters.  This is known as the *elbow method* and the plot is known as a *scree plot*

In [None]:
plt.plot(range(1, 11), sum_sq, 'bx-')
plt.show()

This plot is a little difficult to interpret.  A bend starts at 5 with a sharper change at 6. Based on the angle of the curve at 6, I'll run the model with 6 clusters.

In [None]:
# Fit the KMeans model with 6 clusters
km_model = KMeans(n_clusters=6, n_init=2, random_state=42)
km_model.fit(X)

# gather the predictions
preds = km_model.predict(X)

I will use sklearn.metrics silhouette_score to gauge the performance of 5 clusters.  The Silhouette Coefficient is calculated using the mean intra-cluster distance and the nearest-cluster distance for each sample.  The best value is 1 and the worst is -1. Values near 0 indicate overlapping clusters [sklearn.metrics.silhoulette_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html).  I would like to see a score close to 1.

In [None]:
score = metrics.silhouette_score(X, preds, sample_size=2000)
score

The score is above zero, but still not very close to 1.  I will try with 5 clusters.

In [None]:
# Fit the KMeans model with 5 clusters
km_model = KMeans(n_clusters=5, n_init=2, random_state=42)
km_model.fit(X)

# gather the predictions
preds = km_model.predict(X)

In [None]:
score = metrics.silhouette_score(X, preds, sample_size=2000)
score

The silhouette score for 5 clusters is closer to 1 than the 6 cluster model.  However, both scores are fairly close to 0 which indicates the clusters maybe overlapping.

In very high-dimensional spaces, euclidean distances tend to become inflated (curse of dimensionality). This may have happened here and to confirm I will check the inertia score. Inertia is the means of how internally coherent the clusters are. A score of zero is optimal.

In [None]:
km_model.inertia_

The score is significantly higher than zero.  I will apply an algorithm to reduce the dimensions.

-------------------------------------------------------------------------------
**Principal Component Analysis (PCA)**

It is possible the dataset had too many dimensions and this is causing the clustering algorithm to not perform as expected (*Curse of dimensionality*).  Linear dimensionality can be reduced using ***Singular Value Decomposition (SVD)***.  I will reduce the dimensions of the dataset and plot the results.

In [None]:
# Create the SVD object and reduce dataset
svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42)
data_reduced = svd.fit_transform(X)
data_reduced = pd.DataFrame(data_reduced)

# Plot results
ax = data_reduced.plot(kind='scatter', x=0, y=1, c=preds, cmap='rainbow')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_title('K-Mean Clusters after Reducing Dimensions')

The plot above indicates the clusters are not well defined.  Even though there is overlabp, the plot still shows 4 clusters.  There is overlap along the edges of the clusters and this confirms that the *inertia_ score* is indicating overlap.

**Hierarchical Cluster Analysis (HCA)**

**HCA** is an algorithm that can be used to make the clusters more distinct and provide better separation between them.  It works by treating each observation as a separate cluster.  Then it repeatedly executes the following two steps:
1) Identify the two clusters that are closest together
2) Merge the two most similar ones.  This continues until all the clusters are merged

**Dendrograms**

We can use a dendrogram to visualize the history of groupings and figure out the optimal number of clusters. To determine the largest vertical distance that does not intersect any of the other clusters Draw a horizontal line at both extremities The optimal number of clusters is equal to the number of vertical lines going through the horizontal line.

In [None]:
# I will select a sample of 2000 from the 10,000
indices = np.random.choice(X.shape[0], 2000, replace=False)
ydist = X[indices].toarray() # The linkage() method input requires an array

# The scipy hierarchial linkage method is used to create the HCA
# parameter ward instructs the method to use the Ward variance 
# minimization algorithm
Z = linkage(ydist, 'ward')

# Plot dendrogram
plt.figure(figsize=(15, 8))
dendrogram(Z)
plt.title('Dendrogram on Sample Data')
plt.ylabel('Euclidean distances')
plt.show()

In [None]:
plt.figure(figsize=(15, 8))
dendrogram(Z)
plt.title('Dendrogram')
plt.ylabel('Eluclidean distances')
plt.axhline(y=9, color='b', linestyle='--')
plt.show()

If I understand the Lecture for Week 6 correctly, the dendrogram is indicating I should apply hierarchical clustering for 5 clusters.

I will try K-Means again with 5 clusters on the data that had PCA and HCA applied

In [None]:
# Fit the KMeans model with 5 clusters
km_model = KMeans(n_clusters=5, n_init=2, random_state=42)
km_model.fit(Z)

# gather the predictions
preds = km_model.predict(Z)
metrics.silhouette_score(Z, preds, sample_size=2000)

Look at the *inertia_ score*

In [None]:
km_model.inertia_

The *silhouette score* shows significant improvement; therefore, the clusters are not overlapping as much as before the PCA.  Unfortunately, it looks likes the clusters are less defined after the HCA as the *inertia score* is significantly higher than before the HCA which indicates the clusters are not well defined.

-------------------------------------------------------------------------------
### Summary and Discussion
**Summary**

The goal of this lab was to download a text-based dataset from Twitter or Reddit.  I chose to download submittals from the Space subReddit.  The submitttal  and its comments had to be downloaded separately due to API limitations.  Since the relationship between a submittal and comments is a one-to-many, I saved the submittals and their comments to a *sqlite3* database file.  This allowed me to load the data into a dataframe using a *SQL* join statement.

After creating the dataframe from a SQL statement, I performed some basic data wrangling to change the timestamps from UNIX epoch to a readable format.  I also identified the sumbittal's title and the comment's body as the features of interest in this study.  As part of the data wrangling process, I also reduced the number of records in the dataset from over 300,000 to 10,000 using Pandas' head function.  Then I preprocessed all of the text features by converting all characters to lowercase, removing digits, removing punctuation, and applying a lemmatizer algorithm to reduce words to their basic forms.  I chose to lemmatize the text instead of stemming, since sklearn's *WordNetLemmatizer* actually uses a corpra (word dictionary) to look up the words being lemmatized.  The basic form of the word is know as a *lemma*.  In this case, I believe the lemmatizer reduced *Mars* to *mar*.  this maybe due to me converting everything to lowercase before applying the lemmatizer.

The features `title`, `body`, `author`, and `comment_author` were merged together into one feature named `content`. Furthermore, the `text` feature was dropped, since it did not contain any useful information.  After the merge and drop processes, some basic EDA was conducted on the feature.  This included listing the top 10 words used in the dataset and plotting a word cloud.  

Next the dataset under study was vectorized using the TfidfVectorizer function.  the TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features.  These features were then fitted to a K-Means algorithm and a 10 step loop was constructed to iterate through different number of clusters and a score calculated to find the optimal number of clusters to use.  Using a *scree* plot and the *silhouette score*, I determined that 6 clusters were the optimal number.  However, the *silhouette* and *inertia_* scores indicated that the clusters were not well defined and the model may be suffering from the *curse of dimentionality*.  Therefore, a SVD method was applied to the dataset to reduce its linear dimensionality.  None the less, a PCA plot indicated the clusters were still not well formed and a HCA was conducted.

**Results**

The PCA process did reduce the *dimentionality* of the clusters with a significantly larger score.  This indicates the clusters are still overlapping, but not as bad as before the PCA.  The *inertia score* after HCA indicates the clusters are still not well defined and actually became less defined.

| Process                              | Silhouette | Inertia      |
|--------------------------------------|------------|--------------|
| K-Means<br> 5 Clusters                   | 0.063      | 8750.09      |
| K-Means<br> 6 Clusters                   | 0.062      | Not Recorded |
| K-Means after<br> PCA and HCA<br> 5 Clusters | 0.43      | 526322304.61  |

**Discussion**

I found that just clustering text is most likely not the best use for K-Means clustering.  However, I do think that this can be remedy by setting some categories for the clusters.  For example, try to cluster around categories such as *Earth*, *Moon*, *Mars*, and *Image*.  It would be interesting to see if K-Means could create those clusters based on analyzing the text of the submittal and its comments.