# Unsupervised Learning

In this project assignment, I will conduct the following operations on my data:

- Vectorize the texts (CountVectorizer, Tfidfvectorizer, Word Embeddings)
- Apply K-Means Clustering to the vectorized data
- Find the best clustering and save the clustering outcome as a new feature

In [149]:
!pip install -U git+https://github.com/UChicago-Computational-Content-Analysis/lucem_illud.git

Collecting git+https://github.com/UChicago-Computational-Content-Analysis/lucem_illud.git
  Cloning https://github.com/UChicago-Computational-Content-Analysis/lucem_illud.git to /tmp/pip-req-build-j841qgwj
  Running command git clone --filter=blob:none --quiet https://github.com/UChicago-Computational-Content-Analysis/lucem_illud.git /tmp/pip-req-build-j841qgwj
  Resolved https://github.com/UChicago-Computational-Content-Analysis/lucem_illud.git to commit b17a265d3b8253424e5b38872457f7437909a65d
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting python-docx (from lucem-illud==8.0.1)
  Downloading python_docx-1.1.0-py3-none-any.whl (239 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m239.6/239.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting pdfminer2 (from lucem-illud==8.0.1)
  Downloading pdfminer2-20151206-py2.py3-none-any.whl (117 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.8/117.8 kB[0m [31m14.5 MB/s[0m eta 

In [150]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn import datasets, decomposition
from sklearn.decomposition import TruncatedSVD

from sklearn.cluster import KMeans
from matplotlib import pyplot as plt

from sklearn.metrics import silhouette_score

import gensim
import lucem_illud

In [2]:
# Import the data
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
df = pd.read_csv("/content/drive/MyDrive/DBCommunity/saved_data/tech_aca_data.csv")

In [4]:
df.head(5)

Unnamed: 0,Title,Text,Author,Reply,LastReply,PublishTime,Like,Collect,Repost,Community_name,...,Reply_Month,Reply_Day,Pub_Year,Pub_Month,Pub_Day,normalized_text,tokenized_sentences,TopPost,Length,Aca
0,精华\n\n\n \n ...,由猴面包树组员倡议，我们小组建立官方slack交流群啦！为了营造一个更友好安全的交流氛围，目...,Anon加重音,1115,2023-12-21,2020-10-09,2,4,4,Academia,...,12,21,2020,10,9,"['猴面包树', '组员', '倡议', '小组', '建立', '官方', 'slack'...","[['猴面包树', '组员', '倡议', '小组', '建立', '官方', 'slack...",True,313,1.0
1,精华\n\n\n \n ...,—————————本帖为问卷调查、招募研究对象的集中贴，姐妹们如有问卷调查需要大家帮忙填写或...,Anon加重音,64,2023-12-01,2020-10-14,1,2,1,Academia,...,12,1,2020,10,14,"['本帖', '问卷调查', '招募', '研究', '对象', '贴', '姐妹', '问...","[['本帖', '问卷调查', '招募', '研究', '对象', '贴', '姐妹', '...",True,67,1.0
2,精华\n\n\n \n ...,前情(意见征集贴) https://www.douban.com/group/topic/1...,Anon加重音,12,2023-07-04,2020-10-10,4,6,4,Academia,...,7,4,2020,10,10,"['前', '情', '意见', '征集', '贴', 'https', 'www', 'd...","[['前', '情', '意见', '征集', '贴', 'https', 'www'], ...",True,244,1.0
3,精华\n\n\n \n ...,论坛第二期分享会她说PhD：不同的人生路径的文字稿和音频分享来啦。非常感谢小组长们的全力支持...,Sophie,2,2023-05-03,2020-11-02,1,6,0,Academia,...,5,3,2020,11,2,"['论坛', '第二期', '分享', '会', '说', 'PhD', '人生', '路径...","[['论坛', '第二期', '分享', '会', '说', 'phd', '人生', '路...",True,177,1.0
4,精华\n\n\n \n ...,09/30/21更新: 管理员实在是没有能力及时追踪所有申请相关的帖子，大家有相关的帖子想要...,丸子,6,2023-03-25,2020-12-09,2,2,1,Academia,...,3,25,2020,12,9,"['09', '30', '21', '更新', '管理员', '实在', '能力', '追...","[['更新', '管理员', '实在', '能力', '追踪', '申请', '相关', '...",True,478,1.0


In [6]:
# Replace np.nan values with an empty string
df['seg_text'] = df['seg_text'].fillna('')

In [56]:
# import Chinese stopwords
with open("/content/drive/MyDrive/DBCommunity/my_stopwords.txt") as file:
    cn_stopwords = [line.rstrip() for line in file]

cn_stopwords.extend(["…", ":", "\n", ' '])
cn_stopwords[:10]

['里', '是', '有', '想', '很', '出', '做', '不', '日', '月']

## Vectorization

### CountVectorizer

In [227]:
count_vectorizer = CountVectorizer(min_df=5, max_df=0.8,ngram_range=(1,1),binary=False,stop_words=cn_stopwords,
                                    token_pattern=r'\b[^\d\W]+\b')  
# This regex matches words that do not start with a digit and are not entirely numeric
X_count = count_vectorizer.fit_transform(df["seg_text"])
X_count.shape



(17969, 20765)

In [139]:
count_features = count_vectorizer.get_feature_names_out()

In [140]:
X_count[:10,:10].toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 2, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [228]:
list(zip(count_vectorizer.vocabulary_.keys(), X_count.data))[:20]

[('组员', 1),
 ('小组', 2),
 ('建立', 1),
 ('官方', 3),
 ('slack', 7),
 ('交流', 3),
 ('群', 3),
 ('营造', 1),
 ('更', 1),
 ('友好', 1),
 ('氛围', 1),
 ('invitation', 1),
 ('进群', 4),
 ('方式', 2),
 ('详见', 1),
 ('后文', 1),
 ('地区', 1),
 ('有分', 1),
 ('channel', 5),
 ('自定义', 1)]

### Tf-idf Vectorizer

In [144]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, max_features=5000, min_df=3, stop_words=cn_stopwords, 
                                    norm='l2', token_pattern=r'\b[^\d\W]+\b')
X_tfidf = tfidf_vectorizer.fit_transform(df["seg_text"])
X_tfidf.shape



(17969, 5000)

In [222]:
tfidf_features = tfidf_vectorizer.get_feature_names_out()

In [145]:
list(zip(tfidf_vectorizer.vocabulary_.keys(), X_tfidf.data))[:20]

[('组员', 0.04569925685827179),
 ('小组', 0.07491483839740684),
 ('建立', 0.06191267848130932),
 ('官方', 0.053885636969389215),
 ('slack', 0.0489803723855503),
 ('交流', 0.17100414666713595),
 ('群', 0.06759691320333223),
 ('更', 0.06708497015723602),
 ('友好', 0.06516224656912525),
 ('氛围', 0.11073610219261167),
 ('进群', 0.07768050113818961),
 ('方式', 0.042469082774995545),
 ('地区', 0.06118682483562995),
 ('channel', 0.06499550412266991),
 ('智能', 0.06142349612777225),
 ('功能', 0.07045277965349427),
 ('新手', 0.05098678299999754),
 ('看', 0.022385130684771735),
 ('指南', 0.2054071079090263),
 ('设置', 0.06303054286392623)]

### Word Embeddings

In [187]:
# data cleaning for create word embeddings
df['tokenized_sentences'] = df['tokenized_sentences'].apply(lambda x: eval(x))

In [188]:
dfW2V = gensim.models.word2vec.Word2Vec(df['tokenized_sentences'].sum(), sg=0)

In [195]:
dfW2V.wv.vectors

array([[-0.97187287, -0.9953027 ,  2.5201297 , ..., -1.7973468 ,
         0.20276675, -1.3598017 ],
       [-0.6567084 , -1.597359  ,  1.6831065 , ..., -0.74648356,
         0.20860827,  0.5729985 ],
       [ 0.14342594, -0.1581567 , -0.6187185 , ..., -1.1494849 ,
         1.4204233 , -1.0308222 ],
       ...,
       [-0.00868899,  0.01830971, -0.03220418, ..., -0.05985091,
         0.01884498,  0.0241404 ],
       [-0.01676401,  0.00994757, -0.01891322, ..., -0.08101659,
         0.02408008,  0.04166513],
       [-0.01386326,  0.04780138,  0.04536173, ..., -0.07184813,
        -0.03556011,  0.03583223]], dtype=float32)

In [196]:
dfW2V.wv.index_to_key[20]

'时间'

In [230]:
list(zip(dfW2V.wv.index_to_key, dfW2V.wv.vectors))[:20]

[('工作',
  array([-0.97187287, -0.9953027 ,  2.5201297 ,  2.3452647 , -1.5033754 ,
         -2.2536154 , -0.05305953,  1.0797791 ,  0.73584753, -0.51222616,
          1.6513708 , -0.41133708,  0.01504901, -0.4725312 , -0.90267706,
         -2.112994  ,  2.2016342 , -0.90707517, -0.7536464 , -1.7678225 ,
          1.312517  ,  0.639658  ,  0.25208664,  1.0346292 , -0.3399734 ,
         -1.3050929 ,  0.50377005,  1.0854871 , -1.3862387 ,  1.2734426 ,
         -1.3373451 ,  0.03804211,  2.1000226 ,  0.5323971 , -0.34023324,
          0.02645837, -0.06668783, -0.7322328 ,  0.4448285 , -1.5219665 ,
         -0.52079606,  0.49828583, -0.4774285 ,  1.559864  ,  1.820648  ,
          0.9634628 ,  1.1487107 ,  2.7354252 , -0.38509414,  1.2172033 ,
          0.30925503,  0.5997304 , -0.7118732 , -1.5854032 ,  0.05862256,
         -1.8842838 ,  0.831901  , -1.4135511 , -0.37197047,  0.18422474,
         -2.1575484 , -1.3922689 , -0.55344754, -0.5923817 , -1.3658361 ,
          0.32684988, -0.75013

In [252]:
def document_vector(word2vec_model, doc_tokens):
    # Flatten doc_tokens if it is a list of lists
    if doc_tokens and isinstance(doc_tokens[0], list):
        doc_tokens = [token for sublist in doc_tokens for token in sublist]

    # Filter out tokens not in the word2vec model's vocabulary
    embeddings = [word2vec_model.wv[word] for word in doc_tokens if word in word2vec_model.wv.key_to_index]

    # If the document contains no words in the model's vocabulary, return a zero vector
    if not embeddings:
        return np.zeros(word2vec_model.vector_size)

    # Compute the mean of the embeddings
    doc_embedding = np.mean(embeddings, axis=0)
    return doc_embedding

# Calculate the document vector for each row in the DataFrame
df['doc_embedding'] = df['tokenized_sentences'].apply(lambda x: document_vector(dfW2V, x))
df['doc_embedding'].head()

0    [0.030430308, 0.047090378, -0.14805494, 0.1494...
1    [-0.30725056, 0.05729676, -0.48614693, 0.50295...
2    [-0.10015714, -0.12877733, -0.21006966, 0.0298...
3    [-0.16227883, -0.3251159, -0.27754685, 0.41127...
4    [0.074237525, -0.043859884, -0.29924968, -0.00...
Name: doc_embedding, dtype: object

## K-means clustering

First, we create a Class for applying different k and vectorized textual data to K-Means Clustering.

In [257]:
class TextClusterVisualizer:
    def __init__(self, X, terms, num_clusters):
        self.num_clusters = num_clusters
        self.km = None
        self.terms = terms
        self.order_centroids = None
        self.X = X

    def fit_transform(self):
        """
        Fits the KMeans model and transforms the data into cluster-distinguished format.
        """

        self.km = KMeans(n_clusters=self.num_clusters, init='k-means++', n_init=5, random_state=42)
        self.km.fit(self.X)

        self.order_centroids = self.km.cluster_centers_.argsort()[:, ::-1]

        cluster_terms = {}
        for i in range(self.num_clusters):
            top_terms = [self.terms[ind] for ind in self.order_centroids[i, :20]]
            cluster_terms[f"Cluster {i}"] = top_terms

        df_clusters = pd.DataFrame(cluster_terms)
        df_clusters.index = [f"Term {i+1}" for i in range(20)]
        return df_clusters

    def visualize_clusters(self):
        """
        Visualizes the clusters using PCA for dimensionality reduction.
        """
        pca = PCA(n_components=2).fit(self.X.toarray())
        reduced_data = pca.transform(self.X.toarray())

        components = pca.components_
        keyword_ids = list(set(self.order_centroids[:,:10].flatten()))
        x = components[:, keyword_ids][0, :]
        y = components[:, keyword_ids][1, :]

        colordict = {
            '0': 'red',
            '1': 'orange',
            '2': 'green',
            '3': 'blue',
            '4': 'yellow'
        }
        colors = [colordict[str(c)] for c in self.km.labels_]

        fig = plt.figure(figsize=(10, 6))
        ax = fig.add_subplot(111)
        ax.set_frame_on(False)
        ax.scatter(reduced_data[:, 0], reduced_data[:, 1], color=colors, alpha=0.5)
        plt.xticks(())
        plt.yticks(())
        plt.title(f'{self.num_clusters} Clusters')
        plt.show()

    def evaluate_clustering(self):
        '''
        Evaluate the clustering performance with Silhouette Score.
        '''
        silhouette_avg = silhouette_score(self.X, self.km.labels_)

        return silhouette_avg

In [143]:
# Find the best k for count vectorizer
k_count = 0
best_score = 0
best_visualizer = None
for k in [3, 4, 5, 6]:
    visualizer = TextClusterVisualizer(X_count, count_features, k)
    score = visualizer.evaluate_clustering()
    if score > best_score:
        best_score = score
        k_count = k
        best_visualizer = visualizer

print("The best k for count vectorizer is: ", k_count)
print("The best score is: ", best_score)

The best k for count vectorizer is:  3
The best score is:  0.01272627826459261


In [223]:
# Find the best k for tfidf vectorizer
k_tfidf = 0
best_score = 0
best_visualizer = None
for k in [3, 4, 5, 6]:
    visualizer = TextClusterVisualizer(X_tfidf, tfidf_features, k)
    score = visualizer.evaluate_clustering()
    if score > best_score:
        best_score = score
        k_tfidf = k
        best_visualizer = visualizer

print("The best k for tfidf vectorizer is: ", k_tfidf)
print("The best score is: ", best_score)

The best k for tfidf vectorizer is:  3
The best score is:  0.00611882470322887


In [261]:
# Convert the list of embeddings in 'doc_embedding' to a NumPy array
X = np.stack(df['doc_embedding'].values)

best_score = 0
best_k = 0

# Iterate over different values of k to find the best one
for k in [3, 4, 5, 6]:
    # Initialize and fit KMeans with the current value of k
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    km.fit(X)

    # Calculate the silhouette score for the current clustering
    score = silhouette_score(X, km.labels_)

    # Update the best score and best k if the current score is better
    if score > best_score:
        best_score = score
        best_k = k

print("The best k for word embedding is: ", best_k)
print("The best score is: ", best_score)

The best k for word embedding is:  5
The best score is:  0.22497853906043092


Based on the aforementioned calculations, we reach the conclusion that word embeddings are the best vectorization strategy for the data in my study.

Specifically, when we classify the posts into 5 clusters based on word embeddings, the Silhouette Score approximates 0.22. A score of 0.22 indicates that, on average, objects are basically well matched to their own cluster and reasonably well separated from other clusters.

Meanwhile, the best clustering performance based on count vectorizer and tfidf vectorizer only achieve 0.013 and 0.006 for Silhouette Score. This means that the transition to using word embeddings represents a significant advancement in clustering performance.

In [262]:
# Save the clustering information in the dataset
best_km = KMeans(n_clusters=5, init='k-means++', n_init=10, random_state=42)
best_km.fit(X)

# Get the cluster labels for each document
cluster_labels = best_km.labels_

# Add the cluster labels as a new column to the original DataFrame
df['cluster_wordembedding'] = cluster_labels

df.head()

Unnamed: 0,Title,Text,Author,Reply,LastReply,PublishTime,Like,Collect,Repost,Community_name,...,Pub_Year,Pub_Month,Pub_Day,normalized_text,tokenized_sentences,TopPost,Length,Aca,doc_embedding,cluster_wordembedding
0,精华\n\n\n \n ...,由猴面包树组员倡议，我们小组建立官方slack交流群啦！为了营造一个更友好安全的交流氛围，目...,Anon加重音,1115,2023-12-21,2020-10-09,2,4,4,Academia,...,2020,10,9,"['猴面包树', '组员', '倡议', '小组', '建立', '官方', 'slack'...","[[猴面包树, 组员, 倡议, 小组, 建立, 官方, slack, 交流, 群], [营造...",True,313,1.0,"[0.030430308, 0.047090378, -0.14805494, 0.1494...",1
1,精华\n\n\n \n ...,—————————本帖为问卷调查、招募研究对象的集中贴，姐妹们如有问卷调查需要大家帮忙填写或...,Anon加重音,64,2023-12-01,2020-10-14,1,2,1,Academia,...,2020,10,14,"['本帖', '问卷调查', '招募', '研究', '对象', '贴', '姐妹', '问...","[[本帖, 问卷调查, 招募, 研究, 对象, 贴, 姐妹, 问卷调查, 帮忙, 填写, 招...",True,67,1.0,"[-0.30725056, 0.05729676, -0.48614693, 0.50295...",1
2,精华\n\n\n \n ...,前情(意见征集贴) https://www.douban.com/group/topic/1...,Anon加重音,12,2023-07-04,2020-10-10,4,6,4,Academia,...,2020,10,10,"['前', '情', '意见', '征集', '贴', 'https', 'www', 'd...","[[前, 情, 意见, 征集, 贴, https, www], [douban], [gro...",True,244,1.0,"[-0.10015714, -0.12877733, -0.21006966, 0.0298...",1
3,精华\n\n\n \n ...,论坛第二期分享会她说PhD：不同的人生路径的文字稿和音频分享来啦。非常感谢小组长们的全力支持...,Sophie,2,2023-05-03,2020-11-02,1,6,0,Academia,...,2020,11,2,"['论坛', '第二期', '分享', '会', '说', 'PhD', '人生', '路径...","[[论坛, 第二期, 分享, 会, 说, phd, 人生, 路径, 文字, 稿, 音频, 分...",True,177,1.0,"[-0.16227883, -0.3251159, -0.27754685, 0.41127...",1
4,精华\n\n\n \n ...,09/30/21更新: 管理员实在是没有能力及时追踪所有申请相关的帖子，大家有相关的帖子想要...,丸子,6,2023-03-25,2020-12-09,2,2,1,Academia,...,2020,12,9,"['09', '30', '21', '更新', '管理员', '实在', '能力', '追...","[[更新, 管理员, 实在, 能力, 追踪, 申请, 相关, 帖子, 相关, 帖子, 想要,...",True,478,1.0,"[0.074237525, -0.043859884, -0.29924968, -0.00...",1


In [268]:
df.to_csv("/content/drive/MyDrive/DBCommunity/saved_data/clustering_aca_tech0224.csv")