# Step3: Analyzing Twitter Data with BERT Embeddings and Clustering

In this notebook, we analyze a collection of tweets using BERT embeddings and clustering techniques. The main steps include loading the tweet data, preprocessing the text, obtaining BERT embeddings, performing K-means clustering, calculating silhouette scores, and visualizing the clusters in 3D space.

### Process Overview
- Load the tweet data from the CSV file 'tweets.csv'.
- Preprocess the tweet text using the 'preprocess_tweet' function.
- Use the BERT model ('bert-base-uncased') to obtain embeddings for each tweet.
- Apply K-means clustering with a specified number of clusters (n_clusters).
- Calculate the silhouette score to assess the clustering quality.
- Reduce the dimensionality of the embeddings to 3D using PCA for visualization.
- Sample 400 tweets from each cluster for visualization purposes.
- Create 3D scatter plots of the sampled tweets from different angles.

**Note**: The resulting DataFrame includes the original tweet data, processed text, BERT embeddings, cluster assignments, silhouette scores, and 3D PCA coordinates for visualization. (intermediate/output/step_3_cluster_samples.csv)
**Note2**: The 'is_related' column in the resulting DataFrame is left blank and should be filled by domain experts based on their knowledge. (intermediate/input/step_4_cluster_samples_manually_labeled.csv)



In [None]:
import pandas as pd
import pickle
from tqdm import tqdm
from utils import preprocess_tweet, get_pretrained_model_and_tokenizer
from transformers import BertTokenizer, BertModel
from sklearn.cluster import KMeans
import numpy as np
import torch
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.metrics import silhouette_score

In [None]:
tweets_df = pd.read_csv('../data/tweets.csv')

In [None]:
tweets_df['processed_text'] = tweets_df['text'].apply(preprocess_tweet)


In [None]:
model_name = 'bert' # options are bert, roberta, sbert, sroberta
model, tokenizer = get_pretrained_model_and_tokenizer(model_name)

In [None]:
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
tweets_df['embedding'] = None

In [None]:
processed_texts = tweets_df['processed_text'].tolist()
batch_size = 1000
embeddings = []
for i in tqdm(range(0, len(processed_texts), batch_size)):
    inputs = tokenizer(processed_texts[i:i+batch_size], return_tensors="pt", padding=True, truncation=True).to(device)
    model = model.to(device)
    outputs = model(**inputs)
    embeddings.extend(outputs['last_hidden_state'].to('cpu').mean(dim=1).detach().numpy().tolist())
tweets_df['embedding'] = embeddings

In [None]:
pickle.dump(tweets_df, open('../data/tweets_with_embeddings.pkl', 'wb'))

In [None]:
n_clusters = 150
kmeans = KMeans(n_clusters=n_clusters, random_state=0)
kmeans.fit(tweets_df['embedding'].tolist())
tweets_df['cluster'] = kmeans.labels_

In [None]:
# calculate silhouette score
silh_score = silhouette_score(tweets_df['embedding'].tolist(), tweets_df['cluster'].tolist())

In [None]:
# reduce dimension to 3d using PCA for visualization
pca = PCA(n_components=3, random_state=0)
tweets_df['embedding_3d'] = pca.fit_transform(np.array(tweets_df['embedding'].tolist()), np.array(tweets_df['cluster'].tolist())).tolist()
# get sample 400 tweets from each cluster to visualize
sampled_tweets_df = tweets_df.groupby('cluster').apply(lambda x: x.sample(100, replace=True)).reset_index(drop=True)

# draw 3d plot in 3 different angles using matplotlib
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(sampled_tweets_df['embedding_3d'].apply(lambda x: x[0]*100).tolist(), sampled_tweets_df['embedding_3d'].apply(lambda x: x[1]*100).tolist(), sampled_tweets_df['embedding_3d'].apply(lambda x: x[2]*100).tolist(), c=sampled_tweets_df['cluster'], cmap='tab20c')
ax.view_init(0, 0)
plt.show()

ax.view_init(90, 0)
plt.show()

ax.view_init(180, 0)
plt.show()



In [None]:
centroids = kmeans.cluster_centers_

# get top 1000 tweets from each cluster that are closest to the centroid
tweets_df['distance_to_centroid'] = tweets_df['embedding'].apply(lambda x: np.linalg.norm(x - centroids[kmeans.predict([x])[0]]))
top_tweets_df = tweets_df.groupby('cluster').apply(lambda x: x.sort_values('distance_to_centroid').head(100)).reset_index(drop=True)
top_tweets_df = top_tweets_df[['id', 'text', 'created_at', 'processed_text', 'cluster']]
top_tweets_df.to_csv('../data/intermediate/output/step_3_cluster_samples.csv', index=False)
top_tweets_df['is_related'] = None
top_tweets_df.to_csv('../data/intermediate/input/step_4_cluster_samples_manually_labeled.csv', index=False)


In [None]:
pickle.dump(tweets_df, open('../data/intermediate/input/step_4_clustered_tweets_with_embeddings.pkl', 'wb'))