<a href="https://colab.research.google.com/github/sahanyafernando/My_NLP_Learning/blob/main/Project_01_Public_Responce_Analysis/notebooks/Project_01_Public_Response_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load preprocessing artifacts
Loads outputs saved by `01_data_loading_and_preprocessing.ipynb`. Run that notebook first if this file is missing.

In [None]:
import pickle, pathlib
# Update this path if your project lives elsewhere in Drive
artifacts_root = pathlib.Path("/content/drive/MyDrive/My_NLP_Learning/Public_Response_Analysis")
artifacts_path = artifacts_root / "artifacts/preprocessing_outputs.pkl"
if artifacts_path.exists():
    with open(artifacts_path, "rb") as f:
        artifacts = pickle.load(f)
    df = artifacts["df"]
    one_hot_vectorizer = artifacts["one_hot_vectorizer"]
    one_hot_matrix = artifacts["one_hot_matrix"]
    bow_vectorizer = artifacts["bow_vectorizer"]
    bow_matrix = artifacts["bow_matrix"]
    tfidf_vectorizer = artifacts["tfidf_vectorizer"]
    tfidf_matrix = artifacts["tfidf_matrix"]
    cooccurrence_vectorizer = artifacts["cooccurrence_vectorizer"]
    cooccurrence_matrix = artifacts["cooccurrence_matrix"]
    print("Loaded preprocessing outputs from artifacts/preprocessing_outputs.pkl")
else:
    print("Run 01_data_loading_and_preprocessing.ipynb to generate artifacts first.")


## Apply K-means Clustering


Apply K-means clustering to the TF-IDF matrix to group documents into clusters.


In [35]:
from sklearn.cluster import KMeans

# Determine an appropriate number of clusters
n_clusters = 5 # Using the same number of topics as NMF for consistency

# Initialize a KMeans model
kmeans_model = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) # n_init is set to 10 for explicit initialization runs

# Fit the KMeans model to the tfidf_matrix
kmeans_model.fit(tfidf_matrix)

# Add the generated cluster labels to the DataFrame df as a new column
df['kmeans_cluster_label'] = kmeans_model.labels_

# Print the count of documents per cluster
print("Count of documents per K-means cluster:")
print(df['kmeans_cluster_label'].value_counts().sort_index())

Count of documents per K-means cluster:
kmeans_cluster_label
0    12
1    21
2    27
3    17
4    23
Name: count, dtype: int64


Interpret K-means Clusters

Interpret the clusters generated by the K-means model by identifying the most representative words for each cluster.

In [36]:
feature_names = tfidf_vectorizer.get_feature_names_out()
n_top_words_cluster = 10

print("\nTop words for each K-means cluster:")
# The cluster centroids represent the average feature values for documents in that cluster
for i, centroid in enumerate(kmeans_model.cluster_centers_):
    print(f"Cluster {i+1}:")
    # Sort words by their centroid weight in descending order
    top_words_indices = centroid.argsort()[:-n_top_words_cluster - 1:-1]
    top_words_cluster = [feature_names[j] for j in top_words_indices]
    print(f"{', '.join(top_words_cluster)}")


Top words for each K-means cluster:
Cluster 1:
publictransport, prove, red, although, break, him, play, lead, raise, each
Cluster 2:
environmentallaws, current, educationpolicy, analysis, simply, generation, place, second, occur, specific
Cluster 3:
economicrelief, phone, face, officer, experience, ever, animal, political, market, hundred
Cluster 4:
increase, economicrelief, list, chance, publictransport, degree, kind, even, husband, different
Cluster 5:
healthcarereform, educationpolicy, despite, measure, seek, mrs, today, attention, table, ago


## Multilingual Sentiment Scoring

Score sentiment polarity using tools like VADER (for English) and explore appropriate methods or custom lexicons for other languages (Spanish, French, German, Hindi), focusing on noise-tolerant preprocessing to enhance accuracy.


In [37]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download VADER lexicon if not already downloaded
try:
    nltk.data.find('sentiment/vader_lexicon')
except LookupError:
    nltk.download('vader_lexicon')

# Initialize the VADER sentiment analyzer
sentiment_analyzer = SentimentIntensityAnalyzer()

print("VADER sentiment analyzer initialized.")

VADER sentiment analyzer initialized.


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
