<a href="https://colab.research.google.com/github/sasanvhn/tm-and-clustering-project/blob/main/hw12_IF_SasanVahidinia_1750430.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [16]:
# Make data directory if it doesn't exist
!mkdir -p data
!mkdir -p models
#!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/text-analysis/data/recipes.csv -P data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/text-analysis/data/state-of-the-union.csv -P data

File ‘data/state-of-the-union.csv’ already there; not retrieving.



Do Clustering and Topic Modeling for the same data:

In [17]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv("data/state-of-the-union.csv")

# Clean the data
df['content'] = df['content'].str.replace(r"[^A-Za-z ]", " ", regex=True).str.lower()

print(df.head())

   year                                            content
0  1790  george washington january          fellow citi...
1  1790   state of the union address george washington ...
2  1791   state of the union address george washington ...
3  1792   state of the union address george washington ...
4  1793   state of the union address george washington ...


–Aim for a model with between 5 and 10 clusters / topics (same size in both models):

In [18]:
#Clustering
from sklearn.cluster import KMeans
from gensim import corpora, models
from gensim.utils import simple_preprocess

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X = vectorizer.fit_transform(df['content'])

n_clusters = 5  # Adjust as needed (between 5–10)
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
df['cluster'] = kmeans.fit_predict(X)

print("\nCluster Distribution:")
print(df['cluster'].value_counts())

for cluster_id in range(n_clusters):
    print(f"\nCluster {cluster_id}:")
    print(df[df['cluster'] == cluster_id]['content'].head(3))

#Topic Modeling
# Tokenize and preprocess the text
texts = df['content'].apply(simple_preprocess)
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=5, no_above=0.5)
corpus = [dictionary.doc2bow(text) for text in texts]

n_topics = n_clusters
lda_model = models.LdaModel(corpus=corpus, num_topics=n_topics, id2word=dictionary, passes=10)

print("\nTopics from LDA:")
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

doc_topics = [lda_model.get_document_topics(doc) for doc in corpus]
dominant_topics = [max(topics, key=lambda x: x[1])[0] for topics in doc_topics]
df['dominant_topic'] = dominant_topics


Cluster Distribution:
cluster
0    89
1    46
2    43
3    26
4    22
Name: count, dtype: int64

Cluster 0:
40     state of the union address andrew jackson dec...
41     state of the union address andrew jackson dec...
45     state of the union address andrew jackson dec...
Name: content, dtype: object

Cluster 1:
125     state of the union address woodrow wilson dec...
126     state of the union address woodrow wilson dec...
128     state of the union address woodrow wilson dec...
Name: content, dtype: object

Cluster 2:
0    george washington january          fellow citi...
1     state of the union address george washington ...
2     state of the union address george washington ...
Name: content, dtype: object

Cluster 3:
177     state of the union address lyndon b  johnson ...
178     state of the union address lyndon b  johnson ...
179     state of the union address lyndon b  johnson ...
Name: content, dtype: object

Cluster 4:
204     state of the union address george h w  bush 

Compare the results:

In [19]:
print("\nCluster Distribution:")
print(df['cluster'].value_counts())

print("\nTopic Distribution:")
print(df['dominant_topic'].value_counts())

print("\nComparison of Clusters and Topics:")
for i in range(n_clusters):
    print(f"\nCluster {i} documents:")
    print(df[df['cluster'] == i]['content'].head(2))

    print(f"\nTopic {i} documents:")
    topic_docs = df[df['dominant_topic'] == i]
    print(topic_docs['content'].head(2))

print(f"\nSum of Squared Errors (SSE): {kmeans.inertia_}")


Cluster Distribution:
cluster
0    89
1    46
2    43
3    26
4    22
Name: count, dtype: int64

Topic Distribution:
dominant_topic
0    76
1    69
3    49
2    27
4     5
Name: count, dtype: int64

Comparison of Clusters and Topics:

Cluster 0 documents:
40     state of the union address andrew jackson dec...
41     state of the union address andrew jackson dec...
Name: content, dtype: object

Topic 0 documents:
0    george washington january          fellow citi...
1     state of the union address george washington ...
Name: content, dtype: object

Cluster 1 documents:
125     state of the union address woodrow wilson dec...
126     state of the union address woodrow wilson dec...
Name: content, dtype: object

Topic 1 documents:
146     state of the union address franklin d  roosev...
149     state of the union address franklin d  roosev...
Name: content, dtype: object

Cluster 2 documents:
0    george washington january          fellow citi...
1     state of the union address georg

Which is Better?

TM is better for this dataset because it provides more thematic insights and overlaps that are important for understanding relationships between documents. Clustering works well for more general groupings but not as good for the detailed connections TM has.


Are There Similarities?



*   Cluster 0 and Topic 0 both include speeches from early presidents like    
       George Washington and Andrew Jackson.
*   Cluster 3 and Topic 3 group modern presidents, like Lyndon B. Johnson and
      Franklin D. Roosevelt.




Obtain data with a query in Carrot2, use a query
that starts with the same letters are your first and
last name
 • E.g. Sasan Vahidinia -> smart vehicles

1 Search and Cluster with Carrot2 -
2 Export Documents

In [20]:
file_path = 'web-smart_vehicles-result.csv'
data = pd.read_csv(file_path)

print(data.head())

        Cluster Level 1  id  \
0  Electric,Grid,Fortwo   3   
1  Electric,Grid,Fortwo   9   
2  Electric,Grid,Fortwo  48   
3  Electric,Grid,Fortwo  49   
4  Electric,Grid,Fortwo  63   

                                               title  \
0                                       Smart Fortwo   
1                               Smart electric drive   
2                                         Smart grid   
3  Grid congestion mitigation in the era of share...   
4          A Review on Smart Grid and Its Components   

                                             snippet  \
0  The Smart Fortwo (stylized as "smart fortwo") ...   
1  The Smart EQ Fortwo, formerly Smart Fortwo ele...   
2  integrating electric vehicles into the smart g...   
3  Rapid integration of photovoltaic systems and ...   
4  A day without electricity is a day of trouble....   

                                                 url    sources  
0         https://en.wikipedia.org/wiki/Smart_Fortwo  Wikipedia  
1  http

 3 Prepare data: Delete
last columns and first
line (Check requirements )

In [21]:
cleaned_data = data.drop(columns=['url', 'sources'])

print("Cleaned Data:")
print(cleaned_data.head())

Cleaned Data:
        Cluster Level 1  id  \
0  Electric,Grid,Fortwo   3   
1  Electric,Grid,Fortwo   9   
2  Electric,Grid,Fortwo  48   
3  Electric,Grid,Fortwo  49   
4  Electric,Grid,Fortwo  63   

                                               title  \
0                                       Smart Fortwo   
1                               Smart electric drive   
2                                         Smart grid   
3  Grid congestion mitigation in the era of share...   
4          A Review on Smart Grid and Its Components   

                                             snippet  
0  The Smart Fortwo (stylized as "smart fortwo") ...  
1  The Smart EQ Fortwo, formerly Smart Fortwo ele...  
2  integrating electric vehicles into the smart g...  
3  Rapid integration of photovoltaic systems and ...  
4  A day without electricity is a day of trouble....  


4 Topic Modeling with your data in
https://mimno.infosci.cornell.edu/jsLDA

It works!