### <strong>Part A: Data Preparation<strong>

<strong>s1. data loading: breakdown message</strong>

-load the csv file into a dataframe  

-inspect the first few rows  

-check overall shape and column info  


In [44]:
# s1. data loading: code

import pandas as pd

df_full = pd.read_csv("HW3_health_headlines_10000.csv")  
# loads entire dataset into df_full

print("full dataset shape:", df_full.shape)  
# shows how many rows and columns

sample_df = df_full.head(5)  
# takes a small sample of rows

print(sample_df)  
# quick peek at initial data

df_full.info()  
# summary of columns, dtypes, and non-null counts


full dataset shape: (10000, 2)
                       id                                              title
0   2008-05/aga-pop053008  Prevalence of pre-cancerous masses in the colo...
1     2008-05/e-nfo052708  New form of ECT is as effective as older types...
2  2008-05/jaaj-add050808  Anti-inflammatory drugs do not improve cogniti...
3  2008-05/jaaj-mmw052208  Many men with low testosterone levels do not r...
4  2008-05/jaaj-mot050108  Much of the increased risk of death from smoki...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      10000 non-null  object
 1   title   10000 non-null  object
dtypes: object(2)
memory usage: 156.4+ KB


In [None]:
# Observations:
# - 2 columns present: 'id' and 'title'
# - Both columns have datatype 'object' indicating textual data
 


<strong>s2. data cleaning: breakdown message</strong>

-remove duplicates based on `title`  
-check missing values in columns  
-create `cleaned_title` for standardized text  
-add `title_length` to measure word counts  
-keep columns that are important for the next step  


In [53]:
# s2. data cleaning: code

import re

# remove duplicates to ensure unique headlines
df_full.drop_duplicates(subset='title', inplace=True)
print("shape after removing duplicates:", df_full.shape)

# check missing values
print("missing values per column:")
print(df_full.isnull().sum())

# function to lowercase and remove punctuation
def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text

# create a cleaned version of titles
df_full['cleaned_title'] = df_full['title'].apply(clean_text)

# track word count in cleaned titles
df_full['title_length'] = df_full['cleaned_title'].apply(lambda x: len(x.split()))

# show a few rows
print(df_full.head(3))

# personal notes
# - duplicates removed, missing values checked
# - new columns: cleaned_title and title_length
# - next step will extend text processing and explore data


shape after removing duplicates: (9988, 7)
missing values per column:
id                   0
title                0
final_text           0
year                 0
month                0
parsed_date          0
final_text_length    0
dtype: int64
                       id                                              title  \
0   2008-05/aga-pop053008  Prevalence of pre-cancerous masses in the colo...   
1     2008-05/e-nfo052708  New form of ECT is as effective as older types...   
2  2008-05/jaaj-add050808  Anti-inflammatory drugs do not improve cogniti...   

                                          final_text  year month parsed_date  \
0   prevalence precancerous mass colon patient 40 50  2008    05  2008-05-01   
1  new form ect effective older type without cogn...  2008    05  2008-05-01   
2  antiinflammatory drug improve cognitive functi...  2008    05  2008-05-01   

   final_text_length                                      cleaned_title  \
0                  7  prevalence of pre

<strong>summarize of current preprocessing: update so far</strong>

-check dataset shape:10,000 rows,2 columns(id,title)
-remove duplicates:9,988 rows remain(12 removed)
-check missing values:none in id,title,cleaned_title,title_length
-text cleaning:cleaned_title created(punctuation removed,lowercase applied)
-title length:title_length added(avg:10 words,min:2,max:31)


<strong>s3. advanced modifications and exploration: breakdown message</strong>

-remove stopwords to refine text  
-optional lemmatization for consistent word forms  
-parse `id` for year and month  
-create a final column named `final_text`  
-check distribution of word counts with an interactive histogram  
-list top words used in `final_text`  


In [54]:
# s3. advanced modifications and exploration: code

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    # splits text, discards filler words, rejoins
    tokens = text.split()
    filtered = [w for w in tokens if w not in stop_words]
    return " ".join(filtered)

df_full['no_stop'] = df_full['cleaned_title'].apply(remove_stopwords)
# removes common filler words, stores result in no_stop

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    # transforms words to base forms
    words = text.split()
    lemmed = [lemmatizer.lemmatize(w) for w in words]
    return " ".join(lemmed)

df_full['final_text'] = df_full['no_stop'].apply(lemmatize_text)
# stores final processed text in final_text

df_full[['year_month','id_code']] = df_full['id'].str.split('/', n=1, expand=True)
# splits id into year_month and id_code

df_full['year'] = df_full['year_month'].str[:4]
# extracts year substring

df_full['month'] = df_full['year_month'].str[5:7]
# extracts month substring

df_full['parsed_date'] = pd.to_datetime(df_full['year_month'] + '-01', errors='coerce')
# forms a valid date by appending day '01'

df_full.drop(columns=['cleaned_title','title_length','no_stop','year_month','id_code'], inplace=True)
# removes less relevant or intermediate columns

df_full['final_text_length'] = df_full['final_text'].apply(lambda x: len(x.split()))
# calculates word count in final_text

import plotly.express as px
from collections import Counter

print(df_full['final_text_length'].describe())
# shows numeric stats for final_text_length

fig_hist = px.histogram(
    df_full,
    x='final_text_length',
    nbins=10,
    title='distribution of final_text length'
)
fig_hist.update_layout(xaxis_title='word count', yaxis_title='frequency')
fig_hist.show()
# displays an interactive histogram of word counts

all_words = " ".join(df_full['final_text']).split()
# compiles all final text into a list of words
word_counts = Counter(all_words)
common_words = word_counts.most_common(15)
common_df = pd.DataFrame(common_words, columns=['word','count'])

fig_bar = px.bar(
    common_df,
    x='word',
    y='count',
    title='top 15 words in final_text'
)
fig_bar.update_layout(xaxis_title='word', yaxis_title='count')
fig_bar.show()
# displays a bar chart of the most common words

#  snippet: display first 10 rows
df_full.head(10)



count    9988.000000
mean        8.021125
std         1.994298
min         1.000000
25%         7.000000
50%         8.000000
75%         9.000000
max        22.000000
Name: final_text_length, dtype: float64


[nltk_data] Downloading package stopwords to /Users/DJ/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/DJ/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,id,title,final_text,year,month,parsed_date,final_text_length
0,2008-05/aga-pop053008,Prevalence of pre-cancerous masses in the colo...,prevalence precancerous mass colon patient 40 50,2008,5,2008-05-01,7
1,2008-05/e-nfo052708,New form of ECT is as effective as older types...,new form ect effective older type without cogn...,2008,5,2008-05-01,10
2,2008-05/jaaj-add050808,Anti-inflammatory drugs do not improve cogniti...,antiinflammatory drug improve cognitive functi...,2008,5,2008-05-01,7
3,2008-05/jaaj-mmw052208,Many men with low testosterone levels do not r...,many men low testosterone level receive treatment,2008,5,2008-05-01,7
4,2008-05/jaaj-mot050108,Much of the increased risk of death from smoki...,much increased risk death smoking reduced with...,2008,5,2008-05-01,10
5,2008-05/jotn-ait052208,Also in the May 27 JNCI,also may 27 jnci,2008,5,2008-05-01,4
6,2008-05/uoc-cle051908,Childhood lead exposure associated with crimin...,childhood lead exposure associated criminal be...,2008,5,2008-05-01,7
7,2008-05/w-dsd052908,Doula support during labor reduces cesarean ra...,doula support labor reduces cesarean rate epid...,2008,5,2008-05-01,7
8,2008-06/aaop-ruh061008,Researchers uncover higher prevalence of perio...,researcher uncover higher prevalence periodont...,2008,6,2008-06-01,9
9,2008-06/asop-csp062408,Cosmetic surgery procedures to exceed 55 milli...,cosmetic surgery procedure exceed 55 million 2...,2008,6,2008-06-01,10


<strong>column descriptions</strong>

-id unique identifier from original data

-title raw headline text from the dataset

-final_text cleaned version of `title` (punctuation removed, stopwords removed, lemmatized)

-year extracted year from `id`

-month extracted month from `id`

-parsed_date datetime with day '01' appended for valid format

-final_text_length word count from `final_text` (indicates text size)



<strong>update so far</strong>

-check dataset shape:10,000 rows,2 columns(id,title)

-remove duplicates:9,988 rows remain(12 removed)

-check missing values:none in id,title,cleaned_title,title_length

-text cleaning:cleaned_title created(punctuation removed,lowercase applied)

-title length:title_length added(avg:10 words,min:2,max:31)

<strong>s4. kmeans clustering: breakdown message</strong>

-build numerical vectors from `final_text`  
-check different values of k (elbow or silhouette)  
-fit kmeans model with chosen k  
-assign cluster labels and explore dominant terms  


<strong>s4.1 vectorization: breakdown message</strong>

-transform `final_text` into numeric form  
-use tf-idf to capture word importance  
-prepare matrix for kmeans  


In [None]:
# s4.1 vectorization: code

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# convert texts to numeric tf-idf format, removing common English words
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X = tfidf_vectorizer.fit_transform(df_full['final_text'])

# convert sparse matrix to dataframe for clearer viewing
tfidf_df = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# display first 10 rows with diverse column selections for a better overview
print(tfidf_df.iloc[:10, [0, 500, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000]])

# explicitly show shape of resulting dataframe
print("tfidf shape:", tfidf_df.shape)


    06  american  birthweights  cup  falter  improved  microscopy  pharmacist  \
0  0.0       0.0           0.0  0.0     0.0       0.0         0.0         0.0   
1  0.0       0.0           0.0  0.0     0.0       0.0         0.0         0.0   
2  0.0       0.0           0.0  0.0     0.0       0.0         0.0         0.0   
3  0.0       0.0           0.0  0.0     0.0       0.0         0.0         0.0   
4  0.0       0.0           0.0  0.0     0.0       0.0         0.0         0.0   
5  0.0       0.0           0.0  0.0     0.0       0.0         0.0         0.0   
6  0.0       0.0           0.0  0.0     0.0       0.0         0.0         0.0   
7  0.0       0.0           0.0  0.0     0.0       0.0         0.0         0.0   
8  0.0       0.0           0.0  0.0     0.0       0.0         0.0         0.0   
9  0.0       0.0           0.0  0.0     0.0       0.0         0.0         0.0   

   risked  tamiflu  
0     0.0      0.0  
1     0.0      0.0  
2     0.0      0.0  
3     0.0      0.0  
4  

<strong>s4.2 cluster selection: breakdown message</strong>

-try multiple cluster counts  
-calculate silhouette scores  
-select the best k  


In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import pandas as pd

# convert texts to numeric tf-idf format, removing common English words
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X = tfidf_vectorizer.fit_transform(df_full['final_text'])

# prepare list of possible cluster numbers to test
possible_k = [5, 7, 10]

# variables to store best number of clusters and its silhouette score
best_k = None
best_score = -1

# test each k, fit k-means, calculate silhouette scores to find optimal number of clusters
for k in possible_k:
    model_temp = KMeans(n_clusters=k, random_state=42)
    model_temp.fit(X)
    labels_temp = model_temp.labels_
    score_temp = silhouette_score(X, labels_temp)
    
    # print intermediate silhouette scores clearly
    print("For k =", k, ", silhouette score:", round(score_temp, 3))

    # check and store best silhouette score and corresponding k
    if score_temp > best_score:
        best_score = score_temp
        best_k = k

# display best k based on silhouette scores
print("\nBest k selected:", best_k, "(silhouette score:", round(best_score, 3), ")")


For k = 5 , silhouette score: 0.004
For k = 7 , silhouette score: 0.005
For k = 10 , silhouette score: 0.007

Best k selected: 10 (silhouette score: 0.007 )


<strong>silhouette score: explanation</strong>

-measures cluster separation and cohesion

-ranges from -1 to +1

-score close to (1) indicates clear, distinct clusters

-score close to (0) indicates overlapping or unclear clusters

-negative scores indicate incorrect cluster assignments

-used to identify best cluster count(k) by selecting highest score

<span style="color:darkblue;"><strong>s4.3 embedding-based approach: breakdown message</strong></span>

-use semantic embeddings for deeper representation  
-replace or combine tf-idf vectors with embeddings  
-refit clustering with new vectors  
-check if silhouette scores or cluster separations improve  
-typical low scores in text mining, but embeddings may reduce overlap  



<strong> reasoning behind embedding-based approach</strong>

-previous silhouette scores (0.004, 0.005, 0.007) suggest minimal cluster separation  

-very low silhouette scores are common in text mining

-text data inherently diverse and overlapping

-difficult for clusters to form clearly-separated groups due to semantic ambiguity

-embeddings capture more context than tf-idf alone  


In [66]:
# s4.3 embedding-based approach

import gensim.downloader as api
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

# load a pre-trained embedding model (example uses glove)
embedding_model = api.load('glove-wiki-gigaword-50')

# convert each cleaned title to an average embedding
def text_to_embedding(text):
    tokens = text.split()
    vectors = [embedding_model[word] for word in tokens if word in embedding_model]
    if len(vectors) > 0:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(50)

# apply embedding conversion to all cleaned titles
embeddings_list = df_full['final_text'].apply(text_to_embedding)
embeddings_matrix = np.vstack(embeddings_list.values)

# test different k values
k_values = [5, 8, 10]
for k in k_values:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(embeddings_matrix)
    labels = km.labels_
    sil_score = silhouette_score(embeddings_matrix, labels)
    print("k =", k, "silhouette =", round(sil_score, 4))

# personal notes
# - embeddings add semantic detail
# - silhouette scores might still be low, but can be compared with tf-idf results
# - next step: refine preprocessing or explore advanced embedding methods


k = 5 silhouette = 0.0392
k = 8 silhouette = 0.0345
k = 10 silhouette = 0.0353


<strong>s5. lda topic modeling: overview</strong>

-discover hidden themes based on word distributions  
-treat each `cleaned_title` as a mixture of these hidden topics  
-transform text into numeric form using tokenization and dictionaries  
-iteratively adjust topic-word assignments to reflect underlying patterns  
-examine top words in each topic to assign labels or detect unclear groups  
