# **DIGS 30004 Final Project - "New York Times Headlines Clustering"**

## By Franklin Wang 

### (Due on 03/12/2025)


## **1. Data Description**

This dataset contains **New York Times headlines from 2017 to 2020. It aims to examine the major trends in news during the first Trump presidency**, along with metadata such as publication section and author. The goal of my final data visualization project is to discover underlying topics using methods such as **TF-IDF, PCA, and K-Means clustering**. The target audience of my work is researchers, media scholars, and students interested in media trends, political discourse, or digital humanities. **All available headlines from 2017–2020 were used without additional sampling or filtering beyond standard cleaning.** Ultimately, I am clustering these headlines into meaningful groups based on their **textual similarity**.

Works Cited:
- For TF-IDF (Text Visualization) from the course reading: 
</br>
*Bird, Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O'Reilly Media, 2009.*

- For PCA (Dimension Reduction - PCA) from the course reading: 
</br>
*Jolliffe, Ian T. Principal Component Analysis. Springer, 2002, pp. 1-27.*

- Plotly PCA visualization page from the week 6 Zoom lecture:
</br>
*Plotly. "PCA Visualization." Plotly, https://plotly.com/python/pca-visualization/.*


In [None]:
import pandas as pd

# Load NYT headlines 2017 dataset
file_path_2017 = "new_york_times_stories_2017.csv"
df_2017 = pd.read_csv(file_path_2017)

# Select relevant columns and drop missing values
df_2017 = df_2017[['headline', 'section', 'year']].dropna()
df_2017.head()

This table shows a snapshot of headlines from the New York Times in 2017, along with the section they were published in and the year. It includes a mix of topics—from sports and global news to arts and opinion—giving the audience a sense of the variety in news coverage during that time. Looking at this structure makes it easier to organize and explore patterns in the headlines. The same kind of analysis can also be done for the years 2018 through 2020 to see how topics and coverage may have changed over time.

In [None]:
# Load NYT headlines 2018 dataset
file_path_2018 = "new_york_times_stories_2018.csv"
df_2018 = pd.read_csv(file_path_2018)

# Select relevant columns and drop missing values
df_2018 = df_2018[['headline', 'section', 'year']].dropna()
df_2018.head()

In [None]:
# Load NYT headlines 2019 dataset
file_path_2019 = "new_york_times_stories_2019.csv"
df_2019 = pd.read_csv(file_path_2019)

# Select relevant columns and drop missing values
df_2019 = df_2019[['headline', 'section', 'year']].dropna()
df_2019.head()

In [None]:
# Load NYT headlines 2020 dataset
file_path_2020 = "new_york_times_stories_2020.csv"
df_2020 = pd.read_csv(file_path_2020)

# Select relevant columns and drop missing values
df_2020 = df_2020[['headline', 'section', 'year']].dropna()
df_2020.head()

## **2. Research Question**

**How did New York Times headlines evolve from 2017 to 2020 during President Trump's first presidency?**
- What were the dominant topics in each year?
- Did the distribution of topics change significantly?
- Was there a political news coverage increase over the years?

**Potential challenges of sentiment analysis on headlines:**
- While many headlines contain strong words that convey positivity or negativity, headlines could be mostly neutral, making classification tricky to handle.
- Some headlines can be misleading or sarcastic, making it difficult for machines to interpret its "human" meaning.
- Headlines are heavily dependent on their contexts; words may have different meanings based on external events.

## **3. Methodology**

- **TF-IDF (Feature Extraction)**: Converts text into numerical features, measuring how important a word is in a document relative to a larger collection of documents (a corpus).
- **PCA (Dimensionality Reduction)**: Reduces dimensionality for visualization and efficiently clustering simplified, complex data by keeping the most important patterns.
- **K-Means Clustering**: Sorts data into groups based on topic similarities, organizing items into categories to identify patterns.

Rationale for technique choice: TF-IDF was selected for its simplicity and interpretability in textual data, while PCA was used for visualization due to its efficiency in reducing high-dimensional data. K-Means was chosen as a straightforward and widely-used clustering algorithm suitable for numeric input from PCA.

## **4. TF-IDF Conversion**

Raw text is not useful for data analysis. It needs to be converted into numbers first. That is where TF-IDF comes in.

First, I clean the text, removing special characters and making everything lowercase for consistency. Then, I apply TF-IDF, which scores words based on importance—common words like “the” get ignored, while unique terms stand out.

The result will be a numerical matrix where each row is a headline and each column represents a key word. This organized data can now be used for PCA to reduce dimensions and K-Means for clustering. 

*P.S.: The ".apply()" used in the following code blocks is credited to ChatGPT experimentation; it applies the cleaning function to each headline. But lambda and regular expression are learned from Dr. Gladstone and Dr. Prosser's courses.*

In [None]:
# TfidfVectorizer converts text into numerical features.
# re (Regular Expressions) is used for text cleaning.
# numpy is used for numerical computations.

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import re

# Preprocess 2017 text (lowercase and remove special characters)
df_2017['cleaned_headline'] = df_2017['headline'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x.lower())) 

# Convert headlines to TF-IDF representation
# Stop words are common words like "the", "and", "is" that need to be removed for analysis
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix_2017 = vectorizer.fit_transform(df_2017['cleaned_headline'])

# Display TF-IDF matrix shape
# Returns (number of headlines, numbers of unique words in vocabulary)
tfidf_matrix_2017.shape

Finding: There were NYT 60,424 headlines in 2017.

In [None]:
# Preprocess 2018 text (lowercase and remove special characters)
df_2018['cleaned_headline'] = df_2018['headline'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x.lower()))

# Convert headlines to TF-IDF representation
# Stop words removes common words like "the", "and", "is"
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix_2018 = vectorizer.fit_transform(df_2018['cleaned_headline'])

# Display TF-IDF matrix shape
# Returns (number of headlines, numbers of unique words in vocabulary)
tfidf_matrix_2018.shape

Finding: There were NYT 58,846 headlines in 2018, seeing a decrease from last year.

In [None]:
# Preprocess 2019 text (lowercase and remove special characters)
df_2019['cleaned_headline'] = df_2019['headline'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x.lower()))

# Convert headlines to TF-IDF representation
# Stop words removes common words like "the", "and", "is"
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix_2019 = vectorizer.fit_transform(df_2019['cleaned_headline'])

# Display TF-IDF matrix shape
# Returns (number of headlines, numbers of unique words in vocabulary)
tfidf_matrix_2019.shape

Finding: There were NYT 53,247 headlines in 2019, seeing a continued decrease of news coverage.

In [None]:
# Preprocess 2020 text (lowercase and remove special characters)
df_2020['cleaned_headline'] = df_2020['headline'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x.lower()))

# Convert headlines to TF-IDF representation
# Stop words removes common words like "the", "and", "is"
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix_2020 = vectorizer.fit_transform(df_2020['cleaned_headline'])

# Display TF-IDF matrix shape
# Returns (number of headlines, numbers of unique words in vocabulary)
tfidf_matrix_2020.shape

Finding: There were NYT 55,489 headlines in 2020, seeing an increase for the first time.

## **5. Word Frequency Bar Chart: Finding the Most Common Words After TF-IDF**

Word frequency bar charts after TF-IDF conversion helps display the most influential words in New York Times (NYT) headlines, highlighting key topics that dominated the news. 

Unlike basic word counts, TF-IDF filters out common words/stop words like "the" or "is" and instead emphasizes words that are more unique and meaningful within the dataset. This allows us to see which terms—such as "Trump," "election," "pandemic," or "climate"—were most significant in shaping the news narrative each year. 

In general, this type of visualization provides a quick, interpretable snapshot of the main themes across different time periods, making it useful for further analysis like clustering, sentiment tracking, or trend forecasting.

*P.S.: The following ".get_feature_names_out()", ".flattern()", and ".sort_values()" take direct credit from ChatGPT.*

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Get feature names (words) from the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()

# Sum TF-IDF scores for each word across all headlines
word_tfidf_sums_2017 = np.array(tfidf_matrix_2017.sum(axis=0)).flatten()

# Create a dataframe of words and their corresponding TF-IDF scores
word_tfidf_df_2017 = pd.DataFrame({'word': feature_names, 'tfidf': word_tfidf_sums_2017})

# Sort words by TF-IDF score in descending order and select the top 20
top_words_2017 = word_tfidf_df_2017.sort_values(by='tfidf', ascending=False).head(20)

# Plot bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x='tfidf', y='word', data=top_words_2017, palette='viridis')
plt.xlabel("Total TF-IDF Score")
plt.ylabel("Word")
plt.title("Top 20 Most Common Words of NYT Headlines in 2017")

# Add source text
plt.figtext(
    0.5, -0.01,  # Position: centered horizontally (0.5), below the plot (-0.01)
    "Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)", 
    fontsize=10, color="gray", ha='center'
)

# Show the chart
plt.show()


In [None]:
# Get feature names (words) from the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()

# Sum TF-IDF scores for each word across all headlines
word_tfidf_sums_2018 = np.array(tfidf_matrix_2018.sum(axis=0)).flatten()

# Create a dataframe of words and their corresponding TF-IDF scores
word_tfidf_df_2018 = pd.DataFrame({'word': feature_names, 'tfidf': word_tfidf_sums_2018})

# Sort words by TF-IDF score in descending order and select the top 20
top_words_2018 = word_tfidf_df_2018.sort_values(by='tfidf', ascending=False).head(20)

# Plot bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x='tfidf', y='word', data=top_words_2018, palette='viridis')
plt.xlabel("Total TF-IDF Score")
plt.ylabel("Word")
plt.title("Top 20 Most Common Words of NYT Headlines in 2018")

# Add source text
plt.figtext(
    0.5, -0.05,  # Position: centered horizontally (0.5), below the plot (-0.01)
    "Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)", 
    fontsize=10, color="gray", ha='center'
)

# Show the chart
plt.show()


In [None]:
# Get feature names (words) from the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()

# Sum TF-IDF scores for each word across all headlines
word_tfidf_sums_2019 = np.array(tfidf_matrix_2019.sum(axis=0)).flatten()

# Create a dataframe of words and their corresponding TF-IDF scores
word_tfidf_df_2019 = pd.DataFrame({'word': feature_names, 'tfidf': word_tfidf_sums_2019})

# Sort words by TF-IDF score in descending order and select the top 20
top_words_2019 = word_tfidf_df_2019.sort_values(by='tfidf', ascending=False).head(20)

# Plot bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x='tfidf', y='word', data=top_words_2019, palette='viridis')
plt.xlabel("Total TF-IDF Score")
plt.ylabel("Word")
plt.title("Top 20 Most Common Words of NYT Headlines in 2019")

# Add source text
plt.figtext(
    0.5, -0.05,  # Position: centered horizontally (0.5), below the plot (-0.01)
    "Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)", 
    fontsize=10, color="gray", ha='center'
)

# Show the chart
plt.show()


In [None]:
# Get feature names (words) from the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()

# Sum TF-IDF scores for each word across all headlines
word_tfidf_sums_2020 = np.array(tfidf_matrix_2020.sum(axis=0)).flatten()

# Create a dataframe of words and their corresponding TF-IDF scores
word_tfidf_df_2020 = pd.DataFrame({'word': feature_names, 'tfidf': word_tfidf_sums_2020})

# Sort words by TF-IDF score in descending order and select the top 20
top_words_2020 = word_tfidf_df_2020.sort_values(by='tfidf', ascending=False).head(20)

# Plot bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x='tfidf', y='word', data=top_words_2020, palette='viridis')
plt.xlabel("Total TF-IDF Score")
plt.ylabel("Word")
plt.title("Top 20 Most Common Words of NYT Headlines in 2020")

# Add source text
plt.figtext(
    0.5, -0.05,  # Position: centered horizontally (0.5), below the plot (-0.01)
    "Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)", 
    fontsize=10, color="gray", ha='center'
)

# Show the chart
plt.show()


## **6. Dimensionality Reduction with PCA**

The following arrays array([x, y]) represent the explained variance ratios of the first two principal components (PC1 and PC2) in Principal Component Analysis (PCA).

We cannot definitively interpret what PCA Component 1 and Component 2 represent because PCA does not label features—it only finds the directions of maximum variance in the dataset. However, we can observe patterns in how different years' headlines cluster and infer what topics or trends might be influencing them.

In [None]:
from sklearn.decomposition import PCA

# Reduce dimensionality with PCA 2017
pca = PCA(n_components=2)
tfidf_pca_2017 = pca.fit_transform(tfidf_matrix_2017.toarray())

# Display explained variance ratio
pca.explained_variance_ratio_

- PC1 explains 1.42% of the total variance in the data.
- PC2 explains 1.11% of the total variance in the data.
- Together, PC1 + PC2 explain only 2.53% of the total variance.

In [None]:
import matplotlib.pyplot as plt

# Scatter plot of PCA 2017 results
plt.figure(figsize=(8, 6))
plt.scatter(tfidf_pca_2017[:, 0], tfidf_pca_2017[:, 1], marker='o')

plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("PCA of NYT Headlines (2017)")
plt.grid(True)

plt.figtext(
    0.5, -0.05,  
    "Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)", 
    fontsize=10, color="gray", ha='center'
)

plt.show()

This plot shows that a lot of the headlines from 2017 were quite similar, clustering closely together. That likely means the New York Times was covering a few major topics repeatedly throughout the year. At the same time, the scatter of some points suggests there were also more unique or varied headlines that stood out from the main themes.

In [None]:
# Reduce dimensionality with PCA 2018
pca = PCA(n_components=2)
tfidf_pca_2018 = pca.fit_transform(tfidf_matrix_2018.toarray())

# Display explained variance ratio
pca.explained_variance_ratio_

In [None]:
# Scatter plot of PCA 2018 results
plt.figure(figsize=(8, 6))
plt.scatter(tfidf_pca_2018[:, 0], tfidf_pca_2018[:, 1], marker='o')

plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("PCA of NYT Headlines (2018)")
plt.grid(True)

plt.figtext(
    0.5, -0.05,  
    "Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)", 
    fontsize=10, color="gray", ha='center'
)

plt.show()

This PCA plot of 2018 NYT headlines shows a similar pattern to 2017, with a dense cluster near the center—indicating that many headlines covered overlapping or repeated topics. However, the points are slightly more spread out compared to 2017, which may suggest that the headlines in 2018 covered a slightly wider range of themes or events, leading to a bit more variety in the text data.

In [None]:
# Reduce dimensionality with PCA 2019
pca = PCA(n_components=2)
tfidf_pca_2019 = pca.fit_transform(tfidf_matrix_2019.toarray())

# Display explained variance ratio
pca.explained_variance_ratio_

In [None]:
import matplotlib.pyplot as plt

# Scatter plot of PCA results
plt.figure(figsize=(8, 6))
plt.scatter(tfidf_pca_2019[:, 0], tfidf_pca_2019[:, 1], marker='o')

plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("PCA of NYT Headlines (2019)")
plt.grid(True)

plt.figtext(
    0.5, -0.05,  
    "Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)", 
    fontsize=10, color="gray", ha='center'
)

plt.show()

This PCA plot of 2019 NYT headlines shows a wider spread than earlier years, with points beginning to branch out in different directions. That means there may have been more variety in the topics covered or the language used in headlines that year. There is still a dense cluster in the center, showing that many headlines shared similar themes—but the extended patterns suggest a shift toward broader or more distinct coverage areas.

In [None]:
# Reduce dimensionality with PCA
pca = PCA(n_components=2)
tfidf_pca_2020 = pca.fit_transform(tfidf_matrix_2020.toarray())

# Display explained variance ratio
pca.explained_variance_ratio_

In [None]:
import matplotlib.pyplot as plt

# Scatter plot of PCA results
plt.figure(figsize=(8, 6))
plt.scatter(tfidf_pca_2020[:, 0], tfidf_pca_2020[:, 1], marker='o')

plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("PCA of NYT Headlines (2020)")
plt.grid(True)

plt.figtext(
    0.5, -0.05,  
    "Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)", 
    fontsize=10, color="gray", ha='center'
)

plt.show()

This PCA plot of 2020 New York Times headlines shows a dense, stretched-out pattern along the horizontal axis, which tells us that most headlines were closely focused around a few major themes. That makes sense—2020 was marked by huge global events like the COVID-19 pandemic and the U.S. presidential election. Because of that, the language in headlines was more repetitive, with similar words and topics showing up again and again. So while there were still differences between headlines, many were pulled into the same direction by the overwhelming focus on these historic moments.

## **7. Plotly Express**

Visualizing All the Original Dimensions

In [None]:
import plotly.express as px
import pandas as pd
import re
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

# Define file paths
file_paths = {
    2017: "new_york_times_stories_2017.csv",
    2018: "new_york_times_stories_2018.csv",
    2019: "new_york_times_stories_2019.csv",
    2020: "new_york_times_stories_2020.csv"
}

# Load and clean data
df_all_years = pd.DataFrame()
for year, path in file_paths.items():
    df = pd.read_csv(path, usecols=['headline']).dropna()
    df['year'] = year
    df['cleaned_headline'] = df['headline'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', str(x).lower()))
    df_all_years = pd.concat([df_all_years, df], ignore_index=True)

# TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = vectorizer.fit_transform(df_all_years['cleaned_headline'])

# PCA
pca = PCA(n_components=4)
pca_result = pca.fit_transform(tfidf_matrix.toarray())

# Create PCA DataFrame
df_pca = pd.DataFrame(pca_result, columns=['PC1', 'PC2', 'PC3', 'PC4'])
df_pca['year'] = df_all_years['year'].values

# Reorder to draw 2020 first, 2017 last
year_order = [2020, 2019, 2018, 2017]
df_pca = df_pca[df_pca['year'].isin(year_order)]
df_pca['year'] = pd.Categorical(df_pca['year'], categories=year_order, ordered=True)
df_pca = df_pca.sort_values('year', ascending=False)

# Plot scatter matrix
fig = px.scatter_matrix(
    df_pca,
    dimensions=['PC1', 'PC2', 'PC3', 'PC4'],
    color=df_pca['year'].astype(str),  # Keep original Plotly color mapping
    title="Pairwise PCA Component Scatter Plots (NYT Headlines 2017–2020)",
    labels={'year': 'Year'}
)

fig.update_traces(diagonal_visible=False)

# Add data source annotation
fig.add_annotation(
    text="Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)",
    xref="paper", yref="paper",
    x=0.5, y=-0.3,
    showarrow=False,
    font=dict(size=12, color="gray"),
    xanchor='center'
)

fig.show()


Visualizing All the Principal Components

In [None]:
import pandas as pd
import re
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

# Define file paths for datasets (update paths as needed)
file_paths = {
    2017: "new_york_times_stories_2017.csv",
    2018: "new_york_times_stories_2018.csv",
    2019: "new_york_times_stories_2019.csv",
    2020: "new_york_times_stories_2020.csv"
}

# Create an empty DataFrame to store all years
df_all_years = pd.DataFrame()

# Process each year’s dataset
for year, path in file_paths.items():
    df = pd.read_csv(path, usecols=['headline']).dropna()
    df['year'] = year
    df['cleaned_headline'] = df['headline'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', str(x).lower()))
    df_all_years = pd.concat([df_all_years, df], ignore_index=True)

# Convert headlines to a TF-IDF feature matrix
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = vectorizer.fit_transform(df_all_years['cleaned_headline'])

# Apply PCA to reduce dimensions to 4
pca = PCA(n_components=4)
components = pca.fit_transform(tfidf_matrix.toarray())

# Create labels with explained variance percentages
labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}

# Create an interactive PCA scatter matrix plot
fig = px.scatter_matrix(
    pd.DataFrame(components, columns=[f"PC{i+1}" for i in range(4)]).assign(Year=df_all_years['year']),
    labels=labels,
    dimensions=[f"PC{i+1}" for i in range(4)],
    color=df_all_years['year'].astype(str),
    title="Pairwise PCA Component Scatter Plots (NYT Headlines 2017-2020)"
)

# Hide diagonal histograms for better visualization
fig.update_traces(diagonal_visible=False)

# Add data source
fig.add_annotation(
    text="Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)",
    xref="paper", yref="paper",
    x=0.5, y=-0.3,
    showarrow=False,
    font=dict(size=12, color="gray"),
    xanchor='center'
)

# Show the plot
fig.show()


Visualizing a Subset of the Principal Components

In [None]:
import pandas as pd
import re
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

# Define file paths for datasets
file_paths = {
    2017: "new_york_times_stories_2017.csv",
    2018: "new_york_times_stories_2018.csv",
    2019: "new_york_times_stories_2019.csv",
    2020: "new_york_times_stories_2020.csv"
}

# Create an empty DataFrame to store all years
df_all_years = pd.DataFrame()

# Process each year’s dataset
for year, path in file_paths.items():
    df = pd.read_csv(path, usecols=['headline']).dropna()
    df['year'] = year
    df['cleaned_headline'] = df['headline'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', str(x).lower()))
    df_all_years = pd.concat([df_all_years, df], ignore_index=True)

# Convert headlines to a TF-IDF feature matrix
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = vectorizer.fit_transform(df_all_years['cleaned_headline'])

# Apply PCA to reduce dimensions to 2
n_components = 2
pca = PCA(n_components=n_components)
components = pca.fit_transform(tfidf_matrix.toarray())

# Compute total explained variance
total_var = pca.explained_variance_ratio_.sum() * 100

# Create labels for PCA components
labels = {str(i): f"PC {i+1}" for i in range(n_components)}
labels['color'] = 'Year'

# Create an interactive PCA scatter matrix plot
fig = px.scatter_matrix(
    pd.DataFrame(components, columns=[f"PC{i+1}" for i in range(n_components)]).assign(Year=df_all_years['year']),
    color='Year',
    dimensions=[f"PC{i+1}" for i in range(n_components)],
    labels=labels,
    title=f'Total Explained Variance: {total_var:.2f}% (NYT Headlines 2017-2020)'
)

# Hide diagonal histograms for better readability
fig.update_traces(diagonal_visible=False)

# Add data source
fig.add_annotation(
    text="Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)",
    xref="paper", yref="paper",
    x=0.5, y=-0.3,
    showarrow=False,
    font=dict(size=12, color="gray"),
    xanchor='center'
)

# Show the plot
fig.show()


A 2.32% explained variance from PCA on TF-IDF data is **not wrong**. In fact, ChatGPT suggests that it is completely **expected** in natural language processing. TF-IDF creates a very high-dimensional, sparse matrix where most words are rare and features are weakly correlated. This means there is no dominant direction of variation for PCA to capture, so the variance gets spread thin across many components. As a result, even the first two principal components might only account for a small percentage of the total variance. **This does not make my visualization bad**—it can still reveal meaningful structure like topic clusters or yearly shifts. However, if I am aiming to compress or explain more variance, ChatGPT suggests that it can improved using methods like Truncated SVD (which is better suited for sparse data) or dense embeddings like BERT, which capture semantic meaning more effectively.

## **8. 2D PCA Scatter Plot**

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize TF-IDF matrix before PCA
tfidf_matrix_array = tfidf_matrix.toarray()
tfidf_matrix_scaled = StandardScaler().fit_transform(tfidf_matrix_array)

# Apply PCA on standardized data
pca = PCA(n_components=2)
pca_components = pca.fit_transform(tfidf_matrix_scaled)


In [None]:
min_sample_size = df_all_years['year'].value_counts().min()

df_balanced = df_all_years.groupby('year').apply(lambda x: x.sample(min_sample_size, random_state=42)).reset_index(drop=True)

In [None]:
import pandas as pd
import numpy as np
import re
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

# Load datasets for 2017, 2018, 2019, and 2020 (Update file paths)
file_paths = {
    2017: "new_york_times_stories_2017.csv",
    2018: "new_york_times_stories_2018.csv",
    2019: "new_york_times_stories_2019.csv",
    2020: "new_york_times_stories_2020.csv"
}

# Create an empty DataFrame to store all years
df_all_years = pd.DataFrame()

# Process each year’s dataset
for year, path in file_paths.items():
    df = pd.read_csv(path, usecols=['headline']).dropna()
    df['year'] = year
    df['cleaned_headline'] = df['headline'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x.lower()))
    df_all_years = pd.concat([df_all_years, df], ignore_index=True)

# Convert headlines to TF-IDF representation
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = vectorizer.fit_transform(df_all_years['cleaned_headline'])

# Perform PCA to reduce dimensions to 2
pca = PCA(n_components=2)
pca_components = pca.fit_transform(tfidf_matrix.toarray())

# Add PCA components to DataFrame
df_pca = pd.DataFrame(pca_components, columns=['PC1', 'PC2'])
df_pca['year'] = df_all_years['year'].values

# Create an interactive 2D PCA scatter plot for all years
fig = px.scatter(
    df_pca, x='PC1', y='PC2', 
    color=df_pca['year'].astype(str),
    opacity = 0.7,
    title="2D PCA Scatterplot of NYT Headlines (2017-2020)",
    labels={'PC1': 'PCA Component 1', 'PC2': 'PCA Component 2', 'year': 'Year'}
)

# Add data source
fig.add_annotation(
    text="Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)",
    xref="paper", yref="paper",
    x=0.5, y=-0.3,
    showarrow=False,
    font=dict(size=12, color="gray"),
    xanchor='center'
)

# Show the plot
fig.show()

# Save the interactive plot as an HTML
fig.write_html("2-D PCA.html")


This PCA scatterplot shows how NYT headlines from 2017 to 2020 are grouped based on patterns in their word usage. Each dot represents a headline, and similar headlines are positioned closer together. While there are some distinct clusters, most points from all four years tend to overlap, suggesting that the general themes and language used in the headlines remained fairly consistent over time. The purple dots from 2020 appear more dominant—not necessarily because the headlines were drastically different, but simply due to how many there were and the order in which they were plotted. This visualization gives a broad look at how headline topics evolved—or didn’t—across those four years.

## **9. Interactive Dash App for PCA on NYT Headlines**

In [None]:
from dash import Dash, dcc, html, Input, Output
import pandas as pd
import re
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer

# Load and process NYT datasets
file_paths = {
    2017: "new_york_times_stories_2017.csv",
    2018: "new_york_times_stories_2018.csv",
    2019: "new_york_times_stories_2019.csv",
    2020: "new_york_times_stories_2020.csv"
}

df_all_years = pd.DataFrame()
for year, path in file_paths.items():
    df = pd.read_csv(path, usecols=['headline']).dropna()
    df['year'] = year
    df['cleaned_headline'] = df['headline'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', str(x).lower()))
    df_all_years = pd.concat([df_all_years, df], ignore_index=True)

# Convert headlines to a TF-IDF feature matrix
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = vectorizer.fit_transform(df_all_years['cleaned_headline'])

# Initialize Dash app
app = Dash(__name__)

app.layout = html.Div([
    html.H4("Visualization of PCA's Explained Variance on NYT Headlines (2017-2020)"),
    dcc.Graph(id="graph"),
    html.P("Number of PCA Components:"),
    dcc.Slider(
        id='slider',
        min=2, max=6, value=3, step=1, 
        marks={i: str(i) for i in range(2, 7)}
    )
])

@app.callback(
    Output("graph", "figure"), 
    Input("slider", "value"))
def run_and_plot(n_components):
    # Perform PCA with selected components
    pca = PCA(n_components=n_components)
    components = pca.fit_transform(tfidf_matrix.toarray())
    
    # Compute total explained variance
    var = pca.explained_variance_ratio_.sum() * 100

    # Define labels
    labels = {str(i): f"PC {i+1}" for i in range(n_components)}
    labels['color'] = 'Year'

    # Create interactive scatter matrix plot
    fig = px.scatter_matrix(
        pd.DataFrame(components, columns=[f"PC{i+1}" for i in range(n_components)]).assign(Year=df_all_years['year']),
        color='Year',
        dimensions=[f"PC{i+1}" for i in range(n_components)],
        labels=labels,
        title=f'Total Explained Variance: {var:.2f}% (NYT Headlines 2017-2020)'
    )

    fig.update_traces(diagonal_visible=False)
    
    return fig

# Run the Dash server
if __name__ == '__main__':
    app.run_server(debug=True)


The increase from 2.3% to 3.21% in explained variance simply means that more principal components were used in the PCA analysis, allowing slightly more of the underlying structure of the TF-IDF matrix to be captured. This is a normal and expected outcome when dealing with high-dimensional, sparse text data like TF-IDF, where each dimension (word) contributes only a small amount of variation. Even though 3.21% may seem low, ChatGPT suggests that it is typical in natural language processing and still useful for visualizing clusters or patterns in reduced dimensions. The low variance does not imply poor results—it reflects the nature of text data and the linear limitations of PCA. 

## 10. **Visualize PCA with px.scatter_3d**

In [None]:
import re
import plotly.express as px
from sklearn.decomposition import PCA

# Define file paths for datasets (update paths as needed)
file_paths = {
    2017: "new_york_times_stories_2017.csv",
    2018: "new_york_times_stories_2018.csv",
    2019: "new_york_times_stories_2019.csv",
    2020: "new_york_times_stories_2020.csv"
}

# Create an empty DataFrame to store all years
df_all_years = pd.DataFrame()

# Process each year’s dataset
for year, path in file_paths.items():
    df = pd.read_csv(path, usecols=['headline']).dropna()
    df['year'] = year
    df['cleaned_headline'] = df['headline'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', str(x).lower()))
    df_all_years = pd.concat([df_all_years, df], ignore_index=True)

# Apply PCA to reduce dimensions to 3 for 3D visualization
pca = PCA(n_components=3)
components = pca.fit_transform(tfidf_matrix.toarray())

# Compute total explained variance
total_var = pca.explained_variance_ratio_.sum() * 100

# Create an interactive 3D scatter plot
fig = px.scatter_3d(
    pd.DataFrame(components, columns=['PC1', 'PC2', 'PC3']).assign(Year=df_all_years['year']),
    x='PC1', y='PC2', z='PC3',
    color=df_all_years['year'].astype(str),
    title=f'Total Explained Variance: {total_var:.2f}% (NYT Headlines 2017-2020)',
    labels={'PC1': 'PC 1', 'PC2': 'PC 2', 'PC3': 'PC 3'}
)

# Add data source
fig.add_annotation(
    text="Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)",
    xref="paper", yref="paper",
    x=0.5, y=-0.3,
    showarrow=False,
    font=dict(size=12, color="gray"),
    xanchor='center'
)

# Show the plot
fig.show()


This 3D scatter plot shows how New York Times headlines from 2017 to 2020 are distributed after being transformed into a reduced-dimensional space using PCA. Each dot represents a headline, positioned based on three principal components (PC1, PC2, PC3), and colored by year. The fact that the points overlap heavily and cluster near the center reflects that the differences in headline content across years are subtle when reduced to three dimensions. The total explained variance of only 3.21% indicates that the first three principal components capture just a small portion of the overall information in the original high-dimensional TF-IDF data, which is expected with text data. While the visualization does not clearly separate years, it still helps illustrate the overall textual similarity across this four-year period.

## **10. Clustering with K-Means**

This section is aided completely by ChatGPT for the purpose of experimentation. ChatGPT helps streamline complex explanations of K-Means and provides optimized code that follows best practices in PCA + clustering.

In [None]:
# Use the Elbow Method to find the best K
import pandas as pd
import re
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Define file paths for datasets (update paths as needed)
file_paths = {
    2017: "new_york_times_stories_2017.csv",
    2018: "new_york_times_stories_2018.csv",
    2019: "new_york_times_stories_2019.csv",
    2020: "new_york_times_stories_2020.csv"
}

# Create an empty DataFrame to store all years
df_all_years = pd.DataFrame()

# Process each year’s dataset
for year, path in file_paths.items():
    df = pd.read_csv(path, usecols=['headline']).dropna()
    df['year'] = year
    df['cleaned_headline'] = df['headline'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', str(x).lower()))
    df_all_years = pd.concat([df_all_years, df], ignore_index=True)

# Convert headlines to a TF-IDF feature matrix
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = vectorizer.fit_transform(df_all_years['cleaned_headline'])

# Apply PCA to reduce dimensions to 3 for clustering
pca = PCA(n_components=3)
pca_result = pca.fit_transform(tfidf_matrix.toarray())

# Use the Elbow Method to determine optimal K
inertia = []
k_values = range(2, 10)

for k in k_values:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(pca_result)
    inertia.append(km.inertia_)

# Plot the Elbow Curve
plt.figure(figsize=(8, 5))
plt.plot(k_values, inertia, marker='o', linestyle='--')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Sum of Squared Distances)')
plt.title('Elbow Method for Optimal K (NYT Headlines 2017-2020)')

plt.figtext(
    0.5, -0.05,  
    "Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)", 
    fontsize=10, color="gray", ha='center'
)

plt.show()


ChatGPT explanation credit: 
- The result of this plot represents the Elbow Method, a common technique used to determine the optimal number of clusters (K) for K-Means clustering. In this case, it is applied to the PCA-reduced TF-IDF representation of New York Times headlines from 2017 to 2020. The y-axis shows the inertia, which measures how tightly grouped the data points are within each cluster — lower values indicate better fit. As K increases, inertia naturally decreases, but the rate of improvement slows after a certain point. The “elbow” of the curve (a noticeable bend) suggests the optimal number of clusters where adding more clusters yields diminishing returns. Identifying that elbow allows us to choose a K that balances performance with simplicity. For NYT headlines, this helps discover the number of meaningful topic groupings that naturally emerge in the data without overfitting.

Finding: 
- The sharp drop in inertia between K=2 and K=4 suggests that adding more clusters significantly improves the model up to that point. After K=4, the curve begins to flatten, indicating diminishing returns. This "elbow" at K=4 suggests that 4 clusters is a reasonable choice for capturing the main patterns in the headlines without overfitting.

In [None]:
# Apply K-Means clustering
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
df_all_years['cluster'] = kmeans.fit_predict(pca_result)

# Display sample headlines with assigned clusters
df_all_years[['headline', 'cluster']].head(20)

ChatGPT explanation credit:
- In K-Means clustering, cluster 0 does not have an inherent meaning. It is simply one of the group labels assigned by the algorithm. For the NYT headlines data, cluster 0 represents a group of headlines that share similar language patterns or topics based on their TF-IDF features. To interpret what cluster 0 actually means, exploring the most common words or themes within that group is needed, such as by viewing sample headlines or generating a word cloud.

What does it mean?
- For example, headlines assigned to cluster 0 might share a general theme (e.g., sports, obituaries, or daily summaries), while cluster 3 might capture political or technology-related stories. This clustering step is useful for uncovering hidden patterns or dominant topics in the headlines without needing to pre-label or categorize the data manually. It is especially valuable for exploring large-scale news datasets to understand how themes evolve over time.


## **11. Cluster Visualization in PCA Space**

In [None]:
import seaborn as sns

# Convert headlines to a TF-IDF feature matrix
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = vectorizer.fit_transform(df_all_years['cleaned_headline'])

# Apply PCA to reduce dimensions to 2 for visualization
pca = PCA(n_components=2)
tfidf_pca = pca.fit_transform(tfidf_matrix.toarray())

# Apply K-Means Clustering with the optimal K (e.g., 5 clusters)
optimal_k = 5  # This should be determined using the Elbow Method
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
df_all_years['cluster'] = kmeans.fit_predict(tfidf_pca)

# Scatter plot of clusters in PCA-reduced space
plt.figure(figsize=(10, 6))
sns.scatterplot(
    x=tfidf_pca[:, 0], 
    y=tfidf_pca[:, 1], 
    hue=df_all_years['cluster'], 
    palette='viridis', 
    alpha=0.7
)
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('K-Means Clustering of NYT Headlines (PCA-reduced space)')
plt.legend(title='Cluster')

plt.figtext(
    0.5, -0.05,  
    "Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)", 
    fontsize=10, color="gray", ha='center'
)

plt.show()


## **12. Topic Clustering with K-Means**

In [None]:
df_all = pd.concat([df_2017, df_2018, df_2019, df_2020], ignore_index=True)

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import re

# Preprocess text (lowercase and remove special characters)
df_all['cleaned_headline'] = df_all['headline'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x.lower()))

# Convert headlines to TF-IDF representation
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
tfidf_matrix = vectorizer.fit_transform(df_all['cleaned_headline'])

# Display TF-IDF matrix shape
tfidf_matrix.shape

Finding: The shape (228006, 1000) means that the dataset contains 228,006 headlines, each represented by a vector of 1,000 TF-IDF scores corresponding to the most important words across all headlines from 2017 to 2020.

In [None]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Reduce dimensionality with PCA
pca = PCA(n_components=2)
tfidf_pca = pca.fit_transform(tfidf_matrix.toarray())

# Apply K-Means clustering
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
df_all['cluster'] = kmeans.fit_predict(tfidf_pca)

# Display sample headlines with assigned clusters
df_all[['headline', 'year', 'cluster']].head(20)

## **13. Topic Distribution Over Time**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Filter for only valid years
df_all = df_all[df_all['year'].isin([2017, 2018, 2019, 2020])]

# Count plot of clusters by year
plt.figure(figsize=(10, 6))
sns.countplot(x=df_all['cluster'], hue=df_all['year'], palette='viridis')
plt.xlabel('Cluster')
plt.ylabel('Number of Headlines')
plt.title('Distribution of NYT Headline Topics from 2017 to 2020')
plt.legend(title='Year')

plt.figtext(
    0.5, -0.05,  
    "Source: Three Decades of New York Times Headlines by Jack Bundy (Kaggle)", 
    fontsize=10, color="gray", ha='center'
)

plt.show()


This bar chart breaks down how NYT headlines from 2017 to 2020 were grouped into different clusters based on topic similarity. Cluster 0 dominates by a large margin each year, which means most headlines likely fell into a broad or general category. The other clusters had significantly fewer headlines, suggesting they represent more specific or less frequently covered topics. Interestingly, the shape of the distribution looks pretty consistent across the years, which could mean that the main types of news stories did not change much over time—even if the details did.

## **14. Word Cloud per Cluster**

In [None]:
# Credit: All code below is an experiment of ChatGPT.

from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Copy to keep original intact
df_all = df_all_years.copy()

# Fill any missing cleaned headlines
df_all['cleaned_headline'] = df_all['cleaned_headline'].fillna('')

# Make sure 'cluster' column exists (if not already assigned)
# Assumes pca_result has been computed earlier and matches df_all shape
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
df_all['cluster'] = kmeans.fit_predict(pca_result)

# ✅ Generate Word Cloud for Each Cluster (0–4)
for cluster in range(num_clusters):
    cluster_data = df_all[df_all['cluster'] == cluster]

    if cluster_data.empty:
        print(f"No data for cluster {cluster}, skipping...")
        continue

    text = " ".join(cluster_data['cleaned_headline'].dropna().tolist())

    if not text.strip():
        print(f"No valid text for cluster {cluster}, skipping word cloud...")
        continue

    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(f"Word Cloud for Cluster {cluster}")
    plt.show()


## **15. Word Cloud per Year**

In [None]:
# Credit: All code below is an experiment of ChatGPT.

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Rename df_all_years for clarity (if needed)
df_all = df_all_years.copy()

# Ensure cleaned_headline exists and is filled
df_all['cleaned_headline'] = df_all['cleaned_headline'].fillna('')

# Generate Word Cloud per year
for year in [2017, 2018, 2019, 2020]:
    # Filter only rows from the year
    year_data = df_all[df_all['year'] == year]

    if year_data.empty:
        print(f"No headlines for {year}, skipping...")
        continue

    # Join cleaned headlines into a single string
    text = " ".join(year_data['cleaned_headline'].dropna().tolist())

    # Skip empty text
    if not text.strip():
        print(f"No valid text for {year}, skipping word cloud...")
        continue

    # Create the word cloud
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

    # Plot the word cloud
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(f"Word Cloud for {year}")
    plt.show()


## **16. Outlook & Future Improvements**

### Limitations of the Current Approach

My project explores New York Times headlines by turning them into numbers using TF-IDF and then visualizing patterns through PCA and clustering. TF-IDF helps us see which words are important in each headline, but it does not understand the meaning behind them. So, two headlines about the same event but using different words might end up looking completely unrelated in the data.

PCA simplifies the data so we can plot it, but it only keeps a tiny slice of the full picture—just 2 to 3 percent of the information. That means the charts give us a rough sketch, not a detailed map. K-Means clustering groups headlines into categories, but it assumes that topics are nicely separated and evenly shaped, which is not how real language works. Plus, headlines are short and often vague, so it is tough to capture what they are really about just from the words they use.

### Potential Improvements and Future Directions

Hypothetically, there are a few ways I could make this analysis more meaningful if I have mastered more advanced data visualization skills. Instead of using TF-IDF, which just looks at word frequency, ChatGPT suggests that I could switch to more advanced models like BERT. These models actually understand the context of words, so headlines with similar meanings would be grouped together—even if they use different vocabulary. For visualization, methods like t-SNE or UMAP might do a better job than PCA at showing how headlines relate to each other in a clearer, more natural way.

Realistically, I could improve things by including more than just the headline text given that there is more time allocated to the planning of my project. For example, knowing what section the article came from, when it was published, or who wrote it could help us track how topics evolved over time or varied between sections. And instead of basic clustering, I could use topic modeling tools to uncover deeper themes in collections of text. This could lead to more insightful, easier-to-understand topic groups that better reflect how readers actually make sense of the news.
