#Used Colab Workbook to consolidate weekly highlights (i.e. recurring themes) through the use of Machine Learning.

#In line with https://www.sigmacomputing.com/blog/tf-idf-definition, I selected Term Frequency-Inverse Document Frequency (TF-IDF) to identify recurring themes based on the frequency of certain phrases (i.e. recurring stories/news reports).

#Output: weekly_headlines.csv file

In [1]:
#Cell 1: Upload and validate space_records.csv data, which was pulled from VS Code / Python.

import pandas as pd
from google.colab import files

uploaded = files.upload()  # upload space_records.csv

df = pd.read_csv("space_records.csv") #upload aggregated data (generated via VS Code)

#Clean data
df["published_date"] = pd.to_datetime(df.get("published_date"), errors="coerce", utc=True)
df["title"] = df.get("title", "").fillna("").astype(str)
df["summary"] = df.get("summary", "").fillna("").astype(str)

df = df.dropna(subset=["published_date"]) #keep rows that can be interpreted
df = df[df["title"].str.len() > 0].copy()

print(f"Loaded {len(df):,} rows from {df}")
print("Date range:", df["published_date"].min(), "→", df["published_date"].max())

display(df[["published_date","source_api","source","event_type","title"]].head(5))

Saving space_records.csv to space_records.csv
Loaded 706 rows from                                                  title  \
0           Scientific Balloon Begins Antarctic Ascent   
1                 A Rare Orbital Electron from Wallops   
2    ARCHE ORBITAL SYSTEMS Signs Strategic MoU with...   
3    No more free rides: it’s time to pay for space...   
4    L3Harris to sell majority stake in space propu...   
..                                                 ...   
701  Report reveals troubling new details about Elo...   
702  Rocket Pharmaceuticals to Participate in the 4...   
703  SpaceX Mission to Deliver Italian Satellite La...   
704  IDF uncovers apparently old, loaded rocket lau...   
705  This huge rocket will soon blast NASA astronau...   

                                               summary                source  \
0    A scientific balloon starts its ascent into th...  Spaceflight News API   
1    Rocket Lab’s Electron made a rare orbital flig...  Spaceflight News API

Unnamed: 0,published_date,source_api,source,event_type,title
0,2026-01-05 16:40:28+00:00,spaceflight_news,Spaceflight News API,launch,Scientific Balloon Begins Antarctic Ascent
1,2026-01-05 16:37:51+00:00,spaceflight_news,Spaceflight News API,launch,A Rare Orbital Electron from Wallops
2,2026-01-05 16:09:08+00:00,spaceflight_news,Spaceflight News API,launch,ARCHE ORBITAL SYSTEMS Signs Strategic MoU with...
3,2026-01-05 14:00:00+00:00,spaceflight_news,Spaceflight News API,security_event,No more free rides: it’s time to pay for space...
4,2026-01-05 12:56:48+00:00,spaceflight_news,Spaceflight News API,launch,L3Harris to sell majority stake in space propu...


In [2]:
#define parameters (i.e. relevant reporting period for recent highlights)

DAYS = 7

period_end = df["published_date"].max()
period_start = period_end - pd.Timedelta(days=DAYS)

df_period = (
    df[(df["published_date"] >= period_start) & (df["published_date"] <= period_end)]
    .copy()
    .sort_values("published_date", ascending=False)
)

print(f"Period: {period_start.date()} → {period_end.date()}  |  Days: {DAYS}")
print(f"Rows in period: {len(df_period):,}")

display(df_period[["published_date","source","event_type","title"]].head(10))


Period: 2025-12-29 → 2026-01-05  |  Days: 7
Rows in period: 211


Unnamed: 0,published_date,source,event_type,title
0,2026-01-05 16:40:28+00:00,Spaceflight News API,launch,Scientific Balloon Begins Antarctic Ascent
1,2026-01-05 16:37:51+00:00,Spaceflight News API,launch,A Rare Orbital Electron from Wallops
591,2026-01-05 16:32:37+00:00,Google News,launch,Synthetic-aperture radar satellite for Earth o...
2,2026-01-05 16:09:08+00:00,Spaceflight News API,launch,ARCHE ORBITAL SYSTEMS Signs Strategic MoU with...
612,2026-01-05 15:33:50+00:00,Google News,launch,"Dublin, Ohio Launches 'Safe Space Program' to ..."
640,2026-01-05 14:27:00+00:00,Google News,policy_or_corporate,Network-modernization contract for U.S. Space ...
672,2026-01-05 14:12:01+00:00,Google News,launch,Brown University police chief placed on leave ...
3,2026-01-05 14:00:00+00:00,Spaceflight News API,security_event,No more free rides: it’s time to pay for space...
700,2026-01-05 13:26:48+00:00,Google News,launch,Rocket Lab (RKLB) Valuation Check After 54.9% ...
4,2026-01-05 12:56:48+00:00,Spaceflight News API,launch,L3Harris to sell majority stake in space propu...


In [3]:
# Cell 3 — Group similar articles into "story clusters"

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import AgglomerativeClustering

SIM_THRESHOLD = 0.50  #higher = stricter grouping (try 0.58–0.68)

df_period["ml_text"] = (df_period["title"].fillna("") + " " + df_period["summary"].fillna("")).str.strip() #one record/row

vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)
X = vectorizer.fit_transform(df_period["ml_text"]) #text to TF-IDF

S = cosine_similarity(X)
D = 1 - S  #clustering articles by similarity (cosine is the distance)

clusterer = AgglomerativeClustering(
    metric="precomputed",
    linkage="average",
    distance_threshold=1 - SIM_THRESHOLD,
    n_clusters=None
)

df_period["story_cluster_id"] = clusterer.fit_predict(D)

print("Articles:", len(df_period))
print("Story clusters:", df_period["story_cluster_id"].nunique())
df_period["story_cluster_id"].value_counts().head(10)

Articles: 211
Story clusters: 190


Unnamed: 0_level_0,count
story_cluster_id,Unnamed: 1_level_1
10,5
2,3
20,3
9,3
4,2
7,2
6,2
5,2
1,2
0,2


In [4]:
# Cell 4 — One representative headline per cluster (simple)

# Pick the most recent article in each cluster as the "headline"
headlines = (
    df_period.sort_values("published_date", ascending=False)
    .groupby("story_cluster_id", as_index=False)
    .first()
)

# Add cluster size (how many articles were grouped into that story)
cluster_sizes = df_period["story_cluster_id"].value_counts().rename("article_count")
headlines = headlines.merge(cluster_sizes, left_on="story_cluster_id", right_index=True)

# Keep only the columns we need
headlines_out = headlines[[
    "story_cluster_id",
    "published_date",
    "source",
    "title",
    "raw_source",
    "event_type",
    "is_security_related",
    "article_count"
]].rename(columns={
    "published_date": "published_max",
    "title": "rep_title",
    "raw_source": "rep_url"
}).sort_values(["article_count","published_max"], ascending=False)

print("Headlines rows:", len(headlines_out))
display(headlines_out.head(30))

headlines_out.to_csv("weekly_headlines.csv", index=False)
print("Saved weekly_headlines.csv")


Headlines rows: 190


Unnamed: 0,story_cluster_id,published_max,source,rep_title,rep_url,event_type,is_security_related,article_count
10,10,2026-01-05 11:36:53+00:00,Google News,The Third COSMO-SkyMed second generation satel...,https://news.google.com/rss/articles/CBMisgFBV...,launch,False,5
2,2,2026-01-05 05:37:37+00:00,Spaceflight News API,Terran Orbital to build satellite buses for SD...,35106,security_event,True,3
20,20,2026-01-02 18:26:59+00:00,Google News,Starlink to lower satellite orbit to enhance s...,https://news.google.com/rss/articles/CBMisAFBV...,satellite_deployment,False,3
9,9,2026-01-02 16:54:56+00:00,Google News,Cyberattack impacts European Space Agency’s ex...,https://news.google.com/rss/articles/CBMikwFBV...,security_event,True,3
4,4,2026-01-05 10:00:00+00:00,Spaceflight News API,Space Rider orbital ballet,35114,launch,False,2
7,7,2026-01-05 08:27:19+00:00,Google News,Daily Report - Air & Space Forces Magazine,https://news.google.com/rss/articles/CBMiZ0FVX...,policy_or_corporate,False,2
14,14,2026-01-04 05:36:19+00:00,Google News,Space Force begins base network overhaul as cy...,https://news.google.com/rss/articles/CBMimAFBV...,policy_or_corporate,False,2
29,29,2026-01-03 22:06:02+00:00,Google News,New Spanish communications satellite suffers ‘...,https://news.google.com/rss/articles/CBMikwFBV...,satellite_deployment,False,2
30,30,2026-01-02 20:24:39+00:00,Google News,Space Force Year in Photos - vandenberg.spacef...,https://news.google.com/rss/articles/CBMipAFBV...,policy_or_corporate,False,2
8,8,2026-01-02 17:33:17+00:00,Google News,This huge rocket will soon blast NASA astronau...,https://news.google.com/rss/articles/CBMi6AFBV...,launch,False,2


Saved weekly_headlines.csv
