## Cardiff NLP Hackathon 2025 - Starter Code

Welcome to Cardiff NLP's second hackathon! Below is some code to get started on the AMPLYFI API and look at some data.

====================

Note: the API is a real time resource so extra points to projects that can treat it as a continual data stream rather than a one-off data source!

Another thing to note about this is that it will affect Amplyfi's servers if you download a silly amount of data. We ask that you only request 100 results per request, but if you have the data you need, try to download it or store it as a variable rather than requesting the exact same data over and over again.

In [2]:
# Import some libraries

import requests
import json
import nltk
import re
import pandas as pd
from collections import defaultdict

from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

Amplyfi have provided some limits and explanations of what you can query the API for below:

`query_text` anything

`result_size` <=100

`include_highlights` boolean (if True, you get sentences matching keyphrases in the query)

`include_smart_tags` boolean (if True, you get back metadata from our "smart" market intel classifiers - signals and industries)

`ai_answer` can only accept "basic", this will take the 5 most relevant docs and answer the query_text based on them




In [7]:
# API endpoint from the newly deployed service

API_URL = "https://zfgp45ih7i.execute-api.eu-west-1.amazonaws.com/sandbox/api/search"
API_KEY = "XYZ38746G38B7RB46GBER"

headers = {
    "Content-Type": "application/json",
    "x-api-key": API_KEY
}

# Edit the below to get different data
payload = {
  "query_text": "what is happening with riots in Los Angeles?",
  "result_size": 100,
  "include_highlights":True,
  "include_smart_tags":True,

}

response = requests.post(API_URL, headers=headers, data=json.dumps(payload))
json_response = response.json()

json_response


{'results': [{'id': 'DL-91a8a3bb8d9240f9471bcd781b9f3bf3',
   'title': 'Los Angeles riots',
   'url': 'https://my-h5news.app.xinhuanet.com/h5/specialTopic/index.html?articleId=024a90db17da4691868e51bd10420f78',
   'summary': 'Los Angeles police and the National Guard clashed with protesters in downtown Los Angeles, California. Roads were blocked, cars were burned, law enforcement used tear gas and lightning. riots in Angel City are spreading to other immigrant-populated cities such as New York City.',
   'score': 395.3676,
   'timestamp': '2025-06-10T05:22:00+00:00',
   'smart_tags': {'industries': ['Media']},
   'highlights': ['unrest, the National Guard and Los Angeles police clashed with protesters, and riots in Angel City are spreading to other immigrant-populated cities such as New York City.',
    'Los Angeles police and the National Guard clashed violently with protesters in Los Angeles on August 8, causing chaos in downtown Los Angeles, the second-largest city in the United Sta

In [8]:
grouped_results = defaultdict(list)

for item in json_response['results']:

  if not item["timestamp"]:
    continue
  else:
    date = item["timestamp"].split("T")[0]

  grouped_results[date].append(item["title"])

grouped_results = dict(grouped_results)

sorted_group_counts = {date: len(grouped_results[date]) for date in sorted(grouped_results)}

#print(json.dumps(sorted_group_counts, indent=4))

print(json.dumps(grouped_results, indent=4))



{
    "2025-06-10": [
        "Los Angeles riots",
        "Los Angeles riots and police careers | Marines in action Newsom vs. Trump | He's a dictator",
        "Keys to Understanding Mass Protests Against Migrant Riots in Los Angeles",
        "Suspect who threw cinderblocks at ICE vehicles during Los Angeles riots has been identified",
        "Los Angeles protesters light Waymo cabs on fire",
        "Trump sends Marines, National Guard reinforcements to fight riots in Los Angeles",
        "Trump justifies troop deployment in Los Angeles over riots",
        "Marines strengthen security in Los Angeles in the face of riots",
        "In Australia condemned the incident during the riots in Los Angeles",
        "Trump deploys another 2,000 National Guard for Los Angeles protests",
        "Trump deploys thousands of Marines and guards in Los Angeles amid riots against riots",
        "About 700 US Marines being mobilised in response to Los Angeles protests",
        "About 700 US Ma

In [5]:
from transformers import pipeline

pipe = pipeline("summarization", model="facebook/bart-large-cnn", max_length=30)

daily_summaries = []

query_results = json.dumps(grouped_results, indent=4)
#print(grouped_results)

for day in grouped_results:
  to_summarize = ""
  summaries = grouped_results[day] #['summary1', 'summary2'...]
  for summary in summaries:
    to_summarize += summary
  print(to_summarize)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


Los Angeles Consulting Group June Events 2025Big Bear eaglet update: Sunny returns home from her first flightCinespia’s big July line-up opens with ‘Top Gun’Meghan Markle to make major appearance within days at glitzy gala in LALos Angeles County celebrates Pride Month with raising of Progress Pride FlagDonald Trump inflates tariffs, global growth stagnatesThe BriefSchool aide arrested for possessing, distributing child pornJames Beard finalist for Outstanding Professional in Beverage ServiceDisney Lays Off Hundreds Across Film and TV Divisions Amid Ongoing Cost-Cutting PushWashington father wanted for murder after 3 daughters found dead: PoliceTrevor Bauer accuser ordered to pay $310KTechcombank Expands Overseas Talent Roadshow 2025 To Europe Following U.S. SuccessNvidia tops Microsoft, regains most valuable company title for first time since January
European stocks mixed as trade tensions and U.S. economy remain in focusImpact on World Cup, OlympicsMillions of Californians who rely o

In [None]:
# summaries_to_summarize = []

# for day in grouped_results: #[list of strings]
#   summaries = grouped_results[day]
#   to_summarize = ""
#   for summary in summaries:
#     to_summarize += summary


#   """
#   for summary in summaries:
#     to_summarize += summary
#   print(to_summarize)
#   """
# print(to_summarize)

Swedish activist Greta Thunberg is attempting to break the Gaza blockade. The Madleen left Italy on June 1 with the aim of delivering aid. Aboard the boat are nationals of Germany, France, Brazil, Turkey, Sweden and Spain.Israel's defence minister vows to prevent aid boat carrying Greta Thunberg and other activists from reaching the Gaza Strip. The Madleen, operated by the Freedom Flotilla Coalition, departed Sicily on June 1 on a mission that aims to break the sea blockade of Gaza and deliver humanitarian aid.The Madleen is part of the Freedom Flotilla Coalition. There are 12 survivors on board, with the Dutchman as captain. "There is no reason for Israel to attack us," Greta Thunberg tells RTL News.Israeli forces stopped a Gaza-bound aid boat and detained Greta Thunberg and other activists. The Freedom Flotilla Coalition said the activists had been "kidnapped by Israeli forces" Israel's Foreign Ministry cast the voyage as a public relations stunt. All the passengers of the 'selfie ya

In [9]:
# from transformers import pipeline

# summarizer = pipeline("summarization", model="facebook/bart-large-cnn", max_length=30)

# for day in grouped_results: #[list of strings]
#   print(day)
#   summaries = grouped_results[day]
#   to_summarize = ""
#   for summary in summaries:
#     to_summarize += summary
#   # print(to_summarize)
#   summary = summarizer(to_summarize, max_length=150, do_sample=False)
#   print(summary)

from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

for day, summaries in grouped_results.items():
    print(day)
    to_summarize = " ".join(summaries)[:2024]  # Limit input size
    summary = summarizer(to_summarize, max_length=100, num_beams=2, do_sample=False)
    print(summary[0]["summary_text"])


Device set to use cpu


2025-06-10
Trump deploys thousands of Marines and guards in Los Angeles amid riots against riots. About 700 US Marines being mobilised in response to Los Angeles protests. Trump flexes strongman instincts with US military at LA protests. Korean Americans decry Trump Jr. for 'Rooftop Koreans' post Riots.
2025-06-09
Riots in Los Angeles are gaining momentum, arrested dozens of people. President Trump deploys National Guard, Marines on standby as Los Angeles protests enter fourth day. China Warns US Citizens About Los Angeles Protests. Elon Musk surprisingly defended Trump on the Los Angeles riots.
2025-06-07
Protests Against Immigration Officers Grow in Los Angeles Over Riots Tens of migrants arrested in riots in Los Los Angeles Migratory networks against Hispanics in Los LA spark protests and clashes. Raids by masked and armed officials against migrants trigger protests in Los L.A. Migrant arrests in LA lead to riots.
2025-06-08


Your max_length is set to 100, but your input_length is only 26. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)


National Guard called to Los Angeles amid anti-ICE protests. Trump blames "radical left" for instigating and financing Los Angeles protests. Anti-ICE protesters persist in Los Angeles despite National Guard deployment. Trump would activate little-used law to detain immigrants in any US state after riots.
2025-06-06
Pussy Riot co-founder starts Los Angeles prison performance with existential scream. Trump sent the National Guard to Los Angeles amid protests amid protests. Pussy Riot's Nadezhda Tolokonnikova: 'I am not a victim. I am a survivor'


In [None]:
json_response['results'][0]

In [None]:
df = pd.json_normalize(json_response['results'])

df.head()

## Example Sentiment Analysis

In [None]:
## Clean data

def clean_text(text):
    """
    - Convert to lowercase
    - Remove URLs
    - Remove punctuation / non-alpha
    - Collapse multiple spaces
    """
    if not isinstance(text, str):
        return ""
    # Remove URLs (very basic)
    text = re.sub(r"http\S+|www\.\S+", "", text)
    # Lowercase
    text = text.lower()
    # Keep only letters and spaces
    text = re.sub(r"[^a-z\s]", " ", text)
    # Collapse multiple spaces
    text = re.sub(r"\s+", " ", text).strip()
    return text

df['clean_summary'] = df['summary'].apply(clean_text)


In [None]:
## Sentiment analysis example

sia = SentimentIntensityAnalyzer()

def get_sentiment_scores(text):
    """
    Returns a dict with these keys:
       - neg: negative sentiment score
       - neu: neutral score
       - pos: positive score
       - compound: normalized, weighted composite (-1 to +1)
    """
    return sia.polarity_scores(text)

# Apply to each summary
df['sentiment'] = df['clean_summary'].apply(get_sentiment_scores)

# Split into separate columns if you like
df['sent_neg'] = df['sentiment'].apply(lambda d: d['neg'])
df['sent_neu'] = df['sentiment'].apply(lambda d: d['neu'])
df['sent_pos'] = df['sentiment'].apply(lambda d: d['pos'])
df['sent_compound'] = df['sentiment'].apply(lambda d: d['compound'])

# Quick look at top 5 compound scores
print(df[['clean_summary', 'sent_compound']].sort_values(by='sent_compound', ascending=False).head())
print(df[['clean_summary', 'sent_compound']].sort_values(by='sent_compound').head())


In [None]:
# Find index of the most positive (max compound) and most negative (min compound) summaries
max_idx = df['sent_compound'].idxmax()
min_idx = df['sent_compound'].idxmin()

# Retrieve the scores
max_score = df.loc[max_idx, 'sent_compound']
min_score = df.loc[min_idx, 'sent_compound']

# Print the full clean summaries along with their sentiment scores
print("Most positive summary (compound = {:.3f}):\n".format(max_score))
print(df.loc[max_idx, 'clean_summary'])


print("\n\nMost negative summary (compound = {:.3f}):\n".format(min_score))
print(df.loc[min_idx, 'clean_summary'])

## Project Ideas

Feel free to use this code to start your own project, and here are some (Chat-GPT generated 😬) ideas for projects:

* Real-Time Sentiment Pulse: Visualize sentiment trends over the past 24-48 hours for any keyword.

* One-Click News Brief: Generate a 3-sentence summary of today's top articles on a given topic.

* Bias/Slant Detector: Compare headlines from multiple outlets on the same event and label their bias.

* Event Timeline Generator: Autofill a chronological list of key dates and summaries for any query.

* Breaking News Alert Bot: Push a short alert whenever article volume spikes or sentiment turns extreme.

* Multilingual Hashtag Trend Mapper: Show related hashtags and translations across different languages.

* Rumor vs. Fact Checker: Verify a user-provided statement against recent reputable sources.

* “What's Changed?” Comparator: Highlight how coverage of a topic has shifted from last month to last week.

* Geo-Mood Map: Color-code countries by average sentiment or topic intensity on a query.

* Voice-Activated News Q&A: Let users speak a question and hear back a 2–3 sentence summary of current events.

## Dashboard libraries for Python

https://shiny.posit.co/py/

https://dash.plotly.com/

https://streamlit.io/

