# Adverse Media Scrapper using GDELT

# Architecture
pgsql

```Watchlist / Entity Master
        ↓
Alias Expansion Engine
        ↓
GDELT GKG API (Daily / Near-RT)
        ↓
Pre-Filter (Themes + Tone)
        ↓
Entity Resolution & Context Matching
        ↓
Risk Scoring Engine
        ↓
Elasticsearch (Index + History)
        ↓
Case Management / Alerts / Dashboards 


# Adverse Media Definition

In [1]:
ADVERSE_THEMES = {
    "CORRUPTION",
    "FRAUD",
    "TERRORISM",
    "SANCTIONS",
    "MONEY_LAUNDERING",
    "ORGANIZED_CRIME",
    "BRIBERY",
    "FINANCIAL_CRIME"
}

NEGATIVE_TONE_THRESHOLD = -2.0



# Model Data 

``` {
  "entity_id": "ENT12345",
  "entity_name": "ABC Exports Pvt Ltd",
  "matched_alias": "ABC Exports",
  "article_url": "https://news.site/article123",
  "source": "Reuters",
  "published_datetime": "2025-01-14T10:30:00Z",
  "gdelt_tone": -3.45,
  "gdelt_themes": ["CORRUPTION", "FRAUD"],
  "locations": ["India"],
  "risk_score": 82,
  "risk_level": "HIGH",
  "explanation": "Negative tone with corruption & fraud themes",
  "ingestion_date": "2025-01-15",
  "raw_gdelt_record": { }
}


# Data extraction Pipeline

In [2]:
import requests
from datetime import datetime

def fetch_gdelt_gkg(query, start_dt, end_dt, timeout=30):
    url = "https://api.gdeltproject.org/api/v2/gkgv2"
    params = {
        "query": query,
        "mode": "GKG",
        "format": "json",
        "startdatetime": start_dt,
        "enddatetime": end_dt,
        "maxrecords": 250
    }
    resp = requests.get(url, params=params, timeout=timeout)
    resp.raise_for_status()
    return resp.json().get("gkgRecords", [])


In [3]:
results = fetch_gdelt_gkg("Acme Bank", "20250101000000", "20250131235959")

HTTPError: 404 Client Error: Not Found for url: https://api.gdeltproject.org/api/v2/gkgv2?query=Acme+Bank&mode=GKG&format=json&startdatetime=20250101000000&enddatetime=20250131235959&maxrecords=250

# Alias Expansion (Critical for FIU Accuracy)
Example:

``` entity = {
    "entity_id": "ENT001",
    "name": "ABC Exports Pvt Ltd",
    "aliases": ["ABC Exports", "ABC Exporters", "ABC Exports India"]
}


In [None]:
def generate_alias_query(entity):
    aliases = entity["aliases"]
    return " OR ".join([f'"{a}"' for a in aliases])


In [None]:
# Adverse Media Filter
def is_adverse(record):
    tone = float(record.get("Tone", 0))
    themes = set(record.get("Themes", "").split(";"))

    adverse_theme_hit = bool(themes & ADVERSE_THEMES)
    negative_tone_hit = tone <= NEGATIVE_TONE_THRESHOLD

    return adverse_theme_hit or negative_tone_hit


In [None]:
#Risk Scoring Engine (Explainable)
def calculate_risk_score(record):
    score = 0
    tone = abs(float(record.get("Tone", 0)))
    themes = set(record.get("Themes", "").split(";"))

    score += min(tone * 5, 30)              # Sentiment impact
    score += len(themes & ADVERSE_THEMES) * 15
    score += 10 if "SANCTIONS" in themes else 0

    return min(score, 100)


In [None]:
#Normalize Record for Elasticsearch
def normalize_record(entity, record, score):
    return {
        "entity_id": entity["entity_id"],
        "entity_name": entity["name"],
        "matched_alias": entity["name"],
        "article_url": record.get("DocumentIdentifier"),
        "published_datetime": record.get("Date"),
        "gdelt_tone": float(record.get("Tone", 0)),
        "gdelt_themes": record.get("Themes", "").split(";"),
        "risk_score": score,
        "risk_level": "HIGH" if score >= 70 else "MEDIUM",
        "explanation": "Automated adverse media detection via GDELT",
        "raw_gdelt_record": record
    }


In [None]:

# Elasticsearch Ingestion (Bulk-Safe)
from elasticsearch import Elasticsearch, helpers

es = Elasticsearch(
    "https://localhost:9200",
    basic_auth=("elastic", "password"),
    verify_certs=False
)

def bulk_ingest(records):
    actions = [
        {
            "_index": "fiu_adverse_media",
            "_source": r
        }
        for r in records
    ]
    helpers.bulk(es, actions)


# Known GDELT Limitations (and Fixes)
```
| Limitation           | Mitigation                         |
| -------------------- | ---------------------------------- |
| False positives      | Entity proximity + alias weighting |
| No article full text | Fetch URL separately               |
| Over-broad themes    | Custom FIU theme whitelist         |
| No legal judgment    | Human review loop                  |


# 9️⃣ How FIUs Commonly Use This
```
✅ AML onboarding checks
✅ Sanction evasion monitoring
✅ Politically exposed person (PEP) monitoring
✅ Trade-based money laundering (TBML) risk
✅ Counter-terror financing (CTF)

In [48]:
import pandas as pd
import re

In [58]:
df1 = pd.read_csv('./output/final_adverse_media_June_2025.csv')
df2 = pd.read_csv('./output/final_adverse_media_July_2025.csv')
df3 = pd.read_csv('./output/final_adverse_media_Aug_2025.csv')
df4 = pd.read_csv('./output/final_adverse_media_Sep_2025.csv')
df5 = pd.read_csv('./output/final_adverse_media_Oct_2025.csv')
df_list = [df1, df2, df3, df4, df5]
df = pd.concat(df_list).reset_index(drop=True)

In [59]:
df.head()

Unnamed: 0,url,url_mobile,title,date,socialimage,domain,language,sourcecountry,article,quantum,lea,crimes,person,locations,Keywords
0,https://timesofindia.indiatimes.com/city/vadod...,https://timesofindia.indiatimes.com/city/vadod...,NREGA scam : Minister sons sent to judicial cu...,20250606,https://static.toiimg.com/thumb/msid-121683107...,timesofindia.indiatimes.com,English,India,"Vadodara: Balwant and Kiran Khabad, the sons o...",[],[],['scam'],"{'rasik rathwa', 'rathwa', 'kiran', 'bhanpur',...",['Dahod district'],"case, nrega, kiran, friday, balwant"
1,https://www.newindianexpress.com/nation/2025/J...,https://www.newindianexpress.com/amp/story/nat...,Missing wife of Indore man found dead on ho...,20250609,https://media.assettype.com/newindianexpress%2...,newindianexpress.com,English,India,In a major development in the case of the miss...,[],[],['murder'],"{'xraja raghuvanshi', 'sonam raghuvanshi', 'gh...",['Ghazipur district'],"meghalaya, sonam, police, arrested, raghuvanshi"
2,https://www.tribuneindia.com/news/jalandhar/no...,https://www.tribuneindia.com/news/jalandhar/no...,No relief for MLA Arora son - The Tribune,20250607,https://www.tribuneindia.com/sortd-service/ima...,tribuneindia.com,English,India,The court of Special Judge Jaswinder Singh on ...,"['Rs 12 crore', 'Rs 60 lakh', 'Rs 10 crore', '...",[],"['bribe', 'corruption']","{'rajan arora', 'aroras', 'darshan singh dyal'...",[],"accused, mla, notices, vashisht, arora"
3,https://timesofindia.indiatimes.com/web-series...,https://timesofindia.indiatimes.com/web-series...,Rick and Morty season 8 : Episode 3 highlig...,20250609,https://static.toiimg.com/thumb/msid-121719119...,timesofindia.indiatimes.com,English,India,What happened in episode 3 of 'Rick and Morty'...,[],[],['kidnap'],"{'sarah chalke', 'doc morty', 'keith david', '...",[],"episode, rick, release, morty, season"
4,https://timesofindia.indiatimes.com/city/ranch...,https://timesofindia.indiatimes.com/city/ranch...,Bokaro villagers stir against sand mining | Ra...,20250609,https://static.toiimg.com/thumb/msid-117692178...,timesofindia.indiatimes.com,English,India,"Bokaro: Residents of several villages, led by ...",[],[],['illegal mining'],"{'ravi kumar singh', 'mamarkudar', 'chas', 'yu...",[],"mining, water, sand, villages, said"


In [60]:
df.columns

Index(['url', 'url_mobile', 'title', 'date', 'socialimage', 'domain',
       'language', 'sourcecountry', 'article', 'quantum', 'lea', 'crimes',
       'person', 'locations', 'Keywords'],
      dtype='object')

In [61]:
import re

# function to convert list, set to text 
def list_to_text(text: str) -> str:
    return re.sub(r"[{}\[\]()\"']", "", text).replace("set", "").strip()


In [62]:
df.tail()

Unnamed: 0,url,url_mobile,title,date,socialimage,domain,language,sourcecountry,article,quantum,lea,crimes,person,locations,Keywords
12080,https://www.moneycontrol.com/news/india/rahul-...,https://www.moneycontrol.com/news/india/rahul-...,Rahul dance taunt not the first : How person...,20251030,https://images.moneycontrol.com/static-mcnews/...,moneycontrol.com,English,India,Returning from a 58-day hiatus from the Bihar ...,[],[],[],"{'behrampura', 'modi', 'rahul gandhi', 'chhath...",[],"modi, congress, pm, bjp, said"
12081,https://www.news18.com/india/bhojpuri-actor-yo...,https://www.news18.com/amp/india/bhojpuri-acto...,"Bhojpuri Actor , YouTuber Mani Meraj Arrested ...",20251006,https://images.news18.com/ibnlive/uploads/2025...,news18.com,English,India,"Bhojpuri Actor, YouTuber Mani Meraj Arrested I...",[],[],"['fraud', 'drug']","{'pinky chaudhary', 'mani meraj', 'youtuber', ...",[],"meraj, police, youtuber, mani, accused"
12082,https://www.tribuneindia.com/news/delhi/two-yo...,https://www.tribuneindia.com/news/delhi/two-yo...,Two youths linked to ISIS arrested,20251025,https://www.tribuneindia.com/sortd-service/ima...,tribuneindia.com,English,India,Two youths allegedly linked to banned terror o...,[],[],[],"{'mohammad adnan khan', 'abu muharib', 'sadiq ...",[],"isis, delhi, said, arrested, handler"
12083,https://www.livelaw.in/high-court/bombay-high-...,https://www.livelaw.in/amp/high-court/bombay-h...,Bombay High Court Seeks SEBI Response In Plea ...,20251006,https://www.livelaw.in/h-upload/2025/10/06/624...,livelaw.in,English,India,A petition has been filed in the Bombay High C...,[],"['SECURITIES AND EXCHANGE BOARD OF INDIA', '(S...",[],{'vinay bansal'},[],"india, offer, management, limited, accused"
12084,https://aninews.in/news/business/mastercard-in...,/news/business/mastercard-introduces-first-eve...,Mastercard introduces first - ever threat inte...,20251029,https://d3lzcn6mbbadaf.cloudfront.net/media/de...,aninews.in,English,India,"PRNewswire\n\nSingapore, October 29: Today, Ma...",['$120 million'],[],['fraud'],{'matthew driver'},"['Asia-Pacific', 'Asia Pacific']","intelligence, fraud, mastercard, threat, cyber"


In [None]:

########################################################
#Function to prepare data for uploading in DB
########################################################
from datetime import date, datetime

def prepare_for_db(df):
    #droping duplicate on url
    df = df.drop_duplicates(subset='url')
    # adding unique ARN
    df = df.reset_index()
    today = date.today().strftime('%Y%m%d')
    df['ARN'] = 'ARN'+today+df['index'].astype('str')
    df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')

    columns = ['ARN', 'date', 'title', 'article', 'domain', 'url', 'lea', 'crimes',
        'person', 'locations', 'quantum', 'Keywords' ]
    df = df[columns]
    # fixing list columns
    list_columns = ['lea', 'crimes', 'person', 'locations', 'quantum', 'Keywords' ]
    for col in list_columns:
        df[col] = df[col].astype('str').map(lambda x: list_to_text(x))
    return df

In [65]:
df = prepare_for_db(df)

In [66]:
df.head()

Unnamed: 0,ARN,date,title,article,domain,url,lea,crimes,person,locations,quantum,Keywords
0,ARN202601260,2025-06-06,NREGA scam : Minister sons sent to judicial cu...,"Vadodara: Balwant and Kiran Khabad, the sons o...",timesofindia.indiatimes.com,https://timesofindia.indiatimes.com/city/vadod...,,scam,"rasik rathwa, rathwa, kiran, bhanpur, nagori, ...",Dahod district,,"case, nrega, kiran, friday, balwant"
1,ARN202601261,2025-06-09,Missing wife of Indore man found dead on ho...,In a major development in the case of the miss...,newindianexpress.com,https://www.newindianexpress.com/nation/2025/J...,,murder,"xraja raghuvanshi, sonam raghuvanshi, ghazipur...",Ghazipur district,,"meghalaya, sonam, police, arrested, raghuvanshi"
2,ARN202601262,2025-06-07,No relief for MLA Arora son - The Tribune,The court of Special Judge Jaswinder Singh on ...,tribuneindia.com,https://www.tribuneindia.com/news/jalandhar/no...,,"bribe, corruption","rajan arora, aroras, darshan singh dyal, raman...",,"Rs 12 crore, Rs 60 lakh, Rs 10 crore, Rs 39 lakh","accused, mla, notices, vashisht, arora"
3,ARN202601263,2025-06-09,Rick and Morty season 8 : Episode 3 highlig...,What happened in episode 3 of 'Rick and Morty'...,timesofindia.indiatimes.com,https://timesofindia.indiatimes.com/web-series...,,kidnap,"sarah chalke, doc morty, keith david, the arca...",,,"episode, rick, release, morty, season"
4,ARN202601264,2025-06-09,Bokaro villagers stir against sand mining | Ra...,"Bokaro: Residents of several villages, led by ...",timesofindia.indiatimes.com,https://timesofindia.indiatimes.com/city/ranch...,,illegal mining,"ravi kumar singh, mamarkudar, chas, yudhistar ...",,,"mining, water, sand, villages, said"


In [67]:
from db.db_manager import AdverseMediaDB

def store_to_sqlite(df_final):
    db = AdverseMediaDB()
    db.create_table()
    db.insert_dataframe(df_final)

In [68]:
store_to_sqlite(df)

In [32]:
df.head()

Unnamed: 0,level_0,index,date,title,article,domain,url,lea,crimes,person,locations,quantum,Keywords,ARN
0,0,0,2025-06-06,NREGA scam : Minister sons sent to judicial cu...,"Vadodara: Balwant and Kiran Khabad, the sons o...",timesofindia.indiatimes.com,https://timesofindia.indiatimes.com/city/vadod...,[],['scam'],"{'rasik rathwa', 'rathwa', 'kiran', 'bhanpur',...",['Dahod district'],[],"case, nrega, kiran, friday, balwant",ARN202601230
1,1,1,2025-06-09,Missing wife of Indore man found dead on ho...,In a major development in the case of the miss...,newindianexpress.com,https://www.newindianexpress.com/nation/2025/J...,[],['murder'],"{'xraja raghuvanshi', 'sonam raghuvanshi', 'gh...",['Ghazipur district'],[],"meghalaya, sonam, police, arrested, raghuvanshi",ARN202601231
2,2,2,2025-06-07,No relief for MLA Arora son - The Tribune,The court of Special Judge Jaswinder Singh on ...,tribuneindia.com,https://www.tribuneindia.com/news/jalandhar/no...,[],"['bribe', 'corruption']","{'rajan arora', 'aroras', 'darshan singh dyal'...",[],"['Rs 12 crore', 'Rs 60 lakh', 'Rs 10 crore', '...","accused, mla, notices, vashisht, arora",ARN202601232
3,3,3,2025-06-09,Rick and Morty season 8 : Episode 3 highlig...,What happened in episode 3 of 'Rick and Morty'...,timesofindia.indiatimes.com,https://timesofindia.indiatimes.com/web-series...,[],['kidnap'],"{'sarah chalke', 'doc morty', 'keith david', '...",[],[],"episode, rick, release, morty, season",ARN202601233
4,4,4,2025-06-09,Bokaro villagers stir against sand mining | Ra...,"Bokaro: Residents of several villages, led by ...",timesofindia.indiatimes.com,https://timesofindia.indiatimes.com/city/ranch...,[],['illegal mining'],"{'ravi kumar singh', 'mamarkudar', 'chas', 'yu...",[],[],"mining, water, sand, villages, said",ARN202601234


In [None]:
from datetime import datetime
import sqlite3

def generate_arn_id(df):
    """
    Generates ARN + YYYYMMDD + 4-digit daily auto-increment
    Example: ARN202601230001
    """
    today = datetime.utcnow().strftime('%Y%m%d')
    cursor = conn.cursor()

    cursor.execute("""
        SELECT id
        FROM adverse_media
        WHERE SUBSTR(id, 4, 8) = ?
        ORDER BY id DESC
        LIMIT 1
    """, (today,))

    row = cursor.fetchone()

    if row:
        last_seq = int(row[0][-4:])
        next_seq = last_seq + 1
    else:
        next_seq = 1

    arn_id = f"ARN{today}{next_seq:04d}"
    return arn_id


In [19]:
from datetime import datetime
datetime.now()

datetime.datetime(2026, 1, 26, 17, 11, 21, 587537)

In [29]:
datetime.now().strftime("%Y%m%d%H%M%S")

'20260126171444'

In [30]:
date.today().strftime("%Y-%m-%d")

'2026-01-26'