# Demo - News to open sanctions

* Read News article from url
* Match entities and relations from article with Open Sanctions data

Compare methods:
* Exact matching
    1. use LLM to extract nodes and relations
    2. use cypher query to see if there are matches in the open sanctions graph
* RAG
    1. embed (parts of) Open Sanctions data and store as Vector in the graph
    2. use LLM to extract nodes and relations
    3. generate embeddings for nodes and relations
    4. find matches using vector similarity match

### Sources:
* RAG paper - https://arxiv.org/abs/2005.11401
* RAG implementation - https://towardsdatascience.com/retrieval-augmented-generation-rag-from-theory-to-langchain-implementation-4e9bd5f6a4f2
* Customer embedding - INSTRUCTOR - https://arxiv.org/pdf/2212.09741.pdf


### Requirements

In [1]:
#!pip install newsapi-python langchain openai langchain-openai neo4j python-dotenv langchainhub langchain-community --quiet

In [2]:
%load_ext watermark
%watermark -p langchain,langchainhub,langchain_community

langchain          : 0.1.5
langchainhub       : 0.1.14
langchain_community: 0.0.17



### Imports

In [116]:
import os
import pandas as pd
from graphdatascience import GraphDataScience
from dotenv import load_dotenv, find_dotenv, dotenv_values
from pathlib import Path
import neo4j

from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores.neo4j_vector import Neo4jVector

from langchain.agents import AgentExecutor, create_react_agent
from langchain.chains import LLMChain
from langchain.chains.conversation.memory import ConversationBufferMemory
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.documents import Document
from langchain.output_parsers.json import SimpleJsonOutputParser
from langchain.prompts import PromptTemplate
from langchain.tools import Tool

from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain

from langchain import hub
from langchain_community.document_loaders import PyPDFLoader

import newsapi

from IPython.display import Image
from IPython.core.display import HTML

from string import Template
from datetime import datetime

### Settings

In [4]:
project_path = Path(os.getcwd()).parent
data_path = project_path / "data"
model_path = project_path / "models"
output_path = project_path / "output"

llm_model = "gpt-4"

# load env settings
load_dotenv("../.env.opensanctions")

neo4j_url = os.getenv('NEO4J_URL')
neo4j_database = "open-sanctions"
neo4j_user = os.getenv('NEO4J_USER')
neo4j_pass = os.getenv('NEO4J_PASS')
openai_api_key = os.getenv('OPENAI_API_KEY')
news_api_key = os.getenv('NEWS_API_KEY')

### 1. Read news article

* Created developer API Key on https://newsapi.org

In [6]:
# Init
api = newsapi.NewsApiClient(api_key=news_api_key)

### Explore available sources

In [7]:
print("newsapi.const.categories:", newsapi.const.categories)
print("newsapi.const.languages:", newsapi.const.languages)
print("newsapi.const.countries:", newsapi.const.countries)

newsapi.const.categories: {'entertainment', 'general', 'health', 'technology', 'sports', 'business', 'science'}
newsapi.const.languages: {'de', 'zh', 'cn', 'en-US', 'no', 'es', 'sv', 'ud', 'he', 'it', 'pt', 'en', 'nl', 'se', 'ru', 'ar', 'fr'}
newsapi.const.countries: {'de', 'gr', 'at', 'ca', 'cn', 'il', 'my', 'za', 'ph', 'jp', 'id', 'no', 'ae', 'lv', 'ch', 'pl', 'co', 'pt', 'eg', 'ng', 'kr', 'lt', 'mx', 'se', 'bg', 'is', 'fr', 'hu', 'tw', 'ma', 'th', 'us', 'zh', 'ro', 'ie', 'cu', 'tr', 'sg', 'cz', 'si', 'es', 'hk', 'sk', 'sa', 'in', 'pk', 'br', 'gb', 'it', 'nz', 'be', 'nl', 'ua', 've', 'ru', 'rs', 'ar', 'au'}


In [8]:
# call get_sources endpoint
dict_sources = api.get_sources()

# put results into dataframe
df_sources = pd.DataFrame(dict_sources['sources'])
df_sources

Unnamed: 0,id,name,description,url,category,language,country
0,abc-news,ABC News,"Your trusted source for breaking news, analysi...",https://abcnews.go.com,general,en,us
1,abc-news-au,ABC News (AU),"Australia's most trusted source of local, nati...",https://www.abc.net.au/news,general,en,au
2,aftenposten,Aftenposten,Norges ledende nettavis med alltid oppdaterte ...,https://www.aftenposten.no,general,no,no
3,al-jazeera-english,Al Jazeera English,"News, analysis from the Middle East and worldw...",https://www.aljazeera.com,general,en,us
4,ansa,ANSA.it,"Agenzia ANSA: ultime notizie, foto, video e ap...",https://www.ansa.it,general,it,it
...,...,...,...,...,...,...,...
123,wired,Wired,"Wired is a monthly American magazine, publishe...",https://www.wired.com,technology,en,us
124,wired-de,Wired.de,Wired reports on how emerging technologies aff...,https://www.wired.de,technology,de,de
125,wirtschafts-woche,Wirtschafts Woche,Das Online-Portal des führenden Wirtschaftsmag...,http://www.wiwo.de,business,de,de
126,xinhua-net,Xinhua Net,"中国主要重点新闻网站,依托新华社遍布全球的采编网络,记者遍布世界100多个国家和地区,地方频...",http://xinhuanet.com/,general,zh,zh


In [9]:
# select sources with "world news" 
cond_worldnews = df_sources.description.str.contains("world news", case=False)

# select dutch sources
cond_dutch = (df_sources.language == 'nl') | (df_sources.country == 'nl')

# filter sources
df_selected_sources = df_sources[cond_worldnews | cond_dutch].head()

# as list
list_selected_sources = df_selected_sources.id.tolist()

# as string (required for api
selected_sources = ",".join(list_selected_sources)

display(df_selected_sources)
print("selected sources (list): ", list_selected_sources)
print("selected sources (str): ", selected_sources)

Unnamed: 0,id,name,description,url,category,language,country
1,abc-news-au,ABC News (AU),"Australia's most trusted source of local, nati...",https://www.abc.net.au/news,general,en,au
78,nbc-news,NBC News,"Breaking news, videos, and the latest top stor...",http://www.nbcnews.com,general,en,us
96,rtl-nieuws,RTL Nieuws,Volg het nieuws terwijl het gebeurt. RTL Nieuw...,https://www.rtlnieuws.nl/,general,nl,nl
120,time,Time,Breaking news and analysis from TIME.com. Poli...,http://time.com,general,en,us


selected sources (list):  ['abc-news-au', 'nbc-news', 'rtl-nieuws', 'time']
selected sources (str):  abc-news-au,nbc-news,rtl-nieuws,time


### Get top headlines for selected sources

get_top_headlines(q=None, qintitle=None, sources=None, language='en', country=None, category=None, page_size=None, page=None)`

In [10]:
results = api.get_top_headlines(sources=selected_sources)
df_articles = pd.DataFrame(results['articles'])

In [11]:
display(df_articles.head())

# template (HTML) for article
article_template = Template(f"""
    <h3><a href="$url">$title</a></h3>
    <b>$source_name</b><br>
    <i>Published at $published_date by $author</i> - $description
    <img src="$urlToImage" width="300">
    <hr>""")

# display articles
for idx, row in df_articles.head().iterrows():
    # format date
    row['published_date'] = datetime.strptime(row['publishedAt'], "%Y-%m-%dT%H:%M:%SZ")
    
    # add source name as separate key
    row['source_name'] = row['source']['name']
    
    # display as HTML
    display(HTML(article_template.substitute(**row)))

Unnamed: 0,source,author,title,description,url,urlToImage,publishedAt,content
0,"{'id': 'abc-news-au', 'name': 'ABC News (AU)'}",Roslyn Butcher,"From Star Wars to Stranger Things, Broome teac...",Steele Harris-Walker's obsession with pop viny...,https://www.abc.net.au/news/2024-02-11/broome-...,https://live-production.wcms.abc-cdn.net.au/f8...,2024-02-11T03:23:08Z,When you think of collectables does your mind ...
1,"{'id': 'abc-news-au', 'name': 'ABC News (AU)'}",Brigitte Murphy,Family of Olympic boxer Charles Jardine celebr...,With Paris hosting the Olympics for the first ...,https://www.abc.net.au/news/2024-02-11/the-she...,https://live-production.wcms.abc-cdn.net.au/d1...,2024-02-11T01:38:47Z,"In 1924, a sheep farmer from the north-west of..."
2,"{'id': 'abc-news-au', 'name': 'ABC News (AU)'}",ABC News,Tech companies should build products with dome...,Abigail* says her ex-husband used his technolo...,https://www.abc.net.au/news/2024-02-11/domesti...,https://live-production.wcms.abc-cdn.net.au/17...,2024-02-10T20:58:58Z,<ul><li>In short: Experts say perpetrators of ...
3,"{'id': 'abc-news-au', 'name': 'ABC News (AU)'}",Eloise Fuss,Photographer Sammy Hawker is capturing Sydney ...,The Canberra-based visual artist 'co-creates' ...,https://www.abc.net.au/news/2024-02-11/canberr...,https://live-production.wcms.abc-cdn.net.au/52...,2024-02-10T20:39:54Z,"It's the middle of Summer in Sydney, but a coo..."
4,"{'id': 'abc-news-au', 'name': 'ABC News (AU)'}",Brianna Morris-Grant,How two men stole Edvard Munch's The Scream in...,"A robber's thank you card, a lengthy sting ope...",https://www.abc.net.au/news/2024-02-11/scream-...,https://live-production.wcms.abc-cdn.net.au/a2...,2024-02-10T19:37:47Z,"Two men, one van, and just 50 seconds that was..."


### Method: Exact matching

* Case : Get news about Russia, find interesting nodes and relations, match to Open Sanctions

#### 1. Use LLM to extract nodes and relations

##### Get news about Russia

In [140]:
%%time

results = api.get_everything(q='oligarch')

for k in ['status', 'totalResults']:
    print(f"{k}: {results[k]}")
    
df_articles = pd.DataFrame(results['articles'])
df_articles.head()

status: ok
totalResults: 266
CPU times: user 23.1 ms, sys: 6.22 ms, total: 29.3 ms
Wall time: 452 ms


Unnamed: 0,source,author,title,description,url,urlToImage,publishedAt,content
0,"{'id': 'vice-news', 'name': 'Vice News'}","Maxwell Strachan, Tim Marchman",Tech Libertarians Fund Drug-Fueled ‘Olympics’ ...,A quixotic enterprise backed by investors incl...,https://www.vice.com/en/article/n7emq7/tech-li...,https://video-images.vice.com/articles/65bd490...,2024-02-02T20:22:11Z,Aron DSouza sees himself as part of a broader ...
1,"{'id': None, 'name': 'The New Yorker'}",Patrick Radden Keefe,A Teen’s Fatal Plunge Into the London Underworld,After Zac Brettler mysteriously fell to his de...,https://www.newyorker.com/magazine/2024/02/12/...,https://media.newyorker.com/photos/65bc2f47a3c...,2024-02-05T11:00:00Z,Sharma was calling from Apartment 504. If hed ...
2,"{'id': None, 'name': 'Mother Jones'}",Tim Murphy,The Rise of the American Oligarchy,What targeting Russia’s wayward billionaires r...,https://www.motherjones.com/politics/2024/01/a...,https://www.motherjones.com/wp-content/uploads...,2024-02-02T14:30:24Z,When the US targeted Russias oligarchs after t...
3,"{'id': None, 'name': 'tagesschau.de'}",tagesschau.de,Kobachidse als neuer Regierungschef von Georgi...,Georgien hat einen neuen Ministerpräsidenten: ...,https://www.tagesschau.de/ausland/europa/georg...,https://images.tagesschau.de/image/7d6bc1fc-b2...,2024-02-09T09:47:48Z,Stand: 09.02.2024 10:47 Uhr\r\nGeorgien hat ei...
4,"{'id': 'polygon', 'name': 'Polygon'}",Oli Welsh,"If we have to recycle old IP, Mr. & Mrs. Smith...",Donald Glover and Maya Erskine replace Brad Pi...,https://www.polygon.com/24059110/mr-mrs-smith-...,https://cdn.vox-cdn.com/thumbor/zf3AwgK1SH-WvF...,2024-02-03T15:00:00Z,"Have you watched Mr. &amp; Mrs. Smith, the 200..."


##### Prepare documents

In [141]:
# concatenate title and description

df_articles['published_date'] = df_articles['publishedAt'].apply(lambda x: datetime.strptime(x, "%Y-%m-%dT%H:%M:%SZ"))
content = (df_articles.title + '. ' + df_articles.description + ' Published at: ' + df_articles.published_date.astype(str)).tolist()
documents = [Document(page_content=c) for c in content]
documents[1]

Document(page_content='A Teen’s Fatal Plunge Into the London Underworld. After Zac Brettler mysteriously fell to his death in the Thames, his parents, Matthew and Rachelle, discovered that he’d been posing as an oligarch’s son, including in dealings with Akbar Shamji and Verinder (Dave )Sharma. Patrick Radden Keefe reports on the … Published at: 2024-02-05 11:00:00')

##### Create PromptTemplate

In [142]:
llm = ChatOpenAI(openai_api_key=openai_api_key, temperature=0)

print("llm.temperature:", llm.temperature)

prompt = PromptTemplate(template="""Extract interesting elements out of a piece of text, can you extract the following from a text?
- Characters
- Events
- Locations
- Objects

<context>
{context}
</context>

Also, can you link these entities in the following way:
- character - RELATES_TO (`how`) - character
- character - INTERACTS_WITH (`how`)- character
- character - INVOLVED_IN - event
- event - LOCATED_ AT (`how`) Location
- object - RELEVANT_FOR (relevance_score (between 0 - 1, why)- event

if the attribute `how` is not relevant omit it.
for Events include date and time if present.

* Please estimate the relevance score yourself
* For RELATES_TO, provide max 2 words of how the characters relate (ie. has_father, has_mother, has_son, has_daughter, has_uncle, has_aunt, has_friend, has_colleague, etc) or interact. 
    * for the entiity links, add the keyword 'how' to indicate the type of relationship, ie. "Bilbo Baggins RELATES_TO (how:has_mother) Belladonna Took"
* For INTERACTS_WITH, provide a short summary why it is relevant and add it as a attribute 'why'

* For RELEVANT FOR, provide max 2 words of why the object is relevant.
    * for the entity links, add the keyword 'why' to indicate the type of relationship, ie. "Bilbo Baggins RELATES_TO (how:has_mother) Belladonna Took"
* For INTERACTS_WITH, provide a short summary why it is relevant and add it as a attribute 'why', ie: Staff - RELEVANT_FOR (relevance_score: 0.8, why: "Gandalf's staff is highly relevant to him as it is part of his iconic appearance and magical abilities.") - Gandalf


Finally, provide a summary for the text.

{output}
""", input_variables=["context", "output"])

llm.temperature: 0.0


##### Extract nodes and entities

In [143]:
# Characters:
# - Tucker Carlson
# - Vladimir Putin

# Events:
# - Tucker Carlson's interview with Vladimir Putin

# Locations:
# - Russia
# - Ukraine

# Objects:
# - None mentioned in the text

# Character - RELATES_TO (how: interviewer) - Character:
# - Tucker Carlson RELATES_TO (how:interviewer) Vladimir Putin

# Character - INVOLVED_IN - Event:
# - Tucker Carlson INVOLVED_IN - Tucker Carlson's interview with Vladimir Putin

# Event - LOCATED_AT - Location:
# - Tucker Carlson's interview with Vladimir Putin LOCATED_AT - Russia

# Summary:
# The ex-Fox News host, Tucker Carlson, will be the first Western journalist to interview Vladimir Putin since Russia invaded Ukraine.
# CPU times: user 34.2 ms, sys: 17.6 ms, total: 51.8 ms
# Wall time: 6.79 s

In [144]:
%%time

document_chain = create_stuff_documents_chain(llm, prompt)

response = document_chain.invoke({
    "context": [documents[1]], "output": ""
})
print(response)

Characters:
- Zac Brettler
- Matthew Brettler
- Rachelle Brettler
- Akbar Shamji
- Verinder (Dave) Sharma

Events:
- Zac Brettler's fatal plunge into the London underworld

Locations:
- The Thames

Objects:
- None mentioned in the given text

Entity Links:
- Zac Brettler RELATES_TO (how:posing_as) oligarch's son
- Zac Brettler INTERACTS_WITH (how:dealing_with) Akbar Shamji and Verinder (Dave) Sharma
- Zac Brettler INVOLVED_IN event: Zac Brettler's fatal plunge into the London underworld
- Zac Brettler LOCATED_AT location: The Thames

Summary:
The text discusses the mysterious death of Zac Brettler, who fell to his death in the Thames. It is revealed that Zac had been posing as an oligarch's son and had dealings with individuals named Akbar Shamji and Verinder (Dave) Sharma. Zac's parents, Matthew and Rachelle, discovered his secret after his death.
CPU times: user 28.7 ms, sys: 6.89 ms, total: 35.6 ms
Wall time: 6.31 s


##### Response as JSON

In [145]:
%%time

# convert response into a Document
doc_nodes_relations = Document(page_content=response)

# prompt to translate the response into valid JSON
prompt_to_json = PromptTemplate(template="""Translate the content from the context IN FULL into structured JSON format

<context>
{context}
</context>

{output}
""", input_variables=["context", "output"])

# create chain and invoke LLM
document_chain = create_stuff_documents_chain(llm, prompt_to_json)
response = document_chain.invoke({
    "context": [doc_nodes_relations], "output": "output the response as valid JSON"
})
print(response)

{
  "characters": [
    "Zac Brettler",
    "Matthew Brettler",
    "Rachelle Brettler",
    "Akbar Shamji",
    "Verinder (Dave) Sharma"
  ],
  "events": [
    "Zac Brettler's fatal plunge into the London underworld"
  ],
  "locations": [
    "The Thames"
  ],
  "objects": [],
  "entity_links": [
    {
      "entity": "Zac Brettler",
      "relation": "RELATES_TO",
      "how": "posing_as",
      "target": "oligarch's son"
    },
    {
      "entity": "Zac Brettler",
      "relation": "INTERACTS_WITH",
      "how": "dealing_with",
      "target": "Akbar Shamji"
    },
    {
      "entity": "Zac Brettler",
      "relation": "INTERACTS_WITH",
      "how": "dealing_with",
      "target": "Verinder (Dave) Sharma"
    },
    {
      "entity": "Zac Brettler",
      "relation": "INVOLVED_IN",
      "how": "event",
      "target": "Zac Brettler's fatal plunge into the London underworld"
    },
    {
      "entity": "Zac Brettler",
      "relation": "LOCATED_AT",
      "how": "location",
     

#### 2. use cypher query to see if there are matches in the open sanctions graph

In [147]:
graph = Neo4jGraph(
    url=neo4j_url,
    username=neo4j_user,
    password=neo4j_pass,
    database=neo4j_database
)

# double check if database is correct. Neo4jGraph uses magic to override the database argument with a value from the environment
print("graph._database:", graph._database)
graph.query("MATCH (n) RETURN count(n)")

graph._database: open-sanctions


[{'count(n)': 180457475}]

#### Create embeddings for Oligarchs

In [148]:
embeddings_model = OpenAIEmbeddings()

In [95]:
%%time
# Wall time: 1.28 s

results = graph.query("MATCH (n:Oligarch) RETURN n.id, n.caption LIMIT 200")
list_oligarch_ids = [r['n.id'] for r in results]
list_oligarch_names = [r['n.caption'] for r in results]

# create embeddings for the 177 Oliarches
embeddings = embeddings_model.embed_documents(list_oligarch_names)

CPU times: user 134 ms, sys: 18.1 ms, total: 152 ms
Wall time: 1.28 s


In [98]:
%%time
# Wall time: 1.3 s

for i, name in enumerate(list_oligarch_names):
    query = f"""
        MATCH (p:Oligarch {{id: '{ list_oligarch_ids[i] }'}})
        CALL db.create.setNodeVectorProperty(p, 'embedding', apoc.convert.fromJsonList('{str(embeddings[i])}'))
        RETURN count(*)
    """
    graph.query(query)

CPU times: user 582 ms, sys: 46.8 ms, total: 629 ms
Wall time: 1.3 s


In [150]:
graph.refresh_schema()
print(graph.schema)

#### Create index for the embeddings 

In [104]:
len(embeddings[0])

1536

In [106]:
query = """
CALL db.index.vector.createNodeIndex(
    'oligarch_embedding',
    'Oligarch',
    'embedding',
    1536,
    'cosine'
)
"""
graph.query(query)

[]

In [None]:
query = "SHOW indexes WHERE type = 'VECTOR'"
graph.query(query)

__Use Neo4jVector to create embeddings for the news articles__

In [172]:
new_vector = Neo4jVector.from_documents(
    documents,
    embeddings_model,
    url=os.getenv('NEO4J_URL'),
    username=os.getenv('NEO4J_USER'),
    password=os.getenv('NEO4J_PASS'),
    database=neo4j_database,
    index_name="myVectorIndex",
    node_label="NewsChunks",
    text_node_property="text",
    embedding_node_property="embedding",
    create_id_index=True,
)

In [198]:
query = "MATCH (n:NewsChunks) RETURN n.text, n.id, n.embedding"
results = graph.query(query)

In [216]:
def match_name_parts(name, text):
    matched_name_parts = []
    for name_part in name.split(" "):
        if name_part in text:
            matched_name_parts.append(name_part)
    return matched_name_parts

In [220]:
nr_nearest_neighbours = 3
threshold = 0.91
min_matched_name_parts = 1

for news_chunk in results:    
    query = f"""
    CALL db.index.vector.queryNodes('oligarch_embedding', {nr_nearest_neighbours}, {news_chunk['n.embedding']})
    YIELD node, score

    RETURN node.caption, node.id AS id, score
    """

    match_results = graph.query(query)
    for match_result in match_results:
        if match_result['score'] > threshold:
            matched_name_parts = match_name_parts(match_result['node.caption'], news_chunk['n.text'])
            if len(matched_name_parts) >= min_matched_name_parts:
                print(f"Matched Oligarch to news item (thresholds, cosine: {threshold}, nr_matched_name_parts: {min_matched_name_parts})")
                print(f"- {match_result['node.caption']}")
                print(f"- {news_chunk['n.text']} (id: {news_chunk['n.text']})")
                print("-", match_result['id'], match_result['score'], match_result['id'])

                print("Exact matches in text for name parts:", matched_name_parts)
                print()
                print("---")

Matched Oligarch to news item (thresholds, cosine: 0.91, nr_matched_name_parts: 1)
- Vladimir YEVTUSHENKOV
- Putin seemed to name his price for giving back Evan Gershkovich: the freedom of a straight-up murderer. Russian President Vladimir Putin said in an interview with Tucker Carlson that he's open to negotiating the release of US journalist Evan Gershkovich. Published at: 2024-02-09 11:58:50 (id: Putin seemed to name his price for giving back Evan Gershkovich: the freedom of a straight-up murderer. Russian President Vladimir Putin said in an interview with Tucker Carlson that he's open to negotiating the release of US journalist Evan Gershkovich. Published at: 2024-02-09 11:58:50)
- Q2007053 0.9157768487930298 Q2007053
Exact matches in text for name parts: ['Vladimir']

---
Matched Oligarch to news item (thresholds, cosine: 0.91, nr_matched_name_parts: 1)
- YUSHVAEV Gavril Abramovich
- The VC Firm That Brokered An Oligarch's Investments. Venture capital fund Target Global managed te