In [1]:
%pylab inline
import json
import textproc.pipeline
import yaml
import asyncio
import aiostream.stream

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [2]:
with open('../Case - Editorial -Data Scientist/data/dn_articles.json') as f:
    data = json.load(f)

# Article enrichment pipeline

The following pipeline extracts metadata from the article bodies and lead texts.
It also looks up people and companies on wikipedia, and extracts relevant metadata from the wikipedia articles.

The pipeline is written as a DSL using YAML syntax. In this DSL, a pipeline is formed from steps that each have a function name and some function specific parameters.

The main functions are: templated (calls an LLM), each (runs a sub pipeline over some JSON field) and wikipedia (downloads wikipedia articles).

The DSL makes heavy use of json paths (for json field lookups) and jinja templates (for prompts). Note the extension of jinja templates allowing json paths as template variables.


In [5]:
pipeline = textproc.pipeline.Pipeline(yaml.safe_load("""
steps:
 - templated:
     description: Extract metadata
     
     prompt: |
       Below follows a newspaper article
       
       * Summarize the article into a single paragraph.
       * Suggest a new title for the following newspaper article. Don't make it too long, it will go on the webpage front page.
       * Extract any key phrases and keywords.
       * Extract any person, company (or other organization) mentioned

       # {{$.title}}
     
       {{$.lead_text}}

       {{$.body}}
     
     output_schema:
       type: object
       properties:
         summary:
           type: string
         suggested_title:
           type: string
         key_phrases:
           type: array
           items:
             type: string
         people:
           type: array
           items:
             type: string
         companies:
           type: array
           items:
             type: string
             
 - wikipedia:
     description: Download data on people
     input_key: $.people
     language: "no"
     
 - each:
     description: Extract data on people
     
     input_key: $.people
     steps:
         - templated:
             description: Extract metadata from a wikipedia article on a person
                            
             prompt: |
               Here follows part of a wikipedia article about a person.
               Extracyt a summary, as well as a list of roles they've had:
        
               {{$.content}}

             output_schema:
               type: object
               properties:
                 summary:
                   type: string
                 roles:
                   type: array
                   items:
                     type: string

 - wikipedia:
     description: Download data on companies
     input_key: $.companies
     language: "no"

 - each:
     description: Extract data on companies
     
     input_key: $.companies
     steps:
         - templated:             
             prompt: |
               Here follows port of a wikipedia article about a company or other organization.
               Extracyt a summary, as well as a list of major events related to the company:
        
               {{$.content}}
               
             output_schema:
               type: object
               properties:
                 summary:
                   type: string
                 events:
                   type: array
                   items:
                     type: string
 
"""))

# Running a pipeline

To run a pipeline, we just supply an (async) iterator or list as input, and get an async iterator as output.
To turn that into a normal list, we use aiostream.stream.list.

In [6]:
out = await aiostream.stream.list(pipeline.run(data[:1]))
                                  

2025-11-01 20:46:49,460 - textproc.pipeline - INFO - Executing Extract metadata
2025-11-01 20:46:49,463 - textproc.pipeline - INFO - Executing Download data on people
2025-11-01 20:46:49,465 - textproc.pipeline - INFO - Executing Extract data on people
2025-11-01 20:46:49,467 - textproc.pipeline - INFO - Executing Download data on companies
2025-11-01 20:46:49,468 - textproc.pipeline - INFO - Executing Extract data on companies
2025-11-01 20:46:49,498 - textproc.filters.templated - INFO - LLM call


AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no
AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no
AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no
AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no
AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no
AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no


2025-11-01 20:47:11,085 - textproc.pipeline - INFO - Executing Extract metadata from a wikipedia article on a person
2025-11-01 20:47:11,092 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:47:11,100 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:47:11,110 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:47:11,119 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:47:11,130 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:47:26,567 - textproc.filters.templated - INFO - LLM call


AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no
AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no
AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no
AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no
AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no
AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no
AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no
AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no
AAAAAAAAAAAAAAA no
XXXXXXXXXXXXX no


2025-11-01 20:47:47,774 - textproc.pipeline - INFO - Executing templated
2025-11-01 20:47:47,781 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:47:47,788 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:47:47,796 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:47:47,806 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:47:47,814 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:47:57,425 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:47:59,579 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:48:16,471 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:48:17,780 - textproc.filters.templated - INFO - LLM call


# Exploring the output

The output data items contain all the fields of the input data items (unless overwritten),
plus any new fields created by the pipeline steps

In [18]:
out[0].keys()

dict_keys(['ingestiontime', 'publication_lantern', 'content_id', 'published_at', 'updated_at', 'content_type', 'content_sub_type', 'authors', 'title', 'lead_text', 'body', 'categories', 'topics', 'inline', 'lead_image_url', 'summary', 'suggested_title', 'key_phrases', 'people', 'companies'])

In [7]:
len(out[0]["people"])

6

In [8]:
len(out[0]["companies"])

9

In [9]:
out[0]["people"][0]["name"]

'María Corina Machado'

In [10]:
out[0]["people"][0]["summary"]

"María Corina Machado Parisca (born 1967) is a Venezuelan engineer, industrial economist, human-rights activist and opposition politician. She founded and led the election-observation organisation Súmate, served as a deputy in Venezuela's National Assembly for Miranda (2011–2014), and is the national coordinator of the party Vente Venezuela. Machado has been barred from domestic political activity by the Maduro government and has received international recognition, including the 2024 Sakharov Prize, the 2024 Václav Havel Prize and the 2025 Nobel Peace Prize."

In [11]:
out[0]["people"][0]["roles"]

['Engineer and industrial economist',
 'Human-rights activist',
 'Opposition politician',
 'Founder and leader of Súmate (2002–2010)',
 "Member of Venezuela's National Assembly, deputy for Miranda (2011–2014)",
 'National coordinator of Vente Venezuela (2012–present)',
 'Nobel Peace Prize laureate (2025)',
 'Sakharov Prize laureate (2024)',
 'Václav Havel Prize for Human Rights laureate (2024)',
 'BBC 100 Women honoree (2018)']

## We are a bit unlucky regarding companies 0 and 1, their wikipedia articles are *disambiguation articles*, and currenbtly there's no code to recurse through that.

In [22]:
out[0]["companies"][2]["name"]

'Sikkerhetsrådet (UN Security Council)'

In [23]:
out[0]["companies"][2]["summary"]

'FNs sikkerhetsråd (United Nations Security Council) er FNs hovedorgan for internasjonal fred og sikkerhet, opprettet i 1945 gjennom FN-pakten. Rådet har 15 medlemmer — fem faste (USA, Storbritannia, Frankrike, Russland, Kina) med vetorett — og ti ikke-permanente medlemmer valgt for toårige perioder. Sikkerhetsrådet alene kan autorisere fredsbevarende operasjoner; presidentskapet rullerer månedlig.'

In [24]:
out[0]["companies"][2]["events"]

['1945: Sikkerhetsrådet ble opprettet gjennom De forente nasjoners pakt som FN‑s hovedorgan for å opprettholde internasjonal fred og sikkerhet.',
 'Tidligere forhold: Folkeforbundet hadde et tilsvarende beslutningsråd med faste medlemmer før FN.',
 'Fra 1945: De fem faste medlemmene (USA, Storbritannia, Frankrike, Russland/USSR, Kina) fikk vetorett — substantielle vedtak krever minst ni stemmer og ingen negativ stemme fra et fast medlem.',
 '1971: Setet for Kina i Sikkerhetsrådet gikk fra Republikken Kina (Taiwan) til Folkerepublikken Kina.',
 'Norge har vært valgt til Sikkerhetsrådet i periodene 1949–1950, 1963–1964, 1979–1980, 2001–2002 og 2021–2022; FN‑general‑forsamlingen valgte Norge for 2021–2022 den 17. juni 2020.',
 'Norges bidrag til FN‑bygningen i New York: Sikkerhetsrådets møterom ble donert og utformet av Arnstein Arneberg, med veggdekorasjon malt av Per Krohg.',
 'Tidlig 2000‑tall: Generalsekretær Kofi Annan ba om forslag til reform av FNs struktur, inkludert forslag om å 

In [25]:
print(out[0]["companies"][2]["content"])

FNs sikkerhetsråd  
---  
[engelsk](/wiki/Engelsk_\(spr%C3%A5k\) "Engelsk \(språk\)")| United Nations
Security Council  
[fransk](/wiki/Fransk_\(spr%C3%A5k\) "Fransk \(språk\)")| Conseil de sécurité
des Nations unies  
[![](//upload.wikimedia.org/wikipedia/commons/thumb/5/52/Emblem_of_the_United_Nations.svg/120px-
Emblem_of_the_United_Nations.svg.png)](/wiki/Fil:Emblem_of_the_United_Nations.svg)  
[![](//upload.wikimedia.org/wikipedia/commons/thumb/9/95/UN-Sicherheitsrat_-
_UN_Security_Council_-_New_York_City_-_2014_01_06.jpg/250px-UN-
Sicherheitsrat_-_UN_Security_Council_-
_New_York_City_-_2014_01_06.jpg)](/wiki/Fil:UN-Sicherheitsrat_-
_UN_Security_Council_-_New_York_City_-_2014_01_06.jpg)  
Grunnlagt| 1945  
System| [FN](/wiki/FN "FN")  
Kammer| hovedorgan i De forente nasjoner, [råd](/wiki/Nemnd "Nemnd")  
Seter| 15  
Møtested| [FN-bygningen](/wiki/FN-bygningen "FN-bygningen")  
Nettsted| <https://www.un.org/securitycouncil>  
[![](//upload.wikimedia.org/wikipedia/commons/thumb/a/aa