In [1]:
%pylab inline
import json
import textproc.pipeline
import yaml
import asyncio
import aiostream.stream

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [2]:
with open('../Case - Editorial -Data Scientist/data/dn_articles.json') as f:
    data = json.load(f)

In [15]:
pipeline = textproc.pipeline.Pipeline(yaml.safe_load("""
steps:
 - templated:
     description: Extract metadata
     
     prompt: |
       Below follows a newspaper article
       
       * Summarize the article into a single paragraph.
       * Suggest a new title for the following newspaper article. Don't make it too long, it will go on the webpage front page.
       * Extract any key phrases and keywords.
       * Extract any person, company (or other organization) mentioned

       # {{$.title}}
     
       {{$.lead_text}}

       {{$.body}}
     
     output_schema:
       type: object
       properties:
         summary:
           type: string
         suggested_title:
           type: string
         key_phrases:
           type: array
           items:
             type: string
         people:
           type: array
           items:
             type: string
         companies:
           type: array
           items:
             type: string
             
 - wikipedia:
     description: Download data on people
     input_key: $.people
     
 - each:
     description: Extract data on people
     
     input_key: $.people
     steps:
         - templated_chunked:
             description: Extract metadata from a wikipedia article on a person
             
             input_key: $.content
             chunking:
               chunk_size: 50000
               chunk_overlap: 500
               
             prompt: |
               Here follows part of a wikipedia article about a person.
               Extracyt a summary, as well as a list of roles they've had:
        
               {{$}}
               
             summary_prompt: |
               Here follows a set of entries about a person. Each has a summary, as well as roles they've had.
               Merge these into a single summary and set of roles.
        
               {% for item in $.content %}
                  Summary: {{item.summary}}
                  Roles: {{item.roles}}
               {% endfor %}
        
             output_schema:
               type: object
               properties:
                 summary:
                   type: string
                 roles:
                   type: array
                   items:
                     type: string

 - wikipedia:
     input_key: $.companies

 - each:
     description: Extract data on companies
     
     input_key: $.companies
     steps:
         - templated_chunked:
             input_key: $.content
             chunking:
               chunk_size: 50000
               chunk_overlap: 500
               
             prompt: |
               Here follows port of a wikipedia article about a company or other organization.
               Extracyt a summary, as well as a list of major events related to the company:
        
               {{$}}
               
             summary_prompt: |
               Here follows a set of entries about a company or other organization. Each has a summary, as well as related events.
               Merge these into a single summary and set of events.
        
               {% for item in $.content %}
                  Summary: {{item.summary}}
                  Roles: {{item.roles}}
               {% endfor %}
               
             output_schema:
               type: object
               properties:
                 summary:
                   type: string
                 events:
                   type: array
                   items:
                     type: string
 
"""))

In [16]:
out = await aiostream.stream.list(pipeline.run(data[:1]))
                                  

2025-11-01 20:12:40,319 - textproc.pipeline - INFO - Executing Extract metadata
2025-11-01 20:12:40,322 - textproc.pipeline - INFO - Executing Download data on people
2025-11-01 20:12:40,325 - textproc.pipeline - INFO - Executing Extract data on people
2025-11-01 20:12:40,326 - textproc.pipeline - INFO - Executing wikipedia
2025-11-01 20:12:40,328 - textproc.pipeline - INFO - Executing Extract data on companies
2025-11-01 20:12:40,362 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:13:11,616 - textproc.pipeline - INFO - Executing Extract metadata from a wikipedia article on a person
2025-11-01 20:13:11,623 - textproc.pipeline - INFO - Executing chunk
2025-11-01 20:13:11,624 - textproc.pipeline - INFO - Executing each
2025-11-01 20:13:11,624 - textproc.pipeline - INFO - Executing templated
2025-11-01 20:13:11,650 - textproc.pipeline - INFO - Executing templated
2025-11-01 20:13:11,652 - textproc.filters.templated - INFO - LLM call
2025-11-01 20:13:11,667 - textproc.filters.

In [17]:
len(out[0]["people"])

6

In [18]:
len(out[0]["companies"])

11

In [19]:
out[0]["people"][0]["summary"]

'María Corina Machado Parisca (born 7 October 1967) is a Venezuelan politician, engineer and civic activist who has been a leading opposition figure to the administrations of Hugo Chávez and Nicolás Maduro. An industrial engineer with a master’s degree in finance, she co‑founded the election‑monitoring organization Súmate, founded Fundación Atenea and has chaired the Oportunitas Foundation. Machado served as a deputy in the National Assembly (Miranda, 2011–2014), is National Coordinator of the party Vente Venezuela, has hosted a regular radio program, and served as an alternate envoy to the Organization of American States. She has led major anti‑government protests, run in opposition presidential primaries (2012; winner of the 2023 Unitary Platform primary but later barred from office), and has been the target of government prosecution, disqualification and brief detention. Internationally she has been a Yale World Fellow (2009) and received numerous honors including the Cádiz Cortes I

In [20]:
out[0]["people"][0]["roles"]

['Venezuelan politician and activist',
 'Opposition leader to Hugo Chávez and Nicolás Maduro',
 "Industrial engineer with a master's degree in finance",
 'Co‑founder and director of Súmate (election‑monitoring NGO)',
 'Founder of Fundación Atenea',
 'Chair of the Oportunitas Foundation',
 'National Coordinator and leader of Vente Venezuela',
 'Member (Deputy) of the National Assembly (Miranda), 2011–2014',
 'Presidential primary candidate (2012 opposition primary; winner of 2023 Unitary Platform primary)',
 'Civil society organizer and protest leader (notably 2014 protests and 2024 electoral crisis)',
 "Radio broadcaster and host (e.g., 'Contigo: Con María Corina Machado' / RCR 750 AM)",
 'Alternate envoy to the Organization of American States (2014)',
 'Yale World Fellow (2009)',
 'Member, Forum of Young Global Leaders',
 'Recipient, Cádiz Cortes Ibero‑American Freedom Prize (2015)',
 'BBC 100 Women honoree (2018)',
 'Recipient, Prize for Freedom — Liberal International (2019)',
 'Rec

In [21]:
out[0]["companies"][0]["summary"]

'Jørgen Watne Frydnes (born 26 November 1984) is a Norwegian administrator and politician. He served as CEO of Utøya AS from 2011 to 2023, was appointed a member of the Norwegian Nobel Committee in 2021 and became its chair in 2024, and has been General Secretary of PEN Norway since 2023. He and Utøya received the Fritt Ord Honorary Award in 2021. In October 2024 a PwC report identified possible corporate-law violations from the period when he led Utøya AS.'

In [22]:
out[0]["companies"][0]["events"]

['Born 26 November 1984.',
 'Served as CEO of Utøya AS from 2011 to 2023.',
 'Received the Fritt Ord Honorary Award (together with Utøya) in 2021.',
 'Appointed as a member of the Norwegian Nobel Committee in 2021.',
 'Became General Secretary of PEN Norway in 2023.',
 'Became chair of the Norwegian Nobel Committee in 2024.',
 'October 2024: PwC report listed possible corporate-law violations stemming from the period when he led Utøya AS.']