Note: this workflow is intended to loosely mirror the tutorial provided at: https://pfdtoolkit.org/getting_started/load_and_screen/

Running this notebook again might produce slightly different outputs. This is because LLMs are non-deterministic and their inherent randomness is difficult to completely eliminate.

In [1]:
# Time the entire workflow

import time
start = time.time()

# Getting started

This page talks you through an example workflow using PFD Toolkit: loading a dataset and screening for relevant cases related to "detention under the Mental Health Act". 

This is just an example. PFD reports contain a breadth of information across a whole range of topics and domains. But in this workflow, we hope to give you a sense of how the toolkit can be used, and how it might support your own project.

---

## Installation

PFD Toolkit can be installed from pip as `pfd_toolkit`:

```bash
pip install pfd_toolkit
```

Or, to update an existing installation:

```bash
pip install -U pfd_toolkit

```

---

## Load your first dataset

First, you'll need to load a PFD dataset. These datasets are updated weekly, meaning you always have access to the latest reports with minimal setup.

In [2]:
from pfd_toolkit import load_reports

# Load all PFD reports from Jan-May 2025
reports = load_reports(
    start_date="2024-01-01",
    end_date="2025-05-01")

# Identify number of reports
num_reports = len(reports)

reports.head(n=5)

Unnamed: 0,url,id,date,coroner,area,receiver,investigation,circumstances,concerns
0,https://www.judiciary.uk/prevention-of-future-...,2025-0209,2025-05-01,A. Hodson,Birmingham and Solihull,NHS England; The Robert Jones and Agnes Hunt O...,On 9th December 2024 I commenced an investigat...,"At 10.45am on 23rd November 2024, Peter sadly ...",To The Robert Jones and Agnes Hunt Orthopaedic...
1,https://www.judiciary.uk/prevention-of-future-...,2025-0208,2025-04-30,J. Andrews,"West Sussex, Brighton and Hove",West Sussex County Council,On 2 November 2024 I commenced an investigatio...,Mrs Turner drove her car into the canal at the...,The inquest was told that South Bank is a resi...
2,https://www.judiciary.uk/prevention-of-future-...,2025-0207,2025-04-30,A. Mutch,Manchester South,Flixton Road Medical Centre; Greater Mancheste...,On 1 October 2024 I commenced an investigation...,Louise Danielle Rosendale was prescribed long ...,The inquest heard evidence that Louise Rosenda...
3,https://www.judiciary.uk/prevention-of-future-...,2025-0120,2025-04-25,M. Hassell,Inner North London,The President Royal College Obstetricians and ...,"On 23 August 2024, one of my assistant coroner...",Jannat was a big baby and her mother had a his...,With the benefit of a maternity and newborn sa...
4,https://www.judiciary.uk/prevention-of-future-...,2025-0206,2025-04-25,J. Heath,North Yorkshire and York,Townhead Surgery,On 4th June 2024 I commenced an investigation ...,"On 15 March 2024, Richard James Moss attended ...",When a referral document is completed by a med...


## Screening for relevant reports

You're likely using PFD Toolkit because you want to answer a specific question. For example: "Do any PFD reports raise concerns related to detention under the Mental Health Act?"

PFD Toolkit lets you query reports in plain English — no need to know precise keywords or categories. Just describe the cases you care about, and the toolkit will return matching reports.

### Set up an LLM client

Before screening reports, we first need to set up an LLM client. Screening and other toolkit features require an LLM to work.

You'll need to head to [platform.openai.com](https://platform.openai.com/docs/overview) and create an API key. Once you've got this, simply feed it to the `LLM`.

In [3]:
from pfd_toolkit import LLM
from dotenv import load_dotenv
import os

# Load OpenAI API key
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialise LLM client
llm_client = LLM(api_key=openai_api_key, 
                 model="gpt-4.1",
                 max_workers=25,
                 temperature=0, 
                 seed=123, 
                 timeout=20
                 )

### Screen reports in plain English

Now, all we need to do is specify our `search_query` (the statement the LLM will use to filter reports), and set up our `Screener` engine.

In [4]:
from pfd_toolkit import Screener

# Create a user query to filter
search_query = "Concerns about detention under the Mental Health Act **only**"

# Screen reports
screener = Screener(llm = llm_client,
                        reports = reports) # Reports that you loaded earlier

filtered_reports = screener.screen_reports(search_query=search_query,
                                           produce_spans=True,
                                           drop_spans=True)

Sending requests to the LLM: 100%|██████████| 884/884 [00:29<00:00, 29.99it/s]


In [5]:
# Capture number of screened reports
num_reports_screened = len(filtered_reports)

# Check how many reports we've identified
print(f"From our initial {num_reports} reports, PFD Toolkit identified {num_reports_screened} \
reports discussing concerns around detention under the Mental Health Act.")

From our initial 884 reports, PFD Toolkit identified 51 reports discussing concerns around detention under the Mental Health Act.


In practice, we'd probably want to extend our start and end dates to cover the entire corpus of reports. We've only kept things short for demo purposes :)

---

## Discover themes in your filtered dataset

With your subset of reports screened for Mental Health Act detention concerns, the next step is to uncover the underlying themes. This lets you see 'at a glance' what issues the coroners keep raising.

We'll use the `Extractor` class to automatically identify themes from the *concerns* section of each report.

In [6]:
from pfd_toolkit import Extractor

extractor = Extractor(
    llm=llm_client,             # The same client you created earlier
    reports=filtered_reports,   # Your screened DataFrame
    
    # Only supply the 'concerns' text
    include_date=False,
    include_coroner=False,
    include_area=False,
    include_receiver=False,
    include_investigation=False,
    include_circumstances=False,
    include_concerns=True   # <--- Only identify themes relating to concerns 
)

The main reason why we're hiding all reports sections other than the coroners' concerns is to help keep the LLM's instructions short & focused. LLMs often perform better when they are given only relevant information.

Your own research question might be different. For example, you might be interested in discovering recurring themes related to 'cause of death', in which case you'll likely want to set `include_investigation` and `include_circumstances` to `True`.


---

### Summarise then discover themes

Before discovering themes, we first need to summarise each report. 

We do this because the length of PFD reports vary from coroner to coroner. By summarising the reports, we're centering on the key messages, keeping the prompt short for the LLM. This may improve performance and increase speed.

The report sections that are summarised depend on the `include_*` flags you set earlier. In this tutorial, we are only summarising the *concerns* section.

In [7]:
# Create short summaries of the concerns
extractor.summarise(trim_intensity="medium")

# Ask the LLM to propose recurring themes
IdentifiedThemes = extractor.discover_themes(
    max_themes=6,  # Limit the list to keep things manageable
)

                                                                    

_Note:_ `Extractor` will warn you if the word count of your summaries is too high. In these cases, you might want to set your `trim_intensity` to `high` or `very high` (though please note that the more we trim, the more detail we lose).

`IdentifiedThemes` is a Pydantic model whose boolean fields represent the themes the LLM found. 

`IdentifiedThemes` is not printable in itself, but it is replicated as a JSON in `self.identified_themes` which we can print. This gives us a record of each proposed theme with an accompanying description.

In [8]:
print(extractor.identified_themes)

```json
{
  "bed_shortage": "Insufficient availability of inpatient mental health beds or suitable placements, leading to delays, inappropriate care environments, or patients being placed far from home.",
  "staff_training": "Inadequate staff training, knowledge, or awareness regarding policies, risk assessment, clinical procedures, or the Mental Health Act.",
  "record_keeping": "Poor, inconsistent, or falsified documentation and record keeping, including failures in care planning, observation records, and communication of key information.",
  "policy_gap": "Absence, inconsistency, or lack of clarity in policies, protocols, or guidance, resulting in confusion or unsafe practices.",
  "communication_failures": "Breakdowns in communication or information sharing between staff, agencies, families, or across systems, impacting patient safety and care continuity.",
  "risk_assessment": "Failures or omissions in risk assessment, escalation, or monitoring, including inadequate recognition of

### Tag the reports

Above, we've only identified the themes: we haven't assigned these themes to the reports.

Once you have the theme model, pass it back into the extractor to assign themes to every report in the dataset:

In [9]:
labelled_reports = extractor.extract_features(
    feature_model=IdentifiedThemes,
    force_assign=True,  # Force the model to make a decision (essentially ban missing data)
    allow_multiple=True,  # A single report might touch on multiple themes
)

labelled_reports.head(n=5)

Extracting features: 100%|██████████| 51/51 [00:05<00:00,  8.72it/s]


Unnamed: 0,url,id,date,coroner,area,receiver,investigation,circumstances,concerns,bed_shortage,staff_training,record_keeping,policy_gap,communication_failures,risk_assessment
29,https://www.judiciary.uk/prevention-of-future-...,2025-0172,2025-04-07,S. Reeves,South London,South London and Maudsley NHS Foundation Trust,"On 21 March 2023, an inquest was opened, and a...",Christopher McDonald was pronounced dead at 14...,The evidence heard at the inquest demonstrated...,False,True,False,False,False,True
59,https://www.judiciary.uk/prevention-of-future-...,2025-0144,2025-03-17,S. Horstead,Essex,Chief Executive Officer of Essex Partnership U...,On 31 October 2023 I commenced an investigatio...,On the 23rd September 2023 after concerns were...,(a) Failures in care planning specifically a f...,False,False,True,False,True,True
65,https://www.judiciary.uk/prevention-of-future-...,2025-0104,2025-03-13,A. Harris,South London,Oxleas NHS Foundation Trust; Care Quality Comm...,On 15th January 2020 an inquest was opened int...,"Mr Paul Dunne had a history of depression, anx...",Individual mental health professionals appeare...,False,True,True,True,True,True
80,https://www.judiciary.uk/prevention-of-future-...,2025-0124,2025-03-06,D. Henry,Coventry,Chair of the Coventry and Warwickshire Partner...,On 13 August 2021 I commenced an investigation...,Mr Gebrsslasié on the 2nd August 2021 was arre...,The inquest explored issues such ligature anch...,False,False,False,True,False,True
85,https://www.judiciary.uk/prevention-of-future-...,2025-0119,2025-03-04,L. Hunt,Birmingham and Solihull,Birmingham and Solihull Mental Health NHS Foun...,On 20th July 2023 I commenced an investigation...,Mr Lynch resided in room 1 in supported living...,To Birmingham and Solihull Mental Health Trust...,False,True,True,True,True,True


The resulting DataFrame now contains a column for each discovered theme, filled with `True` or `False` depending on whether that theme was present in the coroner's concerns.

Finally, we can count how often a theme appears in our collection of reports:

From here you can perform whatever analysis you need — counting how often each theme occurs, filtering for particular issues, or exporting the data to other tools.

In [10]:
extractor.tabulate()

Unnamed: 0,Category,Count,Percentage
0,bed_shortage,14,27.45098
1,staff_training,22,43.137255
2,record_keeping,13,25.490196
3,policy_gap,35,68.627451
4,communication_failures,19,37.254902
5,risk_assessment,34,66.666667


That's it! You've gone from a mass of PFD reports, to a focused set of cases relating to Mental Health Act detention, to a theme‑tagged dataset ready for deeper exploration.

From here we can either save our `labelled_reports` dataset via `pandas` for qualitative analysis, or we can use *even more* analytical features of PFD Toolkit.

In [11]:
# Check workflow runtime

end = time.time()

elapsed_seconds = int(end - start)

minutes, seconds = divmod(elapsed_seconds, 60)
print(f"Elapsed time: {minutes}m {seconds}s")

Elapsed time: 0m 57s
