In [1]:
%load_ext autoreload
%autoreload 2

## 🧾 Use Case 3: Structured Summary from Conversations

In many enterprise workflows, unstructured conversations—like meeting transcripts, support calls, or financial briefings—contain valuable insights that need to be distilled into structured formats for downstream processing.

In this use case, we aim to extract **structured summaries** from raw conversational transcripts. Specifically, we want the model to extract:

- A concise **summary** of the conversation
- Relevant **keywords**
- **Named entities** such as people, organizations, and dates
- The **sentiment** of the discussion


### Why is this useful?

Rather than just generating freeform text, this task helps transform unstructured inputs into **machine-readable structured outputs**, making it suitable for:

- Reporting dashboards  
- Automated indexing and retrieval  
- Analytics pipelines  
- Compliance and auditing tools

### PDF Data 

<img src="assets/financial_transcripts.png">


## 🧑‍🏫 Step 1: Set Up the Teacher Model

This demo expects an openai compatible endpoint. You can use your favorite inference server like vLLM, HFInferenceServer, LlamaStack, etc. For more details on how to setup an inference server using vLLM, please refer to the [README](README.md).

For this demo we will use Llama-3.3-70B-Instruct as our teacher model.

#### Let's test the connection

In [2]:
from openai import OpenAI

openai_api_key = "EMPTY"  # replace with your inference server api key
openai_api_base = (
    "http://150.239.209.43:8008/v1"  # replace with your inference server endpoint
)


client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
teacher_model = models.data[0].id

# Test the connection with a simple completion
response = client.chat.completions.create(
    model=teacher_model,
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.0,
    max_tokens=10,
)
completion = response.choices[0].message.content

print(f"Connection successful! {teacher_model}: {completion}")

Connection successful! meta-llama/Llama-3.3-70B-Instruct: Hello. How can I help you today?


In [5]:
from datasets import load_dataset

seed_data = load_dataset(
    "json", data_files="seed_data/financial_call_transcripts.jsonl", split="train"
)

In [6]:
seed_data[0]

{'conversation_id': 'c47a92e006b54d014a79b447528c55a7',
 'pdf_path': 'seed_data/financial_call_transcripts/c47a92e006b54d014a79b447528c55a7.pdf'}

## Setting up the pipeline


In this section, we walk through a pipeline designed to process financial call transcripts (in PDF format) and extract structured insights using LLM-powered blocks. This is a classic example of transforming unstructured text into structured JSON for downstream analysis.



This YAML defines a flow that reads PDF transcripts, parses them into text, and then uses LLMs to extract:
* ✅ Summary of the transcript
* ✅ Key topics or keywords
* ✅ Mentioned named entities (people, organizations, locations, etc.)
* ✅ Overall sentiment of the call

All of these are combined into a final structured json_output.


```mermaid 
graph LR
    A[PDF Transcript] --> B[parse_transcript<br/>DoclingParsePDF]
    B --> C[conversation]
    C --> D[add_question<br/>AddStaticValue]
    D --> E[gen_summary<br/>LLMBlock]
    E --> F[gen_keywords<br/>LLMBlock]
    F --> G[gen_named_entities<br/>LLMBlock]
    G --> H[gen_sentiment<br/>LLMBlock]

    H --> I[format_json<br/>JSONFormat]
    I --> J[json_output]
```


```yaml
version: "1.0"
blocks:
  - name: parse_transcript
    type: DoclingParsePDF
    config:
      pdf_path_column: pdf_path
      output_column: conversation

  - name: add_question
    type: AddStaticValue
    config:
      column_name: question
      static_value: >
        Extract summary, keywords, named entities, and sentiment from the transcript and return in JSON format.

  - name: gen_summary
    type: LLMBlock
    config:
      config_path: ../prompts/summary.yaml
      output_cols:
        - summary

  - name: gen_keywords
    type: LLMBlock
    config:
      config_path: ../prompts/keywords.yaml
      output_cols:
        - keywords

  - name: gen_named_entities
    type: LLMBlock
    config:
      config_path: ../prompts/named_entities.yaml
      output_cols:
        - named_entities

  - name: gen_sentiment
    type: LLMBlock
    config:
      config_path: ../prompts/sentiment.yaml
      output_cols:
        - sentiment

  - name: format_json
    type: JSONFormat
    config:
      output_column: json_output
    drop_columns:
      - summary
      - keywords
      - named_entities
      - sentiment
```

In [7]:
import os
from instructlab.sdg.pipeline import Pipeline, PipelineContext
from blocks import *

ctx = PipelineContext(
    client=client, model_family="llama", model_id=teacher_model, batch_size=0
)
skills_pipe = Pipeline.from_file(
    ctx, os.path.join(os.getcwd(), "flows/grounded_summary_extraction.yaml")
)

In [8]:
seed_data = seed_data.select(range(10)) # note: this is just for demo purposes, in practice you can use the entire dataset
generated_data = skills_pipe.generate(seed_data)

Map: 100%|██████████| 10/10 [00:00<00:00, 1066.87 examples/s]


In [11]:
from rich.console import Console
from rich.panel import Panel
from rich.syntax import Syntax
import random
import json


rand_idx = random.choice(range(len(generated_data)))
# Your data
data = json.loads(generated_data[rand_idx]["json_output"])

# Convert to JSON string with indentation for pretty printing
json_str = json.dumps(data, indent=2)

# Create syntax highlighted JSON
syntax = Syntax(json_str, "json", theme="github", line_numbers=False)

# Display it inside a panel
console = Console()
console.print(Panel(syntax, title="📊 Extracted Summary", expand=True))

## ✅ Conclusion

In this section, we demonstrated how to construct an end-to-end pipeline that transforms unstructured PDF transcripts into structured JSON insights using modular building blocks. By parsing the document, prompting an LLM to extract specific features, and formatting the results, we’ve created a scalable workflow for financial document analysis—or any use case involving long-form text.

## 📝 Homework: Extend the Pipeline

Your task is to add a new block that extracts a different kind of structured insight from the conversation context.

Some examples include:
* 🧩 Risk factors mentioned in the call
* 📊 Numerical metrics (e.g., revenue, margin, headcount)
* 📌 Action items or decisions discussed

Once you’ve done that, you’ll have taken the first step toward custom skill authoring, opening the door to richer document understanding tailored to your own domain needs. 



