# JSON Processing with fenic

This example demonstrates how to use fenic to process, analyze, and extract structured information from a complex JSON transcript file (such as those produced by speech-to-text systems like Whisper). 

The workflow covers the entire pipeline—from loading and casting the JSON data, to extracting word- and segment-level details, and aggregating speaker statistics using both JQ queries and DataFrame operations.

**Key steps include**:
- Loading a JSON transcript and casting it to a structured JSON type.
- Using JQ queries to extract nested word- and segment-level data.
- Structuring and cleaning extracted data with type casting and calculated fields.
- Aggregating speaker statistics, such as total words, speaking time, and word rates.
- Demonstrating hybrid processing: combining JSON extraction, array operations, and traditional DataFrame analytics.

This notebook provides a practical example of how fenic can transform unstructured JSON data into structured, queryable DataFrames for further analysis.

## Setting Up the fenic Session

This cell configures and initializes a Fenic session with semantic capabilities, enabling the use of a language model for advanced JSON document analysis and extraction tasks.

In [None]:
from pathlib import Path
from typing import Optional

import fenic as fc

config = fc.SessionConfig(
    app_name="json_processing",
    semantic=fc.SemanticConfig(
        language_models={
            "mini": fc.OpenAIModelConfig(
                model_name="gpt-4o-mini",
                rpm=500,
                tpm=200_000
            )
        }
    )
)

# Create session
session = fc.Session.get_or_create(config)


## Loading and Preparing the JSON Transcript

This cell loads the transcript from a JSON file, creates a Fenic DataFrame containing the raw JSON string, and casts it to a structured JSON type to enable further analysis and extraction.

In [None]:
transcript_path = Path("whisper-transcript.json")

with open(transcript_path, "r") as f:
    json_content = f.read()

# Create dataframe with the JSON string
df = session.create_dataframe([{"json_string": json_content}])

# Cast the JSON string to JSON type
df_json = df.select(
    fc.col("json_string").cast(fc.JsonType).alias("json_data")
)

df_json.show(1)

## Extracting Word-Level Data Using JQ

This cell uses a JQ query to extract all individual words from each segment in the JSON transcript. 

It demonstrates nested array traversal and variable binding, producing a DataFrame where each row contains both word-level and segment-level information.


In [None]:
# Extract all words from all segments using JQ
# This demonstrates nested array traversal and variable binding in JQ
words_df = df_json.select(
    fc.json.jq(
        fc.col("json_data"),
        # JQ query explanation:
        # - '.segments[] as $seg' iterates through segments, binding each to $seg
        # - '$seg.words[]' iterates through words in each segment
        # - Constructs object with both word-level and segment-level data
        '.segments[] as $seg | $seg.words[] | {word: .word, speaker: .speaker, start: .start, end: .end, probability: .probability, segment_start: $seg.start, segment_end: $seg.end, segment_text: $seg.text}'
    ).alias("word_data")
).explode("word_data")  # Convert array of word objects into separate rows
words_df.show(3)

## Structuring and Cleaning Word-Level Data

This cell defines a schema for word-level data, casts the extracted word objects to this schema, and unnests the fields for clarity. 

The result is a clean DataFrame with properly typed and named columns for each word and its associated metadata.

In [None]:
# Extract scalar values using struct casting and unnest - more efficient than JQ + get_item(0)
# Define schema for word-level data structure
word_schema = fc.StructType([
    fc.StructField("word", fc.StringType),
    fc.StructField("speaker", fc.StringType),
    fc.StructField("start", fc.FloatType),
    fc.StructField("end", fc.FloatType),
    fc.StructField("probability", fc.FloatType),
    fc.StructField("segment_start", fc.FloatType),
    fc.StructField("segment_end", fc.FloatType)
])

# Cast to struct and unnest to automatically extract all fields
words_clean_df = words_df.select(
    fc.col("word_data").cast(word_schema).alias("word_struct")
).unnest("word_struct").select(
    # Rename fields for clarity
    fc.col("word").alias("word_text"),
    fc.col("speaker"),
    fc.col("start").alias("start_time"),
    fc.col("end").alias("end_time"),
    fc.col("probability"),
    fc.col("segment_start"),
    fc.col("segment_end")
)

print("\nScalar extracted fields:")
words_clean_df.show(3)

## Adding Calculated Fields to Word Data

This cell demonstrates how to add calculated fields to the word-level DataFrame. 

Specifically, it computes the duration of each word (end time minus start time), showcasing arithmetic operations on structured data.

In [None]:
# Add calculated fields - types are already correct from struct schema
# This demonstrates arithmetic operations on struct-extracted data
words_final_df = words_clean_df.select(
    "*",
    # Calculate duration: end_time - start_time (demonstrates arithmetic on struct data)
    (fc.col("end_time") - fc.col("start_time")).alias("duration")
)

print("\n📊 Words DataFrame with calculated duration:")

words_final_df.show(10)

## Creating a Segments DataFrame

This cell extracts segment-level data from the JSON transcript using a JQ query. 

Each row in the resulting DataFrame represents a segment, including its text, timing, and associated words, enabling analysis at a higher granularity than the word level.

In [None]:
# 2. Create Segments DataFrame (Content-focused)
print("\n📝 Creating Segments DataFrame...")

# Extract segment-level data using JQ
# This demonstrates extracting data at a different granularity level
segments_df = df_json.select(
    fc.json.jq(
        fc.col("json_data"),
        # Extract segment objects with their text, timing, and nested words array
        '.segments[] | {text: .text, start: .start, end: .end, words: .words}'
    ).alias("segment_data")
).explode("segment_data")  # Convert segments array into separate rows

print(f"Extracted {segments_df.count()} segments")
segments_df.show(3)


## Structuring and Aggregating Segment-Level Data

This cell defines a schema for basic segment fields and uses a hybrid approach—combining struct casting and JQ queries—to extract, aggregate, and calculate metrics for each segment. 

The resulting DataFrame includes segment text, timing, word count, average confidence, and duration.

In [None]:
# Extract segment fields using hybrid approach: struct casting + JQ for complex aggregations
# Define schema for basic segment fields (text, start, end)
segment_basic_schema = fc.StructType([
    fc.StructField("text", fc.StringType),
    fc.StructField("start", fc.FloatType),
    fc.StructField("end", fc.FloatType)
])

# First extract basic fields using struct casting, then add complex JQ aggregations
segments_clean_df = segments_df.select(
    # Extract basic segment data using struct casting (more efficient)
    fc.col("segment_data").cast(segment_basic_schema).alias("segment_struct"),
    # Complex array aggregations still use JQ (best tool for this)
    fc.json.jq(fc.col("segment_data"), '.words | length').get_item(0).cast(fc.IntegerType).alias("word_count"),
    fc.json.jq(fc.col("segment_data"), '[.words[].probability] | add / length').get_item(0).cast(fc.FloatType).alias("average_confidence")
).unnest("segment_struct").select(
    # Rename for clarity
    fc.col("text").alias("segment_text"),
    fc.col("start").alias("start_time"),
    fc.col("end").alias("end_time"),
    fc.col("word_count"),
    fc.col("average_confidence")
).select(
    "segment_text",
    "start_time",
    "end_time",
    # Calculate segment duration using DataFrame arithmetic
    (fc.col("end_time") - fc.col("start_time")).alias("duration"),
    "word_count",
    "average_confidence"
)

print("\n📊 Segments DataFrame with calculated metrics:")
segments_clean_df.show(5)

## Aggregating Speaker Statistics

This cell creates a summary DataFrame that aggregates statistics for each speaker, such as total words spoken, speaking time, average confidence, and word rate. 

It demonstrates hybrid processing by combining JSON-extracted data with traditional DataFrame analytics.

In [None]:
# 3. Create Speaker Summary DataFrame (Aggregated)
print("\n🎤 Creating Speaker Summary DataFrame...")

# Use traditional DataFrame aggregations on JSON-extracted data
# This demonstrates hybrid processing: JSON extraction + DataFrame analytics
speaker_summary_df = words_final_df.group_by("speaker").agg(
    fc.count("*").alias("total_words"),                    # Count words per speaker
    fc.avg("probability").alias("average_confidence"),     # Average speech confidence
    fc.min("start_time").alias("first_speaking_time"),     # When speaker first appears
    fc.max("end_time").alias("last_speaking_time"),        # When speaker last appears
    fc.sum("duration").alias("total_speaking_time")        # Total time speaking
).select(
    "speaker",
    "total_words", 
    "total_speaking_time",
    "average_confidence",
    "first_speaking_time",
    "last_speaking_time",
    # Calculate derived metric: words per minute
    (fc.col("total_words") / (fc.col("total_speaking_time") / 60.0)).alias("word_rate")
)

print("\n📊 Speaker Summary DataFrame:")
speaker_summary_df.show()

## Pipeline Summary and Key Features

This cell summarizes the entire JSON processing pipeline, highlighting the main inputs, outputs, and key features demonstrated in the notebook. 

It reviews the creation of structured DataFrames for words, segments, and speakers, and lists the core Fenic features used throughout the workflow.


In [None]:
# Summary of what we accomplished
print("\n🎯 JSON Processing Pipeline Summary:")
print("=" * 60)
print("📁 Input: Single JSON file (whisper-transcript.json)\n")
print("📊 Output: 3 structured DataFrames")
print()
print("1. 🔤 Words DataFrame:")
print(f"   - {words_final_df.count()} individual words extracted")
print("   - Fields: word_text, speaker, timing, probability, duration")
print("   - Demonstrates: JQ nested array extraction, type casting")
print()
print("2. 📝 Segments DataFrame:")
print(f"   - {segments_clean_df.count()} conversation segments")
print("   - Fields: text, timing, word_count, average_confidence")
print("   - Demonstrates: JQ aggregations, array operations")
print()
print("3. 🎤 Speaker Summary DataFrame:")
print(f"   - {speaker_summary_df.count()} speakers analyzed")
print("   - Fields: totals, averages, speaking patterns, word rates")
print("   - Demonstrates: DataFrame aggregations on JSON-extracted data")
print()
print("🔧 Key Fenic JSON Features Used:")
print("   ✓ JSON type casting from strings")
print("   ✓ JQ queries for complex nested extraction")
print("   ✓ Array operations and aggregations")
print("   ✓ Type conversion and calculated fields")
print("   ✓ Traditional DataFrame operations on JSON data")

# Clean up
session.stop()