# Markdown Processing with fenic

This example demonstrates how to use Fenic to process, analyze, and extract structured information from an academic paper written in Markdown format. 

The workflow covers the entire pipeline—from loading the Markdown document, generating a table of contents, and extracting document sections, to parsing and structuring references using both markdown-specific and JSON-based techniques.

**Key steps include**:
- Loading and casting the Markdown document for analysis.
- Generating a table of contents and extracting document sections.
- Filtering and parsing the References section to extract individual citations.
- Using both text and JSON-based approaches to structure and analyze reference data.

This notebook provides a practical example of how Fenic can be used to transform unstructured Markdown documents into structured, queryable data for further analysis.

## Setting Up the fenic Session

This cell configures and initializes a Fenic session with semantic capabilities, enabling the use of a language model for advanced Markdown document analysis and extraction tasks.

In [2]:
from pathlib import Path
from typing import Optional

import fenic as fc

config = fc.SessionConfig(
    app_name="markdown_processing",
    semantic=fc.SemanticConfig(
        language_models= {
            "mini": fc.OpenAIModelConfig(
                model_name="gpt-4o-mini",
                rpm=500,
                tpm=200_000
            )
        }
    )
)

# Initialize fenic session
session = fc.Session.get_or_create(config)

## Loading and Preparing the Markdown Document

This cell loads the academic paper from a Markdown file, creates a fenic DataFrame containing the document, and casts the content to a Markdown-specific type to enable further analysis and extraction.

In [None]:
# Load the academic paper markdown content from file
paper_path = Path("attention_is_all_you_need.md")
with open(paper_path, 'r', encoding='utf-8') as f:
    paper_content = f.read()

# Create DataFrame with the paper content as a single row
df = session.create_dataframe({
    "paper_title": ["Attention Is All You Need"],
    "content": [paper_content]
})

# Cast content to MarkdownType to enable markdown-specific functions
df = df.select(
    fc.col("paper_title"),
    fc.col("content").cast(fc.MarkdownType).alias("markdown")
)

print("=== PAPER LOADED ===")
result = df.select(fc.col('paper_title')).to_polars()
print(f"Paper: {result['paper_title'][0]}")
print()

## Generating a Table of Contents

This cell uses fenic’s markdown functions to automatically generate a table of contents from the loaded academic paper, providing an overview of the document’s structure.

In [None]:
# 1. Generate Table of Contents using markdown.generate_toc()
toc_df = df.select(
    fc.col("paper_title"),
    fc.markdown.generate_toc(fc.col("markdown")).alias("toc")
)

toc_df.show()


## Extracting Document Sections

This cell extracts all sections of the academic paper up to level 2 headers and converts them into a structured DataFrame. 

This enables further analysis and querying of individual document sections.

In [None]:
# 2. Extract all document sections and convert to structured DataFrame
sections_df = df.select(
    fc.col("paper_title"),
    fc.markdown.generate_toc(fc.col("markdown")).alias("toc"),
    # Extract sections up to level 2 headers, returning array of section objects
    fc.markdown.extract_header_chunks(fc.col("markdown"), header_level=2).alias("sections")
).explode("sections").unnest("sections")  # Convert array to rows and flatten struct

sections_df.show()

## Parsing the References Section

This cell filters for the References section of the academic paper and splits its content to extract individual citations, enabling further analysis of the bibliography.

In [None]:
# 3. Filter for specific section (References) and parse its content
references_df = sections_df.filter(
    fc.col("heading").contains("References")
)

# Split references content on [1], [2], etc. patterns to separate individual citations
references_df.select(
    fc.text.split(fc.col("content"), r"\[\d+\]").alias("references")
).explode("references").show()
print()

## Extracting References Using JSON and JQ

This cell converts the Markdown document to a JSON structure and uses JQ queries to extract individual references from the References section. 

This approach enables precise parsing and structuring of citation data for further analysis.

In [None]:
# 4. Extract references using JSON + jq approach
# Convert the original document to JSON structure
document_json_df = df.select(
    fc.col("paper_title"),
    fc.markdown.to_json(fc.col("markdown")).alias("document_json")
)

# Extract individual references using pure jq
# References are nested under "7 Conclusion" -> "References" heading
individual_refs_df = document_json_df.select(
    fc.col("paper_title"),
    fc.json.jq(
        fc.col("document_json"),
        # Navigate to References section and split text into individual citations
        '.children[-1].children[] | select(.type == "heading" and (.content[0].text == "References")) | .children[0].content[0].text | split("\\n") | .[]'
    ).alias("reference_text")
).explode("reference_text").select(
    fc.col("paper_title"),
    fc.col("reference_text").cast(fc.StringType).alias("reference_text")
).filter(
    fc.col("reference_text") != ""
)

individual_refs_df.show()

## Extracting Reference Numbers and Content

This cell uses a text extraction template to separate reference numbers from citation content in the References section, producing a structured DataFrame of individual citations for further analysis.

In [None]:
# Extract reference number and content using text.extract() with template
print("Extracting reference numbers and content using text.extract():")
parsed_refs_df = individual_refs_df.select(
    fc.col("paper_title"),
    fc.text.extract(
        fc.col("reference_text"),
        "[${ref_number:none}] ${content:none}"
    ).alias("parsed_ref")
).select(
    fc.col("paper_title"),
    fc.col("parsed_ref").get_item("ref_number").alias("reference_number"),
    fc.col("parsed_ref").get_item("content").alias("citation_content")
)

print("References with separated numbers and content:")
parsed_refs_df.show()
print()

# Clean up session resources
session.stop()