# Extract JSON - Parse JSON from Text and Markdown

`extract_json()` provides robust JSON extraction from strings and markdown:

**Core Features:**
- **Direct Parsing**: Standard JSON string parsing with orjson
- **Markdown Extraction**: Finds JSON in \`\`\`json code blocks
- **Fuzzy Parsing**: Tolerates malformed JSON (missing quotes, brackets, trailing commas)
- **Flexible Input**: Accepts string or list of strings
- **Smart Return**: Single dict or list based on matches

**Common Use Cases:**
- Parsing LLM outputs with markdown formatting
- Extracting structured data from mixed text
- Handling imperfect JSON from user input

In [1]:
from lionherd_core.libs.string_handlers import extract_json

## 1. Direct JSON Parsing

The simplest case: parse a valid JSON string directly.

In [2]:
# Simple object
json_str = '{"name": "Alice", "age": 30}'
result = extract_json(json_str)
print(f"Result: {result}")
print(f"Type: {type(result).__name__}")

Result: {'name': 'Alice', 'age': 30}
Type: dict


In [3]:
# Array of objects
json_array = '[{"id": 1, "name": "Item 1"}, {"id": 2, "name": "Item 2"}]'
result = extract_json(json_array)
print(f"Result: {result}")
print(f"Length: {len(result)}")

Result: [{'id': 1, 'name': 'Item 1'}, {'id': 2, 'name': 'Item 2'}]
Length: 2


## 2. List Input - Join Multiple Strings

Pass a list of strings to join before parsing.

In [4]:
# Split JSON across multiple lines
lines = ["{", '  "name": "Bob",', '  "email": "bob@example.com"', "}"]
result = extract_json(lines)
print(f"Joined and parsed: {result}")

Joined and parsed: {'name': 'Bob', 'email': 'bob@example.com'}


## 3. Markdown Code Block Extraction

Extract JSON from markdown \`\`\`json code blocks - common in LLM outputs.

In [5]:
# Single code block
markdown_text = """
Here's the user data:

```json
{
  "user_id": 123,
  "username": "charlie",
  "active": true
}
```

Hope this helps!
"""

result = extract_json(markdown_text)
print(f"Extracted: {result}")
print(f"user_id: {result['user_id']}")

Extracted: {'user_id': 123, 'username': 'charlie', 'active': True}
user_id: 123


In [6]:
# Multiple code blocks - returns list
multi_block = """
First configuration:

```json
{"env": "dev", "debug": true}
```

Production config:

```json
{"env": "prod", "debug": false}
```
"""

result = extract_json(multi_block)
print(f"Extracted {len(result)} configs:")
for i, config in enumerate(result, 1):
    print(f"  {i}. {config}")

Extracted 2 configs:
  1. {'env': 'dev', 'debug': True}
  2. {'env': 'prod', 'debug': False}


## 4. Return Behavior Control

Control whether single results return a dict or a list with one item.

In [7]:
single_block = '```json\n{"key": "value"}\n```'

# Default: return_one_if_single=True → dict
result_dict = extract_json(single_block)
print(f"Default (single): {result_dict} (type: {type(result_dict).__name__})")

# Force list even for single match
result_list = extract_json(single_block, return_one_if_single=False)
print(f"Force list: {result_list} (type: {type(result_list).__name__})")

Default (single): {'key': 'value'} (type: dict)
Force list: [{'key': 'value'}] (type: list)


## 5. Fuzzy JSON Parsing

Handle common JSON errors: single quotes, missing brackets, trailing commas.

In [8]:
# Single quotes instead of double quotes
malformed = "{'name': 'Diana', 'score': 95}"

# Without fuzzy parsing - fails
try:
    extract_json(malformed)
except Exception:
    print("✗ Standard parsing failed (expected)")

# With fuzzy parsing - succeeds
result = extract_json(malformed, fuzzy_parse=True)
print(f"✓ Fuzzy parsed: {result}")

✓ Fuzzy parsed: {'name': 'Diana', 'score': 95}


In [9]:
# Missing closing bracket
incomplete = '{"items": [1, 2, 3], "total": 3'

result = extract_json(incomplete, fuzzy_parse=True)
print(f"Fixed missing bracket: {result}")

Fixed missing bracket: {'items': [1, 2, 3], 'total': 3}


In [10]:
# Trailing commas
trailing_comma = '{"a": 1, "b": 2,}'

result = extract_json(trailing_comma, fuzzy_parse=True)
print(f"Removed trailing comma: {result}")

Removed trailing comma: {'a': 1, 'b': 2}


In [11]:
# Unquoted keys
unquoted = '{name: "Eve", role: "admin"}'

result = extract_json(unquoted, fuzzy_parse=True)
print(f"Added quotes to keys: {result}")

Added quotes to keys: {'name': 'Eve', 'role': 'admin'}


## 6. Edge Cases

Behavior when no JSON is found or input is invalid.

In [12]:
# No JSON at all - returns empty list
no_json = "This is just plain text with no JSON."
result = extract_json(no_json)
print(f"No JSON found: {result}")

No JSON found: []


In [13]:
# Markdown without json blocks - empty list
no_json_blocks = """
Some text here.

```python
print("Not JSON")
```
"""
result = extract_json(no_json_blocks)
print(f"No JSON blocks: {result}")

No JSON blocks: []


In [14]:
# Invalid JSON even with fuzzy parsing
completely_broken = '{"key": [broken broken broken]}'

try:
    extract_json(completely_broken, fuzzy_parse=True)
except ValueError as e:
    print(f"✗ Cannot parse: {e}")

## 7. Real-World Example - LLM Output Parsing

Typical scenario: parsing structured data from an LLM response.

In [15]:
# Simulated LLM response with markdown and explanation
llm_response = """
I've analyzed the data and here are the findings:

```json
{
  "total_records": 1500,
  "valid_records": 1423,
  "errors": [
    {"type": "missing_field", "count": 45},
    {"type": "invalid_format", "count": 32}
  ],
  "success_rate": 0.948
}
```

The overall data quality looks good with 94.8% valid records.
"""

analysis = extract_json(llm_response)
print(f"Total records: {analysis['total_records']}")
print(f"Success rate: {analysis['success_rate']:.1%}")
print("\nErrors:")
for error in analysis["errors"]:
    print(f"  - {error['type']}: {error['count']}")

Total records: 1500
Success rate: 94.8%

Errors:
  - missing_field: 45
  - invalid_format: 32


## 8. Combining Features

Use multiple features together for robust extraction.

In [16]:
# List input + markdown + fuzzy parsing
mixed_input = [
    "Here's the config:",
    "",
    "```json",
    "{name: 'production', timeout: 30,}",  # Unquoted keys + trailing comma
    "```",
]

config = extract_json(mixed_input, fuzzy_parse=True)
print(f"Extracted config: {config}")
print(f"Timeout: {config['timeout']} seconds")

Extracted config: {'name': 'production', 'timeout': 30}
Timeout: 30 seconds


In [17]:
# Multiple blocks with mixed validity
mixed_blocks = """
Valid config:
```json
{"env": "staging", "port": 8080}
```

Malformed config (will be skipped):
```json
{broken: invalid: json:
```

Another valid one:
```json
{'db': 'postgres', 'host': 'localhost'}
```
"""

# With fuzzy parsing, extracts valid ones and skips broken
configs = extract_json(mixed_blocks, fuzzy_parse=True, return_one_if_single=False)
print(f"Extracted {len(configs)} valid configs:")
for cfg in configs:
    print(f"  {cfg}")

Extracted 2 valid configs:
  {'env': 'staging', 'port': 8080}
  {'db': 'postgres', 'host': 'localhost'}


## 9. Performance Characteristics

Understanding the parsing strategy and performance.

In [18]:
# Parsing attempts in order:
# 1. Direct orjson parsing (fastest)
# 2. Markdown code block extraction
# 3. Fuzzy parsing (if enabled)

import time

# Direct parsing is fastest
direct = '{"key": "value"}'
start = time.perf_counter()
for _ in range(10000):
    extract_json(direct)
direct_time = time.perf_counter() - start
print(f"Direct parsing: {direct_time * 1000:.2f}ms for 10k iterations")

# Markdown extraction is slower
markdown = '```json\n{"key": "value"}\n```'
start = time.perf_counter()
for _ in range(10000):
    extract_json(markdown)
markdown_time = time.perf_counter() - start
print(f"Markdown extraction: {markdown_time * 1000:.2f}ms for 10k iterations")

# Fuzzy parsing is slowest
fuzzy = '{name: "value"}'
start = time.perf_counter()
for _ in range(10000):
    extract_json(fuzzy, fuzzy_parse=True)
fuzzy_time = time.perf_counter() - start
print(f"Fuzzy parsing: {fuzzy_time * 1000:.2f}ms for 10k iterations")

Direct parsing: 3.44ms for 10k iterations
Markdown extraction: 18.52ms for 10k iterations
Fuzzy parsing: 45.00ms for 10k iterations


## Summary Checklist

**Extract JSON Essentials:**
- ✅ Direct JSON parsing with orjson (fast, strict)
- ✅ Markdown code block extraction (\`\`\`json pattern)
- ✅ Fuzzy parsing for malformed JSON (quotes, brackets, commas)
- ✅ List input support (joins with newlines)
- ✅ Smart return: single dict or list based on matches
- ✅ Returns empty list when no JSON found
- ✅ Handles multiple code blocks in one text

**Best Practices:**
- Use direct parsing when JSON is guaranteed valid
- Enable fuzzy parsing for user input or LLM outputs
- Set `return_one_if_single=False` when you always want a list
- Combine with markdown extraction for LLM responses

**Performance Tips:**
- Direct > Markdown extraction > Fuzzy parsing (speed)
- Fuzzy parsing tries multiple strategies (expensive)
- Regex compilation is cached for markdown extraction

**Common Patterns:**
```python
# LLM output
data = extract_json(llm_response, fuzzy_parse=True)

# API response with markdown
items = extract_json(response_text, return_one_if_single=False)

# Joined log lines
parsed = extract_json(log_lines, fuzzy_parse=True)
```