[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/introduction/04_Document_Parsing.ipynb)

# Document Parsing

## Overview

This notebook demonstrates how to parse various document formats using Semantica's parsing modules. You'll learn to extract text, metadata, and structured data from PDFs, DOCX, CSV, JSON, XML, and HTML files.

**Documentation**: [API Reference](https://semantica.readthedocs.io/reference/parse/)

### Learning Objectives

- Use `DocumentParser` for general document parsing
- Use format-specific parsers: `PDFParser`, `DOCXParser`, `CSVParser`, `JSONParser`, `XMLParser`, `HTMLParser`
- Extract text content and metadata from documents
- Parse structured data formats

## Installation

Install Semantica from PyPI:

```bash
pip install semantica
# Or with all optional dependencies:
pip install semantica[all]
```

---

## Step 1: Document Parser

Parse various document formats using the general DocumentParser.


In [None]:
!pip install semantica


In [None]:
from semantica.parse import DocumentParser
import tempfile
import os

document_parser = DocumentParser()

temp_dir = tempfile.mkdtemp()
sample_txt = os.path.join(temp_dir, "sample.txt")

with open(sample_txt, 'w') as f:
    f.write("Apple Inc. is a technology company. Tim Cook is the CEO.")

text = document_parser.extract_text(sample_txt)
metadata = document_parser.extract_metadata(sample_txt)

text[:50], metadata


## Step 2: CSV Parser

Parse CSV files to extract structured data.


In [None]:
from semantica.parse import CSVParser

csv_parser = CSVParser()
csv_file = os.path.join(temp_dir, "data.csv")

with open(csv_file, 'w') as f:
    f.write("name,company,role\n")
    f.write("Tim Cook,Apple Inc.,CEO\n")
    f.write("Satya Nadella,Microsoft Corporation,CEO\n")

csv_data = csv_parser.parse(csv_file)

print(f"Parsed CSV with {len(csv_data.rows)} rows")
print(f"Columns: {csv_data.headers}")
for row in csv_data.rows[:2]:
    print(f"  {row}")


## Step 3: JSON Parser

Parse JSON files to extract structured data.


In [None]:
from semantica.parse import JSONParser
import json

json_parser = JSONParser()
json_file = os.path.join(temp_dir, "data.json")

data = {
    "companies": [
        {"name": "Apple Inc.", "ceo": "Tim Cook"},
        {"name": "Microsoft Corporation", "ceo": "Satya Nadella"}
    ]
}

with open(json_file, 'w') as f:
    json.dump(data, f)

json_data = json_parser.parse(json_file)

print(f"Parsed JSON: {json_data.data}")
print(f"Companies: {len(json_data.data.get('companies', []))}")


## Step 4: XML Parser

Parse XML files to extract structured data.


In [None]:
from semantica.parse import XMLParser

xml_parser = XMLParser()
xml_file = os.path.join(temp_dir, "data.xml")

xml_content = """<?xml version="1.0"?>
<companies>
    <company name="Apple Inc." ceo="Tim Cook"/>
    <company name="Microsoft Corporation" ceo="Satya Nadella"/>
</companies>"""

with open(xml_file, 'w') as f:
    f.write(xml_content)

xml_data = xml_parser.parse(xml_file)

print(f"Parsed XML with {len(xml_data.root.children)} elements")
print(f"Root element: {xml_data.root.tag if xml_data.root else 'None'}")


## Step 5: HTML Parser

Parse HTML files to extract content and structure.


In [None]:
from semantica.parse import HTMLParser

html_parser = HTMLParser()
html_file = os.path.join(temp_dir, "page.html")

html_content = """<html>
<head><title>Sample Page</title></head>
<body>
    <h1>Technology Companies</h1>
    <p>Apple Inc. is a technology company.</p>
</body>
</html>"""

with open(html_file, 'w') as f:
    f.write(html_content)

html_data = html_parser.parse(html_file)

print(f"Parsed HTML")
print(f"Title: {html_data.metadata.get('title', 'N/A')}")
print(f"Text content: {html_data.text[:50]}...")


## Step 6: Structured Data Parser

Use StructuredDataParser for multiple formats.


In [None]:
from semantica.parse import StructuredDataParser

structured_parser = StructuredDataParser()

parsed_json = structured_parser.parse_data(json_file, data_format="json")
parsed_csv = structured_parser.parse_data(csv_file, data_format="csv")

print(f"Structured parser parsed JSON: {len(parsed_json.get('data', {}).get('companies', []))} companies")
print(f"Structured parser parsed CSV: {len(parsed_csv.get('rows', []))} rows")


## Summary

You've learned how to parse various document formats:

- **DocumentParser**: General document parsing
- **CSVParser**: CSV file parsing
- **JSONParser**: JSON file parsing
- **XMLParser**: XML file parsing
- **HTMLParser**: HTML file parsing
- **StructuredDataParser**: Multi-format structured data parsing

Next: Learn how to normalize and clean data in the Data_Normalization notebook.
