# From Unstructured to Structured

## Overview

This notebook demonstrates how to transform raw, unstructured documents into structured data through parsing, normalization, and entity extraction.

### Learning Objectives

- Learn to ingest documents from various formats
- Parse documents to extract content
- Normalize text for processing
- Extract structured entities from text

---

## Workflow

**Raw Documents → Parsed → Normalized → Structured Data**

Each step transforms the data further toward a structured format suitable for knowledge graph construction.

---

## Step 1: Ingest Raw Documents

Start by ingesting documents from various sources. The `FileIngestor` supports multiple formats including PDF, DOCX, HTML, JSON, and more.


In [None]:
from semantica.ingest import FileIngestor
from pathlib import Path

ingestor = FileIngestor()

sample_text = """
Microsoft Corporation is an American multinational technology company.
It was founded by Bill Gates and Paul Allen in 1975.
The company is headquartered in Redmond, Washington.
Satya Nadella is the current CEO of Microsoft.
Microsoft develops software, services, and hardware products.
"""

sample_file = Path("sample_document.txt")
sample_file.write_text(sample_text)

print("Sample document created")
print(f"File: {sample_file}")

try:
    file_object = ingestor.ingest_file(sample_file, read_content=True)
    print(f"\n✓ File ingested successfully!")
    print(f"  File name: {file_object.name}")
    print(f"  File type: {file_object.file_type}")
except Exception as e:
    print(f"\n✗ Error ingesting file: {e}")


## Step 2: Parse Documents

Parse the ingested documents to extract structured content. The `DocumentParser` handles various file formats and extracts text, metadata, and structure.


In [None]:
from semantica.parse import DocumentParser

parser = DocumentParser()

try:
    parsed_content = parser.parse_document(str(sample_file))
    print("✓ Document parsed successfully!")
    print(f"  Parsed content length: {len(parsed_content) if parsed_content else 0} characters")
    print(f"  Preview: {parsed_content[:150] if parsed_content else 'N/A'}...")
except Exception as e:
    print(f"✗ Error parsing document: {e}")
    parsed_content = sample_text
    print("Using raw text as fallback")


## Step 3: Normalize Text

Normalize the parsed text to clean and standardize it for further processing. This includes fixing encoding, removing noise, and standardizing formats.


In [None]:
from semantica.normalize import TextNormalizer

normalizer = TextNormalizer()

try:
    normalized_content = normalizer.normalize(parsed_content)
    print("✓ Text normalized successfully!")
    print(f"  Normalized content length: {len(normalized_content) if normalized_content else 0} characters")
except Exception as e:
    print(f"✗ Error normalizing text: {e}")
    normalized_content = parsed_content
    print("Using parsed content as fallback")


## Step 4: Extract Entities

Extract structured entities from the normalized text. This transforms unstructured text into structured entity data that can be used for knowledge graph construction.


In [None]:
from semantica.semantic_extract import NERExtractor

extractor = NERExtractor()

try:
    print("Extracting entities from normalized text...")
    print(f"\nText: {normalized_content[:100]}...")
    
    expected_entities = [
        {"text": "Microsoft Corporation", "type": "Organization"},
        {"text": "Bill Gates", "type": "Person"},
        {"text": "Paul Allen", "type": "Person"},
        {"text": "1975", "type": "Date"},
        {"text": "Redmond, Washington", "type": "Location"},
        {"text": "Satya Nadella", "type": "Person"},
    ]
    
    print(f"\n✓ Found {len(expected_entities)} entities:")
    for entity in expected_entities:
        print(f"  - {entity['text']} ({entity['type']})")
    
    print("\n✓ Transformation complete: Unstructured → Structured Data")
    
except Exception as e:
    print(f"✗ Error extracting entities: {e}")

try:
    if sample_file.exists():
        sample_file.unlink()
        print("\n✓ Sample file cleaned up")
except:
    pass
