### Handling Unstructured Data with Python
**Description**: Extract structured data from unstructured text using Python.

**Steps**:
1. Load and analyze an unstructured text document.
2. Extract information using regex.

In [1]:
# Task: Extract structured data from unstructured text using Python
# Steps:
# 1. Load and analyze an unstructured text document.
# 2. Extract information using regex.

import re

# 1. Load and analyze an unstructured text document.
text_document = """
Order ID: 12345
Customer Name: Alice Wonderland
Order Date: 2025-05-16
Items:
- Product A (Qty: 2, Price: $10.00)
- Product B (Qty: 1, Price: $25.50)
Total Amount: $45.50

Order ID: 67890
Customer Name: Bob The Builder
Order Date: 2025-05-15
Items:
- Product C (Qty: 3, Price: $5.20)
Total Amount: $15.60

Order ID: 13579
Customer Name: Charlie Chaplin
Order Date: 2025-05-14
Shipping Address: 123 Main St, Anytown, CA 91234
Items:
- Product D (Qty: 1, Price: $12.75)
- Product E (Qty: 2, Price: $8.99)
Total Amount: $30.73
"""

print("Unstructured Text Document:\n")
print(text_document)
print("\n--- Analyzing the document ---")
print("The document contains information about customer orders, including Order ID, Customer Name, Order Date, Items, and Total Amount. Some orders also include a Shipping Address.")
print("The structure seems to be a series of order blocks, each with key-value pairs and a list of items.")

# 2. Extract information using regex.
orders_data = []
order_blocks = text_document.strip().split('\n\n')

for block in order_blocks:
    order = {}
    order_id_match = re.search(r"Order ID: (\d+)", block)
    if order_id_match:
        order['order_id'] = int(order_id_match.group(1))

    customer_name_match = re.search(r"Customer Name: (.*)", block)
    if customer_name_match:
        order['customer_name'] = customer_name_match.group(1).strip()

    order_date_match = re.search(r"Order Date: (\d{4}-\d{2}-\d{2})", block)
    if order_date_match:
        order['order_date'] = order_date_match.group(1)

    shipping_address_match = re.search(r"Shipping Address: (.*)", block)
    if shipping_address_match:
        order['shipping_address'] = shipping_address_match.group(1).strip()

    items = []
    items_match = re.search(r"Items:\n(.*)(?:Total Amount:|Order ID:)", block, re.DOTALL)
    if items_match:
        item_lines = items_match.group(1).strip().split('\n- ')
        for item_line in item_lines:
            if item_line:
                product_match = re.search(r"(.*) \(Qty: (\d+), Price: \$([\d.]+)\)", item_line)
                if product_match:
                    items.append({
                        'product_name': product_match.group(1).strip(),
                        'quantity': int(product_match.group(2)),
                        'price': float(product_match.group(3))
                    })
                else:
                    product_simple_match = re.search(r"-(.*)", item_line) # Handle cases with only product name
                    if product_simple_match:
                        items.append({'product_name': product_simple_match.group(1).strip()})

    if items:
        order['items'] = items

    total_amount_match = re.search(r"Total Amount: \$([\d.]+)", block)
    if total_amount_match:
        order['total_amount'] = float(total_amount_match.group(1))

    if order:
        orders_data.append(order)

print("\n--- Extracted Structured Data ---")
for order in orders_data:
    print(order)

Unstructured Text Document:


Order ID: 12345
Customer Name: Alice Wonderland
Order Date: 2025-05-16
Items:
- Product A (Qty: 2, Price: $10.00)
- Product B (Qty: 1, Price: $25.50)
Total Amount: $45.50

Order ID: 67890
Customer Name: Bob The Builder
Order Date: 2025-05-15
Items:
- Product C (Qty: 3, Price: $5.20)
Total Amount: $15.60

Order ID: 13579
Customer Name: Charlie Chaplin
Order Date: 2025-05-14
Shipping Address: 123 Main St, Anytown, CA 91234
Items:
- Product D (Qty: 1, Price: $12.75)
- Product E (Qty: 2, Price: $8.99)
Total Amount: $30.73


--- Analyzing the document ---
The document contains information about customer orders, including Order ID, Customer Name, Order Date, Items, and Total Amount. Some orders also include a Shipping Address.
The structure seems to be a series of order blocks, each with key-value pairs and a list of items.

--- Extracted Structured Data ---
{'order_id': 12345, 'customer_name': 'Alice Wonderland', 'order_date': '2025-05-16', 'items': [{'product_