# Data I/O and Serialization

This notebook demonstrates data reading, writing, and serialization with various formats.

**Libraries:**
- [orjson](https://github.com/ijl/orjson) / [ujson](https://github.com/ultrajson/ultrajson) / [simplejson](https://simplejson.readthedocs.io/) - Fast JSON libraries
- [lxml](https://lxml.de/) / [xmltodict](https://github.com/martinblech/xmltodict) - XML processing
- [PyYAML](https://pyyaml.org/) - YAML processing
- [openpyxl](https://openpyxl.readthedocs.io/) - Excel files
- [PyArrow](https://arrow.apache.org/docs/python/) / [fastparquet](https://fastparquet.readthedocs.io/) - Parquet files
- [h5py](https://www.h5py.org/) / [PyTables](https://www.pytables.org/) - HDF5 files
- [requests](https://requests.readthedocs.io/) / [httpx](https://www.python-httpx.org/) - HTTP clients

In [None]:
import json
import numpy as np
import pandas as pd
import tempfile
import os

## JSON Libraries Comparison

Python has several JSON libraries with different performance characteristics:
- **orjson**: Fastest, returns bytes, supports numpy arrays
- **ujson**: Very fast, drop-in replacement for json
- **simplejson**: Feature-rich, compatible with json module

In [None]:
import orjson
import ujson
import simplejson

# Sample data
sample_data = {
    "users": [
        {"id": 1, "name": "Alice", "scores": [95, 87, 92]},
        {"id": 2, "name": "Bob", "scores": [78, 85, 90]},
    ],
    "metadata": {"version": "1.0", "generated": "2024-01-01"},
    "count": 2,
}
sample_data

In [None]:
# Standard library json
json_str = json.dumps(sample_data, indent=2)
print("Standard json library:")
print(json_str)

In [None]:
# orjson (fastest, returns bytes)
orjson_bytes = orjson.dumps(sample_data)
print(f"orjson output (bytes): {orjson_bytes}")

# Pretty print with orjson
orjson_pretty = orjson.dumps(sample_data, option=orjson.OPT_INDENT_2).decode("utf-8")
print(f"\norjson pretty printed:\n{orjson_pretty}")

In [None]:
# ujson and simplejson
ujson_str = ujson.dumps(sample_data, indent=2)
simplejson_str = simplejson.dumps(sample_data, indent=2)

print("ujson output:")
print(ujson_str[:100] + "...")

# Parsing
parsed = orjson.loads(orjson_bytes)
print(f"\nParsed data keys: {list(parsed.keys())}")

## XML Libraries

XML processing options:
- **lxml**: Fast, feature-rich XML/HTML processing
- **xmltodict**: Converts XML to Python dictionaries
- **ElementTree**: Standard library XML parser

In [None]:
from lxml import etree
import xmltodict
import xml.etree.ElementTree as ET

# Sample XML
xml_string = """<?xml version="1.0" encoding="UTF-8"?>
<catalog>
    <book id="1">
        <title>Python Programming</title>
        <author>John Doe</author>
        <price>29.99</price>
    </book>
    <book id="2">
        <title>Data Science Handbook</title>
        <author>Jane Smith</author>
        <price>39.99</price>
    </book>
</catalog>
"""
print(xml_string)

In [None]:
# lxml parsing
print("lxml parsing:")
root = etree.fromstring(xml_string.encode())
for book in root.findall("book"):
    title = book.find("title").text
    author = book.find("author").text
    price = book.find("price").text
    print(f"  {title} by {author} - ${price}")

In [None]:
# xmltodict - convert XML to dictionary
print("xmltodict conversion:")
xml_dict = xmltodict.parse(xml_string)
print(f"Type: {type(xml_dict)}")
print(f"Books: {len(xml_dict['catalog']['book'])}")

# Access like a dictionary
for book in xml_dict['catalog']['book']:
    print(f"  Book ID {book['@id']}: {book['title']}")

In [None]:
# Standard library ElementTree
print("ElementTree parsing:")
et_root = ET.fromstring(xml_string)
for book in et_root.findall("book"):
    book_id = book.get("id")
    author = book.find("author").text
    print(f"  Book {book_id}: by {author}")

## YAML Processing

YAML is commonly used for configuration files due to its human-readable format.

In [None]:
import yaml

yaml_data = """
database:
  host: localhost
  port: 5432
  credentials:
    username: admin
    password: secret

servers:
  - name: web1
    ip: 192.168.1.1
  - name: web2
    ip: 192.168.1.2

features:
  enabled: true
  max_connections: 100
"""

# Parse YAML
config = yaml.safe_load(yaml_data)
print("Parsed YAML config:")
print(f"  Database host: {config['database']['host']}")
print(f"  Number of servers: {len(config['servers'])}")
print(f"  Features enabled: {config['features']['enabled']}")

In [None]:
# Dump back to YAML
output_yaml = yaml.dump(config, default_flow_style=False)
print("Dumped YAML:")
print(output_yaml)

## Excel File Handling

Read and write Excel files using openpyxl (for .xlsx) and pandas.

In [None]:
import openpyxl

# Create sample DataFrame
df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000],
})

# Write to Excel
with tempfile.NamedTemporaryFile(suffix=".xlsx", delete=False) as f:
    excel_path = f.name

df.to_excel(excel_path, sheet_name="Employees", index=False)
print(f"Written DataFrame to Excel: {excel_path}")
df

In [None]:
# Read back with pandas
df_read = pd.read_excel(excel_path, sheet_name="Employees")
print(f"Read back DataFrame shape: {df_read.shape}")
df_read

In [None]:
# Using openpyxl directly
wb = openpyxl.load_workbook(excel_path)
ws = wb.active
print(f"Sheet name: {ws.title}")
print(f"Max row: {ws.max_row}, Max col: {ws.max_column}")

# Read cell values
print("\nCell values:")
for row in ws.iter_rows(min_row=1, max_row=2, values_only=True):
    print(f"  {row}")

os.unlink(excel_path)

## Parquet File Handling

Parquet is a columnar storage format optimized for analytics workloads.

**Advantages:**
- Efficient compression
- Column pruning (read only needed columns)
- Predicate pushdown
- Schema preservation

In [None]:
import pyarrow as pa
import pyarrow.parquet as pq
import fastparquet

# Create larger DataFrame
np.random.seed(42)
df_large = pd.DataFrame({
    "id": range(10000),
    "value": np.random.randn(10000),
    "category": np.random.choice(["A", "B", "C"], 10000),
    "date": pd.date_range("2024-01-01", periods=10000, freq="h"),
})

print(f"DataFrame shape: {df_large.shape}")
df_large.head()

In [None]:
# Write with PyArrow
with tempfile.NamedTemporaryFile(suffix=".parquet", delete=False) as f:
    parquet_path = f.name

df_large.to_parquet(parquet_path, engine="pyarrow", compression="snappy")
parquet_size = os.path.getsize(parquet_path)
print(f"Parquet file size: {parquet_size / 1024:.2f} KB")

In [None]:
# Read with PyArrow
df_parquet = pd.read_parquet(parquet_path, engine="pyarrow")
print(f"Full read shape: {df_parquet.shape}")

# Read specific columns only (column pruning)
df_partial = pd.read_parquet(parquet_path, columns=["id", "value"])
print(f"Partial read (2 columns): {df_partial.shape}")

In [None]:
# Parquet metadata and schema
parquet_file = pq.read_table(parquet_path)
print("Parquet schema:")
print(parquet_file.schema)

In [None]:
# FastParquet alternative
df_fp = fastparquet.ParquetFile(parquet_path).to_pandas()
print(f"Fastparquet read shape: {df_fp.shape}")

os.unlink(parquet_path)

## HDF5 File Handling

HDF5 is designed for storing large amounts of scientific data with hierarchical structure.

In [None]:
import h5py
import tables

# Create HDF5 with h5py
with tempfile.NamedTemporaryFile(suffix=".h5", delete=False) as f:
    hdf5_path = f.name

# Write data with h5py
with h5py.File(hdf5_path, "w") as f:
    # Create datasets
    f.create_dataset("data", data=np.random.randn(1000, 100))
    f.create_dataset("labels", data=np.random.randint(0, 10, 1000))
    
    # Create groups with attributes
    grp = f.create_group("metadata")
    grp.attrs["version"] = "1.0"
    grp.attrs["description"] = "Sample HDF5 file"

print("Written HDF5 file with h5py")

In [None]:
# Read with h5py
with h5py.File(hdf5_path, "r") as f:
    print("HDF5 structure:")
    print(f"  Keys: {list(f.keys())}")
    print(f"  Data shape: {f['data'].shape}")
    print(f"  Labels shape: {f['labels'].shape}")
    print(f"  Metadata version: {f['metadata'].attrs['version']}")
    
    # Read a slice of data
    data_slice = f['data'][:10, :5]
    print(f"  Data slice shape: {data_slice.shape}")

os.unlink(hdf5_path)

In [None]:
# Pandas HDFStore (uses PyTables)
with tempfile.NamedTemporaryFile(suffix=".h5", delete=False) as f:
    hdf_pandas_path = f.name

# Write DataFrame to HDF5
df_large.to_hdf(hdf_pandas_path, key="data", mode="w", complevel=5)
print(f"Written DataFrame to HDF5: {os.path.getsize(hdf_pandas_path) / 1024:.2f} KB")

# Read back
df_hdf = pd.read_hdf(hdf_pandas_path, key="data")
print(f"Read back shape: {df_hdf.shape}")

os.unlink(hdf_pandas_path)

## HTTP Client Libraries

Python HTTP clients for API requests:
- **requests**: Simple, synchronous HTTP library
- **httpx**: Modern, async-capable HTTP client

In [None]:
import requests
import httpx

# Note: These examples show API usage patterns
print("requests library examples:")
print("  GET:  response = requests.get('https://api.example.com/data')")
print("  POST: response = requests.post(url, json={'key': 'value'})")
print("  Headers: requests.get(url, headers={'Authorization': 'Bearer token'})")

print("\nhttpx library examples (async capable):")
print("  Sync:  response = httpx.get('https://api.example.com/data')")
print("  Async: async with httpx.AsyncClient() as client:")
print("           response = await client.get(url)")

In [None]:
# Session management with requests
session = requests.Session()
session.headers.update({"User-Agent": "DataScience-Notebook/1.0"})
print(f"Session headers: {dict(session.headers)}")

## Pandas I/O Capabilities

Compare file sizes and read times for different formats.

In [None]:
# Create sample data
df_sample = pd.DataFrame({
    "date": pd.date_range("2024-01-01", periods=100),
    "value": np.random.randn(100),
    "category": np.random.choice(["X", "Y", "Z"], 100),
})

with tempfile.TemporaryDirectory() as tmpdir:
    # CSV
    csv_path = os.path.join(tmpdir, "data.csv")
    df_sample.to_csv(csv_path, index=False)
    csv_size = os.path.getsize(csv_path)
    
    # JSON (Pandas)
    json_path = os.path.join(tmpdir, "data.json")
    df_sample.to_json(json_path, orient="records", date_format="iso")
    json_size = os.path.getsize(json_path)
    
    # Feather (fast binary format)
    feather_path = os.path.join(tmpdir, "data.feather")
    df_sample.to_feather(feather_path)
    feather_size = os.path.getsize(feather_path)
    
    print("File size comparison:")
    print(f"  CSV:     {csv_size:,} bytes")
    print(f"  JSON:    {json_size:,} bytes")
    print(f"  Feather: {feather_size:,} bytes")
    
    # Read back
    df_csv = pd.read_csv(csv_path, parse_dates=["date"])
    df_json = pd.read_json(json_path)
    df_feather = pd.read_feather(feather_path)
    
    print(f"\nAll formats read successfully with shape: {df_csv.shape}")

---

## Summary

In this notebook, we covered:

1. **JSON Libraries**: orjson, ujson, simplejson for fast JSON serialization
2. **XML Processing**: lxml, xmltodict, ElementTree
3. **YAML**: PyYAML for configuration files
4. **Excel**: openpyxl and pandas for spreadsheet I/O
5. **Parquet**: PyArrow and fastparquet for columnar storage
6. **HDF5**: h5py and pandas HDFStore for scientific data
7. **HTTP Clients**: requests and httpx for API access

**Format Selection Guide:**
- **CSV**: Human-readable, universal compatibility
- **JSON**: API data, configuration, web applications
- **Parquet**: Analytics, big data, columnar queries
- **HDF5**: Scientific computing, large arrays, hierarchical data
- **Feather**: Fast DataFrame serialization between Python/R