# Exercise - JSON Schema and Data Validation

In this exercise, you will learn about JSON Schema, a powerful tool for describing and validating JSON data structures. JSON Schema is essential for data integration tasks as it ensures data quality, defines clear contracts between systems, and helps catch data inconsistencies early in the pipeline.

You will work with product catalog data and customer information to understand how to create schemas, validate data, and handle validation errors.

## 1 Introduction to JSON Schema

JSON Schema is a vocabulary that allows you to annotate and validate JSON documents. It provides a clear, human and machine-readable documentation of your JSON data format.

Key benefits:
- **Data Validation**: Ensure data conforms to expected structure
- **Documentation**: Self-documenting data contracts
- **Code Generation**: Auto-generate data models
- **API Design**: Define request/response formats

We'll use the [jsonschema](https://python-jsonschema.readthedocs.io/) library, which implements JSON Schema validation for Python.

### 1.1 Installation and Basic Setup

First, let's install the required library and import the necessary modules.

In [2]:
import json
import pandas as pd
from jsonschema import validate, ValidationError, Draft7Validator
from pprint import pprint

### 1.2 Load and Inspect Sample Data

Load the sample product data and examine its structure to understand what we need to validate.

In [3]:
# Load the products dataset
with open('input/products.json', 'r') as f:
    products_data = json.load(f)

# Display the first product to understand the structure


## 2 Creating Basic JSON Schemas

### 2.1 Define a Product Schema

Based on the sample data, create a JSON schema that validates:
- Required fields: id, name, category, price
- Data types: id (integer), name (string), category (string), price (number)
- Constraints: price must be positive

In [None]:
# Define the product schema
product_schema = {
    # TODO: Complete the schema definition
}

print("Product Schema:")
pprint(product_schema)

### 2.2 Validate Valid Data

Test your schema against valid product data to ensure it works correctly.

In [None]:
# Validate the first product against the schema
try:
    # TODO: Add validation code
    pass
except ValidationError as e:
    print(f"Validation failed: {e.message}")

### 2.3 Test with Invalid Data

Load the invalid products dataset and see how your schema catches validation errors.

In [4]:
# Load invalid products data
with open('input/products_invalid.json', 'r') as f:
    invalid_products = json.load(f)

# Validate each invalid product and collect errors
validation_results = []

for i, product in enumerate(invalid_products):
    # TODO: Add validation logic and error collection
    pass

# Display validation results
for result in validation_results:
    print(f"Product {result['index']}: {'Valid' if result['valid'] else 'Invalid'}")
    if not result['valid']:
        print(f"  Error: {result['error']}")
    print()

## 3 Advanced Schema Features

### 3.1 Complex Data Types and Nested Objects

Extend your product schema to handle more complex data:
- Add an optional `specifications` object with nested properties
- Add a `tags` array of strings
- Add an optional `supplier` object with required `name` and `contact` fields

In [None]:
# Extended product schema with nested objects and arrays
extended_product_schema = {
    # TODO: Extend the basic schema with complex types
}

print("Extended Product Schema:")
pprint(extended_product_schema)

### 3.2 String Patterns and Enums

Add validation for:
- Product ID must follow pattern: "PROD-" followed by 4 digits
- Category must be one of: ["electronics", "clothing", "books", "home", "sports"]
- Optional email field in supplier contact must be valid email format

In [None]:
# Product schema with pattern and enum validation
pattern_schema = {
    # TODO: Add pattern and enum constraints
}

# Test with different data examples
test_cases = [
    {"id": "PROD-1234", "name": "Test Product", "category": "electronics", "price": 99.99},
    {"id": "INVALID-ID", "name": "Bad Product", "category": "electronics", "price": 50.00},
    {"id": "PROD-5678", "name": "Another Product", "category": "invalid_category", "price": 25.00}
]

# TODO: Validate each test case and report results


### 3.3 Conditional Validation

Create a schema that uses conditional validation:
- If category is "electronics", then `warranty_years` field is required
- If category is "clothing", then `size` field is required
- If price > 1000, then `premium_support` field is required

In [None]:
# Schema with conditional validation
conditional_schema = {
    # TODO: Implement conditional validation using if-then-else
}

# Test conditional validation
conditional_test_cases = [
    # Electronics without warranty
    {"id": "PROD-1111", "name": "Laptop", "category": "electronics", "price": 899.99},
    # Electronics with warranty
    {"id": "PROD-2222", "name": "Phone", "category": "electronics", "price": 699.99, "warranty_years": 2},
    # Expensive item without premium support
    {"id": "PROD-3333", "name": "Diamond Ring", "category": "jewelry", "price": 5000.00}
]

# TODO: Test conditional validation


## 4 Schema Composition and References

### 4.1 Using Schema References

Create reusable schema definitions using `$ref` to avoid duplication. Define:
- A common `contact` schema
- A common `address` schema
- Use these in both product supplier and customer schemas

In [None]:
# Define reusable schema components
schema_with_refs = {
    "$schema": "https://json-schema.org/draft/2019-09/schema",
    "$defs": {
        # TODO: Define reusable schemas here
    },
    # TODO: Main schema that references the definitions
}

print("Schema with References:")
pprint(schema_with_refs)

### 4.2 Customer Data Validation

Create a customer schema using the reusable components and validate customer data.

In [None]:
# Load customer data
with open('input/customers.json', 'r') as f:
    customers_data = json.load(f)

# Customer schema
customer_schema = {
    # TODO: Define customer schema using references
}

# Validate customers
# TODO: Validate customer data


## 5 Bulk Validation and Error Reporting

### 5.1 Validate Large Datasets

Create a function to validate large datasets efficiently and generate comprehensive error reports.

In [None]:
def validate_dataset(data, schema, dataset_name="dataset"):
    """Validate a list of records and return detailed results."""
    # TODO: Implement bulk validation function
    pass

def generate_validation_report(validation_results):
    """Generate a summary report of validation results."""
    # TODO: Implement report generation
    pass

# Test with products dataset
# TODO: Use functions to validate and report on products data


### 5.2 Data Cleaning Based on Validation

Use validation results to clean and fix data issues automatically where possible.

In [None]:
def clean_product_data(products, validation_results):
    """Clean product data based on validation errors."""
    cleaned_products = []
    
    # TODO: Implement data cleaning logic
    # Examples:
    # - Convert string prices to numbers
    # - Remove invalid records
    # - Fix common formatting issues
    
    return cleaned_products

# TODO: Clean invalid products data and re-validate


## 6 Integration with Data Pipelines

### 6.1 Schema Evolution and Versioning

Demonstrate how to handle schema changes over time while maintaining backwards compatibility.

In [None]:
# Version 1 of product schema
product_schema_v1 = {
    # TODO: Define v1 schema
}

# Version 2 with additional optional fields
product_schema_v2 = {
    # TODO: Define v2 schema with new fields
}

# Function to validate with fallback
def validate_with_version_fallback(data, schemas):
    """Try validation with multiple schema versions."""
    # TODO: Implement version fallback logic
    pass

# TODO: Test with different data versions


### 6.2 Performance Considerations

Explore performance optimization for validating large datasets.

In [None]:
import time
from jsonschema import Draft7Validator

def benchmark_validation_methods(data, schema, iterations=1000):
    """Benchmark different validation approaches."""
    results = {}
    
    # Method 1: Basic validation
    # TODO: Benchmark basic validate() function
    
    # Method 2: Pre-compiled validator
    # TODO: Benchmark with pre-compiled Draft7Validator
    
    # Method 3: Validation with early stopping
    # TODO: Benchmark with check_schema and iter_errors
    
    return results

# TODO: Run benchmarks and compare performance


## 7 Best Practices and Real-World Applications


- Start with required fields, add optional ones gradually
- Use clear, descriptive field names and add descriptions
- Define appropriate constraints (min/max, patterns, enums)
- Plan for schema evolution with versioning
- Use references to avoid duplication
- Consider performance implications for large datasets