# Exercise - JSON Schema and Data Validation - SOLUTIONS

In this exercise, you will learn about JSON Schema, a powerful tool for describing and validating JSON data structures. JSON Schema is essential for data integration tasks as it ensures data quality, defines clear contracts between systems, and helps catch data inconsistencies early in the pipeline.

You will work with product catalog data and customer information to understand how to create schemas, validate data, and handle validation errors.

## 1 Introduction to JSON Schema

JSON Schema is a vocabulary that allows you to annotate and validate JSON documents. It provides a clear, human and machine-readable documentation of your JSON data format.

Key benefits:
- **Data Validation**: Ensure data conforms to expected structure
- **Documentation**: Self-documenting data contracts
- **Code Generation**: Auto-generate data models
- **API Design**: Define request/response formats

We'll use the [jsonschema](https://python-jsonschema.readthedocs.io/) library, which implements JSON Schema validation for Python.

### 1.1 Installation and Basic Setup

First, let's install the required library and import the necessary modules.

In [1]:
# Install jsonschema library
# !pip install jsonschema

import json
import pandas as pd
from jsonschema import validate, ValidationError, Draft7Validator
from pprint import pprint
import time

### 1.2 Load and Inspect Sample Data

Load the sample product data and examine its structure to understand what we need to validate.

In [2]:
# Load the products dataset
with open('input/products.json', 'r') as f:
    products_data = json.load(f)

# Display the first product to understand the structure
print("Sample product data:")
pprint(products_data[0] if products_data else "No products found")
print(f"\nTotal products: {len(products_data)}")

Sample product data:
{'category': 'electronics',
 'description': 'High-quality wireless headphones with active noise '
                'cancellation',
 'id': 1,
 'in_stock': True,
 'name': 'Wireless Bluetooth Headphones',
 'price': 79.99,
 'specifications': {'color': 'black',
                    'dimensions': {'height': 7.5,
                                   'length': 18.0,
                                   'width': 15.0},
                    'material': 'plastic',
                    'weight': 250.5},
 'supplier': {'contact': {'address': '123 Tech Street, Silicon Valley, CA',
                          'email': 'contact@audiotech.com',
                          'phone': '+1234567890'},
              'name': 'AudioTech Corp'},
 'tags': ['wireless', 'bluetooth', 'headphones', 'audio']}

Total products: 8


## 2 Creating Basic JSON Schemas

### 2.1 Define a Product Schema

Based on the sample data, create a JSON schema that validates:
- Required fields: id, name, category, price
- Data types: id (integer), name (string), category (string), price (number)
- Constraints: price must be positive

In [3]:
# Define the product schema
product_schema = {
    "$schema": "https://json-schema.org/draft/2019-09/schema",
    "type": "object",
    "required": ["id", "name", "category", "price"],
    "properties": {
        "id": {
            "type": "integer",
            "description": "Unique product identifier"
        },
        "name": {
            "type": "string",
            "minLength": 1,
            "description": "Product name"
        },
        "category": {
            "type": "string",
            "minLength": 1,
            "description": "Product category"
        },
        "price": {
            "type": "number",
            "exclusiveMinimum": 0,
            "description": "Product price (must be positive)"
        },
        "description": {
            "type": "string",
            "description": "Optional product description"
        },
        "in_stock": {
            "type": "boolean",
            "description": "Whether product is in stock"
        }
    },
    "additionalProperties": True
}

print("Product Schema:")
pprint(product_schema)

Product Schema:
{'$schema': 'https://json-schema.org/draft/2019-09/schema',
 'additionalProperties': True,
 'properties': {'category': {'description': 'Product category',
                             'minLength': 1,
                             'type': 'string'},
                'description': {'description': 'Optional product description',
                                'type': 'string'},
                'id': {'description': 'Unique product identifier',
                       'type': 'integer'},
                'in_stock': {'description': 'Whether product is in stock',
                             'type': 'boolean'},
                'name': {'description': 'Product name',
                         'minLength': 1,
                         'type': 'string'},
                'price': {'description': 'Product price (must be positive)',
                          'exclusiveMinimum': 0,
                          'type': 'number'}},
 'required': ['id', 'name', 'category', 'price'],
 'type': 

### 2.2 Validate Valid Data

Test your schema against valid product data to ensure it works correctly.

In [4]:
# Validate the first product against the schema
try:
    validate(instance=products_data[0], schema=product_schema)
    print("✓ First product is valid according to the schema!")
    print(f"Validated product: {products_data[0]['name']}")
except ValidationError as e:
    print(f"Validation failed: {e.message}")
    print(f"Failed at path: {' -> '.join(str(x) for x in e.absolute_path) if e.absolute_path else 'root'}")

✓ First product is valid according to the schema!
Validated product: Wireless Bluetooth Headphones


### 2.3 Test with Invalid Data

Load the invalid products dataset and see how your schema catches validation errors.

In [5]:
# Load invalid products data
with open('input/products_invalid.json', 'r') as f:
    invalid_products = json.load(f)

# Validate each invalid product and collect errors
validation_results = []

for i, product in enumerate(invalid_products):
    try:
        validate(instance=product, schema=product_schema)
        validation_results.append({
            'index': i,
            'product': product,
            'valid': True,
            'error': None
        })
    except ValidationError as e:
        validation_results.append({
            'index': i,
            'product': product,
            'valid': False,
            'error': e.message,
            'path': ' -> '.join(str(x) for x in e.absolute_path) if e.absolute_path else 'root'
        })

# Display validation results
for result in validation_results:
    status = "Valid" if result['valid'] else "Invalid"
    product_name = result['product'].get('name', 'Unknown')
    print(f"Product {result['index']} ({product_name}): {status}")
    if not result['valid']:
        print(f"  Error: {result['error']}")
        print(f"  Path: {result['path']}")
    print()

Product 0 (Invalid Product 1): Invalid
  Error: None is not of type 'integer'
  Path: id

Product 1 (): Invalid
  Error: '' should be non-empty
  Path: name

Product 2 (Invalid Product 3): Invalid
  Error: -10.5 is less than or equal to the minimum of 0
  Path: price

Product 3 (Invalid Product 4): Invalid
  Error: 0 is less than or equal to the minimum of 0
  Path: price

Product 4 (Invalid Product 5): Invalid
  Error: 'category' is a required property
  Path: root

Product 5 (Invalid Product 6): Invalid
  Error: 'not_a_number' is not of type 'integer'
  Path: id

Product 6 (Invalid Product 7): Invalid
  Error: '25.99' is not of type 'number'
  Path: price

Product 7 (Invalid Product 8): Valid

Product 8 (Invalid Product 9): Valid

Product 9 (Invalid Product 10): Valid

Product 10 (Invalid Product 11): Valid

Product 11 (Invalid Product 12): Invalid
  Error: '$45.99' is not of type 'number'
  Path: price



## 3 Advanced Schema Features

### 3.1 Complex Data Types and Nested Objects

Extend your product schema to handle more complex data:
- Add an optional `specifications` object with nested properties
- Add a `tags` array of strings
- Add an optional `supplier` object with required `name` and `contact` fields

In [6]:
# Extended product schema with nested objects and arrays
extended_product_schema = {
    "$schema": "https://json-schema.org/draft/2019-09/schema",
    "type": "object",
    "required": ["id", "name", "category", "price"],
    "properties": {
        "id": {"type": "integer"},
        "name": {"type": "string", "minLength": 1},
        "category": {"type": "string", "minLength": 1},
        "price": {"type": "number", "exclusiveMinimum": 0},
        "description": {"type": "string"},
        "in_stock": {"type": "boolean"},
        "tags": {
            "type": "array",
            "items": {
                "type": "string",
                "minLength": 1
            },
            "uniqueItems": True,
            "description": "Product tags"
        },
        "specifications": {
            "type": "object",
            "properties": {
                "weight": {"type": "number", "minimum": 0},
                "dimensions": {
                    "type": "object",
                    "properties": {
                        "length": {"type": "number", "minimum": 0},
                        "width": {"type": "number", "minimum": 0},
                        "height": {"type": "number", "minimum": 0}
                    },
                    "required": ["length", "width", "height"]
                },
                "color": {"type": "string"},
                "material": {"type": "string"}
            },
            "additionalProperties": True
        },
        "supplier": {
            "type": "object",
            "required": ["name", "contact"],
            "properties": {
                "name": {"type": "string", "minLength": 1},
                "contact": {
                    "type": "object",
                    "required": ["phone"],
                    "properties": {
                        "phone": {"type": "string", "pattern": "^\\+?[1-9]\\d{1,14}$"},
                        "email": {"type": "string", "format": "email"},
                        "address": {"type": "string"}
                    }
                }
            }
        }
    },
    "additionalProperties": True
}

print("Extended Product Schema:")
pprint(extended_product_schema)

# Test with first product (which has complex nested data)
try:
    validate(instance=products_data[0], schema=extended_product_schema)
    print("\n✓ Extended schema validates complex product data!")
except ValidationError as e:
    print(f"\n✗ Extended validation failed: {e.message}")

Extended Product Schema:
{'$schema': 'https://json-schema.org/draft/2019-09/schema',
 'additionalProperties': True,
 'properties': {'category': {'minLength': 1, 'type': 'string'},
                'description': {'type': 'string'},
                'id': {'type': 'integer'},
                'in_stock': {'type': 'boolean'},
                'name': {'minLength': 1, 'type': 'string'},
                'price': {'exclusiveMinimum': 0, 'type': 'number'},
                'specifications': {'additionalProperties': True,
                                   'properties': {'color': {'type': 'string'},
                                                  'dimensions': {'properties': {'height': {'minimum': 0,
                                                                                           'type': 'number'},
                                                                                'length': {'minimum': 0,
                                                                                     

### 3.2 String Patterns and Enums

Add validation for:
- Product ID must follow pattern: "PROD-" followed by 4 digits
- Category must be one of: ["electronics", "clothing", "books", "home", "sports"]
- Optional email field in supplier contact must be valid email format

In [7]:
# Product schema with pattern and enum validation
pattern_schema = {
    "$schema": "https://json-schema.org/draft/2019-09/schema",
    "type": "object",
    "required": ["id", "name", "category", "price"],
    "properties": {
        "id": {
            "type": "string",
            "pattern": "^PROD-\\d{4}$",
            "description": "Product ID in format PROD-XXXX"
        },
        "name": {"type": "string", "minLength": 1},
        "category": {
            "type": "string",
            "enum": ["electronics", "clothing", "books", "home", "sports"],
            "description": "Product category from allowed values"
        },
        "price": {"type": "number", "exclusiveMinimum": 0},
        "description": {"type": "string"},
        "supplier": {
            "type": "object",
            "properties": {
                "contact": {
                    "type": "object",
                    "properties": {
                        "email": {
                            "type": "string",
                            "format": "email",
                            "description": "Valid email address"
                        }
                    }
                }
            }
        }
    },
    "additionalProperties": False
}

# Test with different data examples
test_cases = [
    {"id": "PROD-1234", "name": "Test Product", "category": "electronics", "price": 99.99},
    {"id": "INVALID-ID", "name": "Bad Product", "category": "electronics", "price": 50.00},
    {"id": "PROD-5678", "name": "Another Product", "category": "invalid_category", "price": 25.00}
]

# Validate each test case and report results
print("Pattern and Enum Validation Results:")
print("=" * 40)

for i, test_case in enumerate(test_cases):
    try:
        validate(instance=test_case, schema=pattern_schema)
        print(f"✓ Test case {i+1}: Valid")
    except ValidationError as e:
        print(f"✗ Test case {i+1}: {e.message}")
    print(f"  Data: {test_case}")
    print()

Pattern and Enum Validation Results:
✓ Test case 1: Valid
  Data: {'id': 'PROD-1234', 'name': 'Test Product', 'category': 'electronics', 'price': 99.99}

✗ Test case 2: 'INVALID-ID' does not match '^PROD-\\d{4}$'
  Data: {'id': 'INVALID-ID', 'name': 'Bad Product', 'category': 'electronics', 'price': 50.0}

✗ Test case 3: 'invalid_category' is not one of ['electronics', 'clothing', 'books', 'home', 'sports']
  Data: {'id': 'PROD-5678', 'name': 'Another Product', 'category': 'invalid_category', 'price': 25.0}



### 3.3 Conditional Validation

Create a schema that uses conditional validation:
- If category is "electronics", then `warranty_years` field is required
- If category is "clothing", then `size` field is required
- If price > 1000, then `premium_support` field is required

In [8]:
# Schema with conditional validation
conditional_schema = {
    "$schema": "https://json-schema.org/draft/2019-09/schema",
    "type": "object",
    "required": ["id", "name", "category", "price"],
    "properties": {
        "id": {"type": "string"},
        "name": {"type": "string", "minLength": 1},
        "category": {"type": "string"},
        "price": {"type": "number", "exclusiveMinimum": 0},
        "warranty_years": {"type": "integer", "minimum": 1},
        "size": {"type": "string"},
        "premium_support": {"type": "boolean"}
    },
    "allOf": [
        {
            "if": {"properties": {"category": {"const": "electronics"}}},
            "then": {"required": ["warranty_years"]},
            "else": {}
        },
        {
            "if": {"properties": {"category": {"const": "clothing"}}},
            "then": {"required": ["size"]},
            "else": {}
        },
        {
            "if": {"properties": {"price": {"minimum": 1000}}},
            "then": {"required": ["premium_support"]},
            "else": {}
        }
    ],
    "additionalProperties": False
}

# Test conditional validation
conditional_test_cases = [
    # Electronics without warranty
    {"id": "PROD-1111", "name": "Laptop", "category": "electronics", "price": 899.99},
    # Electronics with warranty
    {"id": "PROD-2222", "name": "Phone", "category": "electronics", "price": 699.99, "warranty_years": 2},
    # Expensive item without premium support
    {"id": "PROD-3333", "name": "Diamond Ring", "category": "jewelry", "price": 5000.00}
]

# Test conditional validation
print("Conditional Validation Results:")
print("=" * 35)

for i, test_case in enumerate(conditional_test_cases):
    try:
        validate(instance=test_case, schema=conditional_schema)
        print(f"✓ Test case {i+1}: Valid")
    except ValidationError as e:
        print(f"✗ Test case {i+1}: {e.message}")
    
    print(f"  Product: {test_case['name']} ({test_case['category']}, ${test_case['price']})")
    print()

Conditional Validation Results:
✗ Test case 1: 'warranty_years' is a required property
  Product: Laptop (electronics, $899.99)

✓ Test case 2: Valid
  Product: Phone (electronics, $699.99)

✗ Test case 3: 'premium_support' is a required property
  Product: Diamond Ring (jewelry, $5000.0)



## 4 Schema Composition and References

### 4.1 Using Schema References

Create reusable schema definitions using `$ref` to avoid duplication. Define:
- A common `contact` schema
- A common `address` schema
- Use these in both product supplier and customer schemas

In [9]:
# Define reusable schema components
schema_with_refs = {
    "$schema": "https://json-schema.org/draft/2019-09/schema",
    "$defs": {
        "contact": {
            "type": "object",
            "required": ["phone"],
            "properties": {
                "phone": {
                    "type": "string",
                    "pattern": "^\\+?[1-9]\\d{1,14}$",
                    "description": "Phone number in international format"
                },
                "email": {
                    "type": "string",
                    "format": "email",
                    "description": "Email address"
                }
            },
            "additionalProperties": False
        },
        "address": {
            "type": "object",
            "required": ["street", "city", "country"],
            "properties": {
                "street": {"type": "string", "minLength": 1},
                "city": {"type": "string", "minLength": 1},
                "state": {"type": "string"},
                "postal_code": {"type": "string"},
                "country": {
                    "type": "string",
                    "pattern": "^[A-Z]{2}$",
                    "description": "Two-letter country code"
                }
            },
            "additionalProperties": False
        }
    },
    "type": "object",
    "required": ["id", "name", "category", "price"],
    "properties": {
        "id": {"type": "string"},
        "name": {"type": "string", "minLength": 1},
        "category": {"type": "string"},
        "price": {"type": "number", "exclusiveMinimum": 0},
        "supplier": {
            "type": "object",
            "required": ["name", "contact"],
            "properties": {
                "name": {"type": "string", "minLength": 1},
                "contact": {"$ref": "#/$defs/contact"},
                "address": {"$ref": "#/$defs/address"}
            },
            "additionalProperties": False
        }
    },
    "additionalProperties": True
}

print("Schema with References:")
pprint(schema_with_refs)

Schema with References:
{'$defs': {'address': {'additionalProperties': False,
                       'properties': {'city': {'minLength': 1,
                                               'type': 'string'},
                                      'country': {'description': 'Two-letter '
                                                                 'country code',
                                                  'pattern': '^[A-Z]{2}$',
                                                  'type': 'string'},
                                      'postal_code': {'type': 'string'},
                                      'state': {'type': 'string'},
                                      'street': {'minLength': 1,
                                                 'type': 'string'}},
                       'required': ['street', 'city', 'country'],
                       'type': 'object'},
           'contact': {'additionalProperties': False,
                       'properties': {'email': {'desc

### 4.2 Customer Data Validation

Create a customer schema using the reusable components and validate customer data.

In [10]:
# Load customer data
with open('input/customers.json', 'r') as f:
    customers_data = json.load(f)

# Customer schema using references
customer_schema = {
    "$schema": "https://json-schema.org/draft/2019-09/schema",
    "$defs": {
        "contact": {
            "type": "object",
            "required": ["email"],
            "properties": {
                "phone": {"type": "string", "pattern": "^\\+?[1-9]\\d{1,14}$"},
                "email": {"type": "string", "format": "email"}
            },
            "additionalProperties": False
        },
        "address": {
            "type": "object",
            "required": ["street", "city", "country"],
            "properties": {
                "street": {"type": "string", "minLength": 1},
                "city": {"type": "string", "minLength": 1},
                "state": {"type": "string"},
                "postal_code": {"type": "string"},
                "country": {"type": "string", "pattern": "^[A-Z]{2}$"}
            },
            "additionalProperties": False
        }
    },
    "type": "object",
    "required": ["customer_id", "name", "contact"],
    "properties": {
        "customer_id": {"type": "integer", "minimum": 1},
        "name": {
            "type": "object",
            "required": ["first", "last"],
            "properties": {
                "first": {"type": "string", "minLength": 1},
                "last": {"type": "string", "minLength": 1}
            }
        },
        "contact": {"$ref": "#/$defs/contact"},
        "address": {"$ref": "#/$defs/address"},
        "date_registered": {
            "type": "string",
            "format": "date",
            "description": "Customer registration date"
        },
        "loyalty_tier": {
            "type": "string",
            "enum": ["bronze", "silver", "gold", "platinum"]
        }
    },
    "additionalProperties": False
}

# Validate customers
print("Customer Validation Results:")
print("=" * 30)

valid_customers = 0
for i, customer in enumerate(customers_data):
    try:
        validate(instance=customer, schema=customer_schema)
        print(f"✓ Customer {i+1}: {customer['name']['first']} {customer['name']['last']} - Valid")
        valid_customers += 1
    except ValidationError as e:
        print(f"✗ Customer {i+1}: {e.message}")
        print(f"  Path: {' -> '.join(str(x) for x in e.absolute_path) if e.absolute_path else 'root'}")

print(f"\nSummary: {valid_customers}/{len(customers_data)} customers are valid")

Customer Validation Results:
✓ Customer 1: John Doe - Valid
✓ Customer 2: Jane Smith - Valid
✓ Customer 3: Bob Johnson - Valid
✓ Customer 4: Alice Williams - Valid
✓ Customer 5: Charlie Brown - Valid
✓ Customer 6: Diana Miller - Valid
✓ Customer 7: Eva Davis - Valid
✓ Customer 8: Frank Wilson - Valid

Summary: 8/8 customers are valid


## 5 Bulk Validation and Error Reporting

### 5.1 Validate Large Datasets

Create a function to validate large datasets efficiently and generate comprehensive error reports.

In [11]:
def validate_dataset(data, schema, dataset_name="dataset"):
    """Validate a list of records and return detailed results."""
    validator = Draft7Validator(schema)
    results = {
        'dataset_name': dataset_name,
        'total_records': len(data),
        'valid_records': 0,
        'invalid_records': 0,
        'errors': [],
        'error_summary': {},
        'validation_time': 0
    }
    
    start_time = time.time()
    
    for i, record in enumerate(data):
        errors = list(validator.iter_errors(record))
        if errors:
            results['invalid_records'] += 1
            for error in errors:
                error_info = {
                    'record_index': i,
                    'message': error.message,
                    'path': ' -> '.join(str(x) for x in error.absolute_path) if error.absolute_path else 'root',
                    'invalid_value': error.instance,
                    'schema_path': ' -> '.join(str(x) for x in error.schema_path) if error.schema_path else 'root'
                }
                results['errors'].append(error_info)
                
                # Track error frequency
                error_type = error.validator
                if error_type not in results['error_summary']:
                    results['error_summary'][error_type] = 0
                results['error_summary'][error_type] += 1
        else:
            results['valid_records'] += 1
    
    results['validation_time'] = time.time() - start_time
    return results

def generate_validation_report(validation_results):
    """Generate a summary report of validation results."""
    results = validation_results
    
    print(f"Validation Report for: {results['dataset_name']}")
    print("=" * 50)
    print(f"Total Records: {results['total_records']}")
    print(f"Valid Records: {results['valid_records']} ({results['valid_records']/results['total_records']*100:.1f}%)")
    print(f"Invalid Records: {results['invalid_records']} ({results['invalid_records']/results['total_records']*100:.1f}%)")
    print(f"Validation Time: {results['validation_time']:.3f} seconds")
    print()
    
    if results['error_summary']:
        print("Error Summary:")
        for error_type, count in sorted(results['error_summary'].items()):
            print(f"  {error_type}: {count} occurrences")
        print()
    
    if results['errors']:
        print("First 5 Errors:")
        for error in results['errors'][:5]:
            print(f"  Record {error['record_index']}: {error['message']}")
            print(f"    Path: {error['path']}")
            print(f"    Value: {error['invalid_value']}")
            print()

# Test with products dataset
product_results = validate_dataset(products_data, product_schema, "Products Dataset")
generate_validation_report(product_results)

# Test with invalid products
invalid_results = validate_dataset(invalid_products, product_schema, "Invalid Products Dataset")
generate_validation_report(invalid_results)

Validation Report for: Products Dataset
Total Records: 8
Valid Records: 8 (100.0%)
Invalid Records: 0 (0.0%)
Validation Time: 0.000 seconds

Validation Report for: Invalid Products Dataset
Total Records: 12
Valid Records: 4 (33.3%)
Invalid Records: 8 (66.7%)
Validation Time: 0.000 seconds

Error Summary:
  exclusiveMinimum: 2 occurrences
  minLength: 1 occurrences
  required: 1 occurrences
  type: 4 occurrences

First 5 Errors:
  Record 0: None is not of type 'integer'
    Path: id
    Value: None

  Record 1: '' should be non-empty
    Path: name
    Value: 

  Record 2: -10.5 is less than or equal to the minimum of 0
    Path: price
    Value: -10.5

  Record 3: 0 is less than or equal to the minimum of 0
    Path: price
    Value: 0

  Record 4: 'category' is a required property
    Path: root
    Value: {'id': 5, 'name': 'Invalid Product 5', 'price': 30.0, 'description': 'Product missing required category field'}



### 5.2 Data Cleaning Based on Validation

Use validation results to clean and fix data issues automatically where possible.

In [12]:
def clean_product_data(products, validation_results):
    """Clean product data based on validation errors."""
    cleaned_products = []
    repair_log = []
    
    # Create a map of errors by record index
    errors_by_record = {}
    for error in validation_results['errors']:
        record_idx = error['record_index']
        if record_idx not in errors_by_record:
            errors_by_record[record_idx] = []
        errors_by_record[record_idx].append(error)
    
    for i, product in enumerate(products):
        if i not in errors_by_record:
            # Product is valid, keep as is
            cleaned_products.append(product.copy())
        else:
            # Product has errors, try to fix them
            cleaned_product = product.copy()
            fixed_errors = []
            
            for error in errors_by_record[i]:
                if error['message'].startswith('None is not of type'):
                    # Handle missing required fields
                    if 'name' in error['path'] and cleaned_product.get('name') is None:
                        cleaned_product['name'] = 'Unknown Product'
                        fixed_errors.append('Set missing name')
                    elif 'category' in error['path'] and cleaned_product.get('category') is None:
                        cleaned_product['category'] = 'uncategorized'
                        fixed_errors.append('Set missing category')
                
                elif 'is not of type' in error['message'] and 'number' in error['message']:
                    # Try to convert string prices to numbers
                    if 'price' in error['path'] and isinstance(cleaned_product.get('price'), str):
                        try:
                            # Remove currency symbols and convert
                            price_str = str(cleaned_product['price']).replace('$', '').replace(',', '')
                            cleaned_product['price'] = float(price_str)
                            fixed_errors.append(f'Converted price from string to number')
                        except ValueError:
                            pass
                
                elif 'is less than or equal to the minimum' in error['message'] or 'is less than the minimum' in error['message']:
                    # Handle negative or zero prices
                    if 'price' in error['path'] and cleaned_product.get('price', 0) <= 0:
                        cleaned_product['price'] = 1.0  # Set minimum price
                        fixed_errors.append('Fixed negative/zero price')
                
                elif "should be non-empty" in error['message']:
                    # Handle empty strings
                    if 'name' in error['path'] and cleaned_product.get('name') == '':
                        cleaned_product['name'] = 'Unknown Product'
                        fixed_errors.append('Fixed empty name')
            
            if fixed_errors:
                repair_log.append({
                    'record_index': i,
                    'original_name': product.get('name', 'Unknown'),
                    'fixes': fixed_errors
                })
            
            # Only keep the product if we could fix critical issues
            try:
                validate(instance=cleaned_product, schema=product_schema)
                cleaned_products.append(cleaned_product)
            except ValidationError:
                # Still invalid after cleaning, skip this record
                repair_log.append({
                    'record_index': i,
                    'original_name': product.get('name', 'Unknown'),
                    'fixes': ['REMOVED - Could not fix all validation errors']
                })
    
    return cleaned_products, repair_log

# Clean invalid products data and re-validate
print("Cleaning Invalid Products...")
cleaned_products, repair_log = clean_product_data(invalid_products, invalid_results)

print("\nRepair Log:")
for repair in repair_log:
    print(f"Record {repair['record_index']} ({repair['original_name']}):")
    for fix in repair['fixes']:
        print(f"  - {fix}")

# Re-validate cleaned data
print(f"\nOriginal dataset: {len(invalid_products)} products")
print(f"Cleaned dataset: {len(cleaned_products)} products")

if cleaned_products:
    cleaned_results = validate_dataset(cleaned_products, product_schema, "Cleaned Products")
    generate_validation_report(cleaned_results)

Cleaning Invalid Products...

Repair Log:
Record 0 (Invalid Product 1):
  - REMOVED - Could not fix all validation errors
Record 1 ():
  - Fixed empty name
Record 2 (Invalid Product 3):
  - Fixed negative/zero price
Record 3 (Invalid Product 4):
  - Fixed negative/zero price
Record 4 (Invalid Product 5):
  - REMOVED - Could not fix all validation errors
Record 5 (Invalid Product 6):
  - REMOVED - Could not fix all validation errors
Record 6 (Invalid Product 7):
  - Converted price from string to number
Record 11 (Invalid Product 12):
  - Converted price from string to number

Original dataset: 12 products
Cleaned dataset: 9 products
Validation Report for: Cleaned Products
Total Records: 9
Valid Records: 9 (100.0%)
Invalid Records: 0 (0.0%)
Validation Time: 0.000 seconds



## 6 Integration with Data Pipelines

### 6.1 Schema Evolution and Versioning

Demonstrate how to handle schema changes over time while maintaining backwards compatibility.

In [13]:
# Version 1 of product schema
product_schema_v1 = {
    "$schema": "https://json-schema.org/draft/2019-09/schema",
    "version": "1.0",
    "type": "object",
    "required": ["id", "name", "price"],
    "properties": {
        "id": {"type": "integer"},
        "name": {"type": "string"},
        "price": {"type": "number", "minimum": 0}
    },
    "additionalProperties": True
}

# Version 2 with additional optional fields
product_schema_v2 = {
    "$schema": "https://json-schema.org/draft/2019-09/schema",
    "version": "2.0",
    "type": "object",
    "required": ["id", "name", "price", "category"],  # Added category as required
    "properties": {
        "id": {"type": "integer"},
        "name": {"type": "string"},
        "price": {"type": "number", "minimum": 0},
        "category": {"type": "string"},  # New required field
        "description": {"type": "string"},  # New optional field
        "tags": {"type": "array", "items": {"type": "string"}},  # New optional field
        "created_at": {"type": "string", "format": "date-time"}  # New optional field
    },
    "additionalProperties": False
}

# Function to validate with fallback
def validate_with_version_fallback(data, schemas):
    """Try validation with multiple schema versions."""
    results = []
    
    # Sort schemas by version (descending)
    sorted_schemas = sorted(schemas, key=lambda s: s.get('version', '0.0'), reverse=True)
    
    for record_idx, record in enumerate(data):
        validation_result = {
            'record_index': record_idx,
            'record': record,
            'valid': False,
            'schema_version': None,
            'errors': []
        }
        
        # Try each schema version
        for schema in sorted_schemas:
            try:
                validate(instance=record, schema=schema)
                validation_result['valid'] = True
                validation_result['schema_version'] = schema.get('version', 'unknown')
                break
            except ValidationError as e:
                validation_result['errors'].append({
                    'schema_version': schema.get('version', 'unknown'),
                    'error': e.message
                })
        
        results.append(validation_result)
    
    return results

# Test with different data versions
legacy_data = [
    {"id": 1, "name": "Old Product", "price": 50.0},  # v1 compatible
    {"id": 2, "name": "New Product", "price": 75.0, "category": "electronics", "description": "Latest model"}  # v2 compatible
]

schemas = [product_schema_v1, product_schema_v2]
version_results = validate_with_version_fallback(legacy_data, schemas)

print("Schema Version Validation Results:")
print("=" * 40)

for result in version_results:
    product_name = result['record'].get('name', 'Unknown')
    if result['valid']:
        print(f"✓ {product_name}: Valid with schema v{result['schema_version']}")
    else:
        print(f"✗ {product_name}: Invalid with all schema versions")
        for error in result['errors']:
            print(f"  v{error['schema_version']}: {error['error']}")
    print()

Schema Version Validation Results:
✓ Old Product: Valid with schema v1.0

✓ New Product: Valid with schema v2.0



### 6.2 Performance Considerations

Explore performance optimization for validating large datasets.

In [14]:
def benchmark_validation_methods(data, schema, iterations=1000):
    """Benchmark different validation approaches."""
    results = {}
    
    # Method 1: Basic validation
    start_time = time.time()
    for _ in range(iterations):
        for record in data:
            try:
                validate(instance=record, schema=schema)
            except ValidationError:
                pass
    results['basic_validation'] = time.time() - start_time
    
    # Method 2: Pre-compiled validator
    validator = Draft7Validator(schema)
    start_time = time.time()
    for _ in range(iterations):
        for record in data:
            try:
                validator.validate(record)
            except ValidationError:
                pass
    results['precompiled_validator'] = time.time() - start_time
    
    # Method 3: Validation with early stopping
    start_time = time.time()
    for _ in range(iterations):
        for record in data:
            validator.is_valid(record)
    results['is_valid_check'] = time.time() - start_time
    
    return results

# Create a larger dataset for benchmarking
benchmark_data = products_data * 5  # Repeat data to make it larger

print(f"Benchmarking validation performance with {len(benchmark_data)} records...")
benchmark_results = benchmark_validation_methods(benchmark_data, product_schema, iterations=100)

print("\nPerformance Benchmark Results:")
print("=" * 35)

for method, duration in sorted(benchmark_results.items()):
    records_per_second = len(benchmark_data) * 100 / duration
    print(f"{method:25s}: {duration:.3f}s ({records_per_second:,.0f} records/sec)")

# Show speedup comparison
baseline = benchmark_results['basic_validation']
print("\nSpeedup vs Basic Validation:")
for method, duration in benchmark_results.items():
    if method != 'basic_validation':
        speedup = baseline / duration
        print(f"{method:25s}: {speedup:.1f}x faster")

Benchmarking validation performance with 40 records...

Performance Benchmark Results:
basic_validation         : 2.747s (1,456 records/sec)
is_valid_check           : 0.059s (68,152 records/sec)
precompiled_validator    : 0.059s (67,271 records/sec)

Speedup vs Basic Validation:
precompiled_validator    : 46.2x faster
is_valid_check           : 46.8x faster


## 7 Best Practices and Real-World Applications

### 7.1 Schema Design Best Practices

Summarize the key best practices learned:
- Start with required fields, add optional ones gradually
- Use clear, descriptive field names and add descriptions
- Define appropriate constraints (min/max, patterns, enums)
- Plan for schema evolution with versioning
- Use references to avoid duplication
- Consider performance implications for large datasets

In [15]:
# Create a comprehensive, well-documented product schema
final_product_schema = {
    "$schema": "https://json-schema.org/draft/2019-09/schema",
    "$id": "https://example.com/product.schema.json",
    "title": "Product",
    "description": "A product in our catalog with comprehensive validation",
    "version": "2.0",
    "type": "object",
    "required": ["id", "name", "category", "price"],
    "properties": {
        "id": {
            "type": "string",
            "pattern": "^PROD-\\d{4}$",
            "description": "Unique product identifier in format PROD-XXXX",
            "examples": ["PROD-1234", "PROD-5678"]
        },
        "name": {
            "type": "string",
            "minLength": 1,
            "maxLength": 100,
            "description": "Product name (1-100 characters)"
        },
        "category": {
            "type": "string",
            "enum": ["electronics", "clothing", "books", "home", "sports", "other"],
            "description": "Product category from predefined list"
        },
        "price": {
            "type": "number",
            "minimum": 0.01,
            "maximum": 999999.99,
            "multipleOf": 0.01,
            "description": "Product price in USD (0.01 - 999,999.99)"
        },
        "description": {
            "type": "string",
            "maxLength": 1000,
            "description": "Optional product description (max 1000 characters)"
        },
        "in_stock": {
            "type": "boolean",
            "default": True,
            "description": "Whether the product is currently in stock"
        },
        "tags": {
            "type": "array",
            "items": {
                "type": "string",
                "minLength": 1,
                "maxLength": 50
            },
            "uniqueItems": True,
            "maxItems": 10,
            "description": "Product tags (max 10, each 1-50 characters)"
        },
        "created_at": {
            "type": "string",
            "format": "date-time",
            "description": "Product creation timestamp in ISO 8601 format"
        },
        "updated_at": {
            "type": "string",
            "format": "date-time",
            "description": "Product last update timestamp in ISO 8601 format"
        }
    },
    "additionalProperties": False,
    "examples": [
        {
            "id": "PROD-1234",
            "name": "Wireless Bluetooth Headphones",
            "category": "electronics",
            "price": 79.99,
            "description": "High-quality wireless headphones with noise cancellation",
            "in_stock": True,
            "tags": ["wireless", "bluetooth", "headphones", "audio"],
            "created_at": "2023-01-15T10:30:00Z",
            "updated_at": "2023-03-20T14:45:00Z"
        }
    ]
}

print("Final Comprehensive Product Schema:")
pprint(final_product_schema)

# Test the comprehensive schema with a properly formatted product
test_product = {
    "id": "PROD-9999",
    "name": "Test Product",
    "category": "electronics",
    "price": 49.99,
    "description": "A test product for schema validation",
    "in_stock": True,
    "tags": ["test", "example"],
    "created_at": "2023-09-04T10:00:00Z",
    "updated_at": "2023-09-04T10:00:00Z"
}

try:
    validate(instance=test_product, schema=final_product_schema)
    print("\n✓ Test product validates successfully against comprehensive schema!")
except ValidationError as e:
    print(f"\n✗ Validation failed: {e.message}")

Final Comprehensive Product Schema:
{'$id': 'https://example.com/product.schema.json',
 '$schema': 'https://json-schema.org/draft/2019-09/schema',
 'additionalProperties': False,
 'description': 'A product in our catalog with comprehensive validation',
 'examples': [{'category': 'electronics',
               'created_at': '2023-01-15T10:30:00Z',
               'description': 'High-quality wireless headphones with noise '
                              'cancellation',
               'id': 'PROD-1234',
               'in_stock': True,
               'name': 'Wireless Bluetooth Headphones',
               'price': 79.99,
               'tags': ['wireless', 'bluetooth', 'headphones', 'audio'],
               'updated_at': '2023-03-20T14:45:00Z'}],
 'properties': {'category': {'description': 'Product category from predefined '
                                            'list',
                             'enum': ['electronics',
                                      'clothing',
            

### 7.2 Integration Summary

Document how JSON Schema validation fits into data integration workflows:

1. **Data Ingestion**: Validate incoming data at entry points
2. **ETL Pipelines**: Use schemas to ensure data quality throughout transformation
3. **API Design**: Define clear contracts for data exchange
4. **Data Governance**: Enforce data standards and compliance
5. **Error Handling**: Implement graceful degradation for invalid data
6. **Monitoring**: Track validation metrics for data quality monitoring

In [16]:
def data_integration_pipeline(raw_data, schema, output_file):
    """A simple pipeline that validates, cleans, and saves data."""
    pipeline_stats = {
        'input_records': len(raw_data),
        'valid_records': 0,
        'cleaned_records': 0,
        'rejected_records': 0,
        'output_records': 0
    }
    
    print(f"Starting data integration pipeline with {len(raw_data)} records...")
    
    # Step 1: Validate incoming data
    print("Step 1: Validating data...")
    validation_results = validate_dataset(raw_data, schema, "Pipeline Input")
    pipeline_stats['valid_records'] = validation_results['valid_records']
    
    # Step 2: Clean invalid data where possible
    print("Step 2: Cleaning invalid data...")
    if validation_results['invalid_records'] > 0:
        cleaned_data, repair_log = clean_product_data(raw_data, validation_results)
        pipeline_stats['cleaned_records'] = len([r for r in repair_log if 'REMOVED' not in str(r['fixes'])])
        pipeline_stats['rejected_records'] = validation_results['invalid_records'] - pipeline_stats['cleaned_records']
    else:
        cleaned_data = [record for i, record in enumerate(raw_data) 
                       if not any(error['record_index'] == i for error in validation_results['errors'])]
    
    # Step 3: Final validation
    print("Step 3: Final validation...")
    final_validation = validate_dataset(cleaned_data, schema, "Pipeline Output")
    pipeline_stats['output_records'] = final_validation['valid_records']
    
    # Step 4: Save to output file
    print(f"Step 4: Saving {pipeline_stats['output_records']} records to {output_file}...")
    with open(output_file, 'w') as f:
        json.dump(cleaned_data, f, indent=2)
    
    # Report pipeline statistics
    print("\nPipeline Statistics:")
    print("=" * 20)
    print(f"Input records:    {pipeline_stats['input_records']}")
    print(f"Valid records:    {pipeline_stats['valid_records']}")
    print(f"Cleaned records:  {pipeline_stats['cleaned_records']}")
    print(f"Rejected records: {pipeline_stats['rejected_records']}")
    print(f"Output records:   {pipeline_stats['output_records']}")
    if pipeline_stats['input_records'] > 0:
        print(f"Data quality:     {pipeline_stats['output_records']/pipeline_stats['input_records']*100:.1f}%")
    
    return pipeline_stats

# Run the pipeline with sample data
pipeline_stats = data_integration_pipeline(
    invalid_products, 
    product_schema, 
    'output/cleaned_products.json'
)

print("\n" + "=" * 60)
print("JSON Schema Exercise Complete!")
print("=" * 60)
print("\nKey concepts covered:")
print("• Basic schema definition and validation")
print("• Advanced features (patterns, conditionals, references)")
print("• Error handling and data cleaning")
print("• Schema evolution and versioning")
print("• Performance optimization")
print("• Integration with data pipelines")
print("\nThis foundation will help you implement robust data validation")
print("in your data integration and ETL workflows!")

Starting data integration pipeline with 12 records...
Step 1: Validating data...
Step 2: Cleaning invalid data...
Step 3: Final validation...
Step 4: Saving 9 records to output/cleaned_products.json...

Pipeline Statistics:
Input records:    12
Valid records:    4
Cleaned records:  5
Rejected records: 3
Output records:   9
Data quality:     75.0%

JSON Schema Exercise Complete!

Key concepts covered:
• Basic schema definition and validation
• Advanced features (patterns, conditionals, references)
• Error handling and data cleaning
• Schema evolution and versioning
• Performance optimization
• Integration with data pipelines

This foundation will help you implement robust data validation
in your data integration and ETL workflows!
