# Session 1.4: Advanced Data Structures

## **Important for Handling Nested Schemas in DataFrames**

### **Learning Objectives**
By the end of this session, you will:
- Understand nested lists, dictionaries, and custom data structures
- Manipulate complex healthcare records
- Prepare data for PySpark DataFrame schemas
- Apply advanced structures to real-world healthcare scenarios

---

### **Relevance to PySpark**
Advanced data structures are essential for working with nested schemas in PySpark DataFrames, enabling you to process hierarchical healthcare data efficiently.

---

## 1. Nested Lists and Dictionaries

# Nested patient records
patients = [
    {
        'id': 'PT001',
        'name': 'John Doe',
        'visits': [
            {'date': '2025-01-10', 'bp': '140/90', 'notes': 'Routine checkup'},
            {'date': '2025-03-15', 'bp': '135/85', 'notes': 'Follow-up'}
        ]
    },
    {
        'id': 'PT002',
        'name': 'Jane Smith',
        'visits': [
            {'date': '2025-02-20', 'bp': '130/80', 'notes': 'Diabetes management'}
        ]
    }
]

# Access nested data
for patient in patients:
    print(f"Patient {patient['name']} visits:")
    for visit in patient['visits']:
        print(f"  {visit['date']}: BP {visit['bp']} - {visit['notes']}")

## 2. Custom Data Structures with Classes

# Define a custom class for healthcare records
class PatientRecord:
    def __init__(self, id, name, visits):
        self.id = id
        self.name = name
        self.visits = visits
    
    def add_visit(self, date, bp, notes):
        self.visits.append({'date': date, 'bp': bp, 'notes': notes})
    
    def get_latest_visit(self):
        return self.visits[-1] if self.visits else None

# Create patient records
john = PatientRecord('PT001', 'John Doe', [{'date': '2025-01-10', 'bp': '140/90', 'notes': 'Routine checkup'}])
john.add_visit('2025-03-15', '135/85', 'Follow-up')

latest = john.get_latest_visit()
print(f"Latest visit for {john.name}: {latest}")

## 3. Data Transformation for Nested Schemas

# Transform nested patient data for DataFrame schema
def flatten_patient_visits(patients):
    flat_records = []
    for patient in patients:
        for visit in patient['visits']:
            record = {
                'id': patient['id'],
                'name': patient['name'],
                'visit_date': visit['date'],
                'bp': visit['bp'],
                'notes': visit['notes']
            }
            flat_records.append(record)
    return flat_records

flat_data = flatten_patient_visits(patients)
print('Flattened patient visit records:')
for record in flat_data:
    print(record)

## 4. Practice Exercise

Design a nested healthcare data structure and write code to flatten it for DataFrame use.

# Exercise: Nested clinical trial data
clinical_trials = [
    {
        'trial_id': 'T001',
        'subjects': [
            {'id': 'S001', 'visits': [{'date': '2025-01-10', 'outcome': 'Improved'}, {'date': '2025-02-10', 'outcome': 'Stable'}]},
            {'id': 'S002', 'visits': [{'date': '2025-01-12', 'outcome': 'No Change'}]}
        ]
    },
    {
        'trial_id': 'T002',
        'subjects': [
            {'id': 'S003', 'visits': [{'date': '2025-03-01', 'outcome': 'Improved'}]}
        ]
    }
]

# TODO: Write a function to flatten all subject visits for DataFrame
# Your code here

---

## Summary

In this session, you learned:
- ✅ How to work with nested lists and dictionaries
- ✅ How to build custom data structures with classes
- ✅ How to transform nested data for DataFrame schemas
- ✅ Practice with real-world healthcare data
- ✅ Essential skills for handling nested schemas in PySpark

**Next:** Session 1.5 - Conditions and Loops