# Session 1.3: Data Structures

## **Core for Manipulating Lists, Dicts, Comprehensions**

### **Learning Objectives**
By the end of this session, you will:
- Master Python lists, dictionaries, tuples, and sets
- Use list and dictionary comprehensions
- Apply data structures to healthcare data processing
- Understand foundations for PySpark DataFrame operations

---

### **Relevance to PySpark**
Data structures are fundamental to PySpark operations. Understanding lists, dictionaries, and comprehensions will help you work with DataFrame rows, columns, and transformations.

---

## 1. Lists - Patient Data Collections

In [None]:
# Patient vital signs over time
heart_rates = [72, 75, 78, 73, 76, 80, 74]
temperatures = [98.6, 99.1, 98.8, 99.2, 98.9, 99.0, 98.7]

print(f"Heart rates: {heart_rates}")
print(f"Average heart rate: {sum(heart_rates) / len(heart_rates):.1f} bpm")
print(f"Max temperature: {max(temperatures)}°F")
print(f"Min temperature: {min(temperatures)}°F")

# Adding new readings
heart_rates.append(77)
temperatures.extend([98.8, 99.3])

print(f"Updated heart rates: {heart_rates}")
print(f"Total readings: {len(heart_rates)} heart rate, {len(temperatures)} temperature")

## 2. Dictionaries - Patient Records

In [None]:
# Patient information dictionary
patient_001 = {
    'id': 'PT001',
    'name': 'John Doe',
    'age': 45,
    'gender': 'M',
    'diagnosis': 'Hypertension',
    'medications': ['Lisinopril', 'Hydrochlorothiazide'],
    'allergies': ['Penicillin'],
    'vitals': {
        'blood_pressure': '140/90',
        'heart_rate': 78,
        'temperature': 98.6
    }
}

# Accessing patient data
print(f"Patient: {patient_001['name']} (ID: {patient_001['id']})")
print(f"Diagnosis: {patient_001['diagnosis']}")
print(f"Medications: {', '.join(patient_001['medications'])}")
print(f"Heart Rate: {patient_001['vitals']['heart_rate']} bpm")

# Adding new information
patient_001['last_visit'] = '2025-01-15'
patient_001['vitals']['weight'] = 175.5

print(f"Last visit: {patient_001['last_visit']}")
print(f"Weight: {patient_001['vitals']['weight']} lbs")

## 3. Lists of Dictionaries - Multiple Patients

In [None]:
# Multiple patient records (similar to DataFrame rows)
patients = [
    {'id': 'PT001', 'name': 'John Doe', 'age': 45, 'diagnosis': 'Hypertension'},
    {'id': 'PT002', 'name': 'Jane Smith', 'age': 32, 'diagnosis': 'Diabetes'},
    {'id': 'PT003', 'name': 'Bob Johnson', 'age': 58, 'diagnosis': 'Heart Disease'},
    {'id': 'PT004', 'name': 'Alice Brown', 'age': 29, 'diagnosis': 'Asthma'}
]

# Processing patient data
print("Patient Summary:")
print("-" * 50)
for patient in patients:
    print(f"{patient['id']}: {patient['name']}, Age {patient['age']}, {patient['diagnosis']}")

# Calculate average age
total_age = sum(patient['age'] for patient in patients)
average_age = total_age / len(patients)
print(f"\nAverage patient age: {average_age:.1f} years")

## 4. List Comprehensions - Data Filtering and Transformation

In [None]:
# Filter patients by age group
senior_patients = [p for p in patients if p['age'] >= 50]
young_patients = [p for p in patients if p['age'] < 35]

print("Senior patients (50+):")
for patient in senior_patients:
    print(f"  {patient['name']}, Age {patient['age']}")

print("\nYoung patients (<35):")
for patient in young_patients:
    print(f"  {patient['name']}, Age {patient['age']}")

# Extract specific data
patient_names = [p['name'] for p in patients]
patient_ages = [p['age'] for p in patients]
chronic_conditions = [p['diagnosis'] for p in patients if p['diagnosis'] in ['Diabetes', 'Hypertension']]

print(f"\nAll patient names: {patient_names}")
print(f"Chronic conditions: {chronic_conditions}")

## 5. Dictionary Comprehensions - Data Aggregation

In [None]:
# Create ID to name mapping
id_to_name = {p['id']: p['name'] for p in patients}
print(f"ID to Name mapping: {id_to_name}")

# Age categories
age_categories = {
    p['name']: 'Senior' if p['age'] >= 50 else 'Adult' if p['age'] >= 18 else 'Minor' 
    for p in patients
}
print(f"Age categories: {age_categories}")

# Diagnosis frequency count
from collections import Counter
diagnosis_counts = Counter(p['diagnosis'] for p in patients)
print(f"Diagnosis frequency: {dict(diagnosis_counts)}")

## 6. Tuples - Immutable Data Records

In [None]:
# Lab results as tuples (immutable records)
lab_result_1 = ('PT001', 'Glucose', 95, 'mg/dL', 'Normal')
lab_result_2 = ('PT002', 'Cholesterol', 220, 'mg/dL', 'High')
lab_result_3 = ('PT003', 'Blood Pressure', (140, 90), 'mmHg', 'High')

lab_results = [lab_result_1, lab_result_2, lab_result_3]

# Processing lab results
print("Lab Results:")
for patient_id, test_name, value, unit, status in lab_results:
    print(f"{patient_id}: {test_name} = {value} {unit} ({status})")

# Named tuples for better structure
from collections import namedtuple

LabResult = namedtuple('LabResult', ['patient_id', 'test_name', 'value', 'unit', 'status'])

structured_result = LabResult('PT004', 'Hemoglobin', 14.2, 'g/dL', 'Normal')
print(f"\nStructured result: {structured_result.patient_id} - {structured_result.test_name}: {structured_result.value}")

## 7. Sets - Unique Data Collections

In [None]:
# Unique diagnoses across all patients
all_diagnoses = {p['diagnosis'] for p in patients}
print(f"Unique diagnoses: {all_diagnoses}")

# Patient medication tracking
patient_a_meds = {'Lisinopril', 'Metformin', 'Aspirin'}
patient_b_meds = {'Metformin', 'Insulin', 'Aspirin'}

# Set operations
common_meds = patient_a_meds & patient_b_meds  # Intersection
all_meds = patient_a_meds | patient_b_meds     # Union
unique_to_a = patient_a_meds - patient_b_meds  # Difference

print(f"\nCommon medications: {common_meds}")
print(f"All medications: {all_meds}")
print(f"Unique to Patient A: {unique_to_a}")

## 8. Nested Data Structures - Complex Patient Records

In [None]:
# Complex patient data structure
comprehensive_patient = {
    'personal_info': {
        'id': 'PT005',
        'name': 'Emily Davis',
        'age': 34,
        'gender': 'F',
        'contact': {
            'phone': '555-0123',
            'email': 'emily.davis@email.com'
        }
    },
    'medical_history': {
        'allergies': ['Shellfish', 'Latex'],
        'chronic_conditions': ['Diabetes Type 2'],
        'surgeries': [
            {'procedure': 'Appendectomy', 'date': '2018-03-15'},
            {'procedure': 'Gallbladder removal', 'date': '2021-07-22'}
        ]
    },
    'current_medications': [
        {'name': 'Metformin', 'dosage': '500mg', 'frequency': 'twice daily'},
        {'name': 'Insulin', 'dosage': '10 units', 'frequency': 'before meals'}
    ],
    'vital_signs_history': [
        {'date': '2025-01-20', 'bp': '120/80', 'hr': 72, 'temp': 98.6},
        {'date': '2025-01-15', 'bp': '125/82', 'hr': 75, 'temp': 98.8}
    ]
}

# Accessing nested data
print(f"Patient: {comprehensive_patient['personal_info']['name']}")
print(f"Email: {comprehensive_patient['personal_info']['contact']['email']}")
print(f"Allergies: {', '.join(comprehensive_patient['medical_history']['allergies'])}")

# Process medications
print("\nCurrent Medications:")
for med in comprehensive_patient['current_medications']:
    print(f"  {med['name']}: {med['dosage']} ({med['frequency']})")

# Latest vital signs
latest_vitals = comprehensive_patient['vital_signs_history'][0]
print(f"\nLatest vitals ({latest_vitals['date']}): BP {latest_vitals['bp']}, HR {latest_vitals['hr']}")

## 9. Practice Exercise

Process a clinical trial dataset using data structures.

In [None]:
# Exercise: Clinical trial data processing
# Given clinical trial data, perform the following operations:

clinical_trial_data = [
    {'subject_id': 'S001', 'group': 'Treatment', 'age': 45, 'response': 'Improved', 'side_effects': ['Nausea']},
    {'subject_id': 'S002', 'group': 'Control', 'age': 52, 'response': 'No Change', 'side_effects': []},
    {'subject_id': 'S003', 'group': 'Treatment', 'age': 38, 'response': 'Improved', 'side_effects': ['Headache', 'Fatigue']},
    {'subject_id': 'S004', 'group': 'Treatment', 'age': 41, 'response': 'Worsened', 'side_effects': ['Nausea', 'Dizziness']},
    {'subject_id': 'S005', 'group': 'Control', 'age': 48, 'response': 'No Change', 'side_effects': []}
]

# TODO: Complete the following tasks:
# 1. Filter subjects in the treatment group
# 2. Calculate average age by group
# 3. Count response types
# 4. List all unique side effects
# 5. Find subjects with improvement and no side effects

# Your code here

---

## Summary

In this session, you learned:
- ✅ Lists for ordered data collections
- ✅ Dictionaries for key-value data structures
- ✅ List and dictionary comprehensions for data processing
- ✅ Tuples for immutable records
- ✅ Sets for unique data collections
- ✅ Nested data structures for complex healthcare records
- ✅ Foundation concepts for PySpark DataFrame operations

**Next:** Session 1.4 - Advanced Data Structures