# Session 1.6: Functions

## **Includes Lambda, Map, Filter – Core Concepts in PySpark Transformations**

### **Learning Objectives**
By the end of this session, you will:
- Create and use Python functions
- Master lambda functions for data processing
- Apply map, filter, and reduce operations
- Understand function concepts essential for PySpark transformations

---

### **Relevance to PySpark**
Functions, especially lambda functions and functional programming concepts like map and filter, are fundamental to PySpark transformations and data processing pipelines.

---

## 1. Basic Functions for Healthcare Data Processing

In [None]:
def calculate_bmi(weight_kg, height_m):
    """Calculate Body Mass Index."""
    return round(weight_kg / (height_m ** 2), 2)

def bmi_category(bmi):
    """Determine BMI category."""
    if bmi < 18.5:
        return "Underweight"
    elif bmi < 25:
        return "Normal"
    elif bmi < 30:
        return "Overweight"
    else:
        return "Obese"

def patient_risk_score(age, bmi, has_diabetes, smoker):
    """Calculate patient risk score."""
    score = 0
    
    # Age factor
    if age > 65:
        score += 3
    elif age > 45:
        score += 2
    elif age > 30:
        score += 1
    
    # BMI factor
    if bmi >= 30:
        score += 2
    elif bmi >= 25:
        score += 1
    
    # Conditions
    if has_diabetes:
        score += 2
    if smoker:
        score += 2
    
    return score

# Test the functions
patient_weight = 75.5
patient_height = 1.75
patient_age = 45

bmi = calculate_bmi(patient_weight, patient_height)
category = bmi_category(bmi)
risk = patient_risk_score(patient_age, bmi, True, False)

print(f"Patient BMI: {bmi} ({category})")
print(f"Risk Score: {risk}")

## 2. Lambda Functions - Inline Data Processing

In [None]:
# Patient data
patients = [
    {'id': 'PT001', 'name': 'John Doe', 'age': 45, 'weight': 75.5, 'height': 1.75},
    {'id': 'PT002', 'name': 'Jane Smith', 'age': 32, 'weight': 62.3, 'height': 1.65},
    {'id': 'PT003', 'name': 'Bob Johnson', 'age': 58, 'weight': 88.2, 'height': 1.80},
    {'id': 'PT004', 'name': 'Alice Brown', 'age': 29, 'weight': 55.8, 'height': 1.62}
]

# Lambda functions for calculations
calc_bmi = lambda w, h: round(w / (h ** 2), 2)
is_senior = lambda age: age >= 65
format_name = lambda name: name.upper()

# Apply lambda functions
for patient in patients:
    bmi = calc_bmi(patient['weight'], patient['height'])
    senior_status = "Senior" if is_senior(patient['age']) else "Adult"
    formatted_name = format_name(patient['name'])
    
    print(f"{formatted_name}: BMI {bmi}, Status: {senior_status}")

## 3. Map Function - Data Transformation

In [None]:
# Extract patient ages
ages = [p['age'] for p in patients]
print(f"Original ages: {ages}")

# Transform ages using map
age_groups = list(map(lambda age: 'Senior' if age >= 65 else 'Middle-aged' if age >= 45 else 'Adult', ages))
print(f"Age groups: {age_groups}")

# Calculate BMIs using map
bmis = list(map(lambda p: calc_bmi(p['weight'], p['height']), patients))
print(f"BMIs: {bmis}")

# Temperature conversion (Celsius to Fahrenheit)
temps_celsius = [36.5, 37.2, 38.1, 36.8, 37.5]
temps_fahrenheit = list(map(lambda c: round(c * 9/5 + 32, 1), temps_celsius))

print(f"Celsius: {temps_celsius}")
print(f"Fahrenheit: {temps_fahrenheit}")

# Patient ID formatting
patient_ids = ['pt001', 'pt002', 'pt003', 'pt004']
formatted_ids = list(map(lambda id: id.upper().replace('PT', 'PATIENT-'), patient_ids))
print(f"Formatted IDs: {formatted_ids}")

## 4. Filter Function - Data Filtering

In [None]:
# Filter patients by criteria
print("All patients:")
for p in patients:
    print(f"  {p['name']}: Age {p['age']}, BMI {calc_bmi(p['weight'], p['height'])}")

# Filter senior patients (65+)
senior_patients = list(filter(lambda p: p['age'] >= 65, patients))
print(f"\nSenior patients: {[p['name'] for p in senior_patients]}")

# Filter patients with high BMI (>25)
high_bmi_patients = list(filter(lambda p: calc_bmi(p['weight'], p['height']) > 25, patients))
print("\nPatients with high BMI:")
for p in high_bmi_patients:
    print(f"  {p['name']}: BMI {calc_bmi(p['weight'], p['height'])}")

# Filter by multiple criteria
middle_aged_overweight = list(filter(
    lambda p: 40 <= p['age'] < 60 and calc_bmi(p['weight'], p['height']) >= 25, 
    patients
))
print(f"\nMiddle-aged overweight patients: {[p['name'] for p in middle_aged_overweight]}")

# Filter lab values
glucose_levels = [85, 92, 110, 125, 95, 140, 88, 155]
normal_glucose = list(filter(lambda x: 70 <= x <= 100, glucose_levels))
high_glucose = list(filter(lambda x: x > 125, glucose_levels))

print(f"\nNormal glucose levels: {normal_glucose}")
print(f"High glucose levels: {high_glucose}")

## 5. Reduce Function - Data Aggregation

In [None]:
from functools import reduce

# Calculate total and average weights
weights = [p['weight'] for p in patients]
total_weight = reduce(lambda x, y: x + y, weights)
print(f"Total weight: {total_weight:.1f} kg")
print(f"Average weight: {total_weight / len(weights):.1f} kg")

# Find maximum and minimum ages
ages = [p['age'] for p in patients]
max_age = reduce(lambda x, y: x if x > y else y, ages)
min_age = reduce(lambda x, y: x if x < y else y, ages)
print(f"Age range: {min_age} - {max_age} years")

# Combine patient names
names = [p['name'] for p in patients]
all_names = reduce(lambda x, y: x + ", " + y, names)
print(f"All patients: {all_names}")

# Calculate medication counts
medication_lists = [
    ['Aspirin', 'Lisinopril'],
    ['Metformin', 'Insulin'],
    ['Aspirin', 'Simvastatin', 'Lisinopril'],
    ['Albuterol']
]

all_medications = reduce(lambda x, y: x + y, medication_lists)
unique_medications = list(set(all_medications))
print(f"\nAll medications: {all_medications}")
print(f"Unique medications: {unique_medications}")
print(f"Total unique medications: {len(unique_medications)}")

## 6. Function Composition - PySpark Style Operations

In [None]:
# Chaining operations (similar to PySpark transformations)
def process_patient_data(patients_list):
    """Process patient data through multiple transformations."""
    
    # Step 1: Add BMI to each patient
    patients_with_bmi = list(map(
        lambda p: {**p, 'bmi': calc_bmi(p['weight'], p['height'])}, 
        patients_list
    ))
    
    # Step 2: Add BMI category
    patients_with_category = list(map(
        lambda p: {**p, 'bmi_category': bmi_category(p['bmi'])}, 
        patients_with_bmi
    ))
    
    # Step 3: Filter patients with BMI > 23
    filtered_patients = list(filter(
        lambda p: p['bmi'] > 23, 
        patients_with_category
    ))
    
    # Step 4: Sort by BMI (highest first)
    sorted_patients = sorted(filtered_patients, key=lambda p: p['bmi'], reverse=True)
    
    return sorted_patients

# Process the data
processed_patients = process_patient_data(patients)

print("Processed Patient Data (BMI > 23, sorted by BMI):")
print("-" * 60)
for p in processed_patients:
    print(f"{p['name']}: BMI {p['bmi']} ({p['bmi_category']})")

## 7. Higher-Order Functions - Advanced Pattern

In [None]:
def create_threshold_filter(threshold, field):
    """Create a filter function based on threshold and field."""
    return lambda patient: patient[field] > threshold

def create_range_filter(min_val, max_val, field):
    """Create a filter function for range filtering."""
    return lambda patient: min_val <= patient[field] <= max_val

def apply_filters(data, *filters):
    """Apply multiple filters to data."""
    result = data
    for filter_func in filters:
        result = list(filter(filter_func, result))
    return result

# Create filter functions
age_filter = create_threshold_filter(35, 'age')
weight_range_filter = create_range_filter(60, 80, 'weight')

# Apply filters
filtered_data = apply_filters(patients, age_filter, weight_range_filter)

print("Patients > 35 years old with weight 60-80 kg:")
for p in filtered_data:
    print(f"  {p['name']}: Age {p['age']}, Weight {p['weight']} kg")

## 8. Practice Exercise

Clinical trial data analysis using functional programming.

In [None]:
# Exercise: Clinical trial data processing
trial_data = [
    {'subject_id': 'S001', 'group': 'Treatment', 'baseline_score': 65, 'final_score': 78, 'age': 45},
    {'subject_id': 'S002', 'group': 'Control', 'baseline_score': 68, 'final_score': 70, 'age': 52},
    {'subject_id': 'S003', 'group': 'Treatment', 'baseline_score': 72, 'final_score': 85, 'age': 38},
    {'subject_id': 'S004', 'group': 'Treatment', 'baseline_score': 60, 'final_score': 62, 'age': 41},
    {'subject_id': 'S005', 'group': 'Control', 'baseline_score': 75, 'final_score': 73, 'age': 48}
]

# TODO: Using map, filter, reduce, and lambda functions:
# 1. Calculate improvement score for each subject (final - baseline)
# 2. Filter subjects with improvement > 5 points
# 3. Calculate average improvement by group
# 4. Find the subject with maximum improvement
# 5. Create a summary report function

# Your code here

---

## Summary

In this session, you learned:
- ✅ Creating and using Python functions
- ✅ Lambda functions for inline processing
- ✅ Map function for data transformation
- ✅ Filter function for data filtering
- ✅ Reduce function for data aggregation
- ✅ Function composition and chaining
- ✅ Higher-order functions
- ✅ Functional programming concepts essential for PySpark

**Next:** Session 1.7 - Dates and Times