# üîç Sorting & Searching Algorithms for ML

**Welcome, St. Mark!** In this notebook, we'll explore sorting and searching algorithms that form the backbone of efficient ML systems.

We'll implement:
1. **QuickSort Analysis** - Fast sorting with performance insights
2. **Binary Search Variants** - Efficient searching in sorted data
3. **External Sorting** - Sorting data larger than memory

By the end, you'll understand how these algorithms enable scalable ML data processing.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import time
import heapq
from collections import defaultdict

print("Setting up sorting and searching algorithms for ML...")

## Method 1: QuickSort Analysis - Fast Sorting with Performance Insights

**Healthcare Analogy:** Like organizing patient files by urgency - quick to sort but can have worst-case scenarios.

**ML Application:** Feature ranking, dataset preprocessing, algorithm analysis.

In [None]:
def quicksort(arr, low=0, high=None):
    if high is None:
        high = len(arr) - 1
    
    if low < high:
        pivot_idx = partition(arr, low, high)
        quicksort(arr, low, pivot_idx - 1)
        quicksort(arr, pivot_idx + 1, high)
    
    return arr

def partition(arr, low, high):
    pivot = arr[high]
    i = low - 1
    
    for j in range(low, high):
        if arr[j] <= pivot:
            i += 1
            arr[i], arr[j] = arr[j], arr[i]
    
    arr[i + 1], arr[high] = arr[high], arr[i + 1]
    return i + 1

def quicksort_with_analysis(arr):
    comparisons = 0
    swaps = 0
    recursion_depth = 0
    max_depth = 0
    
    def _quicksort_analyzed(arr, low, high, depth):
        nonlocal comparisons, swaps, recursion_depth, max_depth
        
        recursion_depth = max(recursion_depth, depth)
        max_depth = max(max_depth, depth)
        
        if low < high:
            pivot_idx = _partition_analyzed(arr, low, high)
            _quicksort_analyzed(arr, low, pivot_idx - 1, depth + 1)
            _quicksort_analyzed(arr, pivot_idx + 1, high, depth + 1)
    
    def _partition_analyzed(arr, low, high):
        nonlocal comparisons, swaps
        
        pivot = arr[high]
        i = low - 1
        
        for j in range(low, high):
            comparisons += 1
            if arr[j] <= pivot:
                i += 1
                arr[i], arr[j] = arr[j], arr[i]
                swaps += 1
        
        arr[i + 1], arr[high] = arr[high], arr[i + 1]
        swaps += 1
        return i + 1
    
    _quicksort_analyzed(arr, 0, len(arr) - 1, 0)
    
    return {
        'comparisons': comparisons,
        'swaps': swaps,
        'max_depth': max_depth,
        'sorted_array': arr
    }

In [None]:
# Healthcare application: Patient priority sorting
patients = [
    {"id": "PAT001", "name": "Adebayo Johnson", "urgency": 8, "condition": "Heart attack"},
    {"id": "PAT002", "name": "Fatima Abubakar", "urgency": 3, "condition": "Check-up"},
    {"id": "PAT003", "name": "Chukwuma Nwosu", "urgency": 6, "condition": "Pneumonia"},
    {"id": "PAT004", "name": "Amina Suleiman", "urgency": 9, "condition": "Severe infection"},
    {"id": "PAT005", "name": "Ibrahim Musa", "urgency": 2, "condition": "Minor injury"}
]

urgency_scores = [p["urgency"] for p in patients]

print("üè• Patient Triage Sorting:")
print("Before sorting (by arrival order):")
for patient in patients:
    print(f"  {patient['id']}: {patient['condition']} (urgency: {patient['urgency']})")

# Sort by urgency
analysis = quicksort_with_analysis(urgency_scores.copy())
sorted_patients = sorted(patients, key=lambda x: x["urgency"], reverse=True)

print("\nAfter sorting (by medical urgency):")
for patient in sorted_patients:
    print(f"  {patient['id']}: {patient['condition']} (urgency: {patient['urgency']})")

print(f"\nQuickSort Performance: {analysis['comparisons']} comparisons, {analysis['swaps']} swaps")

**Healthcare Analysis:**

- **Triage Sorting:** Patients processed by medical urgency rather than arrival order
- **Performance Insights:** QuickSort efficient for most cases

**Nigerian Healthcare Impact:** Enable proper patient prioritization in emergency departments.

## Method 2: Binary Search Variants - Efficient Searching in Sorted Data

**Healthcare Analogy:** Like finding a specific patient record in a sorted filing system.

**ML Application:** Nearest neighbor search, feature ranking lookup, sorted data access.

In [None]:
def binary_search(arr, target):
    left, right = 0, len(arr) - 1
    
    while left <= right:
        mid = (left + right) // 2
        
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    
    return -1

def find_closest_elements(arr, target, k=3):
    if not arr:
        return []
    
    # Find insertion point
    left, right = 0, len(arr)
    while left < right:
        mid = (left + right) // 2
        if arr[mid] < target:
            left = mid + 1
        else:
            right = mid
    idx = left
    
    # Get candidates around insertion point
    left = max(0, idx - k)
    right = min(len(arr) - 1, idx + k)
    candidates = arr[left:right + 1]
    
    # Sort by distance to target
    candidates.sort(key=lambda x: abs(x - target))
    
    return candidates[:k]

In [None]:
# Healthcare application: Medical code lookup
medical_codes = {
    "A00": "Cholera",
    "I10": "Essential hypertension",
    "J00": "Acute nasopharyngitis"
}

sorted_codes = sorted(medical_codes.keys())
sorted_descriptions = [medical_codes[code] for code in sorted_codes]

print("üè• Medical Code Binary Search:")

for code in ["I10", "A00"]:
    idx = binary_search(sorted_codes, code)
    if idx != -1:
        print(f"‚úì Found {code}: {sorted_descriptions[idx]}")

# Performance comparison
large_dataset = list(range(10000))
target = 5000

start = time.time()
for _ in range(1000):
    binary_search(large_dataset, target)
binary_time = time.time() - start

start = time.time()
for _ in range(1000):
    target in large_dataset
linear_time = time.time() - start

print(f"\nBinary search: {binary_time:.4f}s")
print(f"Linear search: {linear_time:.4f}s")
print(f"Binary search is {linear_time/binary_time:.1f}x faster!")

**Healthcare Analysis:**

- **Medical Code Lookup:** Fast ICD-10 code searching
- **Performance:** Dramatic speedup over linear search

**Nigerian Healthcare Impact:** Enable fast medical code lookup.

## Method 3: External Sorting - Sorting Data Larger Than Memory

**Healthcare Analogy:** Like organizing millions of patient records when you can't fit them all in memory.

**ML Application:** Large dataset preprocessing, distributed computing, big data sorting.

In [None]:
def external_sort(data, chunk_size=1000):
    if len(data) <= chunk_size:
        return sorted(data)
    
    # Phase 1: Sort chunks
    sorted_chunks = []
    for i in range(0, len(data), chunk_size):
        chunk = data[i:i + chunk_size]
        sorted_chunk = sorted(chunk)
        sorted_chunks.append(sorted_chunk)
    
    # Phase 2: K-way merge
    return k_way_merge(sorted_chunks)

def k_way_merge(sorted_chunks):
    if not sorted_chunks:
        return []
    
    pq = []
    chunk_iterators = []
    
    for i, chunk in enumerate(sorted_chunks):
        if chunk:
            heapq.heappush(pq, (chunk[0], i, 0))
            chunk_iterators.append(iter(chunk))
    
    result = []
    
    while pq:
        value, chunk_idx, elem_idx = heapq.heappop(pq)
        result.append(value)
        
        try:
            next_value = next(chunk_iterators[chunk_idx])
            heapq.heappush(pq, (next_value, chunk_idx, elem_idx + 1))
        except StopIteration:
            pass
    
    return result

In [None]:
# Healthcare application: Large-scale patient data sorting
print("üè• Large-Scale Healthcare Data Sorting:")

# Generate large patient dataset
np.random.seed(42)
large_patient_data = [
    {
        "id": f"PAT{i:05d}",
        "age": np.random.randint(1, 100)
    }
    for i in range(5000)
]

ages = [p["age"] for p in large_patient_data]

print(f"Sorting {len(ages)} patient records...")

# External sort simulation
sorted_ages = external_sort(ages.copy(), chunk_size=1000)

print(f"Sorted result: {sorted_ages[:5]}...{sorted_ages[-5:]}")
print(f"Verification: {'‚úì Sorted correctly' if sorted_ages == sorted(ages) else '‚úó Sort failed'}")

**Healthcare Analysis:**

- **Large Dataset Sorting:** Handles millions of patient records efficiently
- **Scalability:** Works with limited memory

**Nigerian Healthcare Impact:** Enable processing of national health databases.

## üéØ Key Takeaways and Nigerian Healthcare Applications

**Algorithm Summary:**
- **QuickSort:** Fast average-case sorting with performance analysis
- **Binary Search:** O(log n) searching in sorted data
- **External Sorting:** Scalable sorting for large datasets

**Healthcare Translation - Mark:**
Imagine building Nigeria's healthcare AI systems:
- **QuickSort:** Prioritize patient treatment by medical urgency
- **Binary Search:** Fast medical code lookup
- **External Sorting:** Process millions of patient records efficiently

**Performance achieved:** Sorting and searching algorithms enable efficient data processing for large-scale ML healthcare systems!