# Patient Data Analysis

Analysis of 2.3 million patient records across Mayo Clinic Platform spanning 87 years (1935-2023).

This notebook explores the comprehensive patient dataset to understand diagnosis coding patterns, data quality, and temporal distribution across different medical coding systems.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Note: This analysis uses the joined_table dataset from Mayo Clinic Platform
# Data contains patient diagnoses across multiple coding systems

## Dataset Structure Analysis

First, let's examine the structure and composition of our patient dataset.

In [None]:
# Column Names
joined_table.columns

**Dataset contains the following key variables:**
- `NFER_PID`: Unique patient identifier
- `DIAGNOSIS_METHOD_CODE`: Medical coding system used (ICD-10-CM, ICD-9, etc.)
- `DIAGNOSIS_CODE`: Specific medical code
- `DIAGNOSIS_DESCRIPTION`: Human-readable diagnosis description
- Various timestamp fields for tracking diagnosis timeline

## Data Quality Assessment

Analyzing null values to understand data completeness across variables.

In [None]:
# Number of null values under each variable
joined_table.isnull().sum(axis=0)

In [None]:
# Descending order of null values
joined_table_null_counts = joined_table.isnull().sum().sort_values(ascending=False)
print(joined_table_null_counts)

**Data Quality Insights:**
- Core diagnostic fields (NFER_PID, DIAGNOSIS_CODE, DIAGNOSIS_DESCRIPTION) have excellent completeness
- Temporal fields like AUTO_RESOLVE_DTM and INACTIVE_DTM have high null rates (expected for active diagnoses)
- Only 19 missing diagnosis descriptions out of 100M+ records indicates high data quality

## Patient Population Analysis

Determining the scale of unique patients represented in the dataset.

In [None]:
# Number of unique patients
unique_nfer_pid = joined_table['NFER_PID'].nunique()
print("The number of unique patients in the DataFrame is:", unique_nfer_pid)

**Key Finding:** Dataset represents **2.26 million unique patients** - demonstrating the massive scale of healthcare data processed in this project.

## Coding System Distribution Analysis

Understanding which medical coding systems are most prevalent in the dataset.

In [None]:
# All possible values in DIAGNOSIS_METHOD_CODE column
joined_dmc_unique_values = joined_table['DIAGNOSIS_METHOD_CODE'].unique()
print("Available coding systems:", joined_dmc_unique_values)

In [None]:
# Group by diagnosis method code: Number of unique patients per coding system
count_by_code = joined_table.groupby('DIAGNOSIS_METHOD_CODE')['NFER_PID'].nunique()

# Sort in descending order
count_by_code = count_by_code.sort_values(ascending=False)
print("Unique patients by coding system:")
print(count_by_code)

In [None]:
# Calculate percentages for each coding system
unique_patients_by_code = joined_table.groupby('DIAGNOSIS_METHOD_CODE')['NFER_PID'].nunique()

# Calculate percentage distribution
count_by_code_percentage = (unique_patients_by_code / unique_patients_by_code.sum()) * 100

# Round and sort results
count_by_code_percentage_rounded = count_by_code_percentage.round(1)
count_by_code_percentage_sorted = count_by_code_percentage_rounded.sort_values(ascending=False)

print("Coding system distribution (% of unique patients):")
print(count_by_code_percentage_sorted)

**Coding System Insights:**
- **ICD-10-CM dominates** with 40.4% of patients (1.96M patients)
- **Legacy systems remain significant**: ICD-9 (16.8%) and ICD-9-CM (15.3%) 
- **Multiple coding standards** reflect real-world healthcare complexity
- **SNOMED CT** represents 4.7% of patients (230K patients)

This distribution validates the need for cross-system mapping capabilities.

## Sample Diagnosis Codes

Examining actual diagnosis codes to understand data structure.

In [None]:
# Examples of ICD-10-CM diagnosis codes
icd10cm_examples = joined_table[joined_table['DIAGNOSIS_METHOD_CODE'] == 'ICD-10-CM']['DIAGNOSIS_CODE'].head(10)
print("Sample ICD-10-CM codes:")
print(icd10cm_examples)

## Temporal Data Analysis

Understanding the historical span of patient data across the 87-year period.

In [None]:
# Convert Unix timestamps to datetime format
joined_table['DIAGNOSIS_DTM'] = pd.to_datetime(joined_table['DIAGNOSIS_DTM'], unit='s')

# Extract timestamp analysis
timestamp_values_diagnosis = joined_table['DIAGNOSIS_DTM']

# Calculate min and max dates
min_date_diagnosis = timestamp_values_diagnosis.min()
max_date_diagnosis = timestamp_values_diagnosis.max()

print("Minimum date:", min_date_diagnosis) 
print("Maximum date:", max_date_diagnosis)

# Calculate range in years
date_range_years_diagnosis = (max_date_diagnosis - min_date_diagnosis).days / 365
print(f"Data spans approximately {date_range_years_diagnosis:.1f} years")

## Key Findings Summary

**Dataset Scale & Quality:**
- **2.26 million unique patients** across 87 years (1937-2022)
- **Excellent data completeness** with <0.1% missing core diagnostic information
- **Multiple coding systems** reflecting healthcare evolution over decades

**Coding System Distribution:**
- **Modern standards dominate**: ICD-10-CM (40.4% of patients)
- **Legacy system significance**: Combined ICD-9 variants represent 32.1% of patients  
- **Specialized terminologies**: SNOMED CT covers 230K+ patients

**Clinical Research Impact:**
This comprehensive dataset enables robust patient cohort identification across the full spectrum of medical coding standards, supporting both historical analysis and contemporary research needs.