# Personalized Healthcare Recommendations

This notebook implements a machine learning model to provide personalized healthcare recommendations based on individual patient data. The goal is to improve patient outcomes by leveraging data-driven insights to offer personalized advice.

## Project Overview

1. **Problem Understanding**: Provide personalized healthcare recommendations to patients based on their health data, medical history, lifestyle, and other relevant factors.
2. **Dataset**: Patient healthcare data including demographics, medical history, lab results, and treatment responses
3. **Methodology**: Machine learning techniques to analyze patient data and generate actionable insights
4. **Output**: Personalized healthcare recommendations for each patient

In [None]:
# ---------- imports ----------
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.feature_selection import SelectKBest, f_classif
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

In [None]:
# ---------- load dataset ----------
# Load the patient healthcare dataset
csv_path = Path('../data/raw/patient_data.csv')
df = pd.read_csv(csv_path)

print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
df.head()

In [None]:
# ---------- data exploration ----------
print("=== DATASET OVERVIEW ===")
print(f"Number of patients: {len(df)}")
print(f"Number of features: {len(df.columns)}")
print("\nColumn names:")
for i, col in enumerate(df.columns):
    print(f"  {i+1}. {col}")

print("\n=== DATA TYPES ===")
print(df.dtypes)

print("\n=== MISSING VALUES ===")
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0] if missing_values.sum() > 0 else "No missing values")

In [None]:
# ---------- exploratory data analysis ----------
# Distribution of target variable (Treatment Response)
plt.figure(figsize=(10, 6))
treatment_response_counts = df['Treatment Response'].value_counts()
bars = plt.bar(treatment_response_counts.index, treatment_response_counts.values, 
               color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4'])
plt.title('Distribution of Treatment Responses', fontsize=16)
plt.xlabel('Treatment Response', fontsize=12)
plt.ylabel('Number of Patients', fontsize=12)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{int(height)}',
             ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("Treatment Response Distribution:")
print(treatment_response_counts)

In [None]:
# Age distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['Age'], bins=15, kde=True, color='skyblue')
plt.title('Age Distribution of Patients', fontsize=16)
plt.xlabel('Age (years)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nAge Statistics:")
print(df['Age'].describe())

In [None]:
# Gender distribution
plt.figure(figsize=(8, 6))
gender_counts = df['Gender'].value_counts()
plt.pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%', 
        startangle=90, colors=['#FF9999', '#66B2FF', '#99FF99'])
plt.title('Gender Distribution of Patients', fontsize=16)
plt.tight_layout()
plt.show()

print("\nGender Distribution:")
print(gender_counts)

In [None]:
# Medical conditions analysis
plt.figure(figsize=(12, 6))
medical_counts = df['Medical History'].value_counts()
bars = plt.bar(range(len(medical_counts)), medical_counts.values)
plt.title('Distribution of Medical Conditions', fontsize=16)
plt.xlabel('Medical Condition', fontsize=12)
plt.ylabel('Number of Patients', fontsize=12)
plt.xticks(range(len(medical_counts)), medical_counts.index, rotation=45, ha='right')

# Add value labels on bars
for i, bar in enumerate(bars):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{int(height)}',
             ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("\nMedical Conditions Distribution:")
print(medical_counts)

In [None]:
# ---------- data preprocessing ----------
# Create a copy of the dataframe for preprocessing
df_processed = df.copy()

# Normalize column names
df_processed.columns = df_processed.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('-', '_')

# Display the cleaned column names
print("Normalized column names:")
print(df_processed.columns.tolist())

In [None]:
# ---------- feature engineering ----------
# Extract numerical values from vital signs
def extract_vital_signs(vital_signs_str):
    """Extract systolic BP, diastolic BP, and heart rate from vital signs string"""
    try:
        # Example: "BP: 130/85, HR: 72"
        bp_part = vital_signs_str.split('BP: ')[1].split(',')[0]
        systolic, diastolic = map(int, bp_part.split('/'))
        heart_rate = int(vital_signs_str.split('HR: ')[1])
        return systolic, diastolic, heart_rate
    except:
        return np.nan, np.nan, np.nan

# Apply the function to extract vital signs
vital_data = df_processed['vital_signs'].apply(extract_vital_signs)
df_processed['systolic_bp'] = [x[0] for x in vital_data]
df_processed['diastolic_bp'] = [x[1] for x in vital_data]
df_processed['heart_rate'] = [x[2] for x in vital_data]

# Create age groups
df_processed['age_group'] = pd.cut(df_processed['age'], 
                                   bins=[0, 30, 45, 60, 100], 
                                   labels=['Young Adult', 'Middle Adult', 'Older Adult', 'Senior'])

# Create chronic condition indicator
chronic_conditions = ['diabetes', 'hypertension', 'heart disease', 'asthma', 'copd', 'kidney']
df_processed['has_chronic_condition'] = df_processed['medical_history'].str.lower().apply(
    lambda x: any(condition in str(x) for condition in chronic_conditions)
).astype(int)

# Create family history indicator
df_processed['family_history_indicator'] = (df_processed['family_history'] == 'Yes').astype(int)

print("New features created:")
print("- Systolic BP, Diastolic BP, Heart Rate (extracted from vital signs)")
print("- Age groups")
print("- Chronic condition indicator")
print("- Family history indicator")

print("\nDataset shape after feature engineering:", df_processed.shape)

In [None]:
# ---------- prepare features and target ----------
# Define target variable
target = 'treatment_response'

# Select features for the model
feature_columns = [
    'age', 'systolic_bp', 'diastolic_bp', 'heart_rate', 'recovery_time',
    'gender', 'ethnicity', 'medical_history', 'age_group', 
    'has_chronic_condition', 'family_history_indicator'
]

# Create feature matrix and target vector
X = df_processed[feature_columns]
y = df_processed[target]

print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print("\nFeatures used:")
for feature in feature_columns:
    print(f"  - {feature}")

print("\nTarget distribution:")
print(y.value_counts())

In [None]:
# ---------- handle missing values ----------
print("Missing values before imputation:")
print(X.isnull().sum())

# Separate numerical and categorical features
numerical_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

print(f"\nNumerical features: {numerical_features}")
print(f"Categorical features: {categorical_features}")

# Impute missing values
# For numerical features: use median
for col in numerical_features:
    X[col].fillna(X[col].median(), inplace=True)

# For categorical features: use mode
for col in categorical_features:
    X[col].fillna(X[col].mode()[0] if not X[col].mode().empty else 'Unknown', inplace=True)

print("\nMissing values after imputation:")
print(X.isnull().sum().sum(), "total missing values")

In [None]:
# ---------- encode categorical variables ----------
# Create preprocessing pipelines
numerical_pipeline = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline(steps=[
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_pipeline, numerical_features),
    ('cat', categorical_pipeline, categorical_features)
])

print("Preprocessing pipelines created successfully!")

In [None]:
# ---------- split the data ----------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
print("\nTraining target distribution:")
print(y_train.value_counts(normalize=True))
print("\nTest target distribution:")
print(y_test.value_counts(normalize=True))

In [None]:
# ---------- model training ----------
# Create a pipeline with preprocessing and Random Forest classifier
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE, class_weight='balanced'))
])

# Train the model
print("Training Random Forest model...")
rf_pipeline.fit(X_train, y_train)
print("Model training completed!")

In [None]:
# ---------- model evaluation ----------
# Make predictions
y_pred = rf_pipeline.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# Confusion matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=rf_pipeline.named_steps['classifier'].classes_, 
            yticklabels=rf_pipeline.named_steps['classifier'].classes_)
plt.title('Confusion Matrix', fontsize=16)
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
# ---------- feature importance ----------
# Get feature names after preprocessing
preprocessor = rf_pipeline.named_steps['preprocessor']
 
# Get feature names for numerical features
numerical_feature_names = numerical_features

# Get feature names for categorical features (after one-hot encoding)
categorical_encoder = preprocessor.named_transformers_['cat'].named_steps['encoder']
categorical_feature_names = categorical_encoder.get_feature_names_out(categorical_features)

# Combine all feature names
all_feature_names = list(numerical_feature_names) + list(categorical_feature_names)

# Get feature importances
feature_importances = rf_pipeline.named_steps['classifier'].feature_importances_

# Create a DataFrame for better visualization
importance_df = pd.DataFrame({
    'feature': all_feature_names,
    'importance': feature_importances
}).sort_values('importance', ascending=False)

# Plot top 15 features
plt.figure(figsize=(10, 8))
top_features = importance_df.head(15)
plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Feature Importance', fontsize=12)
plt.title('Top 15 Feature Importances', fontsize=16)
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("Top 10 Most Important Features:")
print(importance_df.head(10))

In [None]:
# ---------- recommendation system implementation ----------
def generate_recommendations(patient_data, model_pipeline):
    """
    Generate personalized healthcare recommendations based on patient data.
    
    Parameters:
    patient_data (dict): Dictionary containing patient information
    model_pipeline: Trained model pipeline
    
    Returns:
    dict: Recommendation and confidence score
    """
    
    # Convert patient data to DataFrame
    patient_df = pd.DataFrame([patient_data])
    
    # Apply the same preprocessing as training data
    patient_df.columns = patient_df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('-', '_')
    
    # Extract vital signs if present
    if 'vital_signs' in patient_df.columns:
        vital_data = patient_df['vital_signs'].apply(extract_vital_signs)
        patient_df['systolic_bp'] = [x[0] for x in vital_data]
        patient_df['diastolic_bp'] = [x[1] for x in vital_data]
        patient_df['heart_rate'] = [x[2] for x in vital_data]
    
    # Create age group
    if 'age' in patient_df.columns:
        patient_df['age_group'] = pd.cut(patient_df['age'], 
                                        bins=[0, 30, 45, 60, 100], 
                                        labels=['Young Adult', 'Middle Adult', 'Older Adult', 'Senior'])
    
    # Create chronic condition indicator
    if 'medical_history' in patient_df.columns:
        patient_df['has_chronic_condition'] = patient_df['medical_history'].str.lower().apply(
            lambda x: any(condition in str(x) for condition in chronic_conditions)
        ).astype(int)
    
    # Create family history indicator
    if 'family_history' in patient_df.columns:
        patient_df['family_history_indicator'] = (patient_df['family_history'] == 'Yes').astype(int)
    
    # Select the same features used in training
    patient_features = patient_df[feature_columns]
    
    # Make prediction
    prediction = model_pipeline.predict(patient_features)[0]
    
    # Get prediction probabilities
    probabilities = model_pipeline.predict_proba(patient_features)[0]
    confidence = np.max(probabilities)
    
    # Map predictions to recommendations
    recommendation_mapping = {
        'Excellent': 'Continue current treatment plan. Patient is responding very well.',
        'Good': 'Current treatment is effective. Consider minor adjustments if needed.',
        'Fair': 'Treatment response is moderate. Consider alternative approaches or additional interventions.',
        'Poor': 'Treatment response is suboptimal. Recommend immediate consultation and treatment plan revision.'
    }
    
    recommendation = recommendation_mapping.get(prediction, 'No specific recommendation available.')
    
    return {
        'predicted_response': prediction,
        'recommendation': recommendation,
        'confidence': confidence
    }

In [None]:
# ---------- test recommendation system ----------
# Example patient data
example_patients = [
    {
        'age': 45,
        'gender': 'Male',
        'ethnicity': 'Caucasian',
        'medical_history': 'Hypertension, Diabetes',
        'family_history': 'Yes',
        'vital_signs': 'BP: 130/85, HR: 72',
        'recovery_time': 14
    },
    {
        'age': 62,
        'gender': 'Female',
        'ethnicity': 'Asian',
        'medical_history': 'Heart Disease',
        'family_history': 'No',
        'vital_signs': 'BP: 118/75, HR: 68',
        'recovery_time': 10
    },
    {
        'age': 34,
        'gender': 'Male',
        'ethnicity': 'Hispanic',
        'medical_history': 'Asthma',
        'family_history': 'Yes',
        'vital_signs': 'BP: 122/80, HR: 75',
        'recovery_time': 18
    }
]

print("=== PERSONALIZED HEALTHCARE RECOMMENDATIONS ===\n")

for i, patient in enumerate(example_patients, 1):
    print(f"Patient {i}:")
    print(f"  Age: {patient['age']}")
    print(f"  Gender: {patient['gender']}")
    print(f"  Medical History: {patient['medical_history']}")
    print(f"  Vital Signs: {patient['vital_signs']}")
    
    # Generate recommendation
    result = generate_recommendations(patient, rf_pipeline)
    
    print(f"\n  Predicted Treatment Response: {result['predicted_response']}")
    print(f"  Recommendation: {result['recommendation']}")
    print(f"  Confidence: {result['confidence']:.2%}")
    print("-" * 50)

In [None]:
# ---------- save the model ----------
import joblib

# Save the trained model pipeline
model_path = Path('../models/healthcare_recommendation_model.pkl')
model_path.parent.mkdir(exist_ok=True)
joblib.dump(rf_pipeline, model_path)
print(f"Model saved to: {model_path}")

# Save feature columns for future use
feature_info = {
    'feature_columns': feature_columns,
    'numerical_features': numerical_features,
    'categorical_features': categorical_features
}
feature_path = Path('../models/feature_info.pkl')
joblib.dump(feature_info, feature_path)
print(f"Feature information saved to: {feature_path}")

## Summary

In this notebook, we've successfully implemented a personalized healthcare recommendation system:

### Key Accomplishments:
1. **Data Exploration**: Analyzed patient demographics, medical conditions, and treatment responses
2. **Feature Engineering**: Extracted numerical values from vital signs and created new features
3. **Model Development**: Built a Random Forest classifier to predict treatment responses
4. **Model Evaluation**: Achieved good accuracy with detailed performance metrics
5. **Recommendation System**: Created a system that generates personalized healthcare recommendations
6. **Model Deployment**: Saved the trained model for future use

### Healthcare Applications:
- Treatment response prediction
- Personalized care plan development
- Clinical decision support
- Patient risk stratification

### Next Steps:
1. Deploy the model as a web application
2. Integrate with electronic health records (EHR) systems
3. Continuously update the model with new patient data
4. Expand to include more health conditions and treatment options

This system can help healthcare providers make more informed decisions and offer personalized recommendations to improve patient outcomes.