In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# COVID-19 Severity Classification and Demographic Analysis\n",
    "\n",
    "This project analyzes patient measurement data to classify **COVID-19 severity** and explore **demographic trends** (age, race, gender).\n",
    "\n",
    "It demonstrates data preprocessing, feature engineering, imputation, classification modeling, and exploratory demographic grouping using **Python**, **pandas**, and **scikit-learn**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Import Libraries\n",
    "We use pandas for data handling and scikit-learn for preprocessing and classification."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import accuracy_score, classification_report"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Load and Preprocess the Data\n",
    "We load the dataset, convert date fields, and retain only the **first measurement** per patient.  \n",
    "Next, we pivot the dataset so that each measurement type becomes a column."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Load dataset\n",
    "covid_data = pd.read_csv(\"data_covid.csv\")\n",
    "\n",
    "# Convert measurement_date to datetime\n",
    "covid_data['measurement_date'] = pd.to_datetime(covid_data['measurement_date'])\n",
    "\n",
    "# Keep only first measurement per person\n",
    "first_day_data = covid_data[\n",
    "    covid_data['measurement_date'] == covid_data.groupby('person_id')['measurement_date'].transform('min')\n",
    "]\n",
    "\n",
    "# Pivot: measurement types as columns\n",
    "first_day_pivot = first_day_data.pivot_table(\n",
    "    index=['person_id', 'current_age', 'category', 'race_name', 'gen_name'],\n",
    "    columns='measurement_name',\n",
    "    values='value_as_number'\n",
    ").reset_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Define Feature Groups\n",
    "We use:\n",
    "- **Vital Signs** for the mild vs. non-mild classification.\n",
    "- **All Lab Measurements** for the severe classifier."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "vital_signs = [\n",
    "    'Diastolic blood pressure', 'Body temperature', 'Systolic blood pressure',\n",
    "    'Body weight', 'Respiratory rate', 'Oxygen saturation in Arterial blood'\n",
    "]\n",
    "\n",
    "lab_measurements = [\n",
    "    'Heart rate', 'Hematocrit [Volume Fraction] of Blood by Automated count',\n",
    "    'Erythrocytes [#/volume] in Blood by Automated count',\n",
    "    'Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma',\n",
    "    'Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma',\n",
    "    'Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma',\n",
    "    'Bilirubin.total [Mass/volume] in Serum or Plasma',\n",
    "    'Albumin [Mass/volume] in Serum or Plasma', 'MCHC [Mass/volume] by Automated count',\n",
    "    'Hemoglobin [Mass/volume] in Blood', 'Platelets [#/volume] in Blood by Automated count',\n",
    "    'Glomerular filtration rate/1.73 sq M.predicted [Volume Rate/Area] in Serum, Plasma or Blood by Creatinine-based formula (MDRD)',\n",
    "    'Protein [Mass/volume] in Serum or Plasma', 'MCH [Entitic mass] by Automated count',\n",
    "    'MCV [Entitic volume] by Automated count', 'Leukocytes [#/volume] in Blood by Automated count'\n",
    "] + vital_signs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Prepare Data for Classification\n",
    "- **Binary Classifier:** distinguishes *mild* vs. *non-mild* cases (using only vital signs).  \n",
    "- **Severe Classifier:** predicts *severe* vs. *non-severe* cases (using all lab measurements)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Mild vs Non-Mild\n",
    "X_binary = first_day_pivot[vital_signs].values\n",
    "y_binary = (first_day_pivot['category'] != 'mild').astype(int).values\n",
    "\n",
    "# Severe vs Non-Severe\n",
    "X_severe = first_day_pivot[lab_measurements].values\n",
    "y_severe = first_day_pivot['category'].apply(lambda x: 1 if x == 'severe' else 0).values\n",
    "\n",
    "# Handle missing values\n",
    "imputer = SimpleImputer(strategy='median')\n",
    "X_binary = imputer.fit_transform(X_binary)\n",
    "X_severe = imputer.fit_transform(X_severe)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Step 5: Split Data into Training and Test Sets\n",
    "We reserve 20% of the data for evaluation."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "X_train_binary, X_test_binary, y_train_binary, y_test_binary = train_test_split(\n",
    "    X_binary, y_binary, test_size=0.2, random_state=42\n",
    ")\n",
    "\n",
    "X_train_severe, X_test_severe, y_train_severe, y_test_severe = train_test_split(\n",
    "    X_severe, y_severe, test_size=0.2, random_state=42\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Step 6: Train Logistic Regression Models\n",
    "We use Logistic Regression for interpretability and baseline classification."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "binary_classifier = LogisticRegression(max_iter=1000)\n",
    "severe_classifier = LogisticRegression(max_iter=1000)\n",
    "\n",
    "binary_classifier.fit(X_train_binary, y_train_binary)\n",
    "severe_classifier.fit(X_train_severe, y_train_severe)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Step 7: Evaluate Models\n",
    "We assess model accuracy and classification performance using precision, recall, and F1-score."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Mild vs Non-Mild\n",
    "binary_preds = binary_classifier.predict(X_test_binary)\n",
    "print(\"Mild Classifier Results:\")\n",
    "print(\"Accuracy:\", accuracy_score(y_test_binary, binary_preds))\n",
    "print(classification_report(y_test_binary, binary_preds))\n",
    "\n",
    "# Severe vs Non-Severe\n",
    "severe_preds = severe_classifier.predict(X_test_severe)\n",
    "print(\"\\n Severe Classifier Results:\")\n",
    "print(\"Accuracy:\", accuracy_score(y_test_severe, severe_preds))\n",
    "print(classification_report(y_test_severe, severe_preds))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 8: Cascaded Prediction (Two-Stage Classification)\n",
    "We first predict *non-mild* cases, then use the **severe classifier** only on those patients."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Filter test set for severe classification only if non-mild\n",
    "binary_predictions = binary_classifier.predict(X_test_binary)\n",
    "X_test_severe_final = X_test_severe[binary_predictions == 1]\n",
    "y_test_severe_final = y_test_severe[binary_predictions == 1]\n",
    "\n",
    "# Predict severe cases\n",
    "final_preds = severe_classifier.predict(X_test_severe_final)\n",
    "print(\"\\n Final Cascaded Classifier Results:\")\n",
    "print(\"Accuracy:\", accuracy_score(y_test_severe_final, final_preds))\n",
    "print(classification_report(y_test_severe_final, final_preds))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 9: Demographic Analysis\n",
    "We analyze COVID severity distribution by **sex**, **age group**, and **race**."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Group by Sex\n",
    "grouped_sex = first_day_pivot.groupby(['category', 'gen_name']).size().unstack(fill_value=0)\n",
    "grouped_sex['Total'] = grouped_sex.sum(axis=1)\n",
    "\n",
    "# Group by Age\n",
    "age_bins = [0, 18, 45, 65, 150]\n",
    "age_labels = ['< 18', '18 - 45', '46 - 65', '> 65']\n",
    "first_day_pivot['Age_Group'] = pd.cut(first_day_pivot['current_age'], bins=age_bins, labels=age_labels)\n",
    "grouped_age = first_day_pivot.groupby(['category', 'Age_Group']).size().unstack(fill_value=0)\n",
    "grouped_age['Total'] = grouped_age.sum(axis=1)\n",
    "\n",
    "# Group by Race\n",
    "grouped_race = first_day_pivot.groupby(['category', 'race_name']).size().unstack(fill_value=0)\n",
    "grouped_race['Total'] = grouped_race.sum(axis=1)\n",
    "\n",
    "# Display demographic tables\n",
    "print(\"\\n Table 1: Demographic Characteristics of Patient Population\\n\")\n",
    "print(\"By Sex:\\n\", grouped_sex)\n",
    "print(\"\\nBy Age Group:\\n\", grouped_age)\n",
    "print(\"\\nBy Race:\\n\", grouped_race)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "This analysis:\n",
    "- Preprocessed COVID-19 measurement data into a structured, per-patient format  \n",
    "- Built and evaluated a two-stage **mild/non-mild → severe** classification pipeline  \n",
    "- Explored demographic patterns by sex, age, and race  \n",
    "\n",
    "**Future Work:**\n",
    "- Try Random Forest or XGBoost for better accuracy  \n",
    "- Apply SMOTE to handle class imbalance  \n",
    "- Add feature importance and interpretability (e.g., SHAP)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
