In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# COVID-19 Severity Classification\n",
    "\n",
    "This notebook builds two classification models to predict COVID-19 severity levels using clinical measurements. The workflow includes data preprocessing, imputation, model training (Logistic Regression), evaluation, and basic demographic analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Import Libraries"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import accuracy_score, classification_report"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Load and Preprocess Data\n",
    "\n",
    "We load the COVID-19 dataset, convert measurement dates, filter to first-day measurements, and pivot to have one row per patient."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "covid_data = pd.read_csv('data_covid.csv')\n",
    "covid_data['measurement_date'] = pd.to_datetime(covid_data['measurement_date'])\n",
    "\n",
    "# Keep only first-day measurements per patient\n",
    "first_day_data = covid_data[covid_data['measurement_date'] == covid_data.groupby('person_id')['measurement_date'].transform('min')]\n",
    "\n",
    "# Pivot data so each measurement becomes a column\n",
    "first_day_pivot = first_day_data.pivot_table(\n",
    "    index=['person_id', 'current_age', 'category', 'race_name', 'gen_name'],\n",
    "    columns='measurement_name',\n",
    "    values='value_as_number'\n",
    ").reset_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Feature Selection\n",
    "\n",
    "We define subsets of vital signs and laboratory measurements for our models."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "vital_signs = [\n",
    "    'Diastolic blood pressure', 'Body temperature', 'Systolic blood pressure',\n",
    "    'Body weight', 'Respiratory rate', 'Oxygen saturation in Arterial blood'\n",
    "]\n",
    "\n",
    "lab_measurements = [\n",
    "    'Diastolic blood pressure', 'Body temperature', 'Respiratory rate',\n",
    "    'Oxygen saturation in Arterial blood', 'Systolic blood pressure', 'Body weight',\n",
    "    'Heart rate', 'Hematocrit [Volume Fraction] of Blood by Automated count',\n",
    "    'Erythrocytes [#/volume] in Blood by Automated count',\n",
    "    'Aspartate aminotransferase [Enzymatic activity/volume] in Serum or Plasma',\n",
    "    'Alanine aminotransferase [Enzymatic activity/volume] in Serum or Plasma',\n",
    "    'Alkaline phosphatase [Enzymatic activity/volume] in Serum or Plasma',\n",
    "    'Bilirubin.total [Mass/volume] in Serum or Plasma', 'Albumin [Mass/volume] in Serum or Plasma',\n",
    "    'MCHC [Mass/volume] by Automated count', 'Hemoglobin [Mass/volume] in Blood',\n",
    "    'Platelets [#/volume] in Blood by Automated count',\n",
    "    'Glomerular filtration rate/1.73 sq M.predicted [Volume Rate/Area] in Serum, Plasma or Blood by Creatinine-based formula (MDRD)',\n",
    "    'Protein [Mass/volume] in Serum or Plasma', 'MCH [Entitic mass] by Automated count',\n",
    "    'MCV [Entitic volume] by Automated count', 'Leukocytes [#/volume] in Blood by Automated count'\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Prepare Data for Classification\n",
    "\n",
    "- The **mild classifier** uses only vital signs to predict whether a case is mild or not.\n",
    "- The **severe classifier** uses all available lab and vital data to distinguish severe from other cases."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Mild (binary) classifier setup\n",
    "X_binary = first_day_pivot[vital_signs].values\n",
    "y_binary = (first_day_pivot['category'] != 'mild').astype(int).values\n",
    "\n",
    "# Severe classifier setup\n",
    "X_severe = first_day_pivot.drop(['person_id', 'current_age', 'category', 'race_name', 'gen_name'], axis=1).values\n",
    "y_severe = first_day_pivot['category'].apply(lambda x: 1 if x == 'severe' else 0).values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Handle Missing Data and Split Sets"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "imputer = SimpleImputer(strategy='median')\n",
    "X_binary = imputer.fit_transform(X_binary)\n",
    "X_severe = imputer.fit_transform(X_severe)\n",
    "\n",
    "X_train_binary, X_test_binary, y_train_binary, y_test_binary = train_test_split(X_binary, y_binary, test_size=0.2, random_state=42)\n",
    "X_train_severe, X_test_severe, y_train_severe, y_test_severe = train_test_split(X_severe, y_severe, test_size=0.2, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: Train Models\n",
    "\n",
    "We use **Logistic Regression** for both classifiers."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "binary_classifier = LogisticRegression(max_iter=500)\n",
    "binary_classifier.fit(X_train_binary, y_train_binary)\n",
    "\n",
    "severe_classifier = LogisticRegression(max_iter=500)\n",
    "severe_classifier.fit(X_train_severe, y_train_severe)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7: Evaluate Classifiers\n",
    "\n",
    "We evaluate both models independently and then combine predictions for final accuracy."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Evaluate mild classifier\n",
    "binary_predictions = binary_classifier.predict(X_test_binary)\n",
    "binary_accuracy = accuracy_score(y_test_binary, binary_predictions)\n",
    "print('Mild Classifier Accuracy:', binary_accuracy)\n",
    "print(classification_report(y_test_binary, binary_predictions))\n",
    "\n",
    "# Evaluate severe classifier\n",
    "severe_predictions = severe_classifier.predict(X_test_severe)\n",
    "severe_accuracy = accuracy_score(y_test_severe, severe_predictions)\n",
    "print('Severe Classifier Accuracy:', severe_accuracy)\n",
    "print(classification_report(y_test_severe, severe_predictions))\n",
    "\n",
    "# Combine predictions for final evaluation\n",
    "X_test_severe_final = X_test_severe[binary_predictions == 1]\n",
    "y_test_severe_final = y_test_severe[binary_predictions == 1]\n",
    "severe_predictions_final = severe_classifier.predict(X_test_severe_final)\n",
    "\n",
    "final_accuracy = accuracy_score(y_test_severe_final, severe_predictions_final)\n",
    "print('Final Combined Classifier Accuracy:', final_accuracy)\n",
    "print(classification_report(y_test_severe_final, severe_predictions_final))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 8: Demographic Analysis\n",
    "\n",
    "We summarize patient demographics by severity level across gender, age, and race."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Group by sex\n",
    "grouped_sex = first_day_pivot.groupby(['category', 'gen_name']).size().unstack(fill_value=0)\n",
    "grouped_sex['Total'] = grouped_sex.sum(axis=1)\n",
    "\n",
    "# Group by age\n",
    "age_bins = [0, 18, 45, 65, 150]\n",
    "age_labels = ['< 18', '18 - 45', '46 - 65', '> 65']\n",
    "first_day_pivot['Age_Group'] = pd.cut(first_day_pivot['current_age'], bins=age_bins, labels=age_labels)\n",
    "grouped_age = first_day_pivot.groupby(['category', 'Age_Group']).size().unstack(fill_value=0)\n",
    "grouped_age['Total'] = grouped_age.sum(axis=1)\n",
    "\n",
    "# Group by race\n",
    "grouped_race = first_day_pivot.groupby(['category', 'race_name']).size().unstack(fill_value=0)\n",
    "grouped_race['Total'] = grouped_race.sum(axis=1)\n",
    "\n",
    "# Print summaries\n",
    "print('Table 1: Demographic Characteristics of Patient Population')\n",
    "print('\\nSex')\n",
    "print(grouped_sex)\n",
    "print('\\nAge')\n",
    "print(grouped_age)\n",
    "print('\\nRace')\n",
    "print(grouped_race)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.x"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

