# Project Title: 

Sepsis Classification Machine Learning Project with FAST API Deployment

# Business Understanding

## 1. Introduction
This project focuses on the early detection and classification of sepsis, a life-threatening medical condition. Sepsis is a critical concern in healthcare, and early diagnosis can significantly improve patient outcomes. The objective is to build a robust machine learning model for sepsis classification and deploy it into a web application using FastAPI, making it accessible for real-time predictions.

### 1.1. Objectives
- Understand the Data: 
The primary objective of this project is to gain a comprehensive understanding of the patient data, which includes various health-related features, demographics, and the presence or absence of sepsis. This understanding will empower healthcare professionals and decision-makers to make informed decisions regarding patient care and intervention.

- Predict Sepsis: 
Develop an accurate machine learning classification model that can predict the likelihood of a patient developing sepsis based on the provided features. Early and accurate sepsis prediction is crucial for timely medical intervention and improving patient outcomes.

- Web Application Integration: 
Integrate the trained sepsis classification model into a web application using FAST API. This web application will serve as a practical tool for healthcare practitioners to input patient data and receive real-time sepsis risk predictions, aiding in clinical decision-making.

### 1.2. Methodology
To achieve the project objectives, we will follow a structured approach:

- Data Loading and Exploration: 
Begin by loading and exploring the patient data, including features like age, vital signs, and medical history. This step will provide initial insights into the dataset and identify any data quality issues.

- Data Preprocessing: 
Handle missing values, perform feature engineering, and encode categorical variables as needed. Preprocessing steps will ensure that the data is ready for training the machine learning model.

- Model Development: 
Select and implement a suitable machine learning classification model for sepsis prediction. This model will be trained on historical patient data to learn patterns indicative of sepsis.

- Model Evaluation: 
Assess the model's performance using appropriate evaluation metrics such as accuracy, precision, recall, F1-score, and ROC-AUC. Rigorous evaluation will help identify the model's effectiveness in predicting sepsis cases.

- FAST API Integration: 
Integrate the trained machine learning model into a FAST API-based web application. This application will provide a user-friendly interface for healthcare professionals to input patient data and obtain sepsis risk predictions.

- Testing and Validation: 
Conduct thorough testing and validation of the web application to ensure its reliability and accuracy in real-time sepsis risk assessment.

- Documentation: 
Provide detailed documentation on how to use the web application, including input requirements and interpretation of results.

By following this methodology, we aim to provide healthcare professionals with a valuable tool for early sepsis detection and decision support, ultimately contributing to improved patient care and outcomes.

# Setup

## Installations

## Importation of Relevant Libraries

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Loading

### Loading the Train and Test Datasets

#### Train Dataset

In [14]:
# Load The Train Dataset
train_df = pd.read_csv("data/Paitients_Files_Train.csv")
train_df.head()

Unnamed: 0,ID,PRG,PL,PR,SK,TS,M11,BD2,Age,Insurance,Sepssis
0,ICU200010,6,148,72,35,0,33.6,0.627,50,0,Positive
1,ICU200011,1,85,66,29,0,26.6,0.351,31,0,Negative
2,ICU200012,8,183,64,0,0,23.3,0.672,32,1,Positive
3,ICU200013,1,89,66,23,94,28.1,0.167,21,1,Negative
4,ICU200014,0,137,40,35,168,43.1,2.288,33,1,Positive


#### Test Dataset

In [15]:
# Load The Test Dataset
test_df = pd.read_csv("data/Paitients_Files_Test.csv")
test_df.head()

Unnamed: 0,ID,PRG,PL,PR,SK,TS,M11,BD2,Age,Insurance
0,ICU200609,1,109,38,18,120,23.1,0.407,26,1
1,ICU200610,1,108,88,19,0,27.1,0.4,24,1
2,ICU200611,6,96,0,0,0,23.7,0.19,28,1
3,ICU200612,1,124,74,36,0,27.8,0.1,30,1
4,ICU200613,7,150,78,29,126,35.2,0.692,54,0


Columns Description (Common to Both Datasets):

- ID: A unique identifier for each patient.
- PRG: Number of pregnancies (applicable only to females).
- PL: Plasma glucose concentration.
- PR: Diastolic blood pressure.
- SK: Triceps skinfold thickness.
- TS: 2-hour serum insulin.
- M11: Body mass index (BMI).
- BD2: Diabetes pedigree function.
- Age: Age of the patient.
- Insurance: Whether the patient has insurance coverage (1 for Yes, 0 for No).
Sepsis (Only in Test Dataset): The target variable indicating the presence or absence of sepsis (Positive for presence, Negative for absence).

Both datasets contain patient-related information, with the test dataset having an additional "Sepsis" column for target classification. The  dataset is designed for model training, while the test dataset will be used for model evaluation.

# Exploratory Data Analysis (EDA)

## Understanding the datasets

An in-depth exploration of the datasets is presented to gain insights into the available variables,their distributions and relationships. This step will provide an initial undertanding of the datasets to identify any data quality issues that will inform the cleaning and pre-processing.

In [None]:
test_df.describe()

print("\nSummary Statistics for Train Dataset:")
print(train_df.describe())

# Summary statistics for categorical features
print("\nSummary Statistics for Categorical Features in Test Dataset:")
print(test_df.describe(include=['object']))

print("\nSummary Statistics for Categorical Features in Train Dataset:")
print(train_df.describe(include=['object']))

# Check for missing values
print("\nMissing Values in Test Dataset:")
print(test_df.isnull().sum())

print("\nMissing Values in Train Dataset:")
print(train_df.isnull().sum())

# Data distribution visualization
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
sns.countplot(data=test_df, x='Sepsis')
plt.title('Sepsis Distribution in Test Dataset')

plt.subplot(1, 2, 2)
sns.countplot(data=train_df, x='Insurance')
plt.title('Insurance Distribution in Train Dataset')

plt.show()

# Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = test_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix (Test Dataset)')
plt.show()

# Pairplot for selected features
sns.pairplot(data=test_df, vars=['PL', 'PR', 'Age'], hue='Sepsis')
plt.suptitle('Pairplot of Selected Features (Test Dataset)', y=1.02)
plt.show()
