# Business Understanding

### Columns and description
1. ID: Unique number to represent patient ID
2. PRG: Plasma glucose
    Hyperglycemia (high levels of plasma glucose) is commonly associated with sepsis.
3. PL: Blood Work Result-1 (mu U/ml)
4. PR: Blood Pressure (mm Hg)
5. SK: Blood Work Result-2 (mm)
6. TS: Blood Work Result-3 (mu U/ml)
7. M11: Body mass index (weight in kg/(height in m)^2)
    Both underweight and obesity can weaken the immune system.  
8. BD2: Blood Work Result-4 (mu U/ml)
9. Age: Patients age (years)
    Age is a significant factor in the risk and outcomes associated with sepsis. 
10. Insurance: If a patient holds a valid insurance card
11. Sepsis: (Target) 
    * Positive: if a patient in ICU will develop a sepsis
    * Negative: otherwise 

Sepsis:- the body's overwhelming and dysregulated response to an infection
// 1. The immune system is not able to effectively fight off a bacterial or viral infection, leading to inflammation



Goal: Build a model to predict if an ICU patient will develop a sepsis 

Null Hypothesis: There is no association between elevated plasma glucose levels and patients developing sepsis. 

Alternate Hypothesis: Elevated plasma glucose levels are associated with petients developing sepsis. 

Analytical Questions
1. What is the effect of age on developing sepsis
2. Is there any correlation between age and sepsis?
3. Does having high plasma glucose level increase the risk for sepsis?
4. How does the presence of antibiotics affect the development of sepsis in patients admitted to intensive care unit (ICU)?
5. Is there any correlation betweeen insurance holders and sepsis?
6. Can we use other factors such as renal failure, liver disease, or heart failure in our prediction model?
7. Are there any differences in the development of sepsis based on race/ethnicity?
8. Could elevated creatinine levels be used as a factor in predicting sepsis?
9. Would it be beneficial to include other medical conditions like pneumonia, urinary tract infection, or gastrointestinal ble
10. Would it be beneficial to include additional variables like kidney dysfunction, liver dysfunction, or heart failure in our model


## Data Understanding

1. Data Collection
The data collection process is the first step in the analysis of a dataset, and it involves gathering all relevant information from various sources to form a comprehension
2. Initial Data inspection
    * Get an overview of the dataset's size, dimensions, and structure
    *understand the data types(numeric, categorical, text)
3. Descriptive statistics 
    * compute summary statistics for each feature, such as mean, median, standard deviation, minimum, maximum, etc
4. Data Visualization
    *Create visualizations like histograms, box plots, scatter plots, and correlation matrices to explore the distribution and relationships between variables. 
    *identify patterns, trends, and potential outliers. 
5. Handle missing values
6. Data cleaning
    *clean the data by addressing inconsistencies, errors, or outliers
    *ensure that the data is in a format suitable for analysis 
7. Feature Engineering
    *Create new features or transform existing ones to enhance the predictive power of the model
    *Encode categorical

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns 


## Data Preparation

1. Split dataset
    *training - trains the model
    *validation - helps tune hyperparameters
    *test - evaluates model performance 
2. Create a pipeline to preprocess the data
    1. Scale
    2. Log Transformation
3. Encode
4. Split dataset into training and evaluation
5. Check balance - Balance Dataset (depending on what you see)

## Modeling

1. Train Model
    1. Train Model 1 - K-Nearest Neigbour (Distance model)
    2. Train Model 2 - Logistic Regression (Gradient Descent)
    3. Train Model 3 - Decision Tree (Decision Tree)
2. Persit Model