<a href="https://colab.research.google.com/github/wamaw123/Biomedical-Data-Analytics-with-Python/blob/main/Foundations%20of%20Data%20Analytics/Disease_Prediction_and_Prevention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Analytics with Python
By : [Abderrahim Benmoussa, Ph.D. ](https://https://github.com/wamaw123)

Project's on Github : https://github.com/wamaw123/Biomedical_Data_analysis

# Disease Prediction and Prevention

In this notebook, we aim to predict the 10-year risk of future coronary heart disease (CHD) using the Framingham Heart Study dataset. This will be a binary classification task, where `1` indicates the risk of CHD, and `0` indicates no risk. To do so we will explore the dataset and go on to test a simple logistic regression model first. We will then compare different models and finally optimize the model to get the best predictive accuracy.

## Step 1: Install and Import Libraries

In this step, we will install and import necessary libraries for our analysis.
- `pandas` and `numpy` for data manipulation
- `seaborn` and `matplotlib` for data visualization
- `scikit-learn` for building and evaluating the machine learning model

In [None]:
# Install necessary packages
!pip install pandas numpy scipy statsmodels patsy dtale scikit-learn pandas_profiling

# Import necessary libraries

## Data Manipulation
import pandas as pd   # Essential for data manipulation and mathematical operations.
import numpy as np    # Used for array-based operations and mathematical functions.

## Visualization
import matplotlib.pyplot as plt  # Fundamental plotting library.
import seaborn as sns            # Builds on top of matplotlib for more advanced visualizations.

## Statistical Testing, modeling and data preparation
from scipy import stats           # Library for scientific and technical computing.
import statsmodels.api as sm      # Provides classes and functions for the estimation of many different statistical models.
import statsmodels.formula.api as smf  # Formula-based API for the statsmodels library.
from sklearn.model_selection import train_test_split  # Import train_test_split function to split data into training and testing sets.
from sklearn.preprocessing import StandardScaler  # Import StandardScaler to standardize features by removing the mean and scaling to unit variance.
from sklearn.linear_model import LogisticRegression  # Import LogisticRegression to perform logistic regression.
from sklearn.metrics import classification_report, accuracy_score  # Import classification_report to build a text report showing the main classification metrics, and accuracy_score to compute the accuracy of the algorithm.
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE

## Interactive Exploration
from collections import Counter
import pandas_profiling as pp





## Step 2: Load the Dataset

Next we load the week 2 dataset directly from GitHub and set it into a Pandas dataframe


In [None]:
# Fetch the dataset from GitHub
url = "https://raw.githubusercontent.com/wamaw123/Biomedical-Data-Analytics-with-Python/afab193c5cb3d6878755c4d12e8baa821a8ab054/Datasets/23/framingham.csv"
df = pd.read_csv(url)
df.head()

#About the dataset

## Source
The dataset is publicly available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has a 10-year risk of future coronary heart disease (CHD). The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes.

## Variables
Each attribute is a potential risk factor. There are both demographic, behavioral, and medical risk factors.

### Demographic:
- **Sex:** male or female (Nominal)
- **Age:** Age of the patient; (Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)

### Behavioral:
- **Current Smoker:** whether or not the patient is a current smoker (Nominal)
- **Cigs Per Day:** the number of cigarettes that the person smoked on average in one day. (Can be considered continuous as one can have any number of cigarettes, even half a cigarette.)

### Medical (history):
- **BP Meds:** whether or not the patient was on blood pressure medication (Nominal)
- **Prevalent Stroke:** whether or not the patient had previously had a stroke (Nominal)
- **Prevalent Hyp:** whether or not the patient was hypertensive (Nominal)
- **Diabetes:** whether or not the patient had diabetes (Nominal)

### Medical (current):
- **Tot Chol:** total cholesterol level (Continuous)
- **Sys BP:** systolic blood pressure (Continuous)
- **Dia BP:** diastolic blood pressure (Continuous)
- **BMI:** Body Mass Index (Continuous)
- **Heart Rate:** heart rate (Continuous - In medical research, variables such as heart rate, though in fact discrete, yet are considered continuous because of a large number of possible values.)
- **Glucose:** glucose level (Continuous)

### Predict variable (desired target):
- **10-year risk of coronary heart disease CHD (binary:** "1" means "Yes," "0" means "No")

## Step 3: Data Preprocessing

We will perform initial data preprocessing such as handling missing values. This step is crucial to ensure the quality and reliability of our machine learning model. There are many ways to deal with missing values for instance. Those can be droped or inputed in different ways. Let's first check what missing values we have.

In [None]:
# Checking for missing values
# isnull() returns a DataFrame where each cell is either True or False depending on that cell's null status.
# sum() will then sum the True values (count of missing values) for each column.
print(df.isnull().sum())

# Calculate the percentage of missing values for each column
missing_percentage = (df.isnull().sum() / len(df)) * 100

# Display the percentage of missing values for each column
print(f"Percentage of missing values for each column:\n{missing_percentage}")

# Visualize the missing values as a heatmap
plt.figure(figsize=(10, 6))  # Set the figure size
sns.heatmap(df.isnull(),     # Provide DataFrame with null-status information
            cbar=False,      # Do not draw a color bar
            cmap='viridis')  # Use the viridis color map

plt.title('Missing Values Heatmap')
plt.show()

Analysis of the output : Education and the values for glycemia seem to be the ones with most missing values. There are no missing values for the 10 year CHD. There are many ways to deal with missing values.

- Remove Rows with Missing Values
- Replace Missing Values with Mean
- Replace Missing Values with Median
- Replace Missing Values with Mode
- Use Forward or Backward Fill

I would prefer either dropping the values since only 10% are missing for glycemia or using median. I don't expect much difference between the two, so I will go with either of them.

In [None]:
# Sample code to illustrate handling missing values based on user's choice
def handle_missing_values(df, option):
    """Handle missing values based on user's choice."""
    if option == 'Remove Rows':
        df.dropna(inplace=True)
    elif option == 'Replace with Mean':
        for column in df.columns:
            df[column].fillna(df[column].mean(), inplace=True)
    elif option == 'Replace with Median':
        for column in df.columns:
            df[column].fillna(df[column].median(), inplace=True)
    elif option == 'Replace with Mode':
        for column in df.columns:
            df[column].fillna(df[column].mode()[0], inplace=True)
    elif option == 'Forward or Backward Fill':
        df.fillna(method='ffill', inplace=True)
        df.fillna(method='bfill', inplace=True)
    return df

# User's choice using Google Colab form field dropdown
missing_value_option = 'Replace with Median' #@param ["Remove Rows", "Replace with Mean", "Replace with Median", "Replace with Mode", "Forward or Backward Fill"]

# Handle missing values based on user's choice
df_nm = handle_missing_values(df, missing_value_option)


Let's check again for missing values

In [None]:
# Checking for missing values
# isnull() returns a DataFrame where each cell is either True or False depending on that cell's null status.
# sum() will then sum the True values (count of missing values) for each column.
print(df_nm.isnull().sum())

# Calculate the percentage of missing values for each column
missing_percentage = (df_nm.isnull().sum() / len(df_nm)) * 100

# Display the percentage of missing values for each column
print(f"Percentage of missing values for each column:\n{missing_percentage}")

# Visualize the missing values as a heatmap
plt.figure(figsize=(10, 6))  # Set the figure size
sns.heatmap(df_nm.isnull(),     # Provide DataFrame with null-status information
            cbar=False,      # Do not draw a color bar
            cmap='viridis')  # Use the viridis color map

plt.title('Missing Values Heatmap')
plt.show()

Analysis of the output : The data is now devoid of missing values.

Let's now check for correlations between these variables

In [None]:
# Calculate the correlation matrix
correlation_matrix = df.corr()

# Display the correlation matrix
print(correlation_matrix)

# Visualize the correlation matrix as a heatmap
plt.figure(figsize=(12, 8))  # Set the size of the figure
sns.heatmap(correlation_matrix,
            annot=True,  # Annotate each cell with the numeric value
            cmap='coolwarm',  # Use a cool-warm color map
            vmin=-1, vmax=1,  # Set color scale limits
            linewidths=.5)  # Set linewidth between entries in matrix

plt.title('Correlation Matrix')
plt.show()

Analysis of the output : we see some interesting correlations with our dependant variable (target). Mostly age and hypertension hallmarks, which is fairly expectable from a scientific point of view.

## Step 4: Exploratory Data Analysis (EDA) and

Let's explore the dataset to understand it better and figure out how to approach the prediction problem. Visualization helps in identifying patterns and anomalies in the dataset. It will be crucial to deal with imbalanced dataset issues by using techniques like oversampling, undersampling, or SMOTE.


In [None]:
#Getting information on the data form
df_nm.info()

Output interpretation: The table contains 4,238 entries (or rows) and 16 different categories (or columns) of information. These categories include things like gender (male), age, education, whether the person is a current smoker, and the number of cigarettes smoked per day, among others. All entries in the table are non-null, meaning they all contain data, and the data is in different formats, including integers (int64) and floating-point numbers (float64). The table takes up about 530 kilobytes of memory space.

Let's further explore the data using panda profiling

In [None]:
pp.ProfileReport(df_nm)

### Overall informations gathered from Panda Profiling :

- There's a strong relationship between the number of cigarettes smoked per day (cigsPerDay) and whether the person is a current smoker (currentSmoker).
- Systolic blood pressure (sysBP) is closely related to diastolic blood pressure (diaBP) and one other field.
- Diastolic blood pressure (diaBP) is also closely related to systolic blood pressure (sysBP) and one other field.
- There's a strong link between glucose levels and diabetes.
- PrevalentHyp is strongly connected with sysBP and one other field.
- The BPMeds, prevalentStroke, and diabetes fields are highly imbalanced, meaning most of the values are the same (80.8% for BPMeds, 94.8% for prevalentStroke, and 82.8% for diabetes).
- Half of the cigsPerDay entries are zero, indicating a lot of non-smokers or missing data : in this case, probably non-smokers since missing data were initial NaNs.

# Imbalance in Data and Why It's an Issue

"Imbalance" in data refers to a situation where the distribution of data among different categories or classes is unequal. One class may have significantly more instances than others.

## Issues Caused by Imbalance:

1. **Model Bias:**
   - An imbalanced dataset can cause a predictive model to be biased towards the majority class, resulting in poor performance for the minority class.

2. **Misleading Accuracy:**
   - A model might show high accuracy by simply predicting the majority class, providing a false sense of effectiveness.

3. **Loss of Information:**
   - Patterns associated with the minority class may be overlooked, leading to a lack of important insights.

4. **Overfitting:**
   - The model may memorize the few available minority class instances rather than generalizing, leading to overfitting.

5. **False Assumptions:**
   - Incorrect assumptions may be made about real-world class distributions, affecting model performance in practical applications.

## Solutions:

1. **Resampling:**
   - Oversample the minority class or undersample the majority class.

2. **Synthetic Data Generation:**
   - Generate new data points for the minority class, e.g., using the Synthetic Minority Over-sampling Technique (SMOTE).

3. **Cost-sensitive Learning:**
   - Penalize the misclassification of the minority class more heavily.

4. **Using Different Evaluation Metrics:**
   - Utilize metrics like the F1-score, precision, recall, or Area Under the Receiver Operating Characteristic (ROC) curve to evaluate model performance.

For the sake of this excercise, let's explore resampling and more specifically SMOTE.



In [None]:
# Importing Necessary Libraries

# Assumed data loading
# df = pd.read_csv('link_to_dataset')

df_rdy = df_nm.copy()

# Function to handle imbalance
def handle_imbalance(df, column, method):
    if method == "Oversampling":
        df_majority = df[df[column]==0]
        df_minority = df[df[column]==1]
        df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority), random_state=123)
        df = pd.concat([df_majority, df_minority_upsampled])
    elif method == "Undersampling":
        df_majority = df[df[column]==0]
        df_minority = df[df[column]==1]
        df_majority_downsampled = resample(df_majority, replace=False, n_samples=len(df_minority), random_state=123)
        df = pd.concat([df_minority, df_majority_downsampled])
    elif method == "SMOTE":
        X = df.drop(columns=[column])
        y = df[column]
        smote = SMOTE(random_state=123)
        X, y = smote.fit_resample(X, y)
        df = pd.concat([X, y], axis=1)
    return df

# Choice of method to handle imbalance for BPMeds
method_BPMeds = 'SMOTE' # @param ["Oversampling", "Undersampling", "SMOTE"]
df_rdy = handle_imbalance(df_rdy, 'BPMeds', method_BPMeds)

# Choice of method to handle imbalance for prevalentStroke
method_prevalentStroke = 'SMOTE' # @param ["Oversampling", "Undersampling", "SMOTE"]
df_rdy = handle_imbalance(df_rdy, 'prevalentStroke', method_prevalentStroke)

# Choice of method to handle imbalance for diabetes
method_diabetes = 'SMOTE' # @param ["Oversampling", "Undersampling", "SMOTE"]
df_rdy = handle_imbalance(df_rdy, 'diabetes', method_diabetes)

# Handling relationships and zero values
# Creating an interaction term for sysBP and diaBP
df_rdy['bp_interaction'] = df_rdy['sysBP'] * df_rdy['diaBP']

# Handling cigsPerDay (since smokers are not missing values, no change is made here)

df_rdy.head()

Let's explore again the dataset now that those issues have been adressed

In [None]:
pp.ProfileReport(df_rdy)



## Step 5: Feature Selection and Splitting the Dataset

In this step, we will select the relevant features for our model and split the dataset into training and testing sets. This separation allows us to evaluate the model's performance on unseen data.  

## Step 6: Feature Scaling

Standardizing the dataset is crucial for many machine learning models and helps the model to converge faster.



## Step 7: Model Building

We will use the Logistic Regression model for prediction, which is a commonly used algorithm for binary classification problems.




## Step 8: Model Evaluation

Evaluate the model performance by comparing the predicted and actual values. We will use metrics such as precision, recall, and accuracy to assess the model performance.



## Step 9: Comparing different models

We will evaluate different models for the same dataset and pick the most accurate one using PyCaret.



## Step 10: Optimization

In this step, we aim to fine-tune the best model's parameters to enhance its performance using techniques like Grid Search or Random Search.


# Conclusions and perspectives

In this notebook, we