# Predicting Diabetes Diagnosis Using Machine Learning: A Comprehensive Analysis of Patient Data

Diabetes is a chronic disease that affects millions of people worldwide, making early detection crucial for better management and treatment. In this analysis, we aim to leverage machine learning techniques to predict whether a patient has diabetes based on various medical attributes. The dataset used for this task comes from the **National Institute of Diabetes and Digestive and Kidney Diseases**, specifically focusing on female patients aged 21 and older of Pima Indian descent.

By exploring this dataset, we will build a predictive model that can diagnose diabetes with high accuracy, using several diagnostic measurements including glucose levels, blood pressure, BMI, and others. The ultimate goal is to create a robust machine learning model capable of predicting the presence of diabetes, offering valuable insights for healthcare professionals and patients alike.

## About the Dataset

This dataset originates from the **National Institute of Diabetes and Digestive and Kidney Diseases**. The objective is to diagnostically predict whether a patient has diabetes based on certain diagnostic measurements included in the dataset. The data specifically focuses on female patients aged 21 years and older of Pima Indian heritage. Several constraints were placed on the selection of the instances from a larger database.

The dataset contains both independent medical predictor variables and one target variable, **Outcome**, which indicates whether a patient has diabetes or not.

## Variables

- **Pregnancies**: Number of pregnancies.
- **Glucose**: 2-hour plasma glucose concentration in the oral glucose tolerance test.
- **Blood Pressure**: Blood pressure (mm Hg).
- **Skin Thickness**: Skin thickness.
- **Insulin**: 2-hour serum insulin (mu U/ml).
- **DiabetesPedigreeFunction**: Diabetes pedigree function.
- **BMI**: Body Mass Index.
- **Age**: Age (in years).
- **Outcome**: Diabetes diagnosis (1 = positive, 0 = negative).

## Techniques and Tools Used

This analysis will employ several tools and techniques, including:

- **Exploratory Data Analysis (EDA)**: To understand the distribution and relationships between variables.
- **Correlation Analysis**: To examine how variables are related to one another.
- **Feature Engineering**: To improve model performance by creating new features and modifying existing ones.
- **Data Preprocessing**: Including handling missing values, outliers, and encoding categorical variables.
- **Model Building**: The following machine learning models will be utilized:
  - **RandomForestClassifier**
  - **Logistic Regression**
  - **K-Nearest Neighbors (KNN)**
  - **Support Vector Classifier (SVC)**
  - **Decision Tree Classifier**
  - **AdaBoost Classifier**
  - **Gradient Boosting Classifier**
  - **XGBoost Classifier**
  - **LightGBM Classifier**
  
- **Hyperparameter Optimization**: Using techniques like grid search and random search to find the optimal settings for the models.
- **Model Evaluation**: Comparing the performance of various models using metrics like accuracy, precision, recall, and F1 score.
- **Visualization**: To display model performance and feature importance.

The overall goal is to build a predictive model that can effectively diagnose diabetes based on the provided medical features.


## Libraries and Tools

To efficiently analyze the dataset and build a machine learning model for diabetes prediction, we utilize a variety of libraries and tools, each serving a specific purpose:

### ðŸ”¹ Data Handling & Manipulation  
- **NumPy** (`numpy`): Efficient numerical operations.  
- **Pandas** (`pandas`): Data manipulation and analysis.  

### ðŸ”¹ Data Visualization  
- **Matplotlib** (`matplotlib.pyplot`): Basic plotting and visualizations.  
- **Seaborn** (`seaborn`): Statistical data visualization.  
- **Plotly** (`plotly.express`, `plotly.graph_objects`): Interactive and advanced visualizations.  

### ðŸ”¹ Machine Learning & Model Evaluation  
- **Scikit-learn** (`sklearn`):  
  - Model training: `RandomForestClassifier`, `LogisticRegression`, `KNeighborsClassifier`, `SVC`, `DecisionTreeClassifier`, `AdaBoostClassifier`, `GradientBoostingClassifier`.  
  - Model evaluation: `accuracy_score`, `precision_score`, `recall_score`, `f1_score`, `roc_auc_score`.  
  - Hyperparameter tuning: `GridSearchCV`, `cross_validate`.  
  - Data preprocessing: `StandardScaler`, `RobustScaler`, `LabelEncoder`, `KNNImputer`.  
  - Train-test splitting: `train_test_split`.  

- **XGBoost** (`xgboost`): Optimized gradient boosting for classification.  
- **LightGBM** (`lightgbm`): High-performance gradient boosting.  

### ðŸ”¹ Miscellaneous  
- **Itertools** (`itertools`): Advanced iteration tools.  
- **Warnings** (`warnings`): Suppressing unnecessary warnings for a cleaner output.  

Each of these tools plays a crucial role in preparing, visualizing, modeling, and evaluating our dataset for effective diabetes prediction. ðŸš€


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import itertools
import plotly.graph_objects as go

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.impute import KNNImputer
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import warnings
warnings.simplefilter(action="ignore")

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



## Configuring Pandas Display Options

To improve the readability of our dataset when displayed in the notebook, we configure **Pandas display options** as follows:  

- **`display.max_columns = None`** â†’ Ensures all columns are displayed without truncation.  
- **`display.width = None`** â†’ Automatically adjusts the display width to fit the notebook output.  
- **`display.max_rows = 20`** â†’ Limits the number of displayed rows to 20 for better readability.  
- **`display.float_format = lambda x: '%.3f' % x`** â†’ Formats floating-point numbers to 3 decimal places for consistency.  

These settings help us visualize the dataset more effectively without losing important details.


In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## Loading the Dataset  

The diabetes dataset is imported using Pandas, allowing us to structure and manipulate the data efficiently. This dataset will serve as the foundation for exploratory data analysis (EDA) and machine learning modeling.  

By loading the data into a DataFrame, we can seamlessly perform operations such as filtering, transformation, and visualization, facilitating a deeper understanding of the relationships between variables and their impact on diabetes prediction.  

In [3]:
df = pd.read_csv("diabetes.csv")

## Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a crucial phase in any data analysis project. In this stage, our goal is to better understand the dataset, explore the relationships between variables, and detect possible patterns, trends, and anomalies.  

During EDA, we focus on identifying:

- The **structure** and **content** of the dataset (size, variable types).
- **Missing values** and their impact on the analysis.
- The **distributions** of numerical and categorical variables.
- **Correlations** between variables that might influence the predictive model.

EDA provides the initial insights needed to make informed decisions about how to proceed with data preparation and model construction. It is essential to ensure that the dataset's quality is suitable for further analysis.

In this project, we will use various tools and methods to explore our dataset and gain a better understanding of the relationships between the variables.  

The **`check_df()`** function is designed to perform a quick preliminary analysis of our dataset to provide an organized overview. When running this function, several key metrics are presented that will help us evaluate the data's quality and characteristics.

#### Description of Each Section:

- **Shape**: Displays the number of rows and columns in the DataFrame. This gives us an idea of the dataset's size.

- **Data Types**: Shows the data types of each column (e.g., integer, float, object). This is important to ensure the data is in the correct format for further analysis.

- **Head and Tail**: Displays the first and last few rows of the dataset. This allows us to quickly inspect the first and last entries to identify potential inconsistencies or patterns.

- **Missing Values**: Calculates and shows the number of missing (NaN) values in each column. Identifying missing values is essential for deciding how to handle them (e.g., removal, imputation, etc.).

- **Unique Values**: Shows the number of unique values in each column. This helps us understand the diversity of data in each variable, especially for categorical columns.

- **Summary Statistics**: Provides key descriptive statistics such as mean, standard deviation, min, max, and percentiles. This is useful for understanding the distribution of numerical variables.

- **Quantiles**: Displays selected percentiles (0, 5, 50, 95, 99, 100) of numerical columns. This helps identify the spread of data and potential outliers.

By running this function at the start of the analysis, we gain a general understanding of the dataset, allowing us to make informed decisions about how to handle missing values, outliers, and other critical aspects before moving forward with modeling.  


In [4]:
def check_df(dataframe, head=5):
    """
    Provides an overview of a Pandas DataFrame, displaying key statistics and structure.

    Parameters:
    dataframe (pd.DataFrame): The DataFrame to analyze.
    head (int): Number of rows to display in the head() and tail() sections.

    Returns:
    None
    """
    print("##################### Shape #####################")
    print(f"Rows: {dataframe.shape[0]}, Columns: {dataframe.shape[1]}\n")

    print("##################### Data Types #####################")
    print(dataframe.dtypes, "\n")

    print("##################### Head #####################")
    print(dataframe.head(head), "\n")

    print("##################### Tail #####################")
    print(dataframe.tail(head), "\n")

    print("##################### Missing Values #####################")
    na_counts = dataframe.isnull().sum()
    print(na_counts[na_counts > 0] if na_counts.sum() > 0 else "No missing values", "\n")

    print("##################### Unique Values #####################")
    print(dataframe.nunique(), "\n")

    print("##################### Summary Statistics #####################")
    print(dataframe.describe().T, "\n")

    print("##################### Quantiles #####################")
    print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

# Execute function
check_df(df)


##################### Shape #####################
Rows: 768, Columns: 9

##################### Data Types #####################
Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object 

##################### Head #####################
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin    BMI  DiabetesPedigreeFunction  \
0            6      148             72             35        0 33.600                     0.627   
1            1       85             66             29        0 26.600                     0.351   
2            8      183             64              0        0 23.300                     0.672   
3            1       89             66             23       94 28.100                  

In [5]:
df = pd.read_csv("diabetes.csv")

## Identifying and Classifying Numeric and Categorical Variables

In any data analysis or machine learning project, one of the first steps is to understand the types of variables in the dataset. This is crucial because the way we handle variables depends on whether they are numerical or categorical.

### Types of Variables
1. **Numerical Variables**:
   - These are variables that represent quantifiable values and can be discrete (like the number of items) or continuous (like height, weight, or age).
   - Examples include: `Age`, `Blood Pressure`, `BMI`, `Glucose`.

2. **Categorical Variables**:
   - These variables represent categories or groups. They usually take on a limited, fixed number of values, and the categories might not have a specific order.
   - Examples include: `Gender`, `Outcome` (diabetes positive/negative), `SkinThickness` (may be treated as categorical depending on analysis).

### Importance of Identifying Variable Types
Correctly identifying the type of each variable helps us make decisions about:
- **Data Preprocessing**: Numerical variables may need scaling or normalization, while categorical variables might require encoding techniques like One-Hot Encoding or Label Encoding.
- **Analysis & Modeling**: Some machine learning models or statistical tests require different treatments for numerical vs categorical variables.

### Categorizing Variables
To classify the variables into numerical and categorical, we typically rely on:
- **Data Type**: Checking the data type of each column (e.g., integer, float, object).
- **Unique Value Count**: For numerical variables with few unique values (e.g., binary or limited range), it might make sense to treat them as categorical.
- **Business Context**: Sometimes, a variable that is numerical might be better treated as categorical based on the domain knowledge or the analysis objective.

### Process Overview
1. **Identify Categorical Variables**: These variables are usually of type "object" or string, but sometimes numerical columns with a small number of unique values can also be treated as categorical.
2. **Identify Numerical Variables**: These are variables of type integer or float that contain continuous or discrete data. However, numerical variables with a very limited range of unique values may need to be categorized.
3. **Distinguish Special Cases**: Some variables may have characteristics that allow them to belong to both categories, such as "numerical but categorical" or "categorical but cardinal." These cases often require additional scrutiny.

### Benefits of Capturing Variable Types
- **Efficient Data Processing**: By accurately categorizing variables, we can efficiently preprocess data, apply the correct transformations, and choose the right algorithms for analysis.
- **Improved Model Accuracy**: Knowing the nature of the data ensures the right model is selected, and appropriate techniques are applied for feature engineering, thus improving the performance of the machine learning model.

In summary, identifying and categorizing numeric and categorical variables is an essential first step in the data exploration process. It lays the foundation for proper data handling, effective analysis, and better decision-making in model building.

In [6]:
def grab_col_names(dataframe, cat_th=10, car_th=20):
    """
    Classify columns in a dataframe into categorical and numerical columns
    based on their characteristics such as data type and number of unique values.

    Parameters:
    dataframe (pd.DataFrame): The DataFrame to analyze.
    cat_th (int): Threshold for categorizing columns with fewer unique values as categorical.
    car_th (int): Threshold for categorizing columns with more unique values as categorical.

    Returns:
    cat_cols (list): List of categorical columns.
    num_cols (list): List of numerical columns.
    cat_but_car (list): List of columns that are categorical but have more than 'car_th' unique values.
    num_but_cat (list): List of columns that are numerical but have fewer than 'cat_th' unique values.
    """

    # Identify categorical columns (those with object type)
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtype == "O"]

    # Identify numerical columns with fewer unique values (treated as categorical)
    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and dataframe[col].dtype != "O"]

    # Identify categorical columns with more than 'car_th' unique values (treated as cardinal)
    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and dataframe[col].dtype == "O"]

    # Merge regular categorical columns with those identified as numerical but categorical
    cat_cols = list(set(cat_cols + num_but_cat) - set(cat_but_car))  # Remove columns that are categorical but cardinal

    # Identify numerical columns excluding those that are treated as categorical
    num_cols = [col for col in dataframe.columns if dataframe[col].dtype != "O" and col not in num_but_cat]

    # Summary statistics
    print(f"Observations: {dataframe.shape[0]}")
    print(f"Variables: {dataframe.shape[1]}")
    print(f"Categorical Columns (cat_cols): {len(cat_cols)}")
    print(f"Numerical Columns (num_cols): {len(num_cols)}")
    print(f"Categorical but Cardinal (cat_but_car): {len(cat_but_car)}")
    print(f"Numerical but Categorical (num_but_cat): {len(num_but_cat)}")

    return cat_cols, num_cols, cat_but_car, num_but_cat

### Variable Classification Results ðŸ“Š

After running the function **`grab_col_names(df)`**, we have classified the columns in the dataset into different categories based on their data types and the number of unique values they contain. Here's a summary of the classification results:

- **Observations**: The dataset contains a total of 768 observations (rows).
- **Variables**: The dataset has 9 variables (columns).

### Classification of Variables:

1. **Categorical Columns (cat_cols)**:
   - **Count**: 1
   - These are columns where the values represent categories or groups. In our dataset, only one column is identified as categorical.

2. **Numerical Columns (num_cols)**:
   - **Count**: 8
   - These columns contain numerical values, representing measurable quantities. Our dataset has 8 numerical variables, which are essential for statistical analysis and machine learning.

3. **Categorical but Cardinal (cat_but_car)**:
   - **Count**: 0
   - There are no categorical columns that have more than 20 unique values. Typically, these columns would require special handling (e.g., they could be treated as cardinal variables with unique values).

4. **Numerical but Categorical (num_but_cat)**:
   - **Count**: 1
   - This category contains variables that are numerical in type but have a limited number of unique values. In our case, one numerical column is treated as categorical due to its limited range of unique values.

In [7]:
cat_cols, num_cols, cat_but_car, num_but_cat = grab_col_names(df)

Observations: 768
Variables: 9
Categorical Columns (cat_cols): 1
Numerical Columns (num_cols): 8
Categorical but Cardinal (cat_but_car): 0
Numerical but Categorical (num_but_cat): 1


Based on this classification, we can proceed with appropriate data preprocessing steps. For instance, categorical columns may need encoding, while numerical columns may require scaling or normalization. Additionally, understanding how many columns belong to each type helps in choosing the right machine learning models and strategies for feature engineering.

The function has identified the following column as categorical:

- **`Outcome`**: This column represents the target variable in the dataset, indicating whether a patient has diabetes or not. It contains two possible values: `1` (diabetic) and `0` (non-diabetic). Despite being a numerical column (with values 0 and 1), it is treated as categorical because it represents a classification or group, rather than a continuous quantity.

Although the `Outcome` column contains numeric values (0 and 1), it is a **binary classification variable**, making it a categorical feature. It does not represent a continuous range of values, but rather two distinct categories or classes, which is why it is categorized as a **categorical variable** in this dataset.

In [8]:
cat_cols

['Outcome']

The function has identified the following columns as numerical:

- **`Pregnancies`**: The number of pregnancies a patient has had.
- **`Glucose`**: The 2-hour plasma glucose concentration during an oral glucose tolerance test.
- **`BloodPressure`**: The blood pressure value (measured in mm Hg).
- **`SkinThickness`**: The thickness of the skin at the triceps (in mm).
- **`Insulin`**: The 2-hour serum insulin level (in ÂµU/ml).
- **`BMI`**: The Body Mass Index (BMI) of the patient.
- **`DiabetesPedigreeFunction`**: A function that represents the genetic relationship of the patient to diabetes.
- **`Age`**: The age of the patient (in years).

These columns represent quantitative data and contain continuous or discrete numeric values. They measure variables such as glucose levels, blood pressure, BMI, and age, all of which are essential for making predictions in machine learning models. These numerical columns will be used for statistical analysis and model training in the next steps of the project.


In [9]:
num_cols

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

No columns have been identified as **Categorical but Cardinal (cat_but_car)** in this dataset.

**Categorical but Cardinal** refers to categorical variables that contain a large number of unique values (typically more than 20), which would require special handling due to their complexity. These variables are usually not suitable for one-hot encoding and might need to be treated as numerical variables or aggregated into fewer categories.

Since no columns fall into this category, we can proceed with handling categorical variables in the usual manner, such as encoding the `Outcome` variable.

In [10]:
cat_but_car

[]

The following column has been identified as **Numerical but Categorical (num_but_cat)**:

- **`Outcome`**: This is the target variable of the dataset, indicating whether a patient has diabetes (`1`) or not (`0`).

While the `Outcome` column contains numeric values (0 and 1), it is treated as categorical because it represents a binary classification, not a continuous quantity. The values (0 and 1) signify two distinct categories (diabetic vs. non-diabetic), making it a categorical feature despite being numeric. This column is used for classification tasks, so it is classified as **Numerical but Categorical**.

In this case, the column is treated numerically for modeling purposes but conceptually remains categorical.

In [13]:
num_but_cat

['Outcome']