# Diabetes - Descriptive Statistics

## Table of Contents

1. [Diabetes - Descriptive Statistics](#diabetes---descriptive-statistics)
2. [Importing necessary libraries](#importing-necessary-libraries)
3. [Loading dataset](#loading-dataset)
4. [Initial information about dataset](#initial-information-about-dataset)
    - [Basic information](#basic-information)
    - [Process null values](#process-null-values)
        - [Check null values](#check-null-values)
        - [Understanding dataset](#understanding-dataset)
        - [Replace missing values](#replace-missing-values)
    - [Process duplicate rows](#process-duplicate-rows)
        - [Check duplicate rows](#check-duplicate-rows)
5. [Descriptive statistics of numeric variables](#descriptive-statistics-of-numeric-variables)
6. [The end](#the-end)

## Importing necessary libraries

In [73]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Loading dataset

In [74]:
columns = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
           "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"]
diabetes = pd.read_csv('pima-indians-diabetes.csv', names=columns)
df = diabetes.copy()
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Initial information about dataset

### Basic information

In [75]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Base on the description included in the dataset file, there are 11 input variables and 1 output variable.

#### Input variables:
- **Pregnancies**: Number of times pregnant

- **Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

- **BloodPressure**: Diastolic blood pressure (mm Hg)

- **SkinThickness**: Triceps skin fold thickness (mm)

- **Insulin**: 2-Hour serum insulin (mu U/ml)

- **BMI**: Body mass index (weight in kg/(height in m)^2)

- **DiabetesPedigreeFunction (DPF)**: Diabetes pedigree function

- **Age**: Age (years)

#### Output variable:
- **Outcome**: Class variable (0 or 1)

Based on medical literature and physiological constraints, the following ranges represent realistic values for each feature:

- **Pregnancies: x ≥ 0**: Negative pregnancy counts are impossible.

- **Glucose: 50 ≤ x ≤ 500 mg/dL**: Values below 50 mg/dL typically indicate severe hypoglycemia requiring immediate medical intervention, while values above 500 mg/dL represent extreme hyperglycemia that would likely be fatal without treatment. Zero values are impossible for living subjects.

- **BloodPressure: 40 ≤ x ≤ 200 mmHg**: Diastolic blood pressure below 40 mmHg or above 200 mmHg represents extreme physiological conditions incompatible with normal life. Zero values are impossible for living subjects.

- **SkinThickness: 5 ≤ x ≤ 50 mm**: Triceps skinfold thickness below 5 mm or above 50 mm is extremely rare, even in cases of severe malnutrition or morbid obesity. Zero values are physiologically impossible.

- **Insulin: x ≥ 2 μU/ml**: Serum insulin levels cannot be negative. Values below 2 μU/ml are extremely rare even in Type 1 diabetes patients. Zero values likely indicate missing data rather than true measurements.

- **BMI: 10 ≤ x ≤ 70**: BMI values below 10 kg/m² are incompatible with life, while values above 70 kg/m² are extraordinarily rare even in cases of extreme obesity. Zero values are impossible.

- **DiabetesPedigreeFunction: x ≥ 0**: This is a calculated score representing genetic predisposition to diabetes based on family history. Negative values would be meaningless in this context.

- **Age: x ≥ 0**: Age values must be zero or positive integers

### Process null values

#### Check null values

We can see on the df.head() that there are some features contain 0, which does not make any sense. This indicates missing value.

Below we replace 0 value by NaN.

In [76]:
df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.nan)
print(f"Total number of null values in the dataset")
print(df.isnull().sum())

Total number of null values in the dataset
Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64


Percentage of missing data

In [77]:
print("Percentage of missing value in each feature")
print(100 * df.isnull().sum() / 768)

Percentage of missing value in each feature
Pregnancies                  0.000000
Glucose                      0.651042
BloodPressure                4.557292
SkinThickness               29.557292
Insulin                     48.697917
BMI                          1.432292
DiabetesPedigreeFunction     0.000000
Age                          0.000000
Outcome                      0.000000
dtype: float64


Before we deal with missing value, we must understand the data first, after that we could choose which method to deal with missing value:
- Delete entire row
- Delete entire feature
- Imputation:
    - Mean, median, mode imputation
    - k-Nearest Neighbors imputation (k-NN)
    - Hot-deck imputation
    - Multiple imputation
    - Regression imputation
    - ...

#### Understanding dataset

In [78]:
df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].describe().apply(lambda x: x.apply('{:.2f}'.format))

Unnamed: 0,Glucose,BloodPressure,SkinThickness,Insulin,BMI
count,763.0,733.0,541.0,394.0,757.0
mean,121.69,72.41,29.15,155.55,32.46
std,30.54,12.38,10.48,118.78,6.92
min,44.0,24.0,7.0,14.0,18.2
25%,99.0,64.0,22.0,76.25,27.5
50%,117.0,72.0,29.0,125.0,32.3
75%,141.0,80.0,36.0,190.0,36.6
max,199.0,122.0,99.0,846.0,67.1


Base on the output, we could detect that:
- There is a large disparity between mean value and median value (50%), which displays that Insulin's data distribution is skewed
- The spread of Glucose, BloodPressure, SkinThickness and BMI data is just slightly skew

Since all 5 features do have skewness in it's distribution and the limited ability of the group, we will choose median imputation as a way to replace missing values

#### Replace missing values

In [79]:
df = df.fillna(df.median())
print("Totally there are {} null values in the dataset".format(df.isnull().sum().sum()))

Totally there are 0 null values in the dataset


### Process duplicate rows

#### Check duplicate rows

In [80]:
duplicated_rows = df[df.duplicated()]
rows, columns = duplicated_rows.shape
print(f"Rows that have duplicated values: {rows}")

Rows that have duplicated values: 0


## Descriptive statistics of numeric variables

In [81]:
df.describe().apply(lambda x: x.apply('{:.2f}'.format))

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.85,121.66,72.39,29.11,140.67,32.46,0.47,33.24,0.35
std,3.37,30.44,12.1,8.79,86.38,6.88,0.33,11.76,0.48
min,0.0,44.0,24.0,7.0,14.0,18.2,0.08,21.0,0.0
25%,1.0,99.75,64.0,25.0,121.5,27.5,0.24,24.0,0.0
50%,3.0,117.0,72.0,29.0,125.0,32.3,0.37,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.63,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


From the output, we could see there are lots of figures generated:

- **Count**: Shows the number of non-null values in each column.
- **Mean**: Indicates the average value for each numerical column.
- **Std**: The standard deviation, representing how spread out the values are from the mean.
- **Min** & **Max**: Show the minimum and maximum values, thus giving the range of the data (range = max - min).
- **25%** (Q1), **50%** (Median), **75%** (Q3): These are the quartiles. The 50% is the median. These give insights into how the data is distributed (e.g. symmetric, skewed).

Distribution/Spread: If the mean and median (50%) are close, the distribution is fairly symmetric; if not, the data may be skewed.

# The end