eda

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

df = pd.read_csv("../data/raw/diabetes.csv")
df.head()


In [None]:
df.info()


In [None]:
(df == 0).sum()


In [None]:
df['Outcome'].value_counts()
df['Outcome'].value_counts(normalize=True)


## 3. Missing Value Analysis
In this dataset, some features contain biologically impossible zero values, which indicate missing measurements rather than actual zeros.


In [None]:
cols_with_zero_as_missing = [
    'Glucose', 
    'BloodPressure', 
    'SkinThickness', 
    'Insulin', 
    'BMI'
]

(df[cols_with_zero_as_missing] == 0).sum()


In [None]:
df_mv = df.copy()
df_mv[cols_with_zero_as_missing] = df_mv[cols_with_zero_as_missing].replace(0, np.nan)

df_mv.isnull().sum()


In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df_mv.isnull(), cbar=False)
plt.title("Missing Value Heatmap")
plt.show()


The heatmap shows that several features, especially Insulin and SkinThickness, contain a significant number of missing values. These missing values will be handled during the preprocessing stage.


## 4. Distribution Analysis


In [None]:
df.hist(figsize=(14,12))
plt.tight_layout()
plt.show()


The distributions show that some features are skewed and contain extreme values, particularly Insulin and BMI. This suggests the presence of outliers and the need for scaling.


In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=df)
plt.xticks(rotation=90)
plt.title("Boxplot of Features")
plt.show()


## 5. Feature vs Target Analysis


In [None]:
plt.figure(figsize=(12,5))

plt.subplot(1,2,1)
sns.boxplot(x='Outcome', y='Glucose', data=df)
plt.title('Glucose vs Outcome')

plt.subplot(1,2,2)
sns.boxplot(x='Outcome', y='BMI', data=df)
plt.title('BMI vs Outcome')

plt.show()


Patients with diabetes (Outcome = 1) tend to have higher Glucose and BMI values compared to non-diabetic patients.


## 6. Correlation Analysis


In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()


## 7. EDA Summary

Exploratory Data Analysis revealed that the dataset contains hidden missing values, skewed distributions, and several outliers. Glucose and BMI show strong relationships with the target variable, making them important predictors for diabetes classification. These findings guide the preprocessing and modeling steps of the project.
