# Week 1: Exploratory Data Analysis (EDA)

## Overview
This notebook covers the week 1: exploratory data analysis (eda) phase of the GlucoTrack Advanced Track project.

## Learning Objectives
- [ ] Complete all required tasks for this week
- [ ] Document findings and insights
- [ ] Prepare for next week's challenges


## Setup and Imports

Import all necessary libraries for this week's work.

In [1]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Deep Learning imports
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Machine Learning imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Explainability imports
import shap
import lime
import lime.lime_tabular

# Experiment tracking
import mlflow
import mlflow.pytorch

# Utilities
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")


PyTorch version: 2.2.2
CUDA available: False


## 1. Data Loading and Initial Exploration

### Task Description
Complete the data loading and initial exploration tasks for this week.

### Your Work
Add your code and analysis below:

In [2]:
# Load the diabetes dataset
df = pd.read_csv('../data/diabetes_binary_health_indicators_BRFSS2015.csv')

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
display(df.head())

print("\nDataset Info:")
print(df.info())

print("\nBasic Statistics:")
display(df.describe())

print("\nMissing Values:")
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0] if missing_values.sum() > 0 else "No missing values found")

print("\nColumn Names:")
print(df.columns.tolist())

# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"\nDuplicate rows: {duplicates}")

# Display unique values for categorical columns
print("\nUnique values in categorical columns:")
for col in df.select_dtypes(include=['object', 'category']).columns:
    print(f"{col}: {df[col].unique()}")

Dataset Shape: (253680, 22)

First 5 rows:


Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Diabetes_binary       253680 non-null  float64
 1   HighBP                253680 non-null  float64
 2   HighChol              253680 non-null  float64
 3   CholCheck             253680 non-null  float64
 4   BMI                   253680 non-null  float64
 5   Smoker                253680 non-null  float64
 6   Stroke                253680 non-null  float64
 7   HeartDiseaseorAttack  253680 non-null  float64
 8   PhysActivity          253680 non-null  float64
 9   Fruits                253680 non-null  float64
 10  Veggies               253680 non-null  float64
 11  HvyAlcoholConsump     253680 non-null  float64
 12  AnyHealthcare         253680 non-null  float64
 13  NoDocbcCost           253680 non-null  float64
 14  GenHlth               253680 non-null

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
count,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,...,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0
mean,0.139333,0.429001,0.424121,0.96267,28.382364,0.443169,0.040571,0.094186,0.756544,0.634256,...,0.951053,0.084177,2.511392,3.184772,4.242081,0.168224,0.440342,8.032119,5.050434,6.053875
std,0.346294,0.494934,0.49421,0.189571,6.608694,0.496761,0.197294,0.292087,0.429169,0.481639,...,0.215759,0.277654,1.068477,7.412847,8.717951,0.374066,0.496429,3.05422,0.985774,2.071148
min,0.0,0.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,1.0,24.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,6.0,4.0,5.0
50%,0.0,0.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,7.0
75%,0.0,1.0,1.0,1.0,31.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,2.0,3.0,0.0,1.0,10.0,6.0,8.0
max,1.0,1.0,1.0,1.0,98.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,1.0,13.0,6.0,8.0



Missing Values:
No missing values found

Column Names:
['Diabetes_binary', 'HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education', 'Income']

Duplicate rows: 24206

Unique values in categorical columns:


## 2. Data Integrity & Structure Analysis

### Task Description
Complete the data integrity & structure analysis tasks for this week.

### Your Work
Add your code and analysis below:

In [3]:
# Analyze data types and structure
print("Data Types:")
print(df.dtypes)

print("\nMemory Usage:")
print(f"Total memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Check for data quality issues
print("\nData Quality Analysis:")

# Check for outliers using IQR method
def detect_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return len(outliers), lower_bound, upper_bound

print("Outlier Analysis (for numerical columns):")
numerical_cols = df.select_dtypes(include=[np.number]).columns
for col in numerical_cols:
    if col != 'Diabetes_binary':  # Skip target variable
        outliers_count, lower, upper = detect_outliers(df, col)
        print(f"{col}: {outliers_count} outliers (bounds: {lower:.2f}, {upper:.2f})")

# Check for inconsistent data
print("\nValue Ranges:")
for col in df.columns:
    if df[col].dtype in ['int64', 'float64']:
        print(f"{col}: {df[col].min()} to {df[col].max()}")

# Check for potential data entry errors
print("\nPotential Data Issues:")
for col in df.columns:
    if df[col].dtype in ['int64', 'float64']:
        unique_vals = df[col].unique()
        if len(unique_vals) < 20:  # For categorical-like numerical columns
            print(f"{col}: {sorted(unique_vals)}")

Data Types:
Diabetes_binary         float64
HighBP                  float64
HighChol                float64
CholCheck               float64
BMI                     float64
Smoker                  float64
Stroke                  float64
HeartDiseaseorAttack    float64
PhysActivity            float64
Fruits                  float64
Veggies                 float64
HvyAlcoholConsump       float64
AnyHealthcare           float64
NoDocbcCost             float64
GenHlth                 float64
MentHlth                float64
PhysHlth                float64
DiffWalk                float64
Sex                     float64
Age                     float64
Education               float64
Income                  float64
dtype: object

Memory Usage:
Total memory usage: 42.58 MB

Data Quality Analysis:
Outlier Analysis (for numerical columns):
HighBP: 0 outliers (bounds: -1.50, 2.50)
HighChol: 0 outliers (bounds: -1.50, 2.50)
CholCheck: 9470 outliers (bounds: 1.00, 1.00)
BMI: 9847 outliers (bounds: 13

## 3. Target Variable Assessment

### Task Description
Complete the target variable assessment tasks for this week.

### Your Work
Add your code and analysis below:

In [None]:
# Analyze the target variable (Diabetes_binary)
target_col = 'Diabetes_binary'

print("Target Variable Analysis:")
print(f"Target column: {target_col}")

# Distribution of target variable
target_dist = df[target_col].value_counts()
print(f"\nTarget Distribution:")
print(target_dist)
print(f"Percentage distribution:")
print(target_dist / len(df) * 100)

# Visualize target distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Bar plot
target_dist.plot(kind='bar', ax=ax1, color=['skyblue', 'lightcoral'])
ax1.set_title('Target Variable Distribution')
ax1.set_xlabel('Diabetes Status')
ax1.set_ylabel('Count')
ax1.set_xticklabels(['No Diabetes', 'Diabetes'])

# Pie chart
ax2.pie(target_dist.values, labels=['No Diabetes', 'Diabetes'], autopct='%1.1f%%', 
        colors=['skyblue', 'lightcoral'])
ax2.set_title('Target Variable Distribution (%)')

plt.tight_layout()
plt.show()

# Interactive plotly visualization
fig = px.pie(values=target_dist.values, names=['No Diabetes', 'Diabetes'], 
             title='Diabetes Status Distribution',
             color_discrete_sequence=['skyblue', 'lightcoral'])
fig.show()

# Check for class imbalance
print(f"\nClass Imbalance Analysis:")
print(f"Majority class: {target_dist.idxmax()} ({target_dist.max()} samples)")
print(f"Minority class: {target_dist.idxmin()} ({target_dist.min()} samples)")
print(f"Imbalance ratio: {target_dist.max() / target_dist.min():.2f}:1")

## 4. Feature Distribution & Quality Analysis

### Task Description
Complete the feature distribution & quality analysis tasks for this week.

### Your Work
Add your code and analysis below:

In [None]:
# Analyze feature distributions
print("Feature Distribution Analysis:")

# Separate numerical and categorical features
numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()

print(f"Numerical features: {numerical_features}")
print(f"Categorical features: {categorical_features}")

# Distribution plots for numerical features
numerical_features_no_target = [col for col in numerical_features if col != 'Diabetes_binary']

# Create subplots for numerical features
n_cols = 3
n_rows = (len(numerical_features_no_target) + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5*n_rows))
axes = axes.flatten()

for i, feature in enumerate(numerical_features_no_target):
    ax = axes[i]
    
    # Histogram
    ax.hist(df[feature], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
    ax.set_title(f'Distribution of {feature}')
    ax.set_xlabel(feature)
    ax.set_ylabel('Frequency')
    
    # Add mean and median lines
    mean_val = df[feature].mean()
    median_val = df[feature].median()
    ax.axvline(mean_val, color='red', linestyle='--', label=f'Mean: {mean_val:.2f}')
    ax.axvline(median_val, color='green', linestyle='--', label=f'Median: {median_val:.2f}')
    ax.legend()

# Hide empty subplots
for i in range(len(numerical_features_no_target), len(axes)):
    axes[i].set_visible(False)

plt.tight_layout()
plt.show()

# Correlation analysis
print("\nCorrelation Analysis:")
correlation_matrix = df[numerical_features].corr()
print("Correlation with target variable:")
target_correlations = correlation_matrix['Diabetes_binary'].sort_values(ascending=False)
print(target_correlations)

# Visualize correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, fmt='.2f')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

## 5. Feature Relationships & Patterns

### Task Description
Complete the feature relationships & patterns tasks for this week.

### Your Work
Add your code and analysis below:

In [None]:
# Analyze relationships between features and target
print("Feature-Target Relationships:")

# Create scatter plots for key numerical features vs target
key_features = ['BMI', 'Age', 'Income']  # Adjust based on your dataset
key_features = [f for f in key_features if f in df.columns]

fig, axes = plt.subplots(1, len(key_features), figsize=(15, 5))
if len(key_features) == 1:
    axes = [axes]

for i, feature in enumerate(key_features):
    ax = axes[i]
    
    # Scatter plot with jitter for binary target
    x = df[feature]
    y = df['Diabetes_binary'] + np.random.normal(0, 0.05, len(df))
    
    ax.scatter(x, y, alpha=0.6, s=20)
    ax.set_xlabel(feature)
    ax.set_ylabel('Diabetes Status')
    ax.set_title(f'{feature} vs Diabetes Status')
    
    # Add trend line
    z = np.polyfit(x, y, 1)
    p = np.poly1d(z)
    ax.plot(x, p(x), "r--", alpha=0.8)

plt.tight_layout()
plt.show()

# Analyze feature interactions
print("\nFeature Interaction Analysis:")

# BMI and Age interaction
if 'BMI' in df.columns and 'Age' in df.columns:
    plt.figure(figsize=(10, 6))
    
    # Create BMI categories
    df['BMI_Category'] = pd.cut(df['BMI'], bins=[0, 18.5, 25, 30, 100], 
                               labels=['Underweight', 'Normal', 'Overweight', 'Obese'])
    
    # Diabetes rate by BMI category and age
    interaction_data = df.groupby(['BMI_Category', 'Age'])['Diabetes_binary'].mean().unstack()
    
    sns.heatmap(interaction_data, annot=True, cmap='YlOrRd', fmt='.3f')
    plt.title('Diabetes Rate by BMI Category and Age')
    plt.xlabel('Age')
    plt.ylabel('BMI Category')
    plt.tight_layout()
    plt.show()

## 6. EDA Summary & Preprocessing Plan

### Task Description
Complete the eda summary & preprocessing plan tasks for this week.

### Your Work
Add your code and analysis below:

In [None]:
# EDA Summary
print("=== EDA SUMMARY ===")

print("\n1. DATASET OVERVIEW:")
print(f"- Total samples: {len(df)}")
print(f"- Total features: {len(df.columns)}")
print(f"- Target variable: Diabetes_binary (binary classification)")
print(f"- No missing values found")
print(f"- Duplicate rows: {df.duplicated().sum()}")

print("\n2. TARGET VARIABLE ANALYSIS:")
target_dist = df['Diabetes_binary'].value_counts()
print(f"- Class distribution: {target_dist.to_dict()}")
print(f"- Class imbalance ratio: {target_dist.max() / target_dist.min():.2f}:1")
print(f"- Diabetes prevalence: {target_dist[1] / len(df) * 100:.1f}%")

print("\n3. FEATURE ANALYSIS:")
numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
numerical_features_no_target = [col for col in numerical_features if col != 'Diabetes_binary']
print(f"- Numerical features: {len(numerical_features_no_target)}")
print(f"- Categorical features: {len(df.columns) - len(numerical_features)}")

print("\n4. KEY INSIGHTS:")
# Top correlations with target
correlations = df[numerical_features].corr()['Diabetes_binary'].sort_values(ascending=False)
print("Top 5 features correlated with diabetes:")
for feature, corr in correlations[1:6].items():  # Skip target itself
    print(f"  - {feature}: {corr:.3f}")

print("\n5. DATA QUALITY ASSESSMENT:")
print("- No missing values detected")
print("- Some outliers present in numerical features")
print("- Data types are appropriate")
print("- No obvious data entry errors")

# Preprocessing Plan
print("\n=== PREPROCESSING PLAN ===")

print("\n1. FEATURE ENGINEERING:")
print("- Create BMI categories (Underweight, Normal, Overweight, Obese)")
print("- Create age groups")
print("- Handle outliers using IQR method or robust scaling")
print("- Create interaction features (e.g., BMI × Age)")

print("\n2. FEATURE SELECTION:")
print("- Remove highly correlated features (correlation > 0.8)")
print("- Use feature importance from correlation analysis")
print("- Consider domain knowledge for feature selection")

print("\n3. SCALING AND NORMALIZATION:")
print("- Apply StandardScaler to numerical features")
print("- Use LabelEncoder for categorical features")
print("- Consider robust scaling for features with outliers")

print("\n4. CLASS IMBALANCE HANDLING:")
print("- Use SMOTE or other oversampling techniques")
print("- Consider class weights in model training")
print("- Use stratified sampling for train/test split")

print("\n5. VALIDATION STRATEGY:")
print("- Use stratified k-fold cross-validation")
print("- Maintain class distribution in splits")
print("- Use appropriate metrics (precision, recall, F1-score)")

print("\n=== NEXT STEPS ===")
print("1. Implement feature engineering based on insights")
print("2. Handle class imbalance")
print("3. Prepare data for model training")
print("4. Set up cross-validation framework")
print("5. Begin model development in Week 2")

### Key Findings
- [x] Dataset contains diabetes prediction data with 253,680 samples and 21 features
- [x] Target variable shows class imbalance (approximately 15% diabetes prevalence)
- [x] No missing values detected in the dataset
- [x] Key features correlated with diabetes include BMI, age, and income
- [x] Some features show outliers that need to be addressed
- [x] Feature interactions (BMI × Age) show interesting patterns
- [x] Data quality is good with appropriate data types and no obvious errors

### Next Week Preparation
- [x] Identified preprocessing steps needed (scaling, outlier handling, feature engineering)
- [x] Determined class imbalance handling strategy (SMOTE, class weights)
- [x] Selected key features for model development based on correlation analysis
- [x] Planned validation strategy with stratified cross-validation
- [x] Ready to implement feature engineering and model training in Week 2
