# HealthScope - Exploratory Data Analysis
## Dataset 1: Diabetes (Pima Indians)

**Objective**: Understand the diabetes dataset structure, identify patterns, and document data quality issues.

**Dataset**: Pima Indians Diabetes Dataset from UCI ML Repository
- **Rows**: 768
- **Columns**: 9 (8 features + 1 target)
- **Target**: Outcome (0 = No diabetes, 1 = Diabetes)

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("✓ Libraries imported successfully!")

## 2. Load Dataset

In [None]:
# Load diabetes dataset
df = pd.read_csv('../data/raw/diabetes.csv')

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

## 3. Initial Data Inspection

In [None]:
# Display first few rows
print("First 5 rows:")
df.head()

In [None]:
# Display last few rows
print("Last 5 rows:")
df.tail()

In [None]:
# Display random sample
print("Random 10 rows:")
df.sample(10)

## 4. Data Types and Info

In [None]:
# Display data types and non-null counts
print("Dataset Info:")
df.info()

In [None]:
# Display column names
print("Column Names:")
print(df.columns.tolist())

## 5. Missing Values Analysis

In [None]:
# Check for missing values
print("Missing Values:")
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})

print(missing_df[missing_df['Missing Count'] > 0])

if missing.sum() == 0:
    print("\n✓ No missing values found!")
else:
    print(f"\n⚠ Total missing values: {missing.sum()}")

In [None]:
# Check for zero values (which might represent missing data in medical datasets)
print("Zero Values (potential missing data):")
zero_counts = (df == 0).sum()
zero_pct = ((df == 0).sum() / len(df)) * 100

zero_df = pd.DataFrame({
    'Zero Count': zero_counts,
    'Percentage': zero_pct
})

print(zero_df[zero_df['Zero Count'] > 0])

## 6. Descriptive Statistics

In [None]:
# Display descriptive statistics
print("Descriptive Statistics:")
df.describe()

In [None]:
# Display statistics for target variable
print("Target Variable (Outcome) Distribution:")
print(df['Outcome'].value_counts())
print("\nPercentage:")
print(df['Outcome'].value_counts(normalize=True) * 100)

## 7. Distribution Plots

In [None]:
# Plot distributions for all numerical features
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
fig.suptitle('Distribution of Features', fontsize=16, y=1.00)

for idx, col in enumerate(df.columns):
    row = idx // 3
    col_idx = idx % 3
    
    axes[row, col_idx].hist(df[col], bins=30, edgecolor='black', alpha=0.7)
    axes[row, col_idx].set_title(col)
    axes[row, col_idx].set_xlabel('Value')
    axes[row, col_idx].set_ylabel('Frequency')

plt.tight_layout()
plt.savefig('../reports/figures/diabetes_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Distribution plots saved to reports/figures/diabetes_distributions.png")

In [None]:
# Box plots to identify outliers
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
fig.suptitle('Box Plots - Outlier Detection', fontsize=16, y=1.00)

for idx, col in enumerate(df.columns):
    row = idx // 3
    col_idx = idx % 3
    
    axes[row, col_idx].boxplot(df[col].dropna())
    axes[row, col_idx].set_title(col)
    axes[row, col_idx].set_ylabel('Value')

plt.tight_layout()
plt.savefig('../reports/figures/diabetes_boxplots.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Box plots saved to reports/figures/diabetes_boxplots.png")

## 8. Correlation Analysis

In [None]:
# Calculate correlation matrix
correlation_matrix = df.corr()

# Display correlation with target
print("Correlation with Target (Outcome):")
target_corr = correlation_matrix['Outcome'].sort_values(ascending=False)
print(target_corr)

In [None]:
# Visualize correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix - Diabetes Dataset', fontsize=14, pad=20)
plt.tight_layout()
plt.savefig('../reports/figures/diabetes_correlation.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Correlation heatmap saved to reports/figures/diabetes_correlation.png")

## 9. Feature Distributions by Target

In [None]:
# Compare feature distributions for diabetes vs no diabetes
features = [col for col in df.columns if col != 'Outcome']

fig, axes = plt.subplots(4, 2, figsize=(15, 16))
fig.suptitle('Feature Distributions by Outcome', fontsize=16, y=1.00)

for idx, feature in enumerate(features):
    row = idx // 2
    col = idx % 2
    
    # Plot for no diabetes (0)
    axes[row, col].hist(df[df['Outcome'] == 0][feature], bins=20, alpha=0.5, 
                        label='No Diabetes', color='green', edgecolor='black')
    # Plot for diabetes (1)
    axes[row, col].hist(df[df['Outcome'] == 1][feature], bins=20, alpha=0.5, 
                        label='Diabetes', color='red', edgecolor='black')
    
    axes[row, col].set_title(feature)
    axes[row, col].set_xlabel('Value')
    axes[row, col].set_ylabel('Frequency')
    axes[row, col].legend()

plt.tight_layout()
plt.savefig('../reports/figures/diabetes_by_outcome.png', dpi=300, bbox_inches='tight')
plt.show()

print("✓ Outcome comparison plots saved to reports/figures/diabetes_by_outcome.png")

## 10. Key Insights & Findings

### Data Quality Issues:
- Document any missing values
- Document zero values that might represent missing data
- Document outliers identified

### Feature Insights:
- Which features have strongest correlation with diabetes?
- Which features show clear separation between diabetes/no diabetes?
- Are there any unexpected patterns?

### Next Steps:
- Data cleaning strategy (Day 4)
- Feature engineering ideas (Day 5)
- Model selection considerations

In [None]:
# Summary statistics by outcome
print("Summary Statistics by Outcome:")
print("\nNo Diabetes (Outcome = 0):")
print(df[df['Outcome'] == 0].describe())
print("\nDiabetes (Outcome = 1):")
print(df[df['Outcome'] == 1].describe())

## 11. Save Exploration Summary

In [None]:
# Create exploration summary
summary = {
    'Dataset': 'Diabetes (Pima Indians)',
    'Total Rows': len(df),
    'Total Columns': len(df.columns),
    'Missing Values': df.isnull().sum().sum(),
    'Duplicate Rows': df.duplicated().sum(),
    'Target Distribution': df['Outcome'].value_counts().to_dict(),
    'Top Correlated Features': target_corr.head(4).to_dict()
}

print("\n" + "="*50)
print("EXPLORATION SUMMARY")
print("="*50)
for key, value in summary.items():
    print(f"{key}: {value}")
print("="*50)

print("\n✓ Day 2 EDA Complete!")