# Exploratory Data Analysis (EDA) of Clalit Health Dataset


In [None]:
# %pip install plotly
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

# Load data
df = pd.read_csv('data/raw/Prediction home assignment data.csv')
df.head()

Unnamed: 0,ID,Disease,Age,Sex,Blood Pressure,Sport Activity Level,BMI,Alcohol Consumption,Cholesterol Level,Family History of Disease,Medication Use,Occupation Type,Sleep Hours per Night,Stress Level
0,9292,Diabetes,65,Male,8.078688,1.226998,8.654499,6.616191,9.809791,Yes,Yes,Manual,4.502498,3.272403
1,8088,Coronary Heart Disease,68,Male,7.561382,2.399566,7.263741,5.461729,8.256561,Yes,Yes,Sedentary,8.834171,6.614335
2,4976,Diabetes,61,Male,9.190168,2.589369,7.039525,4.674114,9.163775,Yes,Yes,Manual,7.076796,4.784249
3,4376,Coronary Heart Disease,63,Male,9.837562,1.762751,8.356391,7.336146,7.039061,Yes,Yes,Active,8.308582,5.863965
4,3227,Diabetes,83,Male,9.469443,2.559051,7.257912,4.782407,9.860855,Yes,Yes,Manual,2.056106,5.610206


## Disease Distribution

In [19]:
fig = px.histogram(df, y='Disease', color='Disease', 
                   category_orders={'Disease': df['Disease'].value_counts().index.tolist()},
                   title='Disease Distribution',
                   labels={'count': 'Number of Records'})
fig.update_traces(marker_line_width=1.5, opacity=0.85)
fig.update_layout(showlegend=False)
fig.show()

The distribution shows that Autoimmune Disorder conditions are relatively rare. This class imbalance is important to recognize, as it may affect the performance and bias of predictive models.

## Age Distribution and by Disease

In [20]:
# Distribution
fig = px.histogram(df, x='Age', nbins=30, marginal='box',
                   title='Age Distribution (with Outliers)')
fig.show()

# Age by disease
fig = px.box(df, x='Disease', y='Age', color='Disease', 
             title='Age Distribution by Disease',
             category_orders={'Disease': df['Disease'].value_counts().index.tolist()})
fig.show()

The age distribution in the dataset is skewed toward adults and older individuals, with a few notable outliers at both extremes.
Chronic diseases such as Diabetes and Coronary Heart Disease are primarily observed in older patients, as reflected by their higher median ages. In contrast, mental health conditions like Anxiety and Depression are more common among younger individuals. Identifying these demographic patterns is important for effective risk grouping and targeted interventions.

Additionally, there are several extreme outliers for certain diagnoses—such as unrealistically low (including negative) or excessively high ages. These likely result from data entry mistakes or rare edge cases and should be reviewed before further analysis.

## Sex Distribution and by Disease

In [6]:
# Overall
fig = px.pie(df, names='Sex', title='Sex Distribution', hole=0.45, color_discrete_sequence=px.colors.qualitative.Pastel)
fig.show()

# By disease (stacked bar)
sex_disease = df.groupby(['Disease', 'Sex']).size().reset_index(name='count')
fig = px.bar(sex_disease, x='Disease', y='count', color='Sex',
             title='Sex by Disease', barmode='stack', 
             category_orders={'Disease': df['Disease'].value_counts().index.tolist()})
fig.show()

Both male and female patients are well represented in the dataset, though a small fraction of records have missing or unspecified sex.

The "Sex by Disease" plot reveals that certain diseases in the dataset are represented by only one sex, or show extreme sex imbalances. This is unusual and may indicate issues such as data entry errors, incomplete data collection, or systematic biases in how cases were recorded. It is important to investigate whether this pattern reflects genuine epidemiological trends, or if it is a result of artifacts in the dataset.

For now, we would not use the sex column for prediction purposes.

## BMI and Sport Activity Level by Disease

In [7]:
# BMI
fig = px.violin(df, y='BMI', x='Disease', color='Disease', box=True, points='all',
                title='BMI Distribution by Disease',
                category_orders={'Disease': df['Disease'].value_counts().index.tolist()})
fig.show()

# Sport Activity Level
fig = px.box(df, y='Sport Activity Level', x='Disease', color='Disease',
             title='Sport Activity Level by Disease',
             category_orders={'Disease': df['Disease'].value_counts().index.tolist()})
fig.show()

The "BMI by Disease" plot shows that for most diseases, BMI values appear to be normalized within a range of 0 to 10. However, the values for Autoimmune Disorder stand out as not normalized, displaying a different scale or distribution compared to the others. To ensure consistency and meaningful comparisons across all disease groups, it is important to normalize the BMI values for Autoimmune Disorder as well.

Additionally, we observe that the normalized BMI values for Coronary Heart Disease and Diabetes are generally higher than those for other diseases, which aligns with the known association between elevated BMI and these conditions.

The "Sport Activity Level by Disease" plot reveals that individuals diagnosed with Anxiety exhibit sport activity levels very similar to those found in the Healthy group. This suggests that, in this dataset, anxiety does not appear to be associated with reduced physical activity, unlike certain chronic physical conditions such as Diabetes or Coronary Heart Disease, where lower activity levels are more prominent. This observation may indicate that physical activity alone is not a distinguishing factor for anxiety in this population, or it could reflect underlying behavioral or reporting patterns that warrant further investigation.

## Alcohol Consumption, Cholesterol Level, Sleep Hours per Night and Stress Level by Disease

In [23]:
import plotly.express as px

# Alcohol Consumption by Disease
fig_alcohol = px.violin(
    df,
    x='Disease',
    y='Alcohol Consumption',
    box=True,
    points="all",
    color='Disease',
    title='Alcohol Consumption Distribution by Disease'
)
fig_alcohol.update_layout(
    yaxis_title='Alcohol Consumption',
    xaxis_title='Disease',
    template='plotly_white'
)
fig_alcohol.show()

# Cholesterol Level by Disease
fig_cholesterol = px.violin(
    df,
    x='Disease',
    y='Cholesterol Level',
    box=True,
    points="all",
    color='Disease',
    title='Cholesterol Level Distribution by Disease'
)
fig_cholesterol.update_layout(
    yaxis_title='Cholesterol Level',
    xaxis_title='Disease',
    template='plotly_white'
)
fig_cholesterol.show()

# Sleep Hours per Night by Disease
fig_sleep = px.violin(
    df,
    x='Disease',
    y='Sleep Hours per Night',
    box=True,
    points="all",
    color='Disease',
    title='Sleep Hours per Night Distribution by Disease'
)
fig_sleep.update_layout(
    yaxis_title='Sleep Hours per Night',
    xaxis_title='Disease',
    template='plotly_white'
)
fig_sleep.show()

# Stress Level by Disease
fig_stress = px.violin(
    df,
    x='Disease',
    y='Stress Level',
    box=True,
    points="all",
    color='Disease',
    title='Stress Level Distribution by Disease'
)
fig_stress.update_layout(
    yaxis_title='Stress Level',
    xaxis_title='Disease',
    template='plotly_white'
)
fig_stress.show()

The stress levels for both anxiety and depression groups are similar, which makes sense given the overlap in symptoms. When we look at sleep hours, anxiety patients tend to sleep less—likely because anxiety can make restful sleep difficult—while those with depression often sleep more, possibly due to low motivation or fatigue. These patterns in the data align well with what we’d expect given the nature of these conditions.

## Family History, Medication Use, and Occupation Type

In [8]:
# Family History
fig = px.histogram(df, x='Family History of Disease', color='Disease',
                   title='Family History by Disease', barmode='group')
fig.show()

# Medication Use
fig = px.histogram(df, x='Medication Use', color='Disease',
                   title='Medication Use by Disease', barmode='group')
fig.show()

# Occupation Type
fig = px.histogram(df, x='Occupation Type', color='Disease',
                   title='Occupation Type by Disease', barmode='group')
fig.show()

## Correlation Heatmap

In [21]:
import plotly.figure_factory as ff
import numpy as np

numeric_cols = df.select_dtypes(include=np.number).columns
corrs = df[numeric_cols].corr().round(1)  # Round to 1 decimal

# Prepare annotations as strings for the heatmap
z_text = [[str(val) for val in row] for row in corrs.values]

fig = ff.create_annotated_heatmap(
    z=corrs.values,
    x=list(corrs.columns),
    y=list(corrs.columns),
    annotation_text=z_text,
    colorscale='Viridis',
    showscale=True
)
fig.update_layout(title_text='Correlation Heatmap (rounded to 1 decimal)')
fig.show()

## EDA Cleanup and Preprocessing Summary

Based on our exploratory data analysis (EDA), the following preprocessing steps need to be performed before modeling:

1. **Remove Age Outliers**

2. **Exclude the Sex Column**

3. **Upsample Autoimmune Disorder Cases**

4. **Normalize Autoimmune Disorder BMI Values**


These steps will help improve data quality and ensure robust, unbiased modeling results.

## Summary

Some features in our data are essentially binary. For example, high sport activity is mostly found in Anxiety or Healthy cases, while high cholesterol is mainly linked to Diabetes, Coronary Heart Disease, or Depression. Because of this, a tree-based model like XGBoost is a good fit, and usually is SOTA for tabular data tasks.

## Data Improvement Suggestions
- **Temporal Data:** Time series of measurements (for example, trends in blood pressure)
- **Lifestyle Data:** Diet, physical activity trackers, sleep quality (from wearables)
- **Socioeconomic Data:** Income, education, region
- **Clinical History:** More granular medication types
- **Text Data:** Doctor visit notes (NLP features)
- **Lab Results:** More detailed blood/urine test results
- **Data Quality:** Ensure no negative/implausible ages, harmonize categorical values
- **External Data:** Environmental exposures (air quality, pollution, neighborhood factors)