### Visualizations

Chesta Dewangan & Himanshu Dongre

Visualizations to provide more info by adding some interactions:
1. Feature Explorer
2. Parallel plot
3. Scatter plot
4. Trend Chart
5. Patient Profile Simulation
6. CKD Patient Profile


For this section we wanted to do some interactive visualization and we decided to use Vega-lite to do so. Vega-lite is a low-level visualization grammar that uses JSON specifications to create the visualizations. Each visualization we used will later be linked in some way to the final visualization we are planning to have by the final submission.

In [40]:
import pandas as pd
import altair as alt
from joblib import load
from sklearn.preprocessing import LabelEncoder

In [31]:
# Load the (preprocessed) data
df = pd.read_csv("ckd_preprocessed.csv")

#### Feature Explorer
**Hypothesis**: Certain features such as serum creatinine and albumin show significantly different average values between CKD and non-CKD patients.

**Why investigate?**: From EDA conclusion, you can see that these features are tied to kideny function. Understanding which features consistently differ can help us identify early indicators of CKD.

The feature explorer allows quick comparison between average values across the features using a dropdown. Once a selection is made the visualization changes.

![Feature Explorer](vega_viz_images/1.png)

In [4]:
# Making sure the actual feature name is visible so that the user can understand
feature_label_map = {
    'sc': 'Serum Creatinine',
    'al': 'Albumin',
    'sg': 'Specific Gravity',
    'hemo': 'Hemoglobin',
    'rc': 'Red Blood Cell Count',
    'pcv': 'Packed Cell Volume'
}

label_to_col = {v: k for k, v in feature_label_map.items()}

df_filtered = df[list(feature_label_map.keys()) + ['classification']].copy()
df_filtered['classification'] = df_filtered['classification'].astype(str)

df_long = df_filtered.melt(id_vars='classification', var_name='feature', value_name='value')
df_long = df_long.dropna()
df_long['label'] = df_long['feature'].map(feature_label_map)

dropdown = alt.binding_select(options=list(label_to_col.keys()), name='Feature: ')
selector = alt.param('FeatureSelector', bind=dropdown, value='Albumin')

bar_chart = alt.Chart(df_long).add_params(
    selector
).transform_filter(
    alt.datum.label == selector
).mark_bar().encode(
    x=alt.X('classification:N', title='CKD Class'),
    y=alt.Y('mean(value):Q', title='Average Value'),
    color=alt.Color('classification:N', title='CKD Classification',
                    scale=alt.Scale(domain=['ckd', 'notckd'],
                                    range=['#e41a1c', '#4daf4a']))
).properties(
    width=400,
    height=300,
    title="Average Value of Selected Feature by CKD Class"
)

bar_chart

#### Parallel plot

**Hypothesis**: Individuals with CKD could have similar patterns across multiple features and similarly non-CKD individuals.

**Why investigate?**: CKD Diagnosis could depend on various features. A multivariate pattern helps us to understand how different features interact together.

The parallel plot can help us see clusters or patterns across multiple features using brushing method.

![Parallel plot](vega_viz_images/21.png)
![Parallel plot after brushing](vega_viz_images/22.png)

In [5]:
features = ['sc', 'al', 'sg', 'hemo', 'rc', 'pcv']

df_parallel = df[features].dropna().copy()

df_parallel['class_label'] = df['classification'].map({
    'ckd': 'CKD',
    'notckd': 'Not CKD'
})

df_long = df_parallel.reset_index().melt(
    id_vars=['index', 'class_label'],
    var_name='feature',
    value_name='value'
)

brush = alt.selection_interval(encodings=['y'])

parallel_plot = alt.Chart(df_long).mark_line().encode(
    x=alt.X('feature:N', title='Feature'),
    y=alt.Y('value:Q', title='Value', scale=alt.Scale(zero=False)),
    color=alt.Color('class_label:N', title='CKD Classification',
                    scale=alt.Scale(domain=['CKD', 'Not CKD'],
                                    range=['#e41a1c', '#4daf4a'])),
    detail='index:N',
    opacity=alt.condition(brush, alt.value(1), alt.value(0.05)),
    tooltip=['feature', 'value', 'class_label']
).add_params(
    brush
).properties(
    width=600,
    height=400,
    title="Parallel Plot"
)

parallel_plot


#### Scatter plot

**Hypothesis**: Feature pairs like packed cell volume vs. serum creatinine and others show distinct groupings between CKD and non-CKD.

**Why investigate?**: Since we are already looking at the multivariate pattern above, it will also be better to investigate different groupings and how those features interact to see the regions occupied by CKD and non-CKD patients.

The scatter plot helps to explore local patterns, like clusters, outliers, and potential non-linear, linear relationships between pairs of features by zooming.

![Scatter plot](vega_viz_images/3.png)

In [6]:
df_scatter = df[features + ['classification']].dropna().copy()
df_scatter['class_label'] = df_scatter['classification'].map({'ckd': 'CKD', 'notckd': 'Not CKD'})

dropdown_x = alt.binding_select(options=features, name='X-Axis Feature:')
dropdown_y = alt.binding_select(options=features, name='Y-Axis Feature:')

x_select = alt.param('xFeature', bind=dropdown_x, value='sc')
y_select = alt.param('yFeature', bind=dropdown_y, value='hemo')

scatter_plot = alt.Chart(df_scatter).add_params(
    x_select,
    y_select
).transform_calculate(
    x="datum[xFeature]",
    y="datum[yFeature]"
).mark_circle(size=60).encode(
    x=alt.X('x:Q', title=None),
    y=alt.Y('y:Q', title=None),
    color=alt.Color('class_label:N', title='CKD Classification',
                    scale=alt.Scale(domain=['CKD', 'Not CKD'],
                                    range=['#e41a1c', '#4daf4a'])),
    tooltip=features + ['classification']
).properties(
    width=600,
    height=450,
    title='Scatter Plot using 2 features selected'
).interactive()

scatter_plot

#### Trend Chart

**Hypothesis**: For CKD patients, the albumin level keeps on changing rapidly compared to non-CKD patients.

**Why investigate?**: Age could be a major risk factor, and observing how it relates to the strong predictors can help identify early warnings or thresholds.

The trend chart between age vs. different features shows mean value over age, making it easy to compare. The individual chart allows zoom to see trends closely.

![Trend chart](vega_viz_images/4.png)

In [7]:
df_trend = df[features + ['age', 'classification']].dropna().copy()
df_trend['class_label'] = df_trend['classification'].map({'ckd': 'CKD', 'notckd': 'Not CKD'})

charts = []

for feature in features:
    chart = alt.Chart(df_trend).mark_line(point=True).encode(
        x=alt.X('age:Q', title='Age'),
        y=alt.Y(f'mean({feature}):Q', title=f'Avg {feature.upper()}'),
        color=alt.Color('class_label:N', title='CKD Classification',
                    scale=alt.Scale(domain=['CKD', 'Not CKD'],
                                    range=['#e41a1c', '#4daf4a'])),
    ).properties(
        width=300,
        height=250,
        title=f'Age vs {feature.upper()}'
    ).interactive()
    charts.append(chart)

final__trend_chart = alt.vconcat(
    *[alt.hconcat(*charts[i:i+2]) for i in range(0, len(charts), 2)]
)

final__trend_chart


#### Patient Profile Simulation

**Hypothesis**: Adjusting simulated cases (like high age and low pcv) will lead to higher predicted CKD risk.

**Why investigate?**: Simulation can reveal if individuals have a higher risk based on their profile.

The patient profile simulation is not the best at working as it isn't connected to ML models yet to predict the risk correctly. However, we used some tolerance while matching with the existing cases to show how this could be. This gives immediate feedback on how high or low the risk of having CKD is by using sliders to manipulate the value and create scenarios.

![Patient profile simulation](vega_viz_images/5.png)

In [47]:
profile_features = ['age','sc', 'al', 'hemo', 'rc', 'pcv']

df_sim = df[profile_features + ['classification']].dropna().copy()
df_sim['CKD'] = df_sim['classification'].map({'ckd': 1, 'notckd': 0})

params = {}
bindings = []
for col in profile_features:
    min_val = float(df_sim[col].min())
    max_val = float(df_sim[col].max())
    bind = alt.binding_range(min=min_val, max=max_val, step=0.5, name=f"{col.upper()}: ")
    param = alt.param(name=f"{col}_param", bind=bind, value=int((min_val + max_val) / 2))
    params[col] = param
    bindings.append(param)

tolerance = 3

conditions = [f"abs(datum.{col} - {col}_param) <= {tolerance}" for col in profile_features]
filter_expr = " && ".join(conditions)

risk = alt.Chart(df_sim).transform_filter(
    filter_expr
).transform_aggregate(
    total='count()',
    ckd_count='sum(CKD)'
).transform_calculate(
    risk='datum.total > 0 ? datum.ckd_count / datum.total : 0'  # Always returns a risk value
).transform_calculate(
    dummy_x='0',
    dummy_y='0'
).mark_circle(size=10000).encode(
    x=alt.X('dummy_x:Q', axis=None),
    y=alt.Y('dummy_y:Q', axis=None),
    color=alt.Color('risk:Q', scale=alt.Scale(scheme='reds', domain=[0, 1]), title='CKD Risk'),
    tooltip=[
        alt.Tooltip('ckd_count:Q', title='CKD Patients'),
        alt.Tooltip('total:Q', title='Similar Patients'),
        alt.Tooltip('risk:Q', format='.0%', title='CKD Risk')
    ]
).add_params(
    *bindings
).properties(
    width=200,
    height=200,
    title='Simulated CKD Risk'
).configure_view(
    stroke=None
)

risk

#### CKD Patient Profile

Unlike the other five visualizations, this visualization was not created for a specific hypothesis. Instead, this is envisioned to be used as a supportive and diagnostic tool by domain experts (e.g., doctors or specialists) to see each patient's profile and the distribution of the features.

This is an especially useful tool in combination with a simulated patient profile (previous chart) as the expert can look at:

- Patient's previous report (feature values).
- Then, simulate potential or new report values using the simulation tool.
- Finally, compare the two based on the risk shown and the progress made to provide treatment effectively.

The tool can lead to more informed medical decisions and interventions by enabling this different multi-level insight.

![CKD patient profile](vega_viz_images/6.png)

In [11]:
features = ['age','sc', 'al', 'sg', 'hemo', 'rc', 'pcv']

df_ckd_only = df[df['classification'] == 'ckd'].dropna(subset=features).copy()
df_ckd_only = df_ckd_only.reset_index(drop=True)
df_ckd_only['patient_id'] = df_ckd_only.index.astype(str)

df_long = df_ckd_only[['patient_id'] + features].melt(id_vars='patient_id', 
                                                      var_name='feature', 
                                                      value_name='value')

dropdown = alt.binding_select(options=df_ckd_only['patient_id'].tolist(), name='CKD Patient Number: ')
selector = alt.param(name='SelectedPatient', bind=dropdown, value='0')

bar_chart = alt.Chart(df_long).transform_filter(
    alt.datum.patient_id == selector
).mark_bar().encode(
    x=alt.X('feature:N', title='Feature'),
    y=alt.Y('value:Q', title='Value'),
    color=alt.Color('feature:N', legend=None, scale=alt.Scale(scheme='set2')),
    tooltip=[
        alt.Tooltip('feature:N', title='Feature'),
        alt.Tooltip('value:Q', title='Value', format='.2f')
    ]
).add_params(
    selector
).properties(
    width=450,
    height=300,
    title='CKD Patient Profile'
)

bar_chart