# **Test Value Criterion for Group Characterization**

**Author:** *Xavier Jacome Piñeiros*

**Methodology References:**
*   Group Characterization: Lebart et al. (2000)

**Data source:**
*   Congressional Voting Records: UC Irvine Machine Learning Repository
    * Data source has label Class.

**Objective:**
*   Use the Test Value criterion to characterize and differentiate groups, which allows for understanding the distinguishing features of each.

In [1]:
import pandas as pd

In [2]:
df = pd.read_excel('heart_disease_male.xls')
df.head()

Unnamed: 0,age,chest_pain,rest_bpress,blood_sugar,rest_electro,max_heart_rate,exercice_angina,disease
0,43,asympt,140,f,normal,135,yes,positive
1,39,atyp_angina,120,f,normal,160,yes,negative
2,39,non_anginal,160,t,normal,160,no,negative
3,42,non_anginal,160,f,normal,146,no,negative
4,49,asympt,140,f,normal,130,no,negative


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209 entries, 0 to 208
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              209 non-null    int64 
 1   chest_pain       209 non-null    object
 2   rest_bpress      209 non-null    int64 
 3   blood_sugar      209 non-null    object
 4   rest_electro     209 non-null    object
 5   max_heart_rate   209 non-null    int64 
 6   exercice_angina  209 non-null    object
 7   disease          209 non-null    object
dtypes: int64(3), object(5)
memory usage: 13.2+ KB


## Define Test Value Function

In [4]:
import pandas as pd
import numpy as np

# Defining the functions as provided by the user

def test_value_continuous(group_mean, overall_mean, empirical_variance, n_total, n_group):
    numerator = group_mean - overall_mean
    denominator = np.sqrt(((n_total-n_group)/(n_total-1))*(empirical_variance/n_group))
    test_value = numerator / denominator
    return test_value

def test_value_categorical(n_ij, n_j, n_group, n_total):
    expected = (n_group * n_j) / n_total
    numerator = n_ij - expected
    denominator = np.sqrt(((n_total - n_group)/(n_total-1))  * (1 - (n_j / n_total)) * ((n_group * n_j) / n_total))
    test_value = numerator / denominator
    return test_value

# Creating a function that calculates the test values for a given dataframe

def calculate_test_values(df, inputs, target):
    results = []
    n_total = len(df)
    groups = df[target].unique()

    for group in groups:
        df_group = df[df[target] == group]
        n_group = len(df_group)

        for input in inputs:
            if pd.api.types.is_numeric_dtype(df[input]):
                # Continous variable
                type = 'continuos'
                group_mean = df_group[input].mean()
                overall_mean = df[input].mean()
                empirical_variance = df[input].var(ddof=1)  # ddof=1 to get the sample variance
                group_empirical_variance = df_group[input].var(ddof=1)
                test_value = test_value_continuous(group_mean, overall_mean, empirical_variance, n_total, n_group)
                recall_accuracy = None  # Placeholder, as calculation is not defined
            else:
                # Categorical variable
                value_counts = df[input].value_counts()
                group_value_counts = df_group[input].value_counts()

                for category in value_counts.index:
                    type = 'categorical'
                    n_ij = group_value_counts.get(category, 0)

                    n_j = value_counts[category]

                    test_value = test_value_categorical(n_ij, n_j, n_group, n_total)

                    group_accuracy = n_ij / n_group if n_group != 0 else 0
                    recall = n_ij / n_j if n_j != 0 else 0
                    frecuency = n_j / n_total if n_total != 0 else 0
                    results.append({
                        'Type' : type,
                        'Attribute': input,
                        'Category': category,
                        'Group': group,
                        'Test Value': test_value,
                        'Recall': recall,
                        'Group Accuracy': group_accuracy,
                        'Overall Accuracy': frecuency,
                        'Group Count': n_ij,
                        'Total Count': n_j
                    })
                continue

            results.append({
                'Type' : type,
                'Attribute': input,
                'Group': group,
                'Test Value': test_value,
                'Recall Accuracy': recall_accuracy,
                'Group Mean': group_mean,
                'Overall Mean': overall_mean,
                'Group Standard Deviation':np.sqrt(group_empirical_variance),
                'Overall Standard Deviation': np.sqrt(empirical_variance)
            })

    results_df = pd.DataFrame(results)

    if results_df['Type'].nunique()==2:
        results_df = results_df[['Group','Type', 'Attribute', 'Category', 'Test Value'
                                 , 'Group Mean', 'Overall Mean', 'Group Standard Deviation', 'Overall Standard Deviation'
                                 ,'Recall', 'Group Accuracy', 'Overall Accuracy']].sort_values(by=['Type','Test Value', 'Group'],ascending=[False, False, True])

    elif results_df['Type'].unique()=='continuos':
        results_df = results_df[['Group','Type', 'Attribute','Test Value'
                                 ,'Group Mean', 'Overall Mean', 'Group Standard Deviation', 'Overall Standard Deviation']].sort_values(by=['Type','Test Value', 'Group'],ascending=[False,False,True])

    else:
        results_df = results_df[['Group','Type', 'Attribute', 'Category','Test Value'
                                 ,'Recall', 'Group Accuracy', 'Overall Accuracy']].sort_values(by=['Type','Test Value', 'Group'],ascending=[False, False, True])


    return results_df

## Apply **Test Value** to describe differences between ***positive*** and ***negative*** groups

In [5]:
inputs = df.columns[:-1]
target = 'disease'

# Calculate the test values
test_value = calculate_test_values(df, inputs, target)

In [6]:
pd.set_option('display.float_format', lambda x: '%.4f' % x)

for i in test_value['Group'].unique():
    print('\n')
    print('Cluster '+str(i))
    display(test_value[(test_value['Group']==i)&(test_value['Test Value'].abs()>=2)])



Cluster negative


Unnamed: 0,Group,Type,Attribute,Category,Test Value,Group Mean,Overall Mean,Group Standard Deviation,Overall Standard Deviation,Recall,Group Accuracy,Overall Accuracy
27,negative,continuos,max_heart_rate,,5.1805,145.1795,137.5742,22.1073,23.8768,,,
15,negative,continuos,age,,-2.7457,46.6068,47.9665,8.2399,8.0541,,,
28,negative,categorical,exercice_angina,no,8.2803,,,,,0.7664,0.8974,0.6555
17,negative,categorical,chest_pain,atyp_angina,6.7905,,,,,0.9077,0.5043,0.311
18,negative,categorical,chest_pain,non_anginal,3.2569,,,,,0.8056,0.2479,0.1722
21,negative,categorical,blood_sugar,f,2.0688,,,,,0.5803,0.9573,0.9234
22,negative,categorical,blood_sugar,t,-2.0688,,,,,0.3125,0.0427,0.0766
29,negative,categorical,exercice_angina,yes,-8.2803,,,,,0.1667,0.1026,0.3445
16,negative,categorical,chest_pain,asympt,-8.3709,,,,,0.2647,0.2308,0.488




Cluster positive


Unnamed: 0,Group,Type,Attribute,Category,Test Value,Group Mean,Overall Mean,Group Standard Deviation,Overall Standard Deviation,Recall,Group Accuracy,Overall Accuracy
0,positive,continuos,age,,2.7457,49.6957,47.9665,7.5049,8.0541,,,
12,positive,continuos,max_heart_rate,,-5.1805,127.9022,137.5742,22.6085,23.8768,,,
1,positive,categorical,chest_pain,asympt,8.3709,,,,,0.7353,0.8152,0.488
14,positive,categorical,exercice_angina,yes,8.2803,,,,,0.8333,0.6522,0.3445
7,positive,categorical,blood_sugar,t,2.0688,,,,,0.6875,0.1196,0.0766
6,positive,categorical,blood_sugar,f,-2.0688,,,,,0.4197,0.8804,0.9234
3,positive,categorical,chest_pain,non_anginal,-3.2569,,,,,0.1944,0.0761,0.1722
2,positive,categorical,chest_pain,atyp_angina,-6.7905,,,,,0.0923,0.0652,0.311
13,positive,categorical,exercice_angina,no,-8.2803,,,,,0.2336,0.3478,0.6555


In [7]:
test_value[test_value['Group']=='positive']

Unnamed: 0,Group,Type,Attribute,Category,Test Value,Group Mean,Overall Mean,Group Standard Deviation,Overall Standard Deviation,Recall,Group Accuracy,Overall Accuracy
0,positive,continuos,age,,2.7457,49.6957,47.9665,7.5049,8.0541,,,
5,positive,continuos,rest_bpress,,1.5968,135.837,133.6603,18.4946,17.433,,,
12,positive,continuos,max_heart_rate,,-5.1805,127.9022,137.5742,22.6085,23.8768,,,
1,positive,categorical,chest_pain,asympt,8.3709,,,,,0.7353,0.8152,0.488
14,positive,categorical,exercice_angina,yes,8.2803,,,,,0.8333,0.6522,0.3445
7,positive,categorical,blood_sugar,t,2.0688,,,,,0.6875,0.1196,0.0766
9,positive,categorical,rest_electro,st_t_wave_abnormality,1.5043,,,,,0.5667,0.1848,0.1435
4,positive,categorical,chest_pain,typ_angina,1.1312,,,,,0.6667,0.0435,0.0287
11,positive,categorical,rest_electro,?,1.1277,,,,,1.0,0.0109,0.0048
10,positive,categorical,rest_electro,left_vent_hyper,-1.0925,,,,,0.2,0.0109,0.0239


Let's focus on the variable **AGE**. The average age across the entire sample is ***47.97 years***. Within the subgroup of individuals with DISEASE = POSITIVE, the average age increases to ***49.70 years***. The primary factor distinguishing those with the illness is the presence of CHEST PAIN = ASYMPT. In the total sample, ***48.8%*** of people exhibit this symptom; however, in the DISEASE = POSITIVE subgroup, this figure rises to 81.5%. Additionally, 73.5% of those displaying CHEST PAIN = ASYMPT fall into the DISEASE = POSITIVE category.