Measuring associations between categorical variables using Dython — because Pearson doesn't work on "yes" and "no."
Standard correlation (Pearson, Spearman) only works on numeric data. But most real-world datasets have categorical variables — colour, type, status, gender. How do you measure the relationship between "Fire" and "Legendary" in a Pokemon dataset? You can't just subtract them.
Dython solves this by computing association values between all pairs of categorical (and mixed) variables, producing a correlation-style heatmap for non-numeric data.
- Identifies categorical columns automatically
- Calculates association values between all categorical pairs
- Visualises the result as a heatmap with a colour gradient
Dython uses Theil's U (uncertainty coefficient) for nominal-nominal associations and correlation ratio for nominal-numeric. Both range from 0 (no association) to 1 (perfect association), so the matrix looks and feels like a Pearson correlation heatmap.
| Association Type | Method | Range |
|---|---|---|
| Categorical ↔ Categorical | Theil's U | 0–1 |
| Categorical ↔ Numeric | Correlation ratio | 0–1 |
| Numeric ↔ Numeric | Pearson | -1 to 1 |
Click the badge above.
pip install dython pandas
git clone https://github.com/wsamuelw/dython.git
cd dython
jupyter notebook dython_demo.ipynbfrom dython.nominal import associations, identify_nominal_columns
# auto-detect categorical columns
categorical_features = identify_nominal_columns(df)
# compute the full association matrix
complete_correlation = associations(df, nominal_columns=categorical_features)
# extract and display
df_complete_corr = complete_correlation['corr']
df_complete_corr.style.background_gradient(cmap='coolwarm')Pokemon Dataset — includes Type, Colour, Legendary status, and other categorical attributes. The association matrix reveals which attributes are most strongly related.
- EDA — understand which categorical features are related before modelling
- Feature selection — drop highly correlated categorical features (just like you would with numeric)
- Data profiling — quickly see relationships in survey data, customer segments, or any categorical-heavy dataset
- Only numeric data — standard
df.corr()is faster and sufficient - Very large datasets — Theil's U computation scales with unique categories
- Causation — association ≠ causation, same as Pearson
- dython — categorical association measures
- pandas — data handling
- matplotlib — heatmap rendering (via dython)
MIT