Dython — Categorical Variable Correlation

Measuring associations between categorical variables using Dython — because Pearson doesn't work on "yes" and "no."

Problem

Standard correlation (Pearson, Spearman) only works on numeric data. But most real-world datasets have categorical variables — colour, type, status, gender. How do you measure the relationship between "Fire" and "Legendary" in a Pokemon dataset? You can't just subtract them.

Dython solves this by computing association values between all pairs of categorical (and mixed) variables, producing a correlation-style heatmap for non-numeric data.

What It Does

Identifies categorical columns automatically
Calculates association values between all categorical pairs
Visualises the result as a heatmap with a colour gradient

How Dython Works

Dython uses Theil's U (uncertainty coefficient) for nominal-nominal associations and correlation ratio for nominal-numeric. Both range from 0 (no association) to 1 (perfect association), so the matrix looks and feels like a Pearson correlation heatmap.

Association Type	Method	Range
Categorical ↔ Categorical	Theil's U	0–1
Categorical ↔ Numeric	Correlation ratio	0–1
Numeric ↔ Numeric	Pearson	-1 to 1

Setup

Google Colab

Click the badge above.

Local

pip install dython pandas
git clone https://github.com/wsamuelw/dython.git
cd dython
jupyter notebook dython_demo.ipynb

Key Code

from dython.nominal import associations, identify_nominal_columns

# auto-detect categorical columns
categorical_features = identify_nominal_columns(df)

# compute the full association matrix
complete_correlation = associations(df, nominal_columns=categorical_features)

# extract and display
df_complete_corr = complete_correlation['corr']
df_complete_corr.style.background_gradient(cmap='coolwarm')

Data

Pokemon Dataset — includes Type, Colour, Legendary status, and other categorical attributes. The association matrix reveals which attributes are most strongly related.

When to Use Dython

EDA — understand which categorical features are related before modelling
Feature selection — drop highly correlated categorical features (just like you would with numeric)
Data profiling — quickly see relationships in survey data, customer segments, or any categorical-heavy dataset

When NOT to Use It

Only numeric data — standard df.corr() is faster and sufficient
Very large datasets — Theil's U computation scales with unique categories
Causation — association ≠ causation, same as Pearson

Tech Stack

dython — categorical association measures
pandas — data handling
matplotlib — heatmap rendering (via dython)

References

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
dython_demo.ipynb		dython_demo.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dython — Categorical Variable Correlation

Problem

What It Does

How Dython Works

Setup

Google Colab

Local

Key Code

Data

When to Use Dython

When NOT to Use It

Tech Stack

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dython — Categorical Variable Correlation

Problem

What It Does

How Dython Works

Setup

Google Colab

Local

Key Code

Data

When to Use Dython

When NOT to Use It

Tech Stack

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages