Python package for the "Conditional Topic Allocation" (CTA): a text-analysis method that identifies topics that correlate with numerical outcomes.
- Corresponding research paper: Conditional Topic Allocations for Open-Ended Survey Responses (2024).
CTA finds topics by conditioning on observables. For example, do Republicans write differently about politics than Democrats? It consists of three steps:
1. Predict the outcome variable with text.
- Uses DistilBERT to predict outcome.
2. Select words with high predictive power (positive or negative).
- Calculates SHAP values for each word and select words with a statistically significant SHAP value.
3. Group words by semantic similarity.
- Returns topics with either positive or negative correlation with the outcome.
CTA supports all languages.
CTApy requires Python 3.9 and pip.
It is highly recommended to use a virtual environment (or conda environment) for the installation.
# upgrade pip, wheel and setuptools
python -m pip install -U pip wheel setuptools
# install the package
python -m pip install -U CTApyIf you want to use Jupyter, make sure you have it installed in the current environment.
Please see the hands-on tutorials, which replicate the research paper: https://github.com/twekhof/CTA/tree/main/tutorials.
CTApy was developed by
Tobias Wekhof, ETH Zurich
This Python package is a research tool currently under development. The authors take no responsibility for the accuracy or reliability of the results produced by it.