This repo provides the PyTorch source code of our paper:
On the nonlinear correlation of ML performance between data subpopulations
Weixin Liang*, Yining Mao*, Yongchan Kwon*, Xinyu Yang, James Zou
ICML (2023) [Arxiv]
TL;DR: We show that there is a “moon shape” correlation (parabolic uptrend curve) between the test performance on the majority subpopulation and the minority subpopulation. This nonlinear correlations hold across model architectures, training settings, datasets, and the imbalance between subpopulations.
Subpopulation shift is a major challenge in ML: test data often have different distribution across subgroups (e.g. different types of users or patients) compared to the training data. Recent works find a strong linear relationship between ID and OOD performance on dataset reconstruction shifts; In contrast, we empirically show that they have a nonlinear correlation under subpopulation shifts.
The “moon shape” phenomenon is a nonlinear correlation (parabolic uptrend curve) between the test performance on the majority subpopulation and the minority subpopulation. We decompose the model’s performance into separate evaluations on the majority and minority subpopulations of the OOD test set, and evaluate under two dataset configurations. Top (a-c): Results of datasets with spurious correlation show a pronounced nonlinearity; Bottom (d-f): Results of datasets without spurious correlation exhibit more subtle nonlinearity.
Why the moon shape is not obvious: Mixture of models can fill in the moon shape.Stronger spurious correlation creates more nonlinear performance correlation.
See our paper for details!
Our implementation framework is based on MXNet and AutoGluon.
- mxnet >= 1.7.0
- pytorch >= 1.10.1
- torchvision >= 0.11.2
- autogluon
- gluoncv
We implement 5 subpopulation shift datasets with 6 settings (2 versions for Modified-CIFAR4).
- Spurious correlation datasets: MetaShift, Waterbirds, Modified-CIFAR4 V1
- Rare subpopulation datasets: PACS, OfficeHome, Modified-CIFAR4 V2
- For MetaShift [GoogleDrive], PACS [GoogleDrive], OfficeHome [GoogleDrive], data needs to be downloaded to corresponding dataset folders in
datasets/
; - For Waterbirds, install WILDS using pip:
pip install wilds
and download data with code; - For Modified-CIFAR4, CIFAR10 dataset will first be downloaded with torchvision.
- To see the dataset samples and prepare the data, run the jupyter notebook in corresponding dataset folder in
datasets/
. For example, Metashift dataset preparation code is indatasets/metashift/metashift_prepare.ipynb
. - In each dataset preparation notebook, you can change the
ROOT_PATH
andEXP_ROOT_PATH
in the first code cell.ROOT_PATH
: downloaded dataset root pathEXP_ROOT_PATH
: experiment root path, default toexperiments/DATASET_NAME
- The prepared data will be saved in
EXP_ROOT_PATH/data
in Pytorch Image Folder Format:- Training data in
EXP_ROOT_PATH/data/train
; - Validation data in
EXP_ROOT_PATH/data/majority-val
andEXP_ROOT_PATH/data/minority-val
.
- Training data in
Train 500 different ML models with varying configurations following the search space of AutoGluon. Here for each dataset, we implement with 5 model architectures, 5 learning rates, 5 batch sizes, and 4 training durations:
@ag.args( # 5 models * 5 lr * 5 batch_size * 4 epochs = 500 configurations
model = ag.space.Categorical(
'mobilenetv3_small',
'resnet18_v1b',
'resnet50_v1',
'mobilenetv3_large',
'resnet101_v2',
),
lr = ag.space.Categorical(0.01, 0.005, 0.001, 0.0005, 0.0001),
batch_size = ag.space.Categorical(8, 16, 32, 64, 128),
epochs = ag.space.Categorical(1, 5, 10, 25)
)
Specify the experiment directory, and you can train the models.
For example, if you prepare and save the data in experiments/metashift/data
, run:
python main.py --exp-dir experiments/metashift
and you will get the following results in experiments/metashift/result
:
- A table with evaluation results of each configuration,
- A 'majority subpopulation accuracy vs. minority subpopulation accuracy' plot corresponding to the table.
If you found this code/work to be useful in your own research, please considering citing the following:
@inproceedings{liang2022nonlinear,
title={On the nonlinear correlation of ML performance betweem data subpopulations},
author={Liang, Weixin and Mao, Yining and Kwon, Yongchan and Yang, Xinyu and Zou, James},
booktitle={ICML},
year={2023}
}