DataDisributionTransferLearning

This is the code used for "The Role of Pre-training Data in Transfer Learning". Our CLIP models are trained from scratch on each of the pre-training datasets unless otherwise mentioned and follow the training code from the OpenCLIP GitHub repository. CLIP models are trained using AdamW optimizer with default PyTorch parameters $\beta_1= 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, batch size 1024, and weight decay of 0.1. For learning rate, we start with a learning rate of $10^{-3}$ and apply a cosine-annealing learning rate schedule~\citep{loshchilov2016sgdr} with 5,000 steps warm-up. We use the same data augmentations as in SimCLR paper.

SimCLR training

Our SimCLR implementation closely follows the training code from the SLIP SimCLR models are also trained for 16 epochs from scratch using AdamW optimizer~\citep{loshchilov2017decoupled} with $\beta_1= 0.9$, $\beta_2 = 0.98$, $\epsilon = 10^{-8}$, batch size 1024, and weight decay of 0.1. we start with a learning rate of $10^{-3}$ and apply a cosine-annealing learning rate schedule with 2 epochs of warm-up. The hidden dimension of SimCLR MLP projection head is set to 4,094 and the output embedding dimension of MLP projection head is set to 256.

Finetuning detail

Each pretrained model is finetuned on the specific downstream task for 128 epochs while the learning rate is from {0.0001, 0.0003, 0.001, 0.003} as starting and applying a cosine-annealing learning rate schedule with 500 steps warm-up and batch size of 128. For each fine-tuning, we choose the best performing result on the test set among the performed grid search. We use the implementation from the WiSE-FT GitHub repository for fine-tuning, where we have only one model and $\alpha=1$.

Install dependencies

conda env create
conda activate DataDisributionTransferLearning

Add directory to PYTHONPATH:

cd DataDisributionTransferLearning
export PYTHONPATH="$PYTHONPATH:$PWD"

Working with Caliban

Most experiments in this repositoty were done using Caliban. Caliban is a tool for developing research workflow and notebooks in an isolated Docker environment and submitting those isolated environments to Google Compute Cloud. Basically you can use the commands in run.sh for different experiments. Each run will load the hyperparameters from config.json and save results in the Google Bucket. Below is a short step-by-step how to run Caliban on GCP:

sudo apt-get install python3 python3-venv python3-pip
sudo usermod -a -G docker ${USER}
Install Docker: Note: check if docker is already installed: sudo apt-get install -y nvidia-docker2 If not continue: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
sudo pkill -SIGHUP dockerd
python3 -m pip install --user pipx
python3 -m pipx ensurepath
source ~/.bashrc (or re-login for the PATH changes to take effect)
pipx install caliban

To check if all is well, run caliban --help

Setting up Google Cloud for Caliban

Give the account owner the name of the account: Go to vm details> API and identity management

Service account Add the Service account($$$@developer.gserviceaccount.com) as an owner to the IAMadmin in google console.

Also add this to the bucket as storage object admin if you are using Google Bucket
gcloud init

Select the account
Set default zone to some zone e.g. europe-west4-a (number 14)

Add the following lines to the end of “~/.bashrc” export REGION="your zone e.g. europe-west4 " export PROJECT_ID="your project ID"

source ~/.bashrc

Test your Environment: gcloud auth list

Follow these steps to get a JSON file for credentials
Move the json file to a path
Add the following to the end of “~/.bashrc”: export GOOGLE_APPLICATION_CREDENTIALS=path to the JSON file
source ~/.bashrc

Then you can either run caliban locally or on the cloud using GCP Training jobs

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
DataDisributionTransferLearning		DataDisributionTransferLearning
clip		clip
images		images
src		src
utils		utils
Hyperparameters_results.csv		Hyperparameters_results.csv
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
config.json		config.json
config_rn50.bkp		config_rn50.bkp
datasets.md		datasets.md
environment.yml		environment.yml
requirements.txt		requirements.txt
run.sh		run.sh

License

rahimentezari/DataDistributionTransferLearning

Folders and files

Latest commit

History

Repository files navigation

DataDisributionTransferLearning

SimCLR training

Finetuning detail

Install dependencies

Add directory to PYTHONPATH:

Working with Caliban

Setting up Google Cloud for Caliban

About

Resources

License

Stars

Watchers

Forks

Languages