ML-Recon

Pytorch FSDP modifications

We're testing FSDP fully sharded model to distribute the parameters and data into multiple GPUs, in order to enlarge the maximum possible training data dimension.

Pytorch Lightning modifications

We're using the Pytorch Lightning framework for convenience of training with multiple GPUs and possibly tackling larger box and many more particles.

ML-Recon

Objective:

ML project to predict Nbody simulation output from initial condition. Both input and output are particle displacement fields.

File descriptions:

reconLPT2Nbody_uNet.py : main excute files (with modifications based on FSDP, lightning, etc)
periodic_padding.py : code to fulfill periodic boundary padding
data_utils.py : how to load data + test/analysis
model/BestModel.pt : Best trained model
configs/config_unet.json : most of the hyperparameters
Unet/uNet.py : architecture

To run the code:

srun --ntasks-per-node=1 --gpus-per-task=1 ./reconLPT2Nbody_uNet_lightning_FSDP.py -c ./configs/config_unet.json 0,1,2,3

Instruction:

Input raw data should be in the format of '0_train.npy','1_train.npy'. The shape of the data in each file should be (sample_size,3,dim,dim,dim), where the first coloumn is sample size, the 3rd to 5th coloumn is (\phi_x, \phi_y,\phi_z) for ZA. 0 and 1 represents initial snapshot and final snapshot.
The output of the model is in the shape of (6,dim,dim,dim) where (0:3,dim,dim,dim) stores the predicted fastPM simulations from uNet model and (3:6,dim,dim,dim) stores the corresponding real simulations. Here our PM simulation dimension is dim = 128.
The best trained model is stored in model/BestModel.pt. All the tests (pancake, cosmology, etc) should be tested on this model. You should only change the following parameters in configs/config_unet.json to do different tests:
- base_data_path: tell where the input (LPT/ZA) is stored.
- output_path: where do you want to store the output
The ZA/PM128 data are stored in the directory on LCRC Swing.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Unet		Unet
checkpoints		checkpoints
configs		configs
lightning_logs		lightning_logs
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_utils.py		data_utils.py
fsdp_utilities.py		fsdp_utilities.py
periodic_padding.py		periodic_padding.py
reconLPT2Nbody_uNet.py		reconLPT2Nbody_uNet.py
reconLPT2Nbody_uNet_FSDP_nolightning.py		reconLPT2Nbody_uNet_FSDP_nolightning.py
reconLPT2Nbody_uNet_lightning.py		reconLPT2Nbody_uNet_lightning.py
reconLPT2Nbody_uNet_lightning_FSDP.py		reconLPT2Nbody_uNet_lightning_FSDP.py
reconLPT2Nbody_uNet_lightning_FSDP_fullshard.py		reconLPT2Nbody_uNet_lightning_FSDP_fullshard.py
reconLPT2Nbody_uNet_lightning_FSDP_noshard.py		reconLPT2Nbody_uNet_lightning_FSDP_noshard.py
reconLPT2Nbody_uNet_lightning_test.py		reconLPT2Nbody_uNet_lightning_test.py
reconLPT2Nbody_uNet_reload.py		reconLPT2Nbody_uNet_reload.py
reconLPT2Nbody_uNet_synthesized.py		reconLPT2Nbody_uNet_synthesized.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pytorch FSDP modifications

Pytorch Lightning modifications

ML-Recon

Objective:

File descriptions:

To run the code:

Instruction:

About

Releases

Packages

Languages

License

xiaofeng-d/multi-GPU

Folders and files

Latest commit

History

Repository files navigation

Pytorch FSDP modifications

Pytorch Lightning modifications

ML-Recon

Objective:

File descriptions:

To run the code:

Instruction:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages