# Modulus GraphCast PT implementation code review

[1] describes that the model could be trained on a single GPU via `python modulus/examples/weather/graphcast/train_graphcast.py`. This script is external to the module, thus I clone the repo [2] here and include the example script in the path.

- [1] https://docs.nvidia.com/deeplearning/modulus/modulus-core/examples/weather/graphcast/readme.html
- [2] https://github.com/NVIDIA/modulus.git

In [2]:
import sys
sys.path.append('./modulus/examples/weather/graphcast')

---
## Configs loading

First I load the configs (hydra) as per main annotation. I kept default vals until changes required.

In [3]:
import hydra

with hydra.initialize(config_path="./modulus/examples/weather/graphcast/conf", version_base="1.3"):
    cfg = hydra.compose(config_name="config")
cfg, cfg.num_iters_step1 + cfg.num_iters_step2 + cfg.num_iters_step3

({'processor_layers': 16, 'hidden_dim': 512, 'mesh_level': 6, 'multimesh': True, 'processor_type': 'MessagePassing', 'khop_neighbors': 32, 'num_attention_heads': 4, 'norm_type': 'TELayerNorm', 'segments': 1, 'force_single_checkpoint': False, 'checkpoint_encoder': True, 'checkpoint_processor': False, 'checkpoint_decoder': False, 'force_single_checkpoint_finetune': False, 'checkpoint_encoder_finetune': True, 'checkpoint_processor_finetune': True, 'checkpoint_decoder_finetune': True, 'concat_trick': True, 'cugraphops_encoder': False, 'cugraphops_processor': False, 'cugraphops_decoder': False, 'recompute_activation': True, 'use_apex': True, 'dataset_path': '/data/era5_75var', 'static_dataset_path': 'static', 'dataset_metadata_path': '/data/era5_75var/metadata/data.json', 'time_diff_std_path': '/time_diff_std.npy', 'latlon_res': [721, 1440], 'num_samples_per_year_train': 1408, 'num_workers': 8, 'num_channels_climate': 73, 'num_channels_static': 5, 'num_channels_val': 3, 'num_val_steps': 8, 

---
## Distributed computation setup

Early on [1] on the main() the distributed manager is initialized.

- [1] https://vscode.dev/github/NVIDIA/modulus/blob/main/examples/weather/graphcast/train_graphcast.py#L328

In [4]:
import os
from modulus.distributed import DistributedManager

# I mock a simple slurm job
os.environ["MODULUS_DISTRIBUTED_INITIALIZATION_METHOD"] = "ENV"
os.environ["MASTER_PORT"] = "12355"
os.environ["MASTER_ADDR"] = "localhost"
os.environ["WORLD_SIZE"] = "1"
os.environ["RANK"] = "0"

DistributedManager.initialize()
dist = DistributedManager()



---
## Dataloder initialization

Although main() defines its own dataloader on [1] - this is not used until a second phase of training.

See [01_dataloader.ipynb](01_dataloader.ipynb)

---
## Area weights factor computation

Right after `datapipe`, `area` is initialized.

```python
class GraphCastTrainer(BaseTrainer):
    def __init__(self, cfg: DictConfig, dist, rank_zero_logger):
        ...
        self.area = normalized_grid_cell_area(self.lat_lon_grid[:, :, 0], unit="deg")
```

What is it and what for?

See [02_area.ipynb](02_area.ipynb)

---
## Loss function, optimizer and schedulers

After the `area` computation, thus from [1] the loss and optimization objects are initialized. The Adam optimizer and the scheduler are well known, but the loss is a function named `GraphCastLossFunction`.

```python
class GraphCastTrainer(BaseTrainer):
    def __init__(self, cfg: DictConfig, dist, rank_zero_logger):
        ...
        if cfg.synthetic_dataset:
            ...
        else:
            self.criterion = GraphCastLossFunction(
                self.area,
                self.channels_list,
                cfg.dataset_metadata_path,
                cfg.time_diff_std_path,
            )
```

What is it doing?

See [03_loss.ipynb](03_loss.ipynb)

---
## Evolution of the training process

The training process has different phases. The code is too verbose to make sense, thus a simplified version is proposed in [04_training_process.ipynb](04_training_process.ipynb)

---

## Model architecture

The model architecture analysis is contained in [05_model.ipynb](05_model.ipynb)

---

# Icosahedron creation and data structure

The icosahedron code is analysed in [06_icosahedron.ipynb](06_icosahedron.ipynb)