# 1. Solution Overview

<font color=red>Our solution does not use additional data, nor does it use traditional forward or reverse simulation inversion methods</font>. 

Instead, it uses an end-to-end deep learning approach.

## 1.1. Data Preprocessing / Feature Engineering

In general, our solution <font color=red>did not use additional datasets</font>, and 4 data preprocessing methods were used:

1) Since the sei data are all very small numbers, to match the scale of the output vel data, all <font color=red>sei data were multiplied by 1e4</font>;

2) All vel data in the training set are counted, with the minimum value being 1.5 and the maximum value being 4.5. To ensure the stability of the model output, <font color=red>the model output was processed through a sigmoid function, multiplied by 3.0, and then added to 1.5 to obtain the final velocity output</font>;

3) Source is placed at [1, 75, 150, 225, 300], 150 after flip will be 151 this causes conflict in feature meaning. So I insert one more channel to represent source at 151 and fill by zero. Feature meaning is then self consistent before and after flip.

3) The input shape was 5 * 10000 * 31, and the output shape was 1259 * 300. <font color=red>The significant difference in input and output shapes made the model design inconvenient, so the output data was reshaped by discarding the last row to obtain 5 * 1000 * 310, which was then interpolated to 5 * 1260 * 308. The visualization of this process is shown below</font>:

![Data Preprocess](images/data_preprocess.PNG "Data Preprocess")

## 1.2. Model description

We use a custom eva02 model (patch size 14). The first s blocks of the model are used for intra-channel feature interaction, then an average pool is used to fuse the five channels, and the remaining blocks are used for further feature interaction. Finally, a nn.Linear and nn.PixelShuffle layer is used for upsampling. <font color=red>The overall model architecture is shown below</font>:

![Model Architecture](images/model_arch.png "Model Architecture")


Based on the above architecture, I trained three different models: eva02_base_s_at_6 (base represents the eva02_base model, s_at_6 represents the first s in the figure above is 6), eva02_base_s_at_8, and eva02_tiny_s_at_8. I trained 10 eva02_base_s_at_6 models, 6 eva02_base_s_at_8 models, and 6 eva02_tiny_s_at_8 models using different seeds. <font color=red>Finally, by ensemble these 22 models, the public lb score is 0.023391</font>

## 1.3. Loss description

1) The eva02_base model uses a weighted L1 Loss, <font color=red>where the weights decrease linearly from the surface to the depths (1-1/4), because I suspect the data at depths is noisier.</font>

2) On the basis of the loss function of the eva02_base model, the eva02_tiny model adds a torch.exp(loss.detach().clone()) weight in order to <font color=red>mine difficult examples. Because I observed that the results predicted by eva02_tiny always have large errors at the fault position, as shown in the figure below：</font>

![EVA02_tiny_pred_err](images/eva02_tiny_pred_err.PNG "EVA02_tiny_pred_err")


## 1.4. Hardware and environment

- PyTorch  2.1.2

- Python  3.10 (ubuntu22.04)

- Cuda  11.8

- GPU  <font color=red>V100(32GB)</font> * 1

- CPU  6 vCPU Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz

- CPU MEM 25GB


With the current pipeline settings, the training process took about 110 hours (with 1 NVIDIA V100 GPU (32 GB) available).


# 2. Solution Reproduction Steps

<font color=red>If you only want to reproduce my inference result, you only need to read section 2.1 and 2.6.</font>

<font color=red>If you want to reproduce both my training and inference result, you need to read from section 2.1 step by step.</font>


## 2.1. Environment Setup
Please, run the following command to install all needed libraries and packages.

In [None]:
! bash install.sh

## 2.2. Download and unzip data

Download the original competition data and unzip it.

Run the shell script in the next cell to download the original competition train data and unzip it. You need to modify the following variable in the below cell:
- ```SRC_TRAIN_DATA_ROOT```, <font color=red>represents the path to save the downloaded training data.</font>

In [None]:
%%bash

SRC_TRAIN_DATA_ROOT="./data/train_data/"

! wget -P $SRC_TEST_DATA_ROOT https://xeek-public-287031953319-eb80.s3.us-east-1.amazonaws.com/speed-and-structure/speed-and-structure-train-data-extended-part1.zip
! wget -P $SRC_TEST_DATA_ROOT https://xeek-public-287031953319-eb80.s3.us-east-1.amazonaws.com/speed-and-structure/speed-and-structure-train-data-extended-part2.zip

for file in $(find $SRC_TRAIN_DATA_ROOT/*.zip -type f); do
    echo "$file is a file"
    unzip -q $file -d $SRC_TRAIN_DATA_ROOT
    rm $file
done

## 2.3. Kfold training data

At the beginning of the competition, only the first part (2999) of the training set was provided. My kfold split was based on this part training set, and the second part of the data (2000) provided in the middle of the comptition is directly used as the training set. <font color=red>Therefore, if you want to replicate my kfold process, you will need to download only the first part of the training set, run the following code to perform a 5-fold split on this part, and then add the second part of the training set directly to the f0 training data.</font>

1) ```SRC_TRAIN_DATA_ROOT```, represents the path of the original part1 training data. This folder contains all the part1 training data.
2) ```KFOLD_TXT_SAVE_ROOT```, represents the path to save the txt file of training and validation data in each fold, <font color=red>which will be used in the training section</font>

<font color=red>I still recommend using the txt file I generated for training directly<font color=red>, as it is more convenient.

<font color=red> The generated fold 0 divided txt file is already provided in path ```./train_txt/```, you can skip the following code and directly use the txt file in folder ```./train_txt/``` for training.</font>

In [None]:
import os
import numpy as np
from sklearn.model_selection import KFold

SRC_TRAIN_DATA_ROOT = r"./data/train/"
KFOLD_TXT_SAVE_ROOT = r"./data/generated_train_txt/"

NUM_FOLD = 5
RANDOM_SEED=123
os.makedirs(KFOLD_TXT_SAVE_ROOT, exist_ok=True)

all_train_case = np.asarray(os.listdir(SRC_TRAIN_DATA_ROOT))
kf = KFold(n_splits=NUM_FOLD, random_state=RANDOM_SEED, shuffle=True)
for i, (train_index, valid_index) in enumerate(kf.split(all_train_case)):
    train_case = all_train_case[train_index]
    valid_case = all_train_case[valid_index]
    np.savetxt(f"{KFOLD_TXT_SAVE_ROOT}/train_f{i}.txt", train_case, fmt="%s")
    np.savetxt(f"{KFOLD_TXT_SAVE_ROOT}/val_f{i}.txt", valid_case, fmt="%s")
    break # only need f0

## 2.4. Pipeline Training Configuration

The training configuration file is located in ```./src/configs/*.py```. <font color=red>In order to ensure the success of the training, the following variables in all the 3 py files may need to be modified.</font>
1) ```dataloader.train.dataset.data_root```, represents the root directory of the training data, which should contain training pairs. <font color=red>It should be changed to the ```SRC_TRAIN_DATA_ROOT``` where the training data was downloaded in Section 2.2.</font>
2) ```dataloader.train.dataset.txt_file```, represents the txt file used for training. To reproduce our results, you need to use the fold0 data divided in Section 2.3 for training. <font color=red>It should be changed to the path where the train_f0.txt file in section 2.3 is located.</font>
3) ```dataloader.val.dataset.data_root```,  represents the root directory of the transposed training data,  should be the same as ```dataloader.train.dataset.data_root```.
4) ```dataloader.train.dataset.txt_file```, represents the txt file used for validation. <font color=red>It should be changed to the path where the val_f0.txt file in section 2.3 is located.</font>

## 2.6. Training step
<font color=red>You can skip this step if you only want to inference with my pre-trained model, which is provided in </font>```./my_checkpoints/*/*.pth```.

Otherwise, run the following command to trigger the model training script. 

In the following command, the meaning of each variable is as follows:
1) ```./src/train_custom_eva02_base.py```, this is the main file of the training script to train custom_eva02_base model
2) ```./src/train_custom_eva02_tiny.py```, this is the main file of the training script to train custom_eva02_tiny model
3) ```src/configs/*.py```, represents the type of model being trained. There are three types: eva02_base_split_at_6、eva02_base_split_at_8、eva02_tiny_split_at_8.
4) ```output-root```, represents the save location of the training results and model.
5) ```seed```, represents the random seed during training.


NOTE:

<font color=red>I spend nearly 110 hours to finish all the followed 22 training process by 1 x V100 (32GB) locally. The training time can be used as a reference for you</font>

<font color=red>And If your GPU memory is only 24GB, you may not able to exactly reproduce my training results using a single GPU under my current training configuration. Therefore, if you need to fully reproduce my results, please use a GPU with a single GPU memory greater than or equal to 32GB, or contact me to provide you with a multi-GPU training script.</font>

In [None]:
# train custom_eva02_base_split_at_6
! torchrun --nproc_per_node=1 ./src/train_custom_eva02_base.py ./src/configs/eva02_base_split_at_6.py --output-root ./experiments/eva02_base_s_at_6/seed4 --seed 4
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_6.py --output-root ./experiments/eva02_base_s_at_6/seed6 --seed 6
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_6.py --output-root ./experiments/eva02_base_s_at_6/seed7 --seed 7
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_6.py --output-root ./experiments/eva02_base_s_at_6/seed9 --seed 9
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_6.py --output-root ./experiments/eva02_base_s_at_6/seed10 --seed 10
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_6.py --output-root ./experiments/eva02_base_s_at_6/seed11 --seed 11
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_6.py --output-root ./experiments/eva02_base_s_at_6/seed12 --seed 12
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_6.py --output-root ./experiments/eva02_base_s_at_6/seed13 --seed 13
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_6.py --output-root ./experiments/eva02_base_s_at_6/seed14 --seed 14
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_6.py --output-root ./experiments/eva02_base_s_at_6/seed15 --seed 15

# # train custom_eva02_base_split_at_8
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_8.py --output-root ./experiments/eva02_base_s_at_8/seed2 --seed 2
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_8.py --output-root ./experiments/eva02_base_s_at_8/seed3 --seed 3
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_8.py --output-root ./experiments/eva02_base_s_at_8/seed4 --seed 4
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_8.py --output-root ./experiments/eva02_base_s_at_8/seed5 --seed 5
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_8.py --output-root ./experiments/eva02_base_s_at_8/seed7 --seed 7
! torchrun --nproc_per_node=1 src/train_custom_eva02_base.py src/configs/eva02_base_split_at_8.py --output-root ./experiments/eva02_base_s_at_8/seed10 --seed 10

# # train custom_eva02_tiny_split_at_8
! torchrun --nproc_per_node=1 src/train_custom_eva02_tiny.py src/configs/eva02_tiny_split_at_8.py --output-root ./experiments/eva02_tiny_s_at_8/seed2 --seed 2
! torchrun --nproc_per_node=1 src/train_custom_eva02_tiny.py src/configs/eva02_tiny_split_at_8.py --output-root ./experiments/eva02_tiny_s_at_8/seed5 --seed 5
! torchrun --nproc_per_node=1 src/train_custom_eva02_tiny.py src/configs/eva02_tiny_split_at_8.py --output-root ./experiments/eva02_tiny_s_at_8/seed9 --seed 9
! torchrun --nproc_per_node=1 src/train_custom_eva02_tiny.py src/configs/eva02_tiny_split_at_8.py --output-root ./experiments/eva02_tiny_s_at_8/seed10 --seed 10
! torchrun --nproc_per_node=1 src/train_custom_eva02_tiny.py src/configs/eva02_tiny_split_at_8.py --output-root ./experiments/eva02_tiny_s_at_8/seed14 --seed 14
! torchrun --nproc_per_node=1 src/train_custom_eva02_tiny.py src/configs/eva02_tiny_split_at_8.py --output-root ./experiments/eva02_tiny_s_at_8/seed3487 --seed 3487



## 2.7. Inference step
To inference the model and form a predictions for test dataset please follow the instructions below. 

Firstly, download the test data, decompress it, and organize the test data into the format described below.


![test_data_dir](images/test_data_dir.PNG)


Run the shell script in the next cell to download the original competition test data and unzip it. You need to modify the following variable in the below cell:
```SRC_TEST_DATA_ROOT```, <font color=red>represents the path to save the downloaded test data.</font>

In [None]:
%%bash

SRC_TEST_DATA_ROOT="./data/test_data/"

! wget -P $SRC_TEST_DATA_ROOT https://xeek-public-287031953319-eb80.s3.us-east-1.amazonaws.com/speed-and-structure/speed-and-structure-test-data.zip

for file in $(find $SRC_TEST_DATA_ROOT/*.zip -type f); do
    echo "$file is a file"
    unzip -q $file -d $SRC_TEST_DATA_ROOT
    rm $file
done

Secondly, you need to load the trained model and perform inference.

In the following command, the meaning of each variable is as follows:

1) ```./src/submit.py```, this is the main file of the inference script and does not need to be modified.

2) in ```./src/submit.py```, <font color=red><u>the following variables in the "main func" may need to be modified.</font>
    - ```configs```, represents the configuration file of al the 22 ensemble inference model.
    - ```ckpt_paths```, represents the checkpoint file of al the 22 ensemble inference model. <font color=red>The order of the model files in ckpt_paths list needs to correspond one-to-one with the previous model configuration files</font>.
    - ```data_root```,  represents the path of the test data. It’s ```"./data/test_data/"``` by default.
    - ```submission_path```,  represents the path where the final generated submission file is located. By default, it’s ```"./final_submission.npz"```

NOTE:
<font color=red>I provided three type of models in ```./my_checkpoints```, 10 custom_eva02_base_split_at_6, 6 custom_base_eva02_split_at_8, and 6 custom_tiny_eva02_split_at_8. Ensemble of all the 22 models resulted 0.023391 in public lb.</font>

|   models     |                   local cv score	  |   public lb score  |
|  ----  | ----  | ----  |
| custom_eva02_base_split_at_6 (10 seed ensmble) |  0.020559	   |0.023526         |
| custom_eva02_base_split_at_8 (6 seed ensemble) |  0.020876	   |0.023591         |
| custom_eva02_tiny_split_at_8 (6 seed ensemble) |  0.020851	   |? (did not test) |
| ensemble all the above 22 models	                  |  0.02054	   |0.023391         |

In [None]:
! python ./src/submit.py

Now, you get the final submission file for scoring.

# 3. Others

### 3.1 That didn’t work

1) openFWI dataset pretrained
2) larger model
3) larger input size
4) Unet arch model
5) clip , dino pretrain
6) larger or smaller patch_size

### 3.2 Didn’t get to try

1) traditional method
2) generate more data
3) end2end forward model aux loss

### 3.3 NOTE

<font color=red>I spend nearly 110 hours to finish all the 22 training process by 1 x V100 (32GB) locally. </font>

<font color=red>And If your GPU memory is only 24GB, you may not able to exactly reproduce my training results using a single GPU under my current training configuration. Therefore, if you need to fully reproduce my traininig results, please use a GPU with a single GPU memory greater than or equal to 32GB, or contact me to provide you with a multi-GPU training script, Thanks.</font>
