## Run WeNet on SageMaker - Gigaspeech

### 选择Notebook环境
Notebook的运行环境可以选择conda_pytorch_p38。

### 下载 wenet 代码

In [None]:
%cd ~/SageMaker

!git clone https://github.com/Chen188/wenet -b sagemaker-giga

wenet_src = '~/SageMaker/wenet'

### 获取基本环境信息

In [None]:
import boto3
import sagemaker

print('SageMaker version: ', sagemaker.__version__)

from sagemaker import get_execution_role

region = boto3.session.Session().region_name
role   = get_execution_role()
sess   = sagemaker.Session()
bucket = sess.default_bucket()

account_id = boto3.client('sts').get_caller_identity().get('Account')

bucket

## 准备Docker image

In [None]:
ecr_repository = 'sagemaker-wenet'

# 登录ECR服务
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account_id}.dkr.ecr.{region}.amazonaws.com.cn

### 创建容器注册表

In [None]:
!aws ecr create-repository --repository-name $ecr_repository

### 构建训练镜像

In [None]:
training_docker_file_path = '~/SageMaker/wenet'

!cat $training_docker_file_path/Dockerfile.gigaspeech

In [None]:
# 构建训练镜像并推送到ECR, China Region.
tag = ':training-pip-pt_1_10_0'
training_repository_uri = '{}.dkr.ecr.{}.amazonaws.com.cn/{}'.format(account_id, region, ecr_repository + tag)
print('training_repository_uri: ', training_repository_uri)

!cd $training_docker_file_path && docker build -t "$ecr_repository$tag" . -f Dockerfile.gigaspeech
!docker tag {ecr_repository + tag} $training_repository_uri
!docker push $training_repository_uri

# !docker pull $training_repository_uri

## 训练数据准备

### 挂载共享存储

可以使用EFS或者FSx for Lustre。这里我们使用的是EFS，然后挂载到/efs目录。

### run.sh 中的变量定义，及完成Stage 0 - 4 之后的结构

set: giga数据集，有XS|M|L|XL 等

giga_data_dir: 下载好的 giga open data （Stage 0）
```
├── audio
│   ├── audiobook
│   ├── podcast
│   └── youtube
├── dir_structure
├── files.yaml
├── GigaSpeech.json
├── GigaSpeech.json.gz
├── GigaSpeech.json.gz.aes
└── TERMS_OF_ACCESS
```
data: 数据预处理后的保存目录（Stage 0，1，2）
```
├── corpus
│   ├── dev_utt_list
│   ├── reco2dur
│   ├── segments
│   ├── test_utt_list
│   ├── text
│   ├── train_xs_utt_list
│   ├── utt2dur
│   ├── utt2subsets
│   └── wav.scp
├── dev
│   ├── data.list
│   ├── segments
│   ├── spk2utt
    ....
├── lang_char_XS
│   ├── input.txt
│   ├── train_xs_unigram5000.model
│   ├── train_xs_unigram5000_units.txt
│   └── train_xs_unigram5000.vocab
├── test
│   ├── data.list
    ....
└── train_xs
    ├── data.list
    ....
```

shards_dir: shard 后的数据（Stage 3）
```
├── dev
│   ├── shards_000000000.tar
    ...
├── test
│   ├── shards_000000000.tar
    ...
└── train_xs
    ├── shards_000000000.tar
    ...
```

dir: 实验目录，包含 ddp_init(分布式) 及 模型文件
```
├── init.pt
├── init.yaml
├── init.zip
└── train.yaml
```

### 数据下载

In [None]:
# 安装依赖
!pip install speechcolab

In [None]:
data_root     = "/efs/wenet-data"
data_set      = "XS"

giga_data_dir = f"{data_root}/giga/raw-data"
processed_dir = f"{data_root}/giga/processed"
shards_dir    = f"{data_root}/giga/shards"
expr_dir      = f"{data_root}/expr/giga"

!mkdir -p "$giga_data_dir" "$processed_dir" "$shards_dir" "$expr_dir"

In [None]:
!echo '!!!!APPLY FOR DOWNLOAD CREDENTIALS BEFORE RUNNING BELLOW CODE, CHECK GigaSpeech/README.md!!!'
!echo '!!!!APPLY FOR DOWNLOAD CREDENTIALS BEFORE RUNNING BELLOW CODE, CHECK GigaSpeech/README.md!!!'
!echo '!!!!APPLY FOR DOWNLOAD CREDENTIALS BEFORE RUNNING BELLOW CODE, CHECK GigaSpeech/README.md!!!'

# subset(default {XL}) specifies the subset to download
set = '{' + data_set + '}'

%cd ~/SageMaker
!git clone https://github.com/SpeechColab/GigaSpeech 

%cd GigaSpeech
!echo 'replace-with-your-password' > SAFEBOX/password

!bash utils/download_gigaspeech.sh  --subset {set} --host magicdata {giga_data_dir}

### 数据预处理

原始数据存放在`$giga_data_dir`中，预处理后的数据放在`$processed_dir`，分片后的数据放在 `shards_dir`。

由于默认使用的是XS数据集，可能遇到 `Warning: POD0000000501 something is wrong, maybeAssertionError, skipped` 的提示，可以忽略掉。

数据处理时间较长，需要耐心等待。

In [None]:
%cd ~/SageMaker/wenet

from sagemaker.pytorch.estimator import PyTorch

CUDA_VISIBLE_DEVICES    = '0'

instance_count   = 1
instance_type    = 'local'

sm_data_root     = '/opt/ml/input/data'

sm_giga_data_dir = f"{sm_data_root}/raw-data"
sm_processed_dir = f"{sm_data_root}/processed"
sm_trail_dir     = f"{sm_data_root}/trial"
sm_shards_dir    = f"{sm_data_root}/shards"

hp= {
    'stage': 0, 'stop_stage': 3, 'set': data_set, 
    'data': sm_processed_dir,
    'dir':  sm_trail_dir,
    'giga_data_dir': sm_giga_data_dir,
    'shards_dir':    sm_shards_dir,
    'num_nodes':     instance_count,
    'CUDA_VISIBLE_DEVICES': CUDA_VISIBLE_DEVICES
}

estimator=PyTorch(
    entry_point     = 'examples/gigaspeech/s0/sm-run.sh',
    image_uri       = 'sagemaker-wenet:training-pip-pt_1_10_0',
    instance_type   = instance_type,
    instance_count  = instance_count,
    source_dir      = '.',
    role            = role,
    hyperparameters = hp,
    
    disable_profiler     = True,
    debugger_hook_config = False
)

estimator.fit({
    'raw-data':  f"file://{giga_data_dir}",
    'processed': f"file://{processed_dir}",
    'shards':    f"file://{shards_dir}",
    'trial':     f"file://{expr_dir}"
})

## 模型训练 - 本地训练模式

在模型研发过程中，算法人员需要反复调整代码逻辑，如果每次代码调整就打包一个docker镜像就显得很麻烦，因此，您可以先通过SageMaker的本地训练模式，来调试代码。本地训练模式会直接在Notebook所在实例中启动对应的容器并执行训练逻辑，并自动将数据映射给容器。

CUDA_VISIBLE_DEVICES需要和之行数据处理代码实例的GPU相匹配，如单个实例只有两个GPU卡，则设为'0,1'。

**Open terminal and run follow command with ROOT user**

```bash
docker_daemon_file=/etc/docker/daemon.json
echo Origin content: `cat $docker_daemon_file`

echo '{ 
   "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-shm-size": "14G"
}' > /etc/docker/daemon.json

cat $docker_daemon_file

service docker restart
```

In [None]:
%cd ~/SageMaker/wenet

from sagemaker.pytorch.estimator import PyTorch

CUDA_VISIBLE_DEVICES    = '0'

instance_count   = 1
instance_type    = 'local_gpu'

sm_data_root     = '/opt/ml/input/data'

sm_giga_data_dir = f"{sm_data_root}/raw-data"
sm_processed_dir = f"{sm_data_root}/processed"
sm_trail_dir     = f"{sm_data_root}/trial"
sm_shards_dir    = f"{sm_data_root}/shards"

hp= {
    'stage': 4, 'stop_stage': 4, 'set': data_set, 
    'data': sm_processed_dir,
    'dir':  sm_trail_dir,
    'giga_data_dir': sm_giga_data_dir,
    'shards_dir':    sm_shards_dir,
    'num_nodes':     instance_count,
    'CUDA_VISIBLE_DEVICES': CUDA_VISIBLE_DEVICES
}

estimator=PyTorch(
    entry_point     = 'examples/gigaspeech/s0/sm-run.sh',
    image_uri       = 'sagemaker-wenet:training-pip-pt_1_10_0',
    instance_type   = instance_type,
    instance_count  = instance_count,
    source_dir      = '.',
    role            = role,
    hyperparameters = hp,
    
    disable_profiler     = True,
    debugger_hook_config = False
)

estimator.fit({
    'raw-data':  f"file://{giga_data_dir}",
    'processed': f"file://{processed_dir}",
    'shards':    f"file://{shards_dir}",
    'trial':     f"file://{expr_dir}"
})

## 模型训练 - SageMaker托管实例

在确定代码逻辑无误后，我们可以很容易通过修改参数的方式，使用托管的实例开启真正的训练任务。

In [None]:
%cd ~/SageMaker/wenet

from sagemaker.inputs import FileSystemInput
from sagemaker.pytorch.estimator import PyTorch

file_system_id = 'fs-0fafc0e57d05616b4'
file_system_access_mode = 'rw'
file_system_type = 'EFS'
security_group_ids = ['sg-066097e10b3fcc47e']
subnets= ['subnet-0d0b4cf92fba1cb2b']

# 定义数据输入
file_system_input_raw  = FileSystemInput(file_system_id=file_system_id,
                                  file_system_type=file_system_type,
                                  directory_path='/wenet-data/giga/raw-data',
                                  file_system_access_mode=file_system_access_mode)

file_system_input_proc = FileSystemInput(file_system_id=file_system_id,
                                  file_system_type=file_system_type,
                                  directory_path='/wenet-data/giga/processed',
                                  file_system_access_mode=file_system_access_mode)

file_system_input_shards = FileSystemInput(file_system_id=file_system_id,
                                  file_system_type=file_system_type,
                                  directory_path='/wenet-data/giga/shards',
                                  file_system_access_mode=file_system_access_mode)

file_system_input_expr = FileSystemInput(file_system_id=file_system_id,
                                  file_system_type=file_system_type,
                                  directory_path='/wenet-data/expr/giga',
                                  file_system_access_mode=file_system_access_mode)


training_repository_uri = training_repository_uri
CUDA_VISIBLE_DEVICES    = '0'

instance_count   = 1
instance_type    = 'ml.g4dn.xlarge'

sm_data_root     = '/opt/ml/input/data'

sm_giga_data_dir = f"{sm_data_root}/raw-data"
sm_processed_dir = f"{sm_data_root}/processed"
sm_trail_dir     = f"{sm_data_root}/trial"
sm_shards_dir    = f"{sm_data_root}/shards"

hp= {
    'stage': 4, 'stop_stage': 4, 'set': data_set, 
    'data': sm_processed_dir,
    'dir':  sm_trail_dir,
    'giga_data_dir': sm_giga_data_dir,
    'shards_dir':    sm_shards_dir,
    'num_nodes':     instance_count,
    'CUDA_VISIBLE_DEVICES': CUDA_VISIBLE_DEVICES
}

estimator=PyTorch(
    entry_point     = 'examples/gigaspeech/s0/sm-run.sh',
    image_uri       = training_repository_uri,
    instance_type   = instance_type,
    instance_count  = instance_count,
    source_dir      = '.',
    role            = role,
    hyperparameters = hp,
    
    subnets         = subnets,
    security_group_ids   = security_group_ids,
    
    disable_profiler     = True,
    debugger_hook_config = False
)

estimator.fit({
    'raw-data':  file_system_input_raw,
    'processed': file_system_input_proc,
    'shards':    file_system_input_shards,
    'trial':     file_system_input_expr
})