# 数据预处理

本项目利用 AWS Sagemaker 封装语音识别的深度学习算法， 提取声音的Filter Bank 特征， 将声音信号转换成图像（40*1000），利用CNN进行特征提取， 最后做语音的分类。 使用了语音增强的功能， 通过变换背景噪声、调整语速和音量， 生成更多的训练样本。

In [None]:
import boto3
import sagemaker
import os
from sagemaker import get_execution_role

region = boto3.session.Session().region_name

role = get_execution_role()

使用SageMaker processing对数据预处理

We will demo using Sagemaker processing in BYOC mode, so first we need package our container.

We are using AWS Deep Learning Container as our base container, you can check the available list in https://aws.amazon.com/cn/releasenotes/available-deep-learning-containers-images/

Remember change the base container by the region you are using.

编写dockerfile （可选，后面步骤提供了一个已经打包好的container image，可以直接使用，无需从零构建image）

In [None]:
%%writefile docker/Dockerfile
# https://aws.amazon.com/cn/releasenotes/available-deep-learning-containers-images/
# FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.15.2-cpu-py36-ubuntu18.04
FROM 727897471807.dkr.ecr.cn-north-1.amazonaws.com.cn/tensorflow-training:1.15.2-cpu-py36-ubuntu18.04

RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         libsm6 \
         libxrender1 \
         libglib2.0-dev \
         libxext6 \
         libsndfile1 \
         libsndfile-dev \
         libgmp-dev \
         libsox-dev \
    && rm -rf /var/lib/apt/lists/*

# RUN mkdir /opt/ml/code
WORKDIR /opt/ml/code
COPY source ./

RUN pip install --upgrade pip
RUN pip install -r requirements.txt -i https://mirrors.163.com/pypi/simple/

WORKDIR /opt/
RUN wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz && xz -d ffmpeg-git-amd64-static.tar.xz \
    && tar xvf ffmpeg-git-amd64-static.tar 
WORKDIR /opt/ffmpeg-git-20200617-amd64-static
RUN cp ffmpeg  ffprobe  qt-faststart  /usr/bin/
    
WORKDIR /opt/
RUN wget http://downloads.xiph.org/releases/ogg/libogg-1.3.4.tar.gz \
    && tar -zvxf libogg-1.3.4.tar.gz 
WORKDIR /opt/libogg-1.3.4 
RUN ./configure && make && make install
    
WORKDIR /opt/
RUN wget http://downloads.xiph.org/releases/vorbis/libvorbis-1.3.6.tar.gz \
    && tar -zvxf libvorbis-1.3.6.tar.gz
WORKDIR /opt/libvorbis-1.3.6 
RUN ./configure && make && make install
    
WORKDIR /opt/
RUN wget https://ftp.osuosl.org/pub/xiph/releases/flac/flac-1.3.3.tar.xz \
    && xz -d flac-1.3.3.tar.xz \
    && tar xvf flac-1.3.3.tar
WORKDIR /opt/flac-1.3.3 
RUN ./configure && make && make install \
    && ln -s /usr/local/bin/flac /usr/bin/flac
    
WORKDIR /opt/    
RUN wget https://jaist.dl.sourceforge.net/project/sox/sox/14.4.2/sox-14.4.2.tar.gz \
    && tar -zvxf sox-14.4.2.tar.gz
WORKDIR /opt/sox-14.4.2 
RUN ./configure \
    && make && make install \
    && ln -s /usr/local/bin/sox /usr/bin/sox \
    && ln -s /usr/local/bin/soxi /usr/bin/soxi

# WORKDIR /opt/    
# RUN wget http://www.mega-nerd.com/libsndfile/files/libsndfile-1.0.28.tar.gz \
#     && tar -zxvf libsndfile-1.0.28.tar.gz
# WORKDIR /opt/libsndfile-1.0.28 
# RUN export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH \
#     && ./configure \
#     && make && make install

# ENV PYTHONUNBUFFERED=TRUE
ENTRYPOINT ["python3"]


创建自己的docker image 库

In [None]:
# Run this cell only onece to create the repository in ECR
import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'spoken-language-identification-sagemaker-processing-container'
tag = ':latest'
uri_suffix = 'amazonaws.com'
if region in ['cn-north-1', 'cn-northwest-1']:
    uri_suffix = 'amazonaws.com.cn'
processing_repository_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)
print(processing_repository_uri)
ecr = '{}.dkr.ecr.{}.{}'.format(account_id, region, uri_suffix)

!aws ecr create-repository --repository-name $ecr_repository

# Docker image的镜像方式

## 以下方式二选一

## 1、使用已经构建好的image
## 2、自行构建（如果是在国内时间较久）

## 使用已经构建好的image


In [None]:
# 不要改下面这行命令
!aws ecr get-login-password --region cn-north-1 | docker login --username AWS --password-stdin 346044390830.dkr.ecr.cn-north-1.amazonaws.com.cn

将 docker image 下载到本地，然后上传到自己的ECR库中

In [None]:
exist_image = '346044390830.dkr.ecr.cn-north-1.amazonaws.com.cn/spoken-language-identification-sagemaker-processing-container:latest'

In [None]:
!docker pull $exist_image

In [None]:
!docker tag $exist_image $processing_repository_uri
!docker push $processing_repository_uri

In [None]:
processing_repository_uri

## 自行构建（如果是在国内时间会较久，主要是下载相应的一些包比较慢）

In [None]:
# if it said no basic auth for pull base image, use below cli first
!aws ecr get-login-password --region cn-north-1 | docker login --username AWS --password-stdin 727897471807.dkr.ecr.cn-north-1.amazonaws.com.cn

In [None]:
!aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $ecr

# Create ECR repository and push docker image
!docker build -t $ecr_repository docker
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

## 设置待处理数据位置及输出位置

In [None]:
# change this to your data path in S3, for example s3://BUCKET/PATH/input/
#改为自己的S3 bucket名, 将原始数据上传到input_data桶下，该任务完成后，将从output_data获取到下一步训练用的数据，即output_data将作为训练数据的输入位置
bucket = 'YOUR_BUCKET_NAME'
input_data = 's3://{}/spoken/processing/'.format(bucket)
output_data = 's3://{}/spoken/processing-folds/'.format(bucket)

原始数据按train,test, noises上传到input_data路径下面各自的目录。

noises/
                            
test/
                            
train/

## Start processing job

For api doc, https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html


[Processor](https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.Processor)

[ScriptProcessor](https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.ScriptProcessor)

[ProcessingInput](https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.ProcessingInput)

[ProcessingOutput](https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.ProcessingOutput)

[ProcessingJob](https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.ProcessingJob)


In [None]:
from sagemaker.processing import ScriptProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.network import NetworkConfig

script_processor = ScriptProcessor(command=['python3'],
                image_uri=processing_repository_uri,
                role=role,
                instance_count=1,
                instance_type='ml.c5.xlarge',
                volume_size_in_gb=50,
                base_job_name='spoken-language-identification')

In [None]:
# For arguments, you could pass parameters and as use them in your script by argparse
script_processor.run(code='preprocess.py',
                      inputs=[ProcessingInput(
                        source=input_data,
                        destination='/opt/ml/processing/input_data',
                        s3_data_distribution_type='ShardedByS3Key')],
                      outputs=[ProcessingOutput(destination=output_data,
                                                source='/opt/ml/processing/output_data',
                                                s3_upload_mode = 'EndOfJob')])

script_processor_job_description = script_processor.jobs[-1].describe()
print(script_processor_job_description)