Skip to content

winston1214/Sign-Language-project

Repository files navigation

Sign-Language-project

Contributors


김영민

곽민지

이다인

김영은

Abstract

Sign Language Translation (SLT) is a task that has not been studied relatively much compared to the study of Sign Language Recognition (SLR). However, the SLR is a study that recognizes the unique grammar of sign language, which is different from the spoken language and has a problem that non-disabled people cannot easily interpret. So, we're going to solve the problem of translating directly spoken language in sign language video. To this end, we propose a new keypoint normalization method for performing translation based on the skeleton point of the signer and robustly normalizing these points in sign language translation. It contributed to performance improvement by a customized normalization method depending on the body parts. In addition, we propose a stochastic frame selection method that enables frame augmentation and sampling at the same time. Finally, it is translated into the spoken language through an Attention-based translation model. Our method can be applied to various datasets in a way that can be applied to datasets without glosses. In addition, quantitative experimental evaluation proved the excellence of our method.

Survey

Environment

  • OS : Ubuntu 18.04.5(Docker) LTS or Colab
  • Cuda : 10.0
  • GPU : Tesla V100-32GB

Data

  • sample video downlaod - $ sh download_sh/sample_data_dowonload.sh

DataSet Download

Enviorment Setting

$ pip install -r requirements.txt
$ python -m pip install cython
$ sudo apt-get install libyaml-dev
  • Setting(Alphapose)
$ git clone https://github.com/winston1214/Sign-Language-project.git && cd Sign-Language-project
$ python setup.py build develop

If you don't run in the COLAB environment or the cuda version is 10.0, refer to this link.

  • Download pretrained File(Please Download)

If you run this command, you can download weight file at once. $ sh downlaod_sh/weight_download.sh

PreProcessing

1. Split frame

$ python frame_split.py # You have to add the main code.

2. Extract KeyPoint(Alphapose)

python scripts/demo_inference.py --cfg configs/halpe_136/resnet/256x192_res50_lr1e-3_2x-regression.yaml --checkpoint pretrained_models/halpe136_fast_res50_256x192.pt --indir ${img_folder_path} --outdir ${save_dir_path} --form boaz --vis_fast --sp

If you use multi-gpu, you don't have to sp option

Extract KeyPoint

Train

$ python train.py --X_path ${X_train.pickle path} --save_path ${model save directory} \
--pt_name ${save pt model name} --model ${LSTM or GRU} --batch ${BATCH SIZE}

## Example

$ python train.py --X_path /sign_data/ --save_path pt_file/ \
--pt_name model1.pt --model GRU --batch 128 --epochs 100 --dropout 0.5
  • X_train.pickle : For convenience, we stored and used the values extracted from the keypoint in pickle file format.
    • (shape : [video_len, max_frame_len, keypoint_len] # [7129, 376, 246] )

Inference

$ python inference.py --video ${VIDEO_NAME} --outdir ${SAVE_PATH} --pt ${WEIGHT_PATH} --model ${MODEL NAME}

You can simply enjoy demo video at the COLAB Open In Colab

Result

Model Hyperparameter Metrics Final Model
GRU-Attention Adam
CrossEntropy
BLEU 93.4
Accuracy 93.5
AdamW
Scheduler
BLEU 95.1
Accuracy 95.0
LSTM Adam
CrossEntropy
BLEU 49.6
Accuracy 50.0
AdamW
Scheduler
BLEU 51.5
Accuracy 51.5

We selected a method that applied the (HAND+BODY Keypoint) + (All Frame Random Augmentation) + (Frame Noramlization) technique as the final model.

More experimental results are shown here.

Demo Video

youtube link

final_video.mp4

Citation

@misc{https://doi.org/10.48550/arxiv.2204.10511,
  doi = {10.48550/ARXIV.2204.10511},
  
  url = {https://arxiv.org/abs/2204.10511},
  
  author = {Kim, Youngmin and Kwak, Minji and Lee, Dain and Kim, Yeongeun and Baek, Hyeongboo},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Keypoint based Sign Language Translation without Glosses},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}