Emotions form an integral part of human interactions. The Intelligence Augmentation for AI Hackathon 2021 paves the way toward more empathy AI systems by aiming to build systems to recognize emotions from audio. The best entry into the competition from our team - Prompt Engineers - is a system that leverages not only the audio features but also the semantics of the spoken words, fusing the two intertwined modalities to achieve a runner-up position on the leaderboard with 61.38% Accuracy. We further improve the latency of the approach by more than 42% via feature reuse, weight sharing and multi-task learning at the cost of only 0.2% Accuracy drop.
Our best performing model is a phono-linguistic model, leveraging both the semantics of the spoken works and the speech features. We obtain speech features from Hubert - a speech pretrained transformer model and language features from Bert - a language model running over the output of the transcribed speech. The features from the two modalities are fused together to achieve 61.38% Accuracy. We observe that Bert features over the transcribed speech alone achieves 55.77% Accuracy. Whereas classifying on only the speech features from Hubert yields 58.98% Accuracy. Together the two modalities achieves the best performance.
We improve latency by multi-task learning the HuBert for Audio features as well as speech transcribing (ASR). This leads to 42% less model parameters with only 0.2% performance drop.
We exported our conda environments for training the models and running the app.
train_env.yml
: Our environment for training. Create usingconda env create --name prompt --file=train_env.yml
&conda activate prompt
app_env.yml
: Our environment for app. Create usingconda env create --name prompt_app --file=app_env.yml
&conda activate prompt_app
Please note that our app_env
was ran on a MacOS 11.2 Machine with Intel processor, whereas our training (train_env
) was done on a linux machine with Nvidia GPUs. The same conda environment may not work on other machines. Instead you may download the packages individually.
You may also download the following dependencies individually as an alternate means to create the environment:
- Pytorch
- Huggingface's Transformers
- Huggingface's Datasets
- torchaudio
- soundfile
- sounddevice
- scikit-learn
- scipy
- numpy
Additional dependencies for running the webapp: streamlit, plotly
- Create a fresh conda environment
pip install streamlit
pip install soundfile
pip install sounddevice
pip install pydub
- Install PyTorch-1.8 and torchaudio
pip install transformers==4.10
If you are training the AST model, then also download the following dependencies:
matplotlib, numba, timm, zipp, wget, llvmlite
- Download the required dependencies or replicate & activate our conda environment, as detailed above.
- Our webapp is in the
app
folder:cd app
. - Download the pre-trained model weights and save it in the
webapp
folder with the namecpu_model.pt
- Run webapp:
streamlit run app.py
- Note that it may take some time to run first time as models, tokenizers, feature-extractors and config are downloaded.
- The webapp will be hosted locally (port and address printed on command line).
- Download the required dependencies or replicate our conda environment, as detailed above.
- Download the set of audio files in
TrainAudioFiles
&TestAudioFiles
and place it inside thedataset
folder. - Get pseudo ASR ground truth labels by running
python3 run_asr.py
from inside thelinguistic
folder. - If you want to train the linguistic (text-only) model, then you can run
python3 train.py
from insidelinguistic
folder with the following optional command line arguments .- Batch Size:
--batch_size=16
- Learning Rate:
--lr=1e-5
- Number of Epochs:
--n_epochs=10
- Do dummy run for debugging purposes:
--dummy_run
- Device to training the model on:
--device=cuda
- Randomizing seed:
--seed=1
- Test the model:
--test_model
- Bert Model name or path:
--bert_type=bert-base-uncased
- Batch Size:
- For training AST: Refer to the instructions in
ast/README.md
, the dataset needs to downloaded and kept insideast/egs/hack/data/
. - If you want to train a model only on audio features, then from inside
phono
folder, runbash run.sh
. You may edit therun.sh
to change the following arguments:- Pooling method to extract model features:
--pooling_mode
Options: ["mean",'max', 'sum'] - Name or path of audio only model's pretrained file:
--model_name_or_path
- Type of model to use:
--model_mode
Example arguments:hubert
orwav2vec2
- Training Batch size per device:
--per_device_train_batch_size
[Type: Integer] - Eval Batch Size per device:
--per_device_eval_batch_size
[Type: Integer] - Learning rate:
--learning_rate
- Number of epochs:
--num_train_epochs
- Gradient Accumulation steps:
--gradient_accumulation_steps
(set 1 for no accumulation)
- Pooling method to extract model features:
- Save, eval and logger steps:
--save_steps
,eval_steps
,logging_steps
- Maximum number of models to save:
--save_total_limit
- Required Arguments (do not change):
--freeze_feature_extractor
,--input_column=filename
,--target_column=emotion
,output_dir="output_dir"
delimiter="comma"
,--evaluation_strategy="steps"
,--fp16
,--train_file="./../dataset/train_set.csv"
,--validation_file="./../dataset/valid_set.csv"
,--test_file="./../dataset/test_set.csv"
- Whether to do train, eval and predict on test:
--do_eval
,--do_train
,--do_predict
- Maximum number of models to save:
- If you want to train phono_linguistic transformer model:
- Extract phonetic features for
train
andtest
set fromphono_feat_extractor
:- After obtaining ASR predictions from
linguistic
, put thetest.json
andtrain.json
in thephono_feat_extractor
folder. - Merge the above two files using
python3 merge_text.py
inside thephono_feat_extractor
folder. - Then do
bash run.sh
from inside the same folder. You may change the same arguments as above mentioned for only on audio features. Additional argument of the Bert model: --bert_name='bert-base-uncased'.
- After obtaining ASR predictions from
- After the above step, you will obtain the
train.pkl
andtest.pkl
files insidephono_feat_extractor
. Put these files inphono_linguistic/data
folder. - Run
python3 bertloader.py
fromphono-linguistic
folder to cache the dataloader for training. - And then train model using
python3 train.py
. - You may include the following arguments for
python3 bertloader.py
andpython3 train.py
. Make sure the same arguments are passed to the two commands- Seed:
--seed=1
(type=int) - Batch_Size:
--batch_size=16
(type=int) - Learning rate:
--lr=1e-5
(type=float) - Number epochs:
--n_epochs=5
(type=int) - Dummy run (for debugging purposes):
--dummy_run
- Device to train model on:
--device
- Whether to log on wanbd:
--wandb
- Bert model name or path:
--bert_type=bert-base-uncased
- Audio model name or path:
--model_name_or_path='facebook/hubert-large-ll60k'
- Seed:
- Extract phonetic features for
- Instruction to use webapp
- You may contact us by opening an issue on this repo. Please allow 2-3 days of time to address the issue.
- License: MIT