MULTIMODAL PROTOTYPICAL SOUND RECOGNITION

Official release of the INTERSPEECH-23 paper : A multimodal prototypical approach for unsupervised sound classification

Before you start :

Clone the repo

git clone --recurse-submodules git@github.com:sakshamsingh1/audio_text_proto.git

Note: If you get error in cloning the submodule i.e. AudioClip. Clone it manually inside ref_repo by running

git clone git@github.com:AndreyGuzhov/AudioCLIP.git

Environment setup

#create the conda environment
conda create --name multi_proto python=3.8
conda activate multi_proto

#install required packages 
chmod a+x setup.sh
./setup.sh

Demo code

Before running the demo code do

Please download the pretrained models before this. (i.e. follow step 2 in "Reproducing the results" section below)
Also, download ESC-50k dataset

cd data/input

# Download the ESC-50 dataset
git clone git@github.com:karolpiczak/ESC-50.git

Use our pre-computed prototype embeddings to find the class label (in ESC-50) for your input audio

python demo.py --model_type=<proto-ac/proto-lc> --audio_path=<path_to_your_audio_file>

Our results in the paper:
We noticed small discrepancy in reproducing results for laion-clap, maybe due to recent update in it.

Reproducing the results

Download data and put them in the data/input directory
Download pretrained model for AudioClip and LIAON-CLAP
Use the already extracted audio-embeddings (we also provide the scripts to extract embedding i.e. extract_embed.py)
For proto-ac and proto-lc models run prototypical.py script with the desired model and dataset
For audioclip and laion-clap results run baseline.py script.

1. Download the datasets

cd data/input

# Download the ESC-50 dataset
git clone git@github.com:karolpiczak/ESC-50.git

# Download the US8K dataset
python download_us8k.py

# Download the FSD50K dataset
python download_fsd50k.py

2. Download the pretrained models

# For AudioCLIP
# should be downloaded in scripts/ref_repo/AudioCLIP/assets
wget https://github.com/AndreyGuzhov/AudioCLIP/releases/download/v0.1/AudioCLIP-Full-Training.pt
rm bpe_simple_vocab_16e6.txt.gz
wget https://github.com/AndreyGuzhov/AudioCLIP/releases/download/v0.1/bpe_simple_vocab_16e6.txt.gz 

#FOR LAION_CLAP
# Should be downloaded in data/input
wget https://huggingface.co/lukewys/laion_clap/resolve/main/630k-audioset-fusion-best.pt

3. Extracting embeddings

Download the extracted embeddings here Google drive and put inside data/processed
OR

We also provide code for extracting embeddings but it is slow (and has to be optimized).

python extract_embd.py --model_type <audioclip/clap> --dataset_name <esc50/us8k/fsd50k>

4. Our prototypical approach (Proto-AC and Proto-LC rows in the result's table)

python prototypical.py --model_type <proto-lc/proto-ac> --data <esc50/us8k/fsd50k> --train_type <zs/sv>

5. Baseline results (AudioClip and LAION-CLAP rows in the result's table)

python baseline.py --model_type <audioclip/clap> --data <esc50/us8k/fsd50k> --train_type <zs/sv>

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
data		data
imgs		imgs
ref_repo		ref_repo
scripts		scripts
slurm_scripts		slurm_scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
baseline.py		baseline.py
common_utils.py		common_utils.py
demo.py		demo.py
extract_embd.py		extract_embd.py
prototypical.py		prototypical.py
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MULTIMODAL PROTOTYPICAL SOUND RECOGNITION

Before you start :

Clone the repo

Environment setup

Demo code

Reproducing the results

1. Download the datasets

2. Download the pretrained models

3. Extracting embeddings

4. Our prototypical approach (Proto-AC and Proto-LC rows in the result's table)

5. Baseline results (AudioClip and LAION-CLAP rows in the result's table)

About

Releases

Packages

Languages

sakshamsingh1/audio_text_proto

Folders and files

Latest commit

History

Repository files navigation

MULTIMODAL PROTOTYPICAL SOUND RECOGNITION

Before you start :

Clone the repo

Environment setup

Demo code

Reproducing the results

1. Download the datasets

2. Download the pretrained models

3. Extracting embeddings

4. Our prototypical approach (Proto-AC and Proto-LC rows in the result's table)

5. Baseline results (AudioClip and LAION-CLAP rows in the result's table)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages