Skip to content

Classification of different categories of audio clips, especially non speech sounds using Bag-of-Frames approach.

License

Notifications You must be signed in to change notification settings

ShuiZhuyuAlex/Audio-classification-using-Bag-of-Frames-approach

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Audio Classification using Bag-of-Frames approach

Requirements

Python 2.7.10

Python modules

1.  Librosa 0.4.3
2.  numpy 1.11.3
3.  sklearn 0.18

Execution

Divide the audio into smaller clips of 5-20 secs each . One audio clip is converted into one feature which contains the Bag of Frames for that audio clip.

  1. Place the training and the test data in a folder by numbering the categories. Example, if the source folder is "Data"

Data -> 1 -> test -> "test audio files of category 1"
Data -> 1 -> train -> "train audio files of category 1"
Data -> 2 -> test -> "test audio files of category 2"
Data -> 2 -> train -> "test audio files of category 2"

  1. python train.py window_length no_of_clusters

    window_length : Window length to divide the audio clip to
    no_of_clusters : No of cluster centroids for k-means clustering

  2. python test.py classifier

    classifiers : svm,nb,dt,knn,adaboost,rf
    Change the parameters in the test.py file to change the parameters of the classifiers.
    All the results are stored in the "Temp" Folder

Project Description

Human speech can be broken down into elementary phonemes and can be modeled using algorithms like Hidden Markov Models (HMM). Stationary patterns like rhythm and melody can be used in classification of music. In contrast, Non speech sounds are random, unstructured, and lack the high level structures observed in speech and music, which makes it difficult to model them using HMM. In this project, the Bag of Frames approach is used to classify audio where a codebook of vectors is generated using K-Means clustering on the training data and Bag of Frames for each of the audio clip is obtained using the codebook. These Bag of Frames are used as input to the classifiers for classification.

The steps involved in the Bag of Frames approach for Environmental Sound Classification is described as follows:

A. Feature Extraction

1. For the purpose of feature extraction, the audio clip is divided into several segments by choosing a particular window length.  
2. Then features are extracted for each of the audio segment.
3. Python libraries Librosa and Scikits are used to extract audio features like MFCC, delta MFCC, Linear Predictive Coding(LPC) coefficients along with other frequency domain features like Mel Spectrogram, Spectral Centroid, Spectral Bandwidth, Spectral Roll Off   and temporal domain features like Root Mean Square Error (RMSE) and Zero Crossing Rate. 

B. K-Means Clustering and Codebook generation

1. Once the features are extracted, the whole training and test data is divided into training and test dataset. 
2. Feature scaling and normalization of training data is done accross each feature.
3. The normalized training data is fed into K-Means clustering algorithm with the number of clusters usually much higher than the total number of classes and the cluster centroids are obtained for the normalized training set. 
4. These cluster centroids form the codebook. 

C. Bag of Frames

1. In the next step, the feature samples from each of the audio clip are vector quantized with respect to the codebook generated and Bag of Frames is obtained from K-Means output.

D. Classification

1. The Bag of Frames is first normalized across each audio clip and later normalized across each of the features. The resultant vectors are labelled accordingly and then used to train a supervised classifier like SVM, KNN or Random Forest.

The test phase includes similar steps where features extracted from the audio clips are normalized and vector quantized using the codebook, followed by obtaining the Bag of Frames for each audio clip. The normalized Bag of Frames are then given as input to the classifier to obtain the final output.

Results

To Evaluate the approach, ESC-10 dataset available at https://github.com/karoldvl/ESC-10 was used.

ESC-50 is a dataset with annotated collection of 2,000 short clips comprising 50 classes of various common sound events. Each class consists of 40 sound clips with each sound clip 5-seconds-long reconverted into a unified format (44.1 kHz, single channel, Ogg Vorbis compression at 192 kbit/s). The labeled datasets were consequently arranged into 5 uniformly sized cross-validation folds.

The ESC-10 is a selection of 10 classes from the bigger dataset ESC-50.

The RF classifier outperformed other classifiers for the ESC-10 dataset with an average accuracy of 84.5% with accuracy of each folds in a 5-Fold cross validation shown below.

alt text

About

Classification of different categories of audio clips, especially non speech sounds using Bag-of-Frames approach.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%