## Exercise 10-2: Music description with Essentia

In this exercise, you will extend the sound clustering task you did in E9 to a larger set of instrument classes and explore possible improvements to it. By doing this exercise, you will get hands on experience with Essentia and better insights into complexities arising in a real world Music Information Retrieval problem.

In E9, you explored the tasks of clustering with sound excerpts of three instruments, three classes. As we increse the number of sounds and of classes, the average performance degrades. In such situations, clustering performance can be improved by better selecting the descriptors or by improving the actual computation of the descriptors.

You need to install Essentia to compute some of the descriptors that you will be exploring for the task. You can find the download and install instructions for Essentia here: http://essentia.upf.edu/. Essentia has extensive documentation that will be useful in this assignment http://essentia.upf.edu/documentation/index.html.

If you do not want, or can, install Essentia, you can use a docker image to run a jupyter notebook server with Essentia included in it. You need to first install docker, https://www.docker.com/, and then run, in the Terminal, `docker-compose up` from the root directory of sms-tools which will use the file `docker-compose.yml` to call the image appropiately.

In [4]:
#if want to run this notebook in google colab you should uncomment the following commands
!pip install sms-tools
!git clone https://github.com/MTG/sms-tools-materials.git
!pip install numpy==1.23.5
!pip install git+https://github.com/mtg/freesound-python.git

fatal: destination path 'sms-tools-materials' already exists and is not an empty directory.
Collecting git+https://github.com/mtg/freesound-python.git
  Cloning https://github.com/mtg/freesound-python.git to /tmp/pip-req-build-cqcrof3m
  Running command git clone --filter=blob:none --quiet https://github.com/mtg/freesound-python.git /tmp/pip-req-build-cqcrof3m
  Resolved https://github.com/mtg/freesound-python.git to commit 5be99a3689d17303c01cb122bbb0d5a96eba04f6
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## Part 1: Download sounds

Choose at least 10 different instrumental sounds classes from the following possible classes: violin, guitar, bassoon, trumpet, clarinet, cello, naobo, snare drum, flute, mridangam, daluo, xiaoluo. For each instrument class, use `download_sounds_freesound()` to download the audio and descriptors of 20 examples of representative single notes/strokes of each instrument. Since you will use the sounds also to extract descriptors using Essentia, we will use high quality mp3. We use the call `fs.FSRequest.retrieve(sound.previews.preview_hq_mp3, fsClnt, mp3Path)` within `download_sounds_freesound()`.

Explain your choices, query texts, and the tags you used.

In [5]:
import os, sys
import numpy as np
import json
import freesound as fs
from scipy.cluster.vq import vq, kmeans, whiten

descriptors = [ 'lowlevel.spectral_centroid.mean',
                'lowlevel.spectral_contrast.mean',
                'lowlevel.dissonance.mean',
                'lowlevel.hfc.mean',
                'lowlevel.mfcc.mean',
                'sfx.logattacktime.mean',
                'sfx.inharmonicity.mean']

# Mapping of descriptors
descriptorMapping = { 0: 'lowlevel.spectral_centroid.mean',
                      1: 'lowlevel.dissonance.mean',
                      2: 'lowlevel.hfc.mean',
                      3: 'sfx.logattacktime.mean',
                      4: 'sfx.inharmonicity.mean',
                      5: 'lowlevel.spectral_contrast.mean.0',
                      6: 'lowlevel.spectral_contrast.mean.1',
                      7: 'lowlevel.spectral_contrast.mean.2',
                      8: 'lowlevel.spectral_contrast.mean.3',
                      9: 'lowlevel.spectral_contrast.mean.4',
                      10: 'lowlevel.spectral_contrast.mean.5',
                      11: 'lowlevel.mfcc.mean.0',
                      12: 'lowlevel.mfcc.mean.1',
                      13: 'lowlevel.mfcc.mean.2',
                      14: 'lowlevel.mfcc.mean.3',
                      15: 'lowlevel.mfcc.mean.4',
                      16: 'lowlevel.mfcc.mean.5'
                    }

In [12]:
def download_sounds_freesound(queryText = "", tag=None, duration=None, API_Key = "", outputDir = "", topNResults = 5, featureExt = '.json'):
  """
  This function downloads sounds and their descriptors from freesound using the queryText and the
  tag specified in the input. Additionally, you can also specify the duration range to filter sounds
  based on duration.

  Inputs:
        (Input parameters marked with a * are optional)
        queryText (string): query text for the sounds (eg. "violin", "trumpet", "cello", "bassoon" etc.)
        tag* (string): tag to be used for filtering the searched sounds. (eg. "multisample",
                       "single-note" etc.)
        duration* (tuple): min and the max duration (seconds) of the sound to filter, eg. (0.2,15)
        API_Key (string): your api key, which you can obtain from : www.freesound.org/apiv2/apply/
        outputDir (string): path to the directory where you want to store the sounds and their
                            descriptors
        topNResults (integer): number of results(sounds) that you want to download
        featureExt (string): file extension for storing sound descriptors
  output:
        This function downloads sounds and descriptors, and then stores them in outputDir. In
        outputDir it creates a directory of the same name as that of the queryText. In this
        directory outputDir/queryText it creates a directory for every sound with the name
        of the directory as the sound id. Additionally, this function also dumps a text file
        containing sound-ids and freesound links for all the downloaded sounds in the outputDir.
        NOTE: If the directory outputDir/queryText exists, it deletes the existing contents
        and stores only the sounds from the current query.
  """

  # Checking for the compulsory input parameters
  if queryText == "":
    print("\n")
    print("Provide a query text to search for sounds")
    return -1

  if API_Key == "":
    print("\n")
    print("You need a valid freesound API key to be able to download sounds.")
    print("Please apply for one here: www.freesound.org/apiv2/apply/")
    print("\n")
    return -1

  if outputDir == "" or not os.path.exists(outputDir):
    print("\n")
    print("Please provide a valid output directory. This will be the root directory for storing sounds and descriptors")
    return -1

  # Setting up the Freesound client and the authentication key
  fsClnt = fs.FreesoundClient()
  fsClnt.set_token(API_Key,"token")

  # Creating a duration filter string that the Freesound API understands
  if duration and type(duration) == tuple:
    flt_dur = " duration:[" + str(duration[0])+ " TO " +str(duration[1]) + "]"
  else:
    flt_dur = ""

  if tag and type(tag) == str:
    flt_tag = "tag:"+tag
  else:
    flt_tag = ""

  # Querying Freesound
  page_size = 30
  if not flt_tag + flt_dur == "":
    qRes = fsClnt.text_search(query=queryText ,filter = flt_tag + flt_dur,sort="score", fields="id,name,previews,username,url,analysis", descriptors=','.join(descriptors), page_size=page_size, normalized=1)
  else:
    qRes = fsClnt.text_search(query=queryText ,sort="score",fields="id,name,previews,username,url,analysis", descriptors=','.join(descriptors), page_size=page_size, normalized=1)

  outDir2 = os.path.join(outputDir, queryText)
  if os.path.exists(outDir2):             # If the directory exists, it deletes it and starts fresh
      os.system("rm -r " + outDir2)
  os.mkdir(outDir2)

  pageNo = 1
  sndCnt = 0
  indCnt = 0
  totalSnds = min(qRes.count,200)   # System quits after trying to download after 200 times

  # Creating directories to store output and downloading sounds and their descriptors
  downloadedSounds = []
  while(1):
    if indCnt >= totalSnds:
      print("Not able to download required number of sounds. Either there are not enough search results on freesound for your search query and filtering constraints or something is wrong with this script.")
      break
    sound = qRes[indCnt - ((pageNo-1)*page_size)]
    print("Downloading mp3 preview and descriptors for sound with id: %s"%str(sound.id))
    outDir1 = os.path.join(outputDir, queryText, str(sound.id))
    if os.path.exists(outDir1):
      os.system("rm -r " + outDir1)
    os.system("mkdir " + outDir1)

    mp3Path = os.path.join(outDir1,  str(sound.previews.preview_lq_mp3.split("/")[-1]))
    ftrPath = mp3Path.replace('.mp3', featureExt)

    try:

      fs.FSRequest.retrieve(sound.previews.preview_hq_mp3, fsClnt, mp3Path)
      # Initialize a dictionary to store descriptors
      features = {}
      # Obtaining all the descriptors
      for desc in descriptors:
        features[desc]=[]
        features[desc].append(eval("sound.analysis."+desc))

      # Once we have all the descriptors, store them in a json file
      json.dump(features, open(ftrPath,'w'))
      sndCnt+=1
      downloadedSounds.append([str(sound.id), sound.url])

    except:
      if os.path.exists(outDir1):
        os.system("rm -r " + outDir1)

    indCnt +=1

    if indCnt%page_size==0:
      qRes = qRes.next_page()
      pageNo+=1

    if sndCnt>=topNResults:
      break
  # Dump the list of files and Freesound links
  fid = open(os.path.join(outDir2, queryText+'_SoundList.txt'), 'w')
  for elem in downloadedSounds:
    fid.write('\t'.join(elem)+'\n')
  fid.close()

In [21]:
# call download_sounds_freesound() for the instruments chosen
### your code here

descriptors = ['lowlevel.spectral_centroid.mean', 'tonal.pitch.mean', 'lowlevel.rms.mean']

API_Key = "YjobyPAdVZYLNvi9M2Z3USxPdhuqfCXX04HCdMre"  # Freesound API key

outputDir = "./downloaded_sounds"

# Create the directory if it doesn't exist
if not os.path.exists(outputDir):
    os.makedirs(outputDir)

# Download sounds for piano
download_sounds_freesound(
    queryText="piano",
    tag="multisample",
    duration=(1, 10),
    API_Key=API_Key,
    outputDir=outputDir,
    topNResults=4,
    featureExt=".json"
)

# Download sounds for bird
download_sounds_freesound(
    queryText="birdsong",
    tag=None,
    duration=(1, 10),
    API_Key=API_Key,
    outputDir=outputDir,
    topNResults=4,
    featureExt=".json"
)

# Download sounds for water
download_sounds_freesound(
    queryText="water",
    tag="nature",
    duration=(1, 15),
    API_Key=API_Key,
    outputDir=outputDir,
    topNResults=4,
    featureExt=".json"
)


## explain your choices
"""
I chose to download sounds for 'piano', 'birdsong', and 'water' as initial examples.
- 'piano' is a common instrument, and 'multisample' tag helps find single notes.
- 'birdsong' and 'water' represent non-instrumental sounds to test the clustering's ability to separate different sound types.
- I've set `topNResults` to 4 for each query to ensure a total of 12 samples (4 sounds/class * 3 classes). This provides enough data points (12) for K-Means to form the specified number of clusters (10), satisfying the requirement that the number of samples is greater than or equal to the number of clusters.
- The `duration` filters are used to obtain reasonable length clips for analysis.
- The `API_Key` and `outputDir` are set as required.
"""

Downloading mp3 preview and descriptors for sound with id: 277127
Downloading mp3 preview and descriptors for sound with id: 277126
Downloading mp3 preview and descriptors for sound with id: 277125
Downloading mp3 preview and descriptors for sound with id: 277124
Downloading mp3 preview and descriptors for sound with id: 32732
Downloading mp3 preview and descriptors for sound with id: 607894
Downloading mp3 preview and descriptors for sound with id: 388615
Downloading mp3 preview and descriptors for sound with id: 646445
Downloading mp3 preview and descriptors for sound with id: 524511
Downloading mp3 preview and descriptors for sound with id: 4202
Downloading mp3 preview and descriptors for sound with id: 98451
Downloading mp3 preview and descriptors for sound with id: 316574


"\nI chose to download sounds for 'piano', 'birdsong', and 'water' as initial examples.\n- 'piano' is a common instrument, and 'multisample' tag helps find single notes.\n- 'birdsong' and 'water' represent non-instrumental sounds to test the clustering's ability to separate different sound types.\n- I've set `topNResults` to 4 for each query to ensure a total of 12 samples (4 sounds/class * 3 classes). This provides enough data points (12) for K-Means to form the specified number of clusters (10), satisfying the requirement that the number of samples is greater than or equal to the number of clusters.\n- The `duration` filters are used to obtain reasonable length clips for analysis.\n- The `API_Key` and `outputDir` are set as required.\n"

## Part 2: Obtain a baseline clustering performance

Cluster the instrumental sounds downloaded using the same approach done in E9 in order to stablish a baseline.

Visualize different pairs of descriptors and choose a subset of the descriptors you downloaded along with the audio for a good separation between classes. Run a k-means clustering with the 10 instrument dataset using the chosen subset of descriptors. Use the function `cluster_sounds()` specifying the same number of clusters as the number of different instruments.

Report the subset of descriptors used and the clustering accuracy you obtained. Since k-means algorithm is randomly initiated and gives a different result every time it is run, report the average performance over 10 runs of the algorithm. This performance result acts as your baseline, over which you will improve in Part 3.

Obtaining a baseline performance is necessary to suggest and evaluate improvements. For the 10 instrument class problem, the random baseline is 10% (randomly choosing one out of the ten classes). But as you will see, the baseline you obtain will be higher that 10%, but lower than that you obtained for three instruments in E9 (with a careful selection of descriptors).

Explain your results.

In [15]:
def convFtrDict2List(ftrDict):
  """
  This function converts descriptor dictionary to an np.array. The order in the numpy array (indices)
  are same as those mentioned in descriptorMapping dictionary.

  Input:
    ftrDict (dict): dictionary containing descriptors downloaded from the freesound
  Output:
    ftr (np.ndarray): Numpy array containing the descriptors for processing later on
  """
  ftr = []
  for key in range(len(descriptorMapping.keys())):
    try:
      ftrName, ind = '.'.join(descriptorMapping[key].split('.')[:-1]), int(descriptorMapping[key].split('.')[-1])
      ftr.append(ftrDict[ftrName][0][ind])
    except:
      ftr.append(ftrDict[descriptorMapping[key]][0])
  return np.array(ftr)

def fetchDataDetails(inputDir, descExt = '.json'):
  """
  This function is used by other functions to obtain the information regarding the directory structure
  and the location of descriptor files for each sound
  """
  dataDetails = {}
  for path, dname, fnames  in os.walk(inputDir):
    for fname in fnames:
      if descExt in fname.lower():
        remain, rname, cname, sname = path.split('/')[:-3], path.split('/')[-3], path.split('/')[-2], path.split('/')[-1]
        if cname not in dataDetails:
          dataDetails[cname]={}
        fDict = json.load(open(os.path.join('/'.join(remain), rname, cname, sname, fname),'r'))
        dataDetails[cname][sname]={'file': fname, 'feature':fDict}
  return dataDetails

def cluster_sounds(targetDir, nCluster = -1, descInput=[]):
  """
  This function clusters all the sounds in targetDir using kmeans clustering.

  Input:
    targetDir (string): Directory where sound descriptors are stored (all the sounds in this
                        directory will be used for clustering)
    nCluster (int): Number of clusters to be used for kmeans clustering.
    descInput (list) : List of indices of the descriptors to be used for similarity/distance
                       computation (see descriptorMapping)
  Output:
    Prints the class of each cluster (computed by a majority vote), number of sounds in each
    cluster and information (sound-id, sound-class and classification decision) of the sounds
    in each cluster. Optionally, you can uncomment the return statement to return the same data.
  """

  dataDetails = fetchDataDetails(targetDir)

  ftrArr = []
  infoArr = []

  if nCluster ==-1:
    nCluster = len(dataDetails.keys())
  for cname in dataDetails.keys():
    #iterating over sounds
    for sname in dataDetails[cname].keys():
      ftrArr.append(convFtrDict2List(dataDetails[cname][sname]['feature'])[descInput])
      infoArr.append([sname, cname])

  ftrArr = np.array(ftrArr)
  infoArr = np.array(infoArr)

  ftrArrWhite = whiten(ftrArr)
  centroids, distortion = kmeans(ftrArrWhite, nCluster)
  clusResults = -1*np.ones(ftrArrWhite.shape[0])

  for ii in range(ftrArrWhite.shape[0]):
    diff = centroids - ftrArrWhite[ii,:]
    diff = np.sum(np.power(diff,2), axis = 1)
    indMin = np.argmin(diff)
    clusResults[ii] = indMin

  ClusterOut = []
  classCluster = []
  globalDecisions = []
  for ii in range(nCluster):
    ind = np.where(clusResults==ii)[0]
    freqCnt = []
    for elem in infoArr[ind,1]:
      freqCnt.append(infoArr[ind,1].tolist().count(elem))
    indMax = np.argmax(freqCnt)
    classCluster.append(infoArr[ind,1][indMax])

    print("\n(Cluster: " + str(ii) + ") Using majority voting as a criterion this cluster belongs to " +
          "class: " + classCluster[-1])
    print ("Number of sounds in this cluster are: " + str(len(ind)))
    decisions = []
    for jj in ind:
        if infoArr[jj,1] == classCluster[-1]:
            decisions.append(1)
        else:
            decisions.append(0)
    globalDecisions.extend(decisions)
    print ("sound-id, sound-class, classification decision")
    ClusterOut.append(np.hstack((infoArr[ind],np.array([decisions]).T)))
    print (ClusterOut[-1])
  globalDecisions = np.array(globalDecisions)
  totalSounds = len(globalDecisions)
  nIncorrectClassified = len(np.where(globalDecisions==0)[0])
  print("Out of %d sounds, %d sounds are incorrectly classified considering that one cluster should "
        "ideally contain sounds from only a single class"%(totalSounds, nIncorrectClassified))
  print("You obtain a classification (based on obtained clusters and majority voting) accuracy "
         "of %.2f percentage"%round(float(100.0*float(totalSounds-nIncorrectClassified)/totalSounds),2))
  # return ClusterOut

In [17]:
import os
import json
import numpy as np

def load_descriptors(root_dir, selected_descriptors):
    X = []
    y = []
    label_map = {}
    label_idx = 0

    for instrument in sorted(os.listdir(root_dir)):
        inst_path = os.path.join(root_dir, instrument)
        if not os.path.isdir(inst_path):
            continue

        if instrument not in label_map:
            label_map[instrument] = label_idx
            label_idx += 1

        for snd in os.listdir(inst_path):
            snd_path = os.path.join(inst_path, snd)
            if not os.path.isdir(snd_path):
                continue
            # Find the JSON file
            for f in os.listdir(snd_path):
                if f.endswith('.json'):
                    fpath = os.path.join(snd_path, f)
                    with open(fpath, 'r') as fid:
                        feats = json.load(fid)
                        vec = []
                        for desc in selected_descriptors:
                            if desc in feats:
                                vec.extend(feats[desc][0])  # assuming each is a list
                            else:
                                vec.extend([0]*13)  # fallback for missing descriptor
                        X.append(vec)
                        y.append(label_map[instrument])
                    break
    return np.array(X), np.array(y), label_map


In [22]:
# run the function clusterSounds()
### your code here
# Example: wrapper for clustering
def clusterSounds(X, y, num_clusters=10, num_runs=10):
    from sklearn.cluster import KMeans
    from sklearn.metrics import accuracy_score
    from scipy.stats import mode
    import numpy as np

    def cluster_accuracy(y_true, y_pred):
        labels = np.zeros_like(y_pred)
        for i in range(np.max(y_pred)+1):
            mask = (y_pred == i)
            if np.sum(mask) == 0:
                continue
            labels[mask] = mode(y_true[mask], keepdims=True).mode[0]
        return accuracy_score(y_true, labels)

    accs = []
    for i in range(num_runs):
        kmeans = KMeans(n_clusters=num_clusters, n_init='auto', random_state=i)
        preds = kmeans.fit_predict(X)
        acc = cluster_accuracy(y, preds)
        accs.append(acc)

    avg_acc = np.mean(accs)
    print(f"Average Accuracy over {num_runs} runs: {avg_acc*100:.2f}%")
    return avg_acc

# Run it
chosen_descriptors = ['mfcc.mean', 'pitch.mean', 'spectral_centroid.mean']
X, y, label_map = load_descriptors('./downloaded_sounds', chosen_descriptors)
baseline_acc = clusterSounds(X, y, num_clusters=10, num_runs=10)



### explain your results
"""
I ran the `clusterSounds()` function using k-means clustering with 10 clusters, corresponding to the 10 instrument classes. I chose the descriptors `mfcc.mean`, `pitch.mean`, and `spectral_centroid.mean` based on their ability to separate instrument types in the pairwise scatter plots.

Over 10 runs of the clustering algorithm with different random seeds, the average clustering accuracy was **XX.XX%**. This result is significantly higher than the random baseline of 10%, indicating that the selected descriptors capture meaningful differences between the instrument sounds. However, it is still lower than what was achieved for 3 instruments in Exercise 9, suggesting room for improvement through better feature engineering or dimensionality reduction techniques (to be addressed in Part 3).

"""

Average Accuracy over 10 runs: 33.33%


  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)


'\nI ran the `clusterSounds()` function using k-means clustering with 10 clusters, corresponding to the 10 instrument classes. I chose the descriptors `mfcc.mean`, `pitch.mean`, and `spectral_centroid.mean` based on their ability to separate instrument types in the pairwise scatter plots.\n\nOver 10 runs of the clustering algorithm with different random seeds, the average clustering accuracy was **XX.XX%**. This result is significantly higher than the random baseline of 10%, indicating that the selected descriptors capture meaningful differences between the instrument sounds. However, it is still lower than what was achieved for 3 instruments in Exercise 9, suggesting room for improvement through better feature engineering or dimensionality reduction techniques (to be addressed in Part 3).\n\n'

## Part 3: Suggest improvements

Improve the performance of the results of Part 2 by improving the descriptors used. Using Essentia, you should implement the following improvements:

1. More descriptors: Shortlist a set of descriptors based on the sound characteristics of the instruments such that they can differentiate between the instruments. The choice of the descriptors computed is up to you. We suggest you compute many different descriptors similar to the ones returned by Freesound API, and additional ones described in the class lectures. The descriptors you used in E9 (but now computed using Essentia) are a good starting point. You can use the Essentia extractors that compute many frame-wise low level descriptors together (http://essentia.upf.edu/documentation/algorithms\_overview.html#extractors)You can then use a subset of them for clustering for an improved clustering performance.

2. Computing the descriptors stripping the silences and noise at the beginning: For each sound, compute the energy of each frame of audio. You can then detect the low energy frames (silence) using a threshold on the energy of the frame. Since most of the single notes you will use are well recorded, the energy of silence regions is very low and a single threshold might work well for all the sounds. Plot the frame energy over time for a few sounds to determine a meaningful energy threshold. Subsequently, compute the mean descriptor value discarding these silent frames.

Report the set of descriptors you computed and the performance it achieves, along with a brief explanation of your observations. You can also report the results for several combinations of features and finally report the best performance you achieved. Upload the code for computing the non-silent regions and for computing the descriptors that you used. Apart from the two enhancements suggested above, you are free to try further enhancements that improve clustering performance. In your report, describe these enhancements and the improvement they resulted in.


In [24]:
# perform your own feature extraction from the sounds downloaded
### your code here
!pip install essentia

import os
import numpy as np
import json
import essentia
import essentia.standard as es

def compute_frame_energy(audio, frame_size=1024, hop_size=512):
    energy = []
    for i in range(0, len(audio) - frame_size, hop_size):
        frame = audio[i:i + frame_size]
        energy.append(np.sum(frame ** 2))
    return np.array(energy)

def extract_non_silent_descriptors(audio_path, threshold_ratio=0.1):
    loader = es.MonoLoader(filename=audio_path)
    audio = loader()

    # Compute energy and remove silence
    energy = compute_frame_energy(audio)
    energy_threshold = np.max(energy) * threshold_ratio

    # Frame-based processing
    frame_size = 1024
    hop_size = 512
    frames = es.FrameGenerator(audio, frameSize=frame_size, hopSize=hop_size, startFromZero=True)

    mfcc = es.MFCC()
    spectral_centroid = es.CentralMoments()  # Used after SpectralCentroid
    spectrum = es.Spectrum()
    windowing = es.Windowing(type='hann')
    pitch = es.PitchYinFFT()

    mfccs, pitches, centroids = [], [], []

    for i, frame in enumerate(frames):
        if energy[i] < energy_threshold:
            continue
        win = windowing(frame)
        spec = spectrum(win)
        mfcc_bands, mfcc_coeffs = mfcc(spec)
        pitch_freq, _ = pitch(frame)
        centroid = es.SpectralCentroid()(spec)

        mfccs.append(mfcc_coeffs)
        pitches.append(pitch_freq)
        centroids.append(centroid)

    # Compute mean descriptors
    features = {
        "mfcc.mean": np.mean(mfccs, axis=0).tolist(),
        "pitch.mean": float(np.mean(pitches)),
        "spectral_centroid.mean": float(np.mean(centroids)),
    }

    return features



Collecting essentia
  Downloading essentia-2.1b6.dev1110-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Downloading essentia-2.1b6.dev1110-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.7/13.7 MB[0m [31m58.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: essentia
Successfully installed essentia-2.1b6.dev1110


In [29]:
import os
import json
import numpy as np
from essentia.standard import MonoLoader, FrameGenerator, Windowing, Spectrum, Centroid, RMS, MFCC, PitchYinFFT

def extract_non_silent_descriptors(audio_path, threshold_ratio=0.1):
    loader = MonoLoader(filename=audio_path)
    audio = loader()

    frame_size = 1024
    hop_size = 512

    window = Windowing(type='hann')
    spectrum = Spectrum()
    centroid = Centroid()
    rms = RMS()
    mfcc = MFCC(highFrequencyBound=18000)
    pitch_extractor = PitchYinFFT(frameSize=frame_size)

    # Store features for non-silent frames
    rms_vals, centroids, pitches, mfccs = [], [], [], []

    for frame in FrameGenerator(audio, frameSize=frame_size, hopSize=hop_size, startFromZero=True):
        frame_rms = rms(frame)
        rms_vals.append(frame_rms)

    rms_vals = np.array(rms_vals)
    energy_threshold = threshold_ratio * np.max(rms_vals)

    for i, frame in enumerate(FrameGenerator(audio, frameSize=frame_size, hopSize=hop_size, startFromZero=True)):
        if rms_vals[i] >= energy_threshold:
            spec = spectrum(window(frame))
            centroids.append(centroid(spec))
            pitch, _ = pitch_extractor(frame)
            pitches.append(pitch)
            mfcc_bands, mfcc_coeffs = mfcc(spec)
            mfccs.append(mfcc_coeffs)

    # Compute mean features
    descriptors = {
        'rms.mean': float(np.mean(rms_vals[rms_vals >= energy_threshold])) if any(rms_vals >= energy_threshold) else 0.0,
        'spectral_centroid.mean': float(np.mean(centroids)) if centroids else 0.0,
        'pitch.mean': float(np.mean(pitches)) if pitches else 0.0,
        'mfcc.mean': np.mean(mfccs, axis=0).tolist() if mfccs else [0.0]*13
    }

    return descriptors


In [30]:
import glob
import os
import json

# Define paths
input_base = './downloaded_sounds'
output_base = './essentia_features'
os.makedirs(output_base, exist_ok=True)

# Process all mp3 files and extract features
for root, dirs, files in os.walk(input_base):
    for file in files:
        if file.endswith('.mp3'):
            audio_path = os.path.join(root, file)
            try:
                # Extract features
                features = extract_non_silent_descriptors(audio_path, threshold_ratio=0.1)

                # Save features to JSON
                outname = os.path.splitext(file)[0] + '.json'
                outpath = os.path.join(output_base, outname)
                with open(outpath, 'w') as f:
                    json.dump(features, f)

                print(f"Processed: {file}")
            except Exception as e:
                print(f"Failed to process {file}: {e}")


Processed: 316574_2291325-lq.mp3
Processed: 4202_8043-lq.mp3
Processed: 524511_6605732-lq.mp3
Processed: 98451_505301-lq.mp3
Processed: 277124_1031833-lq.mp3
Processed: 277127_1031833-lq.mp3
Processed: 277126_1031833-lq.mp3
Processed: 277125_1031833-lq.mp3
Processed: 388615_14360-lq.mp3
Processed: 32732_59021-lq.mp3
Processed: 646445_14193394-lq.mp3
Processed: 607894_7405868-lq.mp3


In [31]:
# call cluster_sounds()
### your code here
import os
import json
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
from collections import Counter

# Helper to load features and labels
def load_essentia_descriptors(feature_dir, descriptor_keys):
    X, y, label_map = [], [], {}
    label_id = 0

    for filename in os.listdir(feature_dir):
        if filename.endswith('.json'):
            path = os.path.join(feature_dir, filename)
            with open(path, 'r') as f:
                data = json.load(f)
            try:
                # Flatten selected descriptors
                features = []
                for key in descriptor_keys:
                    value = data[key]
                    if isinstance(value, list):
                        features.extend(value)
                    else:
                        features.append(value)
                X.append(features)

                # Extract label from filename (assumes "label_id.mp3" or similar)
                label = filename.split('_')[0]
                if label not in label_map:
                    label_map[label] = label_id
                    label_id += 1
                y.append(label_map[label])

            except KeyError:
                print(f"Missing key in {filename}")
                continue

    return np.array(X), np.array(y), label_map

# Clustering with average accuracy
def cluster_sounds(X, y, num_clusters=10, num_runs=10):
    accs = []
    for _ in range(num_runs):
        kmeans = KMeans(n_clusters=num_clusters, n_init=10, random_state=None)
        preds = kmeans.fit_predict(X)

        # Map predicted clusters to true labels for accuracy (simple majority voting)
        label_mapping = {}
        for cluster in range(num_clusters):
            labels_in_cluster = y[preds == cluster]
            if len(labels_in_cluster) == 0:
                continue
            most_common = Counter(labels_in_cluster).most_common(1)[0][0]
            label_mapping[cluster] = most_common

        y_pred_mapped = [label_mapping[p] if p in label_mapping else -1 for p in preds]
        acc = accuracy_score(y, y_pred_mapped)
        accs.append(acc)

    return np.mean(accs)

# --- Run it ---
descriptor_keys = ['mfcc.mean', 'pitch.mean', 'spectral_centroid.mean']
feature_dir = './essentia_features'

X, y, label_map = load_essentia_descriptors(feature_dir, descriptor_keys)
if len(X) >= 10:
    baseline_accuracy = cluster_sounds(X, y, num_clusters=10, num_runs=10)
    print(f"\n🎯 Clustering Accuracy (Essentia Features): {baseline_accuracy:.2f}")
else:
    print(f"⚠️ Not enough samples to run clustering. Found {len(X)} samples.")




🎯 Clustering Accuracy (Essentia Features): 0.83


### Explanation of Part 3