Возможные способы считывать аудио-файлы с помощью питона:

https://stackoverflow.com/questions/2060628/reading-wav-files-in-python

Stereo to Mono:

https://stackoverflow.com/questions/43056088/stereo-to-mono-wave-interpolation-in-python
https://stackoverflow.com/questions/30401042/stereo-to-mono-wav-in-python

Про считывание и записывание wave-файлов:

http://www.labbookpages.co.uk/audio/wavFiles.html

https://github.com/mbereket/music-transcription


Why is Fourier Transform so important?

https://dsp.stackexchange.com/questions/69/why-is-the-fourier-transform-so-important

Полезная презентация:

http://www.machinelearning.ru/wiki/images/c/cf/NizhibitskyMusicSlides.pdf


https://ru.wikipedia.org/wiki/Импульсно-кодовая_модуляция

## Data preparing 
### specifically for MAPS dataset

In [1]:
import soundfile as sf
import numpy as np
import os
import pandas as pd
# pd.set_option('display.max_rows', 20)

In [2]:
example_wav_path = 'MAPS/AkPnStgb/MUS/MAPS_MUS-alb_esp2_AkPnStgb.wav'
example_txt_path = 'MAPS/AkPnStgb/MUS/MAPS_MUS-alb_esp2_AkPnStgb.txt'

In [3]:
data, samplerate = sf.read(example_wav_path)

In [4]:
data.shape

(5620807, 2)

In [5]:
REL_PATH = 'MAPS/'

In [6]:
# checks if the files in the given triplet exists at all
def checkFileTripletValid(triplet):
    for filename in triplet:
        if os.path.isfile(filename):
            continue
        else:
            raise Exception("The given file doesn't exist")
    return True

In [7]:
files = []

for f in os.scandir(REL_PATH):
    if f.is_dir():
        for (dirpath, dirnames, filenames) in os.walk(f.path):
            for file in filenames:
                if file.endswith(".wav"):
                    fullpath = dirpath + '/' + file # forms the path to the file
                    base = os.path.splitext(fullpath)[0] # gets the path to the file without its extention
                    triplet = [base + '.wav', base + '.mid', base + '.txt'] # forms the triplet of files: .wav, .mid, .txt
                    if checkFileTripletValid(triplet):
                        files.append(triplet)


In [8]:
len(files)

29880

In [9]:
# transforms only 2-channel audiofile, others leaves untouched
def makeMonoAudiofile(data):
    if data.shape[1] == 2:
        data = (data[:, 0] + data[:,1]) / 2
    
def translateTimeToSampleNum(data):
    data["OnsetTime"] = np.round(data.values[:, 0] * SAMPLERATE).astype(int)
    data["OffsetTime"] = np.round(data.values[:, 1] * SAMPLERATE).astype(int)

#### A sample is the smallest usable quantum of digital audio. The term frame isn't formally defined in pure audio terms, but is often used in relation to video that may accompany an audio track. In that context a frame is the quantity of audio samples taken during a video frame interval.

К разметке данных можно было бы подойти в лоб: 

According to my copy of the absolutely essential (!) Master Handbook of Acoustics, to hear shorter tones, that is sounds with a short impulse, they need to be louder:

A 1,000-Hz tone sounds like 1,000 Hz in a 1-second tone burst, but an extremely short burst sounds like a click. The duration of such a burst also influences the perceived loudness. Short bursts do not sound as loud as longer ones... A pulse 3 milliseconds long must have a level about 15dB higher to sound as loud as a 0.5-second (500 millisecond) pulse. Tones and random noise follow roughly the same relationship in loudness vs. pulse length.

The 100-msec region is significant... Only when the tones or noise bursts are shorter than this amount must the sound-pressure level be increased to produce a loudness equal to that of long pulses or steady tones or noise. This 100 msec appears to be the integrating time or the time constant of the human ear. (Everest 2001, 60-61)

In [21]:
FRAME_SIZE = 16384 # samples per frame
SAMPLERATE = 44100 # samples per sec
MIDI_PITCH_NUM = 128
DEBUG_MODE = True

# считаем, что нота прозвучала во время фрейма, только если количество 
# семплов проигрывающих данную ноту в этой фрейме превышает MIN_SAMPLES_PRESENT
MIN_SAMPLES_PRESENT = 0.1 * SAMPLERATE

In [11]:
# returns frames (numpy.ndarray of the shape (None, FRAME_SIZE))
# and labels (numpy.ndarray of the shape (None, MIDI_PITCH_NUM))) for an audiofile
def preprocessAudiofile(path_wav, path_txt):
    
    # data (numpy.ndarray) – a two-dimensional NumPy array is returned,
    # where the channels are stored along the first dimension, i.e. as columns. 
    data_wav, samplerate = sf.read(path_wav)
    data_txt = pd.read_csv(path_txt, sep="\t", header=0)

    assert(samplerate==SAMPLERATE)
    
    makeMonoAudiofile(data_wav)
    translateTimeToSampleNum(data_txt)
    
    chunks_amount = len(data_wav) // FRAME_SIZE
    samples_amount = FRAME_SIZE * chunks_amount
    # отбрасываем конец аудиофайла, если количество
    # сэмплов в нем НЕ делится ровно на FRAME_SIZE
    data_wav = data_wav[:samples_amount] 
    
    # делим data_wav на одинаковые chunks_amount, если длина data_wav
    # НЕ делится ровно на chunks_amount, то np.split должен будет выбросить exception
    frames = np.split(data_wav, chunks_amount)
    labels = getLabelsForAudiofile(data_txt, frames)
    
    if DEB
    return frames, labels
    

In [12]:
def getLabelsForAudiofile(data_txt, frames):
    
    # Для каждого frame'a заданного размера FRAME_SIZE хотим создать label,
    # который представляет из себя вектор из 0 и 1 размера MIDI_PITCH_NUM,
    # обозначающий, звучала ли соответствующая нота во время этого frame'a или нет
    # инициализируем все нулями
    labels = np.zeros((len(frames), MIDI_PITCH_NUM))
    
    for i in np.arange(data_txt.shape[0]):

        onset_time = data_txt.values[i][0]
        offset_time = data_txt.values[i][1]
        midi_pitch = data_txt.values[i][2]
        
        begin_frame = onset_time // FRAME_SIZE
        end_frame = offset_time // FRAME_SIZE
        
        for frame_num in np.arange(begin_frame, end_frame + 1):
            
            if frame_num == begin_frame:
                if (((begin_frame + 1) * FRAME_SIZE - 1) - onset_time) > MIN_SAMPLES_PRESENT:
                    # помечаем, что данный midi_pitch звучал в рассматриваемом frame'e
                    labels[begin_frame][midi_pitch] = 1
                    continue
            if frame_num == end_frame:
                if (offset_time - (end_frame * FRAME_SIZE)) > MIN_SAMPLES_PRESENT:
                    labels[end_frame][midi_pitch] = 1
                    continue
                    
            # если между begin_frame и end_frame есть хотя бы один frame, то очевидно,
            # что данный midi_pitch звучал на протяжении всего frame'a
            labels[frame_num][midi_pitch] = 1

    return labels
    

In [13]:
frames, labels = preprocessAudiofile(example_wav_path, example_txt_path)


In [20]:
np.where(labels[6] == 1)

(array([33, 38, 45, 50, 57, 62, 66, 69]),)

In [57]:
data = pd.read_csv('MAPS/AkPnStgb/MUS/MAPS_MUS-alb_esp2_AkPnStgb.txt', sep="\t", header=0)
data.values

array([[   0.500004,    2.45304 ,   38.      ],
       [   0.500004,    2.45304 ,   50.      ],
       [   1.21885 ,    2.45304 ,   57.      ],
       ..., 
       [ 123.032   ,  125.456   ,   62.      ],
       [ 123.032   ,  125.456   ,   69.      ],
       [ 123.032   ,  125.456   ,   78.      ]])

In [55]:
len(data["OnsetTime"].values)

679

In [18]:
data["OnsetTime"] = np.round(data.values[:, 0] * SAMPLERATE).astype(int)
data["OffsetTime"] = np.round(data.values[:, 1] * SAMPLERATE).astype(int)

In [19]:
data

Unnamed: 0,OnsetTime,OffsetTime,MidiPitch
0,22050,108179,38
1,22050,108179,50
2,53751,108179,57
3,64036,108179,69
4,64036,108179,66
5,64036,108179,62
6,84605,108179,33
7,84605,108179,45
8,108179,183877,38
9,108179,183877,50
