In [None]:
import os
import sys
sys.path.append('../original/pytorch')
from read_feats_classV5 import ASVSpoofTrainData
import matplotlib.pyplot as plt
import numpy as np
import pickle

In [None]:
# need to jump over to the pytorch dir to make paths work in that module
os.chdir('../original/pytorch')
tdata = ASVSpoofTrainData()

In [None]:
spec = tdata[0][0].numpy()

In [None]:
# work out which input file we're looking at
data_fn = 'train_info.lst'
with open(data_fn, 'rb') as f:
    data = pickle.load(f)
data['names'][0]

In [None]:
fig = plt.figure(figsize=(15,6))
ax = fig.subplots()
_ = plt.pcolormesh(spec)

In [None]:
spec.shape

The project uses a Matlab FFT routine (in `logpow.m`) to calculate a log power spectrum with a window size of 16ms and an overlap of 8ms.  The FFT size is 1536 points to give a resulting 768 point spectrum (matching the input image size of AlexNET) but with a 16kHz sample frequency the window only contains 256 points so the remainder must be padded with zeros (I can't find a reference to this behavior).  This would have the effect of giving better frequency resolution in the FFT although with only 256 points of data it will just be a smoothed out spectrum.

The matlab code applies a hamming window prior to the FFT operation, then takes the log of the squared FFT. The result is written to an h5 file.

These h5 files are then read by the script `split-data-768.py` which generates a fixed size array of 768x400 by repeating the data if it is shorter than 400 points or truncating if it is longer.  The features are then normalised (subtract mean and divide by stdev) before being saved to npy format files.

These npy files are then read by the code above and used as input to the network.

The result plotted above shows the spectrum with some horizontal (pitch) banding at lower frequencies in the voiced sections. Higher frequencies are messy and the normalisation of the spectrum has maybe reduced the contrast.  We can se the repitition of the signal after around 300 on the x-axis. 

## Feature Extraction with Sidekit

Now we'll try to reproduce something like this using the sidekit library but perhaps with some more sensible settings for the FFT.

One issue is that Sidekit does not do zero padding on FFT spectra so we can't fully reproduce the original features.  However, I'm not sure that zero padding was done for any good reason other than to fit into the 768x400 image size.  

We define a function to create a feature extractor, parameterising the frame size and shift.  Then we can compute the spectrogram for the same input file.

In [None]:
import sidekit 
def make_feature_server(frame_size, shift):
 
    sampling_frequency = 16000
    # window size must be twice the frame size to give the right number of FFT points but since
    # we can't zero pad, we'll be taking in more of the signal in each frame
    window_size =  (2* frame_size+1) / sampling_frequency


    extractor = sidekit.FeaturesExtractor(audio_filename_structure="../../data/ASVspoof2017/ASVspoof2017_V2_train/{}.wav",
                                          feature_filename_structure="../../data/feat/{}.h5",
                                          sampling_frequency=sampling_frequency,
                                          lower_frequency=0,
                                          higher_frequency=sampling_frequency/2,
                                          filter_bank="lin",
                                          filter_bank_size=frame_size,
                                          window_size=window_size,
                                          shift=shift,
                                          ceps_number=20,
                                          pre_emphasis=0.97,
                                          save_param=["fb"],
                                          keep_all_features=True)

    return sidekit.FeaturesServer(features_extractor=extractor,
                                    feature_filename_structure="../../data/feat/{}.h5",
                                    sources=None,
                                    dataset_list=["fb"],
                                    keep_all_features=True)


First we'll compute a spectrogram with the same size using a frame size of 768. This creates a very large window but with a small frame shift of 0.008s there is a huge overlap between frames. This means we get very good frequency resolution but temporally features are very blurred. 

In [None]:
fs = make_feature_server(768, 0.008)

feat, label = fs.load('T_1000001')
print(feat.shape)
fig = plt.figure(figsize=(15,6))
_=plt.pcolormesh(feat.transpose())

In [None]:
# for comparison, the original features again but truncated to align with the above figure
fig = plt.figure(figsize=(15,6))
_=plt.pcolormesh(spec[:,:291])

The frequency resolution is actually much better in the sidekit version since we're taking more signal but the temporal blurring is very apparent.  

We can get a better temporal resolution with a smaller frame size and the same shift.

In [None]:
fs127 = make_feature_server(127, 0.008)

feat127, label = fs127.load('T_1000001')
print(feat127.shape)
fig = plt.figure(figsize=(15,2))
_=plt.pcolormesh(feat127.transpose())

Note the plosive at around 140 which is much more apparent here than even the original plot and very smudged in the wideband spectrogram.  


## Â¿Por qu&eacute; no los dos?

Since the goal is to get an 'image' of 768x400 for input to the CNN we could actually combine both narrow and wide band spectra into a single image to get the best of both worlds.  Keeping the 8ms window shift we can compute one spectrum of 127 points and another of 641 points and splice them together into a single 'image'.

In [None]:
fs641 = make_feature_server(641, 0.008)

feat641, label = fs641.load('T_1000001')
print(feat641.shape)

In [None]:
feat_combined = np.concatenate((feat641, feat127[:293,:]), axis=1)
fig = plt.figure(figsize=(15,6))
_=plt.pcolormesh(feat_combined.transpose())

Duplicating the original code we can repeat the data to give an overall image size of 768x400.

In [None]:
mat = feat_combined.transpose()
size = mat.shape[1] 
mat = np.concatenate((mat,mat[:,0:400-size]), axis=1)
fig = plt.figure(figsize=(15,6))
_=plt.pcolormesh(mat)

In [None]:
# again to compare with the original
fig = plt.figure(figsize=(15,6))
ax = fig.subplots()
_ = plt.pcolormesh(spec)

In [None]:
def normalize(mat, axis):
    """Normalise data"""

    nFeatures = 768

    mat = (mat - np.mean(mat,axis=axis,keepdims=True))
    mat = np.divide(mat,np.std(mat,axis=axis,keepdims=True))

    return mat

In [None]:
 
fig = plt.figure(figsize=(15,6))
_=plt.pcolormesh(normalize(mat,1))