# Feature Selection Experiments for openSMILE

The [openSMILE Python API](https://audeering.github.io/opensmile-python/) includes the [ComParE 2016](http://www.tangsoo.de/documents/Publications/Schuller16-TI2.pdf) feature set which enables us to extract 65 low-level descriptors (for ex. MFCC) as well as 6373 Functionals derived from these decriptors. 

For each audio file - the LLDs are a `samples x 65` df and the functionals are a `1 x 6373` df. 

In these experiments, we extract the functionals for both real and fake audio data and run a set of feature selection techniques to identify a hnadful subset of features for use in a final predictive model.

## Import Statements

In [1]:
import opensmile 
import pandas as pd
import numpy as np
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score, confusion_matrix
import random
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.manifold import TSNE
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns
from sklearn import svm
from pprint import pprint
from tqdm import tqdm
base_path = "/home/ubuntu/"

## Data Loading and Summary

To save time, the data is sampled and features are extracted using openSMILE in a separate notebook `sampling_and_mixing_data.ipynb` and saved in `.csv` format for reusability. The dataset used contains original audio from the [LJSpeech 1.1](https://keithito.com/LJ-Speech-Dataset/) dataset and fake audio generated using GANs for the [NeurIPS 2021 WaveFake](https://arxiv.org/abs/2111.02813) dataset. It also contains audio from LJSpeech generated using ElevenLabs.

Contains 12,800 audio files - 6400 each real and fake with a set of 800 real audio files matching a set of the same 800 fakes generated using one of the archs listed below. 

In [2]:
exp_data_file = '/home/ubuntu/testing-code/opensmile-feature-importance/smile_dfs/0310-lj_experimental_data_v1.csv'
exp_data_df = pd.read_csv(exp_data_file)

In [3]:
exp_data_df.head()

Unnamed: 0,id,file,type,fake,duration(seconds),audspec_lengthL1norm_sma_range,audspec_lengthL1norm_sma_maxPos,audspec_lengthL1norm_sma_minPos,audspec_lengthL1norm_sma_quartile1,audspec_lengthL1norm_sma_quartile2,...,mfcc_sma_de[14]_peakRangeAbs,mfcc_sma_de[14]_peakRangeRel,mfcc_sma_de[14]_peakMeanAbs,mfcc_sma_de[14]_peakMeanMeanDist,mfcc_sma_de[14]_peakMeanRel,mfcc_sma_de[14]_minRangeRel,mfcc_sma_de[14]_meanRisingSlope,mfcc_sma_de[14]_stddevRisingSlope,mfcc_sma_de[14]_meanFallingSlope,mfcc_sma_de[14]_stddevFallingSlope
0,LJ032-0137,/home/ubuntu/data/wavefake_data/LJSpeech_1.1/w...,ElevenLabs,0,7.762,2.529597,0.608866,0.765319,0.343702,0.589645,...,10.773912,0.657822,2.936347,2.935397,20.0,0.609994,129.24013,64.158455,130.97571,60.847908
1,LJ038-0165,/home/ubuntu/data/wavefake_data/LJSpeech_1.1/w...,ElevenLabs,0,8.656,2.406416,0.005841,0.695093,0.295922,0.514596,...,8.458103,0.531263,2.910538,2.911863,-20.0,0.650057,125.02853,58.838596,115.42761,55.276188
2,LJ044-0203,/home/ubuntu/data/wavefake_data/LJSpeech_1.1/w...,ElevenLabs,0,5.15,2.695161,0.249012,0.175889,0.375376,0.674925,...,11.555664,0.663116,2.731076,2.73182,-20.0,0.479881,126.26017,54.306473,109.07703,62.24075
3,LJ003-0044,/home/ubuntu/data/wavefake_data/LJSpeech_1.1/w...,ElevenLabs,0,3.199,2.287197,0.528846,0.996795,0.478625,0.801715,...,11.162925,0.497092,3.827386,3.820832,19.997963,0.526514,155.38283,77.223274,146.11166,88.62794
4,LJ036-0116,/home/ubuntu/data/wavefake_data/LJSpeech_1.1/w...,ElevenLabs,0,2.34,2.027513,0.181416,0.513274,0.277241,0.572403,...,6.160469,0.46208,2.79742,2.774368,19.579084,0.638266,99.164055,44.689617,123.44574,49.45689


In [4]:
exp_data_df.shape

(12800, 6378)

In [5]:
exp_data_df.type.value_counts()

ElevenLabs           1600
Waveglow             1600
Parallel_WaveGan     1600
Multi_Band_MelGan    1600
MelGanLarge          1600
MelGan               1600
HifiGan              1600
Full_Band_MelGan     1600
Name: type, dtype: int64

In [6]:
#check to ensure each id has a corresponding fake
exp_data_df.id.value_counts()

LJ032-0137    2
LJ038-0118    2
LJ046-0168    2
LJ016-0171    2
LJ037-0020    2
             ..
LJ004-0171    2
LJ042-0184    2
LJ015-0150    2
LJ007-0172    2
LJ028-0069    2
Name: id, Length: 6400, dtype: int64

## Train-Dev-Test Split:

In [7]:
#split the data 
f1 = 0.8
f2 = 0.9
train_df, dev_df, test_df = np.split(exp_data_df.sample(frac=1), [int(f1*len(exp_data_df)), int(f2*len(exp_data_df))])

In [8]:
#check split
len(train_df), len(dev_df), len(test_df), (len(train_df) + len(dev_df) + len(test_df))

(10240, 1280, 1280, 12800)

In [9]:
train_df.type.value_counts()

HifiGan              1296
Multi_Band_MelGan    1295
ElevenLabs           1294
Parallel_WaveGan     1293
MelGan               1289
Waveglow             1266
Full_Band_MelGan     1263
MelGanLarge          1244
Name: type, dtype: int64

In [10]:
dev_df.type.value_counts()

Multi_Band_MelGan    170
MelGanLarge          168
HifiGan              163
Full_Band_MelGan     163
Waveglow             156
ElevenLabs           155
MelGan               153
Parallel_WaveGan     152
Name: type, dtype: int64

In [11]:
test_df.type.value_counts()

MelGanLarge          188
Waveglow             178
Full_Band_MelGan     174
MelGan               158
Parallel_WaveGan     155
ElevenLabs           151
HifiGan              141
Multi_Band_MelGan    135
Name: type, dtype: int64

## Feature Scaling:

In [12]:
exp_data_scaler = StandardScaler()
train_df.iloc[:,5:] = exp_data_scaler.fit_transform(train_df.iloc[:,5:])
dev_df.iloc[:,5:] = exp_data_scaler.transform(dev_df.iloc[:,5:])
test_df.iloc[:,5:] = exp_data_scaler.transform(test_df.iloc[:,5:])

## Experiment 1: Brute Force Feature Selection

In the first experiment, we break down the data set into each architecture used to generate fakes 

In [13]:
def run_bflr_for_arch(train_df, dev_df, arch, all_archs=False):
    
    #prepare data
    if all_archs==False:
        trdf = train_df[train_df.type==arch]
    if all_archs==True:
        trdf = train_df[train_df.type.isin(arch)]
    
    if all_archs==False:
        dvdf = dev_df[dev_df.type==arch]
    if all_archs==True:
        dvdf = dev_df[dev_df.type.isin(arch)]
    
    X_train = trdf.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake']).copy()
    y_train = trdf['fake'].copy()
    
    X_dev = dvdf.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake'])
    y_dev = dvdf['fake'].copy()
    
    train_accuracies = []
    dev_accuracies = []
    
    for i in tqdm(range(6373)):
        
        model_lr = LogisticRegression()
        model_lr.fit(X_train.iloc[:,i].to_numpy().reshape(-1, 1), y_train)
        y_hat_train = model_lr.predict(X_train.iloc[:,i].to_numpy().reshape(-1, 1))
        y_hat_dev = model_lr.predict(X_dev.iloc[:,i].to_numpy().reshape(-1, 1))
        train_accuracy = accuracy_score(y_train, y_hat_train)
        train_accuracies.append(train_accuracy)        
        dev_accuracy = accuracy_score(y_dev, y_hat_dev)
        dev_accuracies.append(dev_accuracy)
    
    print("\nAverage train accuracy: {}".format(np.mean(train_accuracies)))
    print("Average dev accuracy: {}\n".format(np.mean(dev_accuracies)))
    
    return dev_accuracies

In [14]:
archs = list(exp_data_df.type.unique())
features = train_df.columns.to_list()[5:]
bruteforce_df = pd.DataFrame(features, columns=['features'])

for arch in archs:
    print("\nRunning for {} architecture\n".format(arch))
    bruteforce_df[arch] = run_bflr_for_arch(train_df, dev_df, arch)

print("\nRunning for all architectures\n")
bruteforce_df['all_archs'] = run_bflr_for_arch(train_df, dev_df, archs, all_archs=True)


Running for ElevenLabs architecture



100%|██████████| 6373/6373 [00:14<00:00, 442.63it/s]



Average train accuracy: 0.6239787686217769
Average dev accuracy: 0.6265079999797533


Running for Waveglow architecture



100%|██████████| 6373/6373 [00:13<00:00, 484.38it/s]



Average train accuracy: 0.5294073610802287
Average dev accuracy: 0.5243857298619576


Running for Parallel_WaveGan architecture



100%|██████████| 6373/6373 [00:13<00:00, 484.73it/s]



Average train accuracy: 0.5303820048058994
Average dev accuracy: 0.5051316408863048


Running for Multi_Band_MelGan architecture



100%|██████████| 6373/6373 [00:13<00:00, 481.64it/s]



Average train accuracy: 0.5243715796673587
Average dev accuracy: 0.5087039994092726


Running for MelGanLarge architecture



100%|██████████| 6373/6373 [00:12<00:00, 500.66it/s]



Average train accuracy: 0.528706313764409
Average dev accuracy: 0.5126015257821315


Running for MelGan architecture



100%|██████████| 6373/6373 [00:13<00:00, 479.08it/s]



Average train accuracy: 0.5244089415721411
Average dev accuracy: 0.49298665017552606


Running for HifiGan architecture



100%|██████████| 6373/6373 [00:12<00:00, 494.81it/s]



Average train accuracy: 0.5171551036103315
Average dev accuracy: 0.5130559424874302


Running for Full_Band_MelGan architecture



100%|██████████| 6373/6373 [00:12<00:00, 490.75it/s]



Average train accuracy: 0.5163550603614143
Average dev accuracy: 0.47603145555588716


Running for all architectures



100%|██████████| 6373/6373 [00:29<00:00, 219.28it/s]


Average train accuracy: 0.5170601162864821
Average dev accuracy: 0.5168382482739683






In [15]:
bruteforce_df[(bruteforce_df.iloc[:,1:] > 0.55).all(1)]

Unnamed: 0,features,ElevenLabs,Waveglow,Parallel_WaveGan,Multi_Band_MelGan,MelGanLarge,MelGan,HifiGan,Full_Band_MelGan,all_archs
1093,pcm_fftMag_fband1000-4000_sma_iqr1-3,0.587097,0.621795,0.736842,0.552941,0.565476,0.555556,0.595092,0.570552,0.555469
1250,pcm_fftMag_spectralFlux_sma_percentile99.0,0.741935,0.551282,0.703947,0.635294,0.660714,0.568627,0.613497,0.613497,0.571875
1251,pcm_fftMag_spectralFlux_sma_pctlrange0-1,0.741935,0.551282,0.703947,0.641176,0.660714,0.575163,0.613497,0.613497,0.56875
1252,pcm_fftMag_spectralFlux_sma_stddev,0.787097,0.570513,0.710526,0.635294,0.690476,0.575163,0.576687,0.570552,0.567969
2954,pcm_fftMag_spectralFlux_sma_de_percentile1.0,0.741935,0.570513,0.684211,0.647059,0.625,0.588235,0.595092,0.582822,0.571875
2956,pcm_fftMag_spectralFlux_sma_de_pctlrange0-1,0.83871,0.634615,0.697368,0.652941,0.636905,0.581699,0.613497,0.552147,0.559375
2957,pcm_fftMag_spectralFlux_sma_de_stddev,0.877419,0.621795,0.736842,0.682353,0.64881,0.562092,0.607362,0.595092,0.564844
3311,mfcc_sma_de[3]_lpgain,0.651613,0.698718,0.572368,0.564706,0.845238,0.797386,0.607362,0.558282,0.582812
3342,mfcc_sma_de[4]_lpgain,0.658065,0.615385,0.585526,0.588235,0.738095,0.732026,0.552147,0.662577,0.592187
3776,jitterDDP_sma_flatness,0.76129,0.75,0.730263,0.817647,0.64881,0.712418,0.748466,0.656442,0.678125


In [16]:
bruteforce_df[(bruteforce_df.iloc[:,1:] > 0.6).all(1)]

Unnamed: 0,features,ElevenLabs,Waveglow,Parallel_WaveGan,Multi_Band_MelGan,MelGanLarge,MelGan,HifiGan,Full_Band_MelGan,all_archs
3776,jitterDDP_sma_flatness,0.76129,0.75,0.730263,0.817647,0.64881,0.712418,0.748466,0.656442,0.678125


In [17]:
bruteforce_df[(bruteforce_df.iloc[:,2:-1] > 0.55).all(axis=1)]

Unnamed: 0,features,ElevenLabs,Waveglow,Parallel_WaveGan,Multi_Band_MelGan,MelGanLarge,MelGan,HifiGan,Full_Band_MelGan,all_archs
5,audspec_lengthL1norm_sma_quartile3,0.941935,0.551282,0.730263,0.576471,0.559524,0.562092,0.625767,0.588957,0.435156
1093,pcm_fftMag_fband1000-4000_sma_iqr1-3,0.587097,0.621795,0.736842,0.552941,0.565476,0.555556,0.595092,0.570552,0.555469
1250,pcm_fftMag_spectralFlux_sma_percentile99.0,0.741935,0.551282,0.703947,0.635294,0.660714,0.568627,0.613497,0.613497,0.571875
1251,pcm_fftMag_spectralFlux_sma_pctlrange0-1,0.741935,0.551282,0.703947,0.641176,0.660714,0.575163,0.613497,0.613497,0.56875
1252,pcm_fftMag_spectralFlux_sma_stddev,0.787097,0.570513,0.710526,0.635294,0.690476,0.575163,0.576687,0.570552,0.567969
2954,pcm_fftMag_spectralFlux_sma_de_percentile1.0,0.741935,0.570513,0.684211,0.647059,0.625,0.588235,0.595092,0.582822,0.571875
2956,pcm_fftMag_spectralFlux_sma_de_pctlrange0-1,0.83871,0.634615,0.697368,0.652941,0.636905,0.581699,0.613497,0.552147,0.559375
2957,pcm_fftMag_spectralFlux_sma_de_stddev,0.877419,0.621795,0.736842,0.682353,0.64881,0.562092,0.607362,0.595092,0.564844
3311,mfcc_sma_de[3]_lpgain,0.651613,0.698718,0.572368,0.564706,0.845238,0.797386,0.607362,0.558282,0.582812
3342,mfcc_sma_de[4]_lpgain,0.658065,0.615385,0.585526,0.588235,0.738095,0.732026,0.552147,0.662577,0.592187


In [18]:
bruteforce_df[(bruteforce_df.iloc[:,2:-1] > 0.9).any(axis=1)]

Unnamed: 0,features,ElevenLabs,Waveglow,Parallel_WaveGan,Multi_Band_MelGan,MelGanLarge,MelGan,HifiGan,Full_Band_MelGan,all_archs
1702,mfcc_sma[6]_lpc2,0.645161,0.564103,0.585526,0.505882,0.904762,0.888889,0.460123,0.546012,0.561719
1732,mfcc_sma[7]_lpc1,0.76129,0.378205,0.506579,0.464706,0.910714,0.915033,0.533742,0.447853,0.610938
1762,mfcc_sma[8]_lpc0,0.774194,0.538462,0.5,0.517647,0.964286,0.908497,0.496933,0.527607,0.583594
1763,mfcc_sma[8]_lpc1,0.651613,0.487179,0.486842,0.488235,0.946429,0.954248,0.546012,0.558282,0.608594
1792,mfcc_sma[9]_lpgain,0.56129,0.615385,0.585526,0.541176,0.916667,0.797386,0.478528,0.490798,0.560156
1794,mfcc_sma[9]_lpc1,0.612903,0.564103,0.480263,0.6,0.910714,0.915033,0.582822,0.460123,0.602344
1824,mfcc_sma[10]_lpc0,0.812903,0.50641,0.440789,0.470588,0.928571,0.856209,0.509202,0.558282,0.610938
1825,mfcc_sma[10]_lpc1,0.645161,0.538462,0.473684,0.511765,0.916667,0.915033,0.515337,0.576687,0.603125
1855,mfcc_sma[11]_lpc0,0.83871,0.480769,0.546053,0.576471,0.904762,0.856209,0.515337,0.527607,0.5875
1856,mfcc_sma[11]_lpc1,0.690323,0.608974,0.546053,0.358824,0.880952,0.901961,0.552147,0.539877,0.59375


In [19]:
bruteforce_df[(bruteforce_df.iloc[:,-1] > 0.60)]

Unnamed: 0,features,ElevenLabs,Waveglow,Parallel_WaveGan,Multi_Band_MelGan,MelGanLarge,MelGan,HifiGan,Full_Band_MelGan,all_archs
66,pcm_RMSenergy_sma_quartile2,0.709677,0.544872,0.743421,0.676471,0.535714,0.575163,0.613497,0.490798,0.60625
67,pcm_RMSenergy_sma_quartile3,0.625806,0.512821,0.743421,0.658824,0.559524,0.535948,0.613497,0.558282,0.607031
1088,pcm_fftMag_fband1000-4000_sma_quartile1,0.774194,0.615385,0.723684,0.623529,0.625,0.535948,0.527607,0.570552,0.619531
1314,pcm_fftMag_spectralEntropy_sma_stddev,0.574194,0.608974,0.592105,0.6,0.738095,0.653595,0.521472,0.570552,0.60625
1668,mfcc_sma[5]_lpgain,0.690323,0.647436,0.644737,0.476471,0.732143,0.72549,0.509202,0.552147,0.615625
1670,mfcc_sma[5]_lpc1,0.83871,0.576923,0.460526,0.482353,0.803571,0.875817,0.588957,0.582822,0.635156
1699,mfcc_sma[6]_lpgain,0.606452,0.583333,0.618421,0.470588,0.815476,0.745098,0.484663,0.552147,0.602344
1700,mfcc_sma[6]_lpc0,0.806452,0.564103,0.526316,0.488235,0.857143,0.803922,0.503067,0.546012,0.600781
1701,mfcc_sma[6]_lpc1,0.709677,0.570513,0.519737,0.470588,0.892857,0.895425,0.533742,0.552147,0.603125
1731,mfcc_sma[7]_lpc0,0.858065,0.455128,0.552632,0.429412,0.869048,0.849673,0.539877,0.466258,0.60625


In [20]:
max_idxs = []
for arch in archs + ['all_archs']:
    max_idxs.append(bruteforce_df[arch].idxmax())
bruteforce_df[bruteforce_df.index.isin(max_idxs)]

Unnamed: 0,features,ElevenLabs,Waveglow,Parallel_WaveGan,Multi_Band_MelGan,MelGanLarge,MelGan,HifiGan,Full_Band_MelGan,all_archs
96,pcm_zcr_sma_quartile1,1.0,0.512821,0.453947,0.611765,0.517857,0.464052,0.441718,0.546012,0.507031
1762,mfcc_sma[8]_lpc0,0.774194,0.538462,0.5,0.517647,0.964286,0.908497,0.496933,0.527607,0.583594
1763,mfcc_sma[8]_lpc1,0.651613,0.487179,0.486842,0.488235,0.946429,0.954248,0.546012,0.558282,0.608594
3790,jitterDDP_sma_quartile1,0.470968,0.826923,0.835526,0.917647,0.535714,0.869281,0.797546,0.674847,0.7375


In [21]:
feats_lr_1 = bruteforce_df[(bruteforce_df.iloc[:,1:] > 0.55).all(1)].features.to_list()
feats_lr_2 = bruteforce_df[(bruteforce_df.iloc[:,1:] > 0.6).all(1)].features.to_list()
feats_lr_3 = bruteforce_df[(bruteforce_df.iloc[:,2:-1] > 0.55).all(axis=1)].features.to_list()
feats_lr_4 = bruteforce_df[(bruteforce_df.iloc[:,2:-1] > 0.9).any(axis=1)].features.to_list()
feats_lr_5 = bruteforce_df[(bruteforce_df.iloc[:,-1] > 0.60)].features.to_list()
feats_lr_6 = bruteforce_df[bruteforce_df.index.isin(max_idxs)].features.to_list()

exp_1_feature_set = set().union(feats_lr_1, feats_lr_2, feats_lr_3, feats_lr_4, feats_lr_5, feats_lr_6)

In [22]:
selected_features = list(exp_1_feature_set)

In [23]:
exp_1_feature_set

{'audspec_lengthL1norm_sma_quartile3',
 'jitterDDP_sma_flatness',
 'jitterDDP_sma_iqr1-2',
 'jitterDDP_sma_percentile1.0',
 'jitterDDP_sma_quartile1',
 'jitterDDP_sma_quartile2',
 'jitterLocal_sma_percentile1.0',
 'jitterLocal_sma_quartile1',
 'mfcc_sma[10]_lpc0',
 'mfcc_sma[10]_lpc1',
 'mfcc_sma[11]_lpc0',
 'mfcc_sma[11]_lpc1',
 'mfcc_sma[12]_lpc0',
 'mfcc_sma[12]_lpc1',
 'mfcc_sma[12]_lpgain',
 'mfcc_sma[13]_lpc0',
 'mfcc_sma[13]_lpc1',
 'mfcc_sma[13]_lpc2',
 'mfcc_sma[14]_lpc0',
 'mfcc_sma[14]_lpc1',
 'mfcc_sma[5]_lpc1',
 'mfcc_sma[5]_lpgain',
 'mfcc_sma[6]_lpc0',
 'mfcc_sma[6]_lpc1',
 'mfcc_sma[6]_lpc2',
 'mfcc_sma[6]_lpgain',
 'mfcc_sma[7]_lpc0',
 'mfcc_sma[7]_lpc1',
 'mfcc_sma[8]_lpc0',
 'mfcc_sma[8]_lpc1',
 'mfcc_sma[9]_lpc0',
 'mfcc_sma[9]_lpc1',
 'mfcc_sma[9]_lpgain',
 'mfcc_sma_de[10]_lpc0',
 'mfcc_sma_de[10]_lpc1',
 'mfcc_sma_de[10]_lpc2',
 'mfcc_sma_de[10]_lpc3',
 'mfcc_sma_de[11]_lpc2',
 'mfcc_sma_de[12]_lpc0',
 'mfcc_sma_de[12]_lpc1',
 'mfcc_sma_de[12]_lpgain',
 'mfcc_sma

In [24]:
len(exp_1_feature_set)

86

#### Test on held out test data

In [25]:
df_lr_final = pd.concat([train_df, dev_df])
X_train_lr_final = df_lr_final.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake']).copy()
y_train_lr_final = df_lr_final['fake'].copy()

X_test_lr_final = test_df.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake']).copy()
y_test_lr_final = test_df['fake'].copy()

In [26]:
#select features identified from brute force
feature_index = train_df.columns.intersection(selected_features)

#fit model
model_lr_final = LogisticRegression(max_iter=1000)
model_lr_final.fit(X_train_lr_final[feature_index], y_train_lr_final)

#predict on held out test data
yhat_train_final = model_lr_final.predict(X_train_lr_final[feature_index])
yhat_test_final = model_lr_final.predict(X_test_lr_final[feature_index])

#compute accuracy
accuracy_lr_train = accuracy_score(y_train_lr_final, yhat_train_final)
accuracy_lr_test = accuracy_score(y_test_lr_final, yhat_test_final)

#print
print('Logistic accuracy train = %.3f' % (accuracy_lr_train*100))
print('Logistic accuracy test = %.3f' % (accuracy_lr_test*100))

Logistic accuracy train = 92.717
Logistic accuracy test = 92.188


## SQS testing

In [32]:
X_train = train_df.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake']).copy()
y_train = train_df['fake'].copy()

X_dev = dev_df.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake'])
y_dev = dev_df['fake'].copy()

X_train_final = pd.concat([X_train, X_dev])
y_train_final = pd.concat([y_train, y_dev])

X_test = test_df.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake'])
y_test = test_df['fake'].copy()

In [30]:
savefile = base_path + 'testing-code/opensmile-feature-importance/smile_dfs/run_data/' + '0311-exp3_feats.txt'
selected_feats = np.genfromtxt(savefile ,dtype='str')

In [31]:
selected_feats

array(['audspec_lengthL1norm_sma_quartile3', 'pcm_zcr_sma_range',
       'pcm_zcr_sma_quartile1', 'pcm_zcr_sma_quartile2',
       'pcm_zcr_sma_pctlrange0-1', 'pcm_fftMag_spectralEntropy_sma_range',
       'pcm_fftMag_spectralSlope_sma_percentile99.0', 'mfcc_sma[3]_lpc1',
       'mfcc_sma[4]_lpc0', 'mfcc_sma[4]_lpc1', 'mfcc_sma[5]_lpgain',
       'mfcc_sma[5]_lpc0', 'mfcc_sma[5]_lpc1', 'mfcc_sma[6]_lpgain',
       'mfcc_sma[6]_lpc0', 'mfcc_sma[6]_lpc1', 'mfcc_sma[7]_lpgain',
       'mfcc_sma[7]_lpc0', 'mfcc_sma[7]_lpc1', 'mfcc_sma[8]_lpc0',
       'mfcc_sma[8]_lpc1', 'mfcc_sma[9]_lpgain', 'mfcc_sma[9]_lpc0',
       'mfcc_sma[9]_lpc1', 'mfcc_sma[10]_lpgain', 'mfcc_sma[10]_lpc0',
       'mfcc_sma[10]_lpc1', 'mfcc_sma[11]_lpgain', 'mfcc_sma[11]_lpc0',
       'mfcc_sma[11]_lpc1', 'mfcc_sma[12]_lpgain', 'mfcc_sma[12]_lpc0',
       'mfcc_sma[12]_lpc1', 'mfcc_sma[13]_lpgain', 'mfcc_sma[13]_lpc0',
       'mfcc_sma[13]_lpc1', 'mfcc_sma[14]_lpgain', 'mfcc_sma[14]_lpc0',
       'mfcc_sma[14]_lpc1'

In [28]:
from sklearn.feature_selection import SequentialFeatureSelector

In [34]:
sfs_selector = SequentialFeatureSelector(estimator=svm.SVC(),
                                         n_features_to_select=15,
                                         cv=5,
                                         direction='forward')

#select features identified from brute force
feature_index = train_df.columns.intersection(selected_feats)

sfs_selector.fit(X_train_final[feature_index], y_train_final)

SequentialFeatureSelector(estimator=SVC(), n_features_to_select=15)

In [35]:
sfs_selector.get_feature_names_out()

array(['pcm_zcr_sma_range', 'mfcc_sma[6]_lpc0', 'mfcc_sma[6]_lpc1',
       'mfcc_sma_de[5]_lpc2', 'mfcc_sma_de[6]_lpgain',
       'mfcc_sma_de[8]_lpc2', 'mfcc_sma_de[9]_lpgain',
       'mfcc_sma_de[9]_lpc2', 'mfcc_sma_de[13]_lpgain',
       'mfcc_sma_de[14]_lpgain', 'jitterLocal_sma_quartile1',
       'jitterDDP_sma_flatness', 'jitterDDP_sma_quartile1',
       'jitterDDP_sma_percentile1.0', 'audspec_lengthL1norm_sma_amean'],
      dtype=object)

In [36]:
selected_feats_e4 = sfs_selector.get_feature_names_out()


In [37]:
selected_feats_e4

array(['pcm_zcr_sma_range', 'mfcc_sma[6]_lpc0', 'mfcc_sma[6]_lpc1',
       'mfcc_sma_de[5]_lpc2', 'mfcc_sma_de[6]_lpgain',
       'mfcc_sma_de[8]_lpc2', 'mfcc_sma_de[9]_lpgain',
       'mfcc_sma_de[9]_lpc2', 'mfcc_sma_de[13]_lpgain',
       'mfcc_sma_de[14]_lpgain', 'jitterLocal_sma_quartile1',
       'jitterDDP_sma_flatness', 'jitterDDP_sma_quartile1',
       'jitterDDP_sma_percentile1.0', 'audspec_lengthL1norm_sma_amean'],
      dtype=object)

In [38]:
savefile_e4 = base_path + 'testing-code/opensmile-feature-importance/smile_dfs/run_data/' + '0312-exp4_feats.txt'
np.savetxt(savefile_e4, selected_feats_e4, fmt='%s')