# Feature Selection Experiments for openSMILE

The [openSMILE Python API](https://audeering.github.io/opensmile-python/) includes the [ComParE 2016](http://www.tangsoo.de/documents/Publications/Schuller16-TI2.pdf) feature set which enables us to extract 65 low-level descriptors (for ex. MFCC) as well as 6373 Functionals derived from these decriptors. 

For each audio file - the LLDs are a `samples x 65` df and the functionals are a `1 x 6373` df. 

In these experiments, we extract the functionals for both real and fake audio data and run a set of feature selection techniques to identify a hnadful subset of features for use in a final predictive model.

## Import Statements

In [1]:
import opensmile 
import pandas as pd
import numpy as np
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score, confusion_matrix
import random
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.manifold import TSNE
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns
from sklearn import svm
from pprint import pprint
from tqdm import tqdm
base_path = "/home/ubuntu/"

## Data Loading and Summary

To save time, the data is sampled and features are extracted using openSMILE in a separate notebook `sampling_and_mixing_data.ipynb` and saved in `.csv` format for reusability. The dataset used contains original audio from the [LJSpeech 1.1](https://keithito.com/LJ-Speech-Dataset/) dataset and fake audio generated using GANs for the [NeurIPS 2021 WaveFake](https://arxiv.org/abs/2111.02813) dataset. It also contains audio from LJSpeech generated using ElevenLabs.

Contains 12,800 audio files - 6400 each real and fake with a set of 800 real audio files matching a set of the same 800 fakes generated using one of the archs listed below. 

In [2]:
exp_data_file = '/home/ubuntu/testing-code/opensmile-feature-importance/smile_dfs/0310-lj_experimental_data_v1.csv'
exp_data_df = pd.read_csv(exp_data_file)

In [3]:
exp_data_df.head()

Unnamed: 0,id,file,type,fake,duration(seconds),audspec_lengthL1norm_sma_range,audspec_lengthL1norm_sma_maxPos,audspec_lengthL1norm_sma_minPos,audspec_lengthL1norm_sma_quartile1,audspec_lengthL1norm_sma_quartile2,...,mfcc_sma_de[14]_peakRangeAbs,mfcc_sma_de[14]_peakRangeRel,mfcc_sma_de[14]_peakMeanAbs,mfcc_sma_de[14]_peakMeanMeanDist,mfcc_sma_de[14]_peakMeanRel,mfcc_sma_de[14]_minRangeRel,mfcc_sma_de[14]_meanRisingSlope,mfcc_sma_de[14]_stddevRisingSlope,mfcc_sma_de[14]_meanFallingSlope,mfcc_sma_de[14]_stddevFallingSlope
0,LJ032-0137,/home/ubuntu/data/wavefake_data/LJSpeech_1.1/w...,ElevenLabs,0,7.762,2.529597,0.608866,0.765319,0.343702,0.589645,...,10.773912,0.657822,2.936347,2.935397,20.0,0.609994,129.24013,64.158455,130.97571,60.847908
1,LJ038-0165,/home/ubuntu/data/wavefake_data/LJSpeech_1.1/w...,ElevenLabs,0,8.656,2.406416,0.005841,0.695093,0.295922,0.514596,...,8.458103,0.531263,2.910538,2.911863,-20.0,0.650057,125.02853,58.838596,115.42761,55.276188
2,LJ044-0203,/home/ubuntu/data/wavefake_data/LJSpeech_1.1/w...,ElevenLabs,0,5.15,2.695161,0.249012,0.175889,0.375376,0.674925,...,11.555664,0.663116,2.731076,2.73182,-20.0,0.479881,126.26017,54.306473,109.07703,62.24075
3,LJ003-0044,/home/ubuntu/data/wavefake_data/LJSpeech_1.1/w...,ElevenLabs,0,3.199,2.287197,0.528846,0.996795,0.478625,0.801715,...,11.162925,0.497092,3.827386,3.820832,19.997963,0.526514,155.38283,77.223274,146.11166,88.62794
4,LJ036-0116,/home/ubuntu/data/wavefake_data/LJSpeech_1.1/w...,ElevenLabs,0,2.34,2.027513,0.181416,0.513274,0.277241,0.572403,...,6.160469,0.46208,2.79742,2.774368,19.579084,0.638266,99.164055,44.689617,123.44574,49.45689


In [4]:
exp_data_df.shape

(12800, 6378)

In [5]:
exp_data_df.type.value_counts()

ElevenLabs           1600
Waveglow             1600
Parallel_WaveGan     1600
Multi_Band_MelGan    1600
MelGanLarge          1600
MelGan               1600
HifiGan              1600
Full_Band_MelGan     1600
Name: type, dtype: int64

In [6]:
#check to ensure each id has a corresponding fake
exp_data_df.id.value_counts()

LJ032-0137    2
LJ038-0118    2
LJ046-0168    2
LJ016-0171    2
LJ037-0020    2
             ..
LJ004-0171    2
LJ042-0184    2
LJ015-0150    2
LJ007-0172    2
LJ028-0069    2
Name: id, Length: 6400, dtype: int64

## Train-Dev-Test Split:

In [7]:
#split the data 
f1 = 0.8
f2 = 0.9
train_df, dev_df, test_df = np.split(exp_data_df.sample(frac=1), [int(f1*len(exp_data_df)), int(f2*len(exp_data_df))])

In [8]:
#check split
len(train_df), len(dev_df), len(test_df), (len(train_df) + len(dev_df) + len(test_df))

(10240, 1280, 1280, 12800)

In [9]:
train_df.type.value_counts()

MelGan               1289
Waveglow             1286
HifiGan              1285
ElevenLabs           1281
MelGanLarge          1280
Multi_Band_MelGan    1277
Parallel_WaveGan     1272
Full_Band_MelGan     1270
Name: type, dtype: int64

In [10]:
dev_df.type.value_counts()

Full_Band_MelGan     171
Waveglow             167
MelGanLarge          165
ElevenLabs           163
Multi_Band_MelGan    158
Parallel_WaveGan     158
MelGan               157
HifiGan              141
Name: type, dtype: int64

In [11]:
test_df.type.value_counts()

HifiGan              174
Parallel_WaveGan     170
Multi_Band_MelGan    165
Full_Band_MelGan     159
ElevenLabs           156
MelGanLarge          155
MelGan               154
Waveglow             147
Name: type, dtype: int64

## Feature Scaling:

In [12]:
exp_data_scaler = StandardScaler()
train_df.iloc[:,5:] = exp_data_scaler.fit_transform(train_df.iloc[:,5:])
dev_df.iloc[:,5:] = exp_data_scaler.transform(dev_df.iloc[:,5:])
test_df.iloc[:,5:] = exp_data_scaler.transform(test_df.iloc[:,5:])

## Experiment 1: Brute Force Feature Selection

In the first experiment, we break down the data set into each architecture used to generate fakes 

In [13]:
def run_bflr_for_arch(train_df, dev_df, arch, all_archs=False):
    
    #prepare data
    if all_archs==False:
        trdf = train_df[train_df.type==arch]
    if all_archs==True:
        trdf = train_df[train_df.type.isin(arch)]
    
    if all_archs==False:
        dvdf = dev_df[dev_df.type==arch]
    if all_archs==True:
        dvdf = dev_df[dev_df.type.isin(arch)]
    
    X_train = trdf.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake']).copy()
    y_train = trdf['fake'].copy()
    
    X_dev = dvdf.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake'])
    y_dev = dvdf['fake'].copy()
    
    train_accuracies = []
    dev_accuracies = []
    
    for i in tqdm(range(6373)):
        
        model_lr = LogisticRegression()
        model_lr.fit(X_train.iloc[:,i].to_numpy().reshape(-1, 1), y_train)
        y_hat_train = model_lr.predict(X_train.iloc[:,i].to_numpy().reshape(-1, 1))
        y_hat_dev = model_lr.predict(X_dev.iloc[:,i].to_numpy().reshape(-1, 1))
        train_accuracy = accuracy_score(y_train, y_hat_train)
        train_accuracies.append(train_accuracy)        
        dev_accuracy = accuracy_score(y_dev, y_hat_dev)
        dev_accuracies.append(dev_accuracy)
    
    print("\nAverage train accuracy: {}".format(np.mean(train_accuracies)))
    print("Average dev accuracy: {}\n".format(np.mean(dev_accuracies)))
    
    return dev_accuracies

In [14]:
archs = list(exp_data_df.type.unique())
features = train_df.columns.to_list()[5:]
bruteforce_df = pd.DataFrame(features, columns=['features'])

for arch in archs:
    print("\nRunning for {} architecture\n".format(arch))
    bruteforce_df[arch] = run_bflr_for_arch(train_df, dev_df, arch)

print("\nRunning for all architectures\n")
bruteforce_df['all_archs'] = run_bflr_for_arch(train_df, dev_df, archs, all_archs=True)


Running for ElevenLabs architecture



100%|██████████| 6373/6373 [00:49<00:00, 129.48it/s]



Average train accuracy: 0.623802627522213
Average dev accuracy: 0.6179231978467441


Running for Waveglow architecture



100%|██████████| 6373/6373 [00:45<00:00, 141.19it/s]



Average train accuracy: 0.532751042683717
Average dev accuracy: 0.5008677138113541


Running for Parallel_WaveGan architecture



100%|██████████| 6373/6373 [00:44<00:00, 143.68it/s]



Average train accuracy: 0.52813843682122
Average dev accuracy: 0.5346040554793065


Running for Multi_Band_MelGan architecture



100%|██████████| 6373/6373 [00:41<00:00, 153.57it/s]



Average train accuracy: 0.5262508077526065
Average dev accuracy: 0.5086043375236112


Running for MelGanLarge architecture



100%|██████████| 6373/6373 [00:46<00:00, 135.67it/s]



Average train accuracy: 0.5281166640514672
Average dev accuracy: 0.5318640666828334


Running for MelGan architecture



100%|██████████| 6373/6373 [00:41<00:00, 153.69it/s]



Average train accuracy: 0.5282941258317156
Average dev accuracy: 0.46359692212668696


Running for HifiGan architecture



100%|██████████| 6373/6373 [00:35<00:00, 177.11it/s]



Average train accuracy: 0.515744864796219
Average dev accuracy: 0.5077849482468704


Running for Full_Band_MelGan architecture



100%|██████████| 6373/6373 [00:41<00:00, 155.41it/s]



Average train accuracy: 0.5153243691706276
Average dev accuracy: 0.4953811905672964


Running for all architectures



100%|██████████| 6373/6373 [03:24<00:00, 31.19it/s]


Average train accuracy: 0.5169546144378628
Average dev accuracy: 0.5037376922171661






In [15]:
bruteforce_df[(bruteforce_df.iloc[:,1:] > 0.55).all(1)]

Unnamed: 0,features,ElevenLabs,Waveglow,Parallel_WaveGan,Multi_Band_MelGan,MelGanLarge,MelGan,HifiGan,Full_Band_MelGan,all_archs
1240,pcm_fftMag_spectralFlux_sma_range,0.638037,0.586826,0.702532,0.651899,0.642424,0.592357,0.638298,0.578947,0.592969
1250,pcm_fftMag_spectralFlux_sma_percentile99.0,0.736196,0.568862,0.708861,0.651899,0.648485,0.592357,0.602837,0.573099,0.591406
1251,pcm_fftMag_spectralFlux_sma_pctlrange0-1,0.736196,0.562874,0.708861,0.64557,0.648485,0.598726,0.595745,0.567251,0.591406
1252,pcm_fftMag_spectralFlux_sma_stddev,0.791411,0.580838,0.702532,0.639241,0.642424,0.617834,0.58156,0.614035,0.582031
2950,pcm_fftMag_spectralFlux_sma_de_quartile3,0.723926,0.610778,0.651899,0.601266,0.557576,0.55414,0.64539,0.584795,0.559375
2954,pcm_fftMag_spectralFlux_sma_de_percentile1.0,0.711656,0.652695,0.664557,0.664557,0.636364,0.630573,0.602837,0.584795,0.583594
2956,pcm_fftMag_spectralFlux_sma_de_pctlrange0-1,0.803681,0.640719,0.683544,0.639241,0.636364,0.598726,0.64539,0.602339,0.56875
2957,pcm_fftMag_spectralFlux_sma_de_stddev,0.822086,0.688623,0.708861,0.683544,0.630303,0.598726,0.588652,0.578947,0.56875
3311,mfcc_sma_de[3]_lpgain,0.588957,0.676647,0.56962,0.575949,0.842424,0.751592,0.588652,0.596491,0.578906
3737,jitterLocal_sma_flatness,0.742331,0.682635,0.588608,0.71519,0.642424,0.598726,0.659574,0.637427,0.607031


In [16]:
bruteforce_df[(bruteforce_df.iloc[:,1:] > 0.6).all(1)]

Unnamed: 0,features,ElevenLabs,Waveglow,Parallel_WaveGan,Multi_Band_MelGan,MelGanLarge,MelGan,HifiGan,Full_Band_MelGan,all_archs
3776,jitterDDP_sma_flatness,0.638037,0.694611,0.740506,0.778481,0.630303,0.738854,0.758865,0.690058,0.679688


In [17]:
bruteforce_df[(bruteforce_df.iloc[:,2:-1] > 0.55).all(axis=1)]

Unnamed: 0,features,ElevenLabs,Waveglow,Parallel_WaveGan,Multi_Band_MelGan,MelGanLarge,MelGan,HifiGan,Full_Band_MelGan,all_archs
1240,pcm_fftMag_spectralFlux_sma_range,0.638037,0.586826,0.702532,0.651899,0.642424,0.592357,0.638298,0.578947,0.592969
1247,pcm_fftMag_spectralFlux_sma_iqr2-3,0.773006,0.580838,0.601266,0.563291,0.624242,0.579618,0.553191,0.584795,0.529687
1250,pcm_fftMag_spectralFlux_sma_percentile99.0,0.736196,0.568862,0.708861,0.651899,0.648485,0.592357,0.602837,0.573099,0.591406
1251,pcm_fftMag_spectralFlux_sma_pctlrange0-1,0.736196,0.562874,0.708861,0.64557,0.648485,0.598726,0.595745,0.567251,0.591406
1252,pcm_fftMag_spectralFlux_sma_stddev,0.791411,0.580838,0.702532,0.639241,0.642424,0.617834,0.58156,0.614035,0.582031
2950,pcm_fftMag_spectralFlux_sma_de_quartile3,0.723926,0.610778,0.651899,0.601266,0.557576,0.55414,0.64539,0.584795,0.559375
2954,pcm_fftMag_spectralFlux_sma_de_percentile1.0,0.711656,0.652695,0.664557,0.664557,0.636364,0.630573,0.602837,0.584795,0.583594
2956,pcm_fftMag_spectralFlux_sma_de_pctlrange0-1,0.803681,0.640719,0.683544,0.639241,0.636364,0.598726,0.64539,0.602339,0.56875
2957,pcm_fftMag_spectralFlux_sma_de_stddev,0.822086,0.688623,0.708861,0.683544,0.630303,0.598726,0.588652,0.578947,0.56875
2970,pcm_fftMag_spectralFlux_sma_de_lpgain,0.883436,0.628743,0.639241,0.651899,0.654545,0.55414,0.588652,0.561404,0.485938


In [18]:
bruteforce_df[(bruteforce_df.iloc[:,2:-1] > 0.9).any(axis=1)]

Unnamed: 0,features,ElevenLabs,Waveglow,Parallel_WaveGan,Multi_Band_MelGan,MelGanLarge,MelGan,HifiGan,Full_Band_MelGan,all_archs
1731,mfcc_sma[7]_lpc0,0.865031,0.508982,0.525316,0.468354,0.860606,0.904459,0.560284,0.497076,0.621875
1732,mfcc_sma[7]_lpc1,0.779141,0.449102,0.506329,0.493671,0.89697,0.923567,0.58156,0.473684,0.651563
1762,mfcc_sma[8]_lpc0,0.717791,0.60479,0.531646,0.481013,0.927273,0.859873,0.510638,0.590643,0.596094
1763,mfcc_sma[8]_lpc1,0.631902,0.45509,0.506329,0.468354,0.939394,0.898089,0.553191,0.625731,0.615625
1793,mfcc_sma[9]_lpc0,0.754601,0.449102,0.525316,0.575949,0.909091,0.840764,0.560284,0.561404,0.601562
1794,mfcc_sma[9]_lpc1,0.582822,0.51497,0.506329,0.512658,0.939394,0.866242,0.609929,0.573099,0.582812
1823,mfcc_sma[10]_lpgain,0.588957,0.580838,0.670886,0.537975,0.90303,0.821656,0.64539,0.51462,0.596875
1824,mfcc_sma[10]_lpc0,0.815951,0.479042,0.5,0.493671,0.909091,0.898089,0.58156,0.567251,0.626563
1825,mfcc_sma[10]_lpc1,0.662577,0.532934,0.575949,0.537975,0.915152,0.942675,0.553191,0.590643,0.639062
1856,mfcc_sma[11]_lpc1,0.650307,0.461078,0.487342,0.474684,0.915152,0.898089,0.553191,0.584795,0.603125


In [19]:
bruteforce_df[(bruteforce_df.iloc[:,-1] > 0.60)]

Unnamed: 0,features,ElevenLabs,Waveglow,Parallel_WaveGan,Multi_Band_MelGan,MelGanLarge,MelGan,HifiGan,Full_Band_MelGan,all_archs
67,pcm_RMSenergy_sma_quartile3,0.644172,0.562874,0.632911,0.689873,0.587879,0.509554,0.617021,0.590643,0.604688
1088,pcm_fftMag_fband1000-4000_sma_quartile1,0.723926,0.580838,0.632911,0.588608,0.612121,0.477707,0.609929,0.54386,0.601562
1670,mfcc_sma[5]_lpc1,0.797546,0.54491,0.468354,0.455696,0.884848,0.828025,0.567376,0.555556,0.613281
1671,mfcc_sma[5]_lpc2,0.533742,0.532934,0.506329,0.575949,0.8,0.796178,0.617021,0.573099,0.617188
1700,mfcc_sma[6]_lpc0,0.865031,0.479042,0.550633,0.493671,0.860606,0.834395,0.496454,0.526316,0.610156
1701,mfcc_sma[6]_lpc1,0.736196,0.497006,0.556962,0.544304,0.89697,0.89172,0.503546,0.573099,0.632812
1731,mfcc_sma[7]_lpc0,0.865031,0.508982,0.525316,0.468354,0.860606,0.904459,0.560284,0.497076,0.621875
1732,mfcc_sma[7]_lpc1,0.779141,0.449102,0.506329,0.493671,0.89697,0.923567,0.58156,0.473684,0.651563
1763,mfcc_sma[8]_lpc1,0.631902,0.45509,0.506329,0.468354,0.939394,0.898089,0.553191,0.625731,0.615625
1793,mfcc_sma[9]_lpc0,0.754601,0.449102,0.525316,0.575949,0.909091,0.840764,0.560284,0.561404,0.601562


In [20]:
max_idxs = []
for arch in archs + ['all_archs']:
    max_idxs.append(bruteforce_df[arch].idxmax())
bruteforce_df[bruteforce_df.index.isin(max_idxs)]

Unnamed: 0,features,ElevenLabs,Waveglow,Parallel_WaveGan,Multi_Band_MelGan,MelGanLarge,MelGan,HifiGan,Full_Band_MelGan,all_archs
96,pcm_zcr_sma_quartile1,1.0,0.497006,0.481013,0.620253,0.472727,0.471338,0.475177,0.584795,0.538281
1885,mfcc_sma[12]_lpgain,0.791411,0.538922,0.56962,0.734177,0.945455,0.802548,0.588652,0.45614,0.596094
3530,mfcc_sma_de[10]_lpc1,0.644172,0.443114,0.468354,0.449367,0.90303,0.949045,0.58156,0.561404,0.620313
3790,jitterDDP_sma_quartile1,0.515337,0.856287,0.772152,0.949367,0.593939,0.834395,0.822695,0.754386,0.744531


In [21]:
feats_lr_1 = bruteforce_df[(bruteforce_df.iloc[:,1:] > 0.55).all(1)].features.to_list()
feats_lr_2 = bruteforce_df[(bruteforce_df.iloc[:,1:] > 0.6).all(1)].features.to_list()
feats_lr_3 = bruteforce_df[(bruteforce_df.iloc[:,2:-1] > 0.55).all(axis=1)].features.to_list()
feats_lr_4 = bruteforce_df[(bruteforce_df.iloc[:,2:-1] > 0.9).any(axis=1)].features.to_list()
feats_lr_5 = bruteforce_df[(bruteforce_df.iloc[:,-1] > 0.60)].features.to_list()
feats_lr_6 = bruteforce_df[bruteforce_df.index.isin(max_idxs)].features.to_list()

exp_1_feature_set = set().union(feats_lr_1, feats_lr_2, feats_lr_3, feats_lr_4, feats_lr_5, feats_lr_6)

In [22]:
selected_features = list(exp_1_feature_set)

In [23]:
exp_1_feature_set

{'jitterDDP_sma_flatness',
 'jitterDDP_sma_percentile1.0',
 'jitterDDP_sma_quartile1',
 'jitterDDP_sma_quartile2',
 'jitterLocal_sma_flatness',
 'jitterLocal_sma_percentile1.0',
 'jitterLocal_sma_quartile1',
 'mfcc_sma[10]_lpc0',
 'mfcc_sma[10]_lpc1',
 'mfcc_sma[10]_lpgain',
 'mfcc_sma[11]_lpc1',
 'mfcc_sma[12]_lpc1',
 'mfcc_sma[12]_lpgain',
 'mfcc_sma[13]_lpc0',
 'mfcc_sma[13]_lpc1',
 'mfcc_sma[13]_lpgain',
 'mfcc_sma[14]_lpc1',
 'mfcc_sma[14]_lpgain',
 'mfcc_sma[5]_lpc1',
 'mfcc_sma[5]_lpc2',
 'mfcc_sma[6]_lpc0',
 'mfcc_sma[6]_lpc1',
 'mfcc_sma[7]_lpc0',
 'mfcc_sma[7]_lpc1',
 'mfcc_sma[8]_lpc0',
 'mfcc_sma[8]_lpc1',
 'mfcc_sma[9]_lpc0',
 'mfcc_sma[9]_lpc1',
 'mfcc_sma_de[10]_lpc0',
 'mfcc_sma_de[10]_lpc1',
 'mfcc_sma_de[10]_lpc2',
 'mfcc_sma_de[10]_lpgain',
 'mfcc_sma_de[11]_lpc0',
 'mfcc_sma_de[11]_lpc1',
 'mfcc_sma_de[12]_lpc1',
 'mfcc_sma_de[12]_lpgain',
 'mfcc_sma_de[13]_lpc2',
 'mfcc_sma_de[13]_lpgain',
 'mfcc_sma_de[2]_lpgain',
 'mfcc_sma_de[3]_lpgain',
 'mfcc_sma_de[5]_lpc0',


In [24]:
len(exp_1_feature_set)

88

#### Test on held out test data

In [25]:
df_lr_final = pd.concat([train_df, dev_df])
X_train_lr_final = df_lr_final.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake']).copy()
y_train_lr_final = df_lr_final['fake'].copy()

X_test_lr_final = test_df.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake']).copy()
y_test_lr_final = test_df['fake'].copy()

In [26]:
#select features identified from brute force
feature_index = train_df.columns.intersection(selected_features)

#fit model
model_lr_final = LogisticRegression(max_iter=1000)
model_lr_final.fit(X_train_lr_final[feature_index], y_train_lr_final)

#predict on held out test data
yhat_train_final = model_lr_final.predict(X_train_lr_final[feature_index])
yhat_test_final = model_lr_final.predict(X_test_lr_final[feature_index])

#compute accuracy
accuracy_lr_train = accuracy_score(y_train_lr_final, yhat_train_final)
accuracy_lr_test = accuracy_score(y_test_lr_final, yhat_test_final)

#print
print('Logistic accuracy train = %.3f' % (accuracy_lr_train*100))
print('Logistic accuracy test = %.3f' % (accuracy_lr_test*100))

Logistic accuracy train = 92.682
Logistic accuracy test = 92.344


## SQS testing

In [27]:
X_train = train_df.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake']).copy()
y_train = train_df['fake'].copy()

X_dev = dev_df.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake'])
y_dev = dev_df['fake'].copy()

X_test = test_df.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake'])
y_test = test_df['fake'].copy()

In [28]:
from sklearn.feature_selection import SelectKBest, mutual_info_regression

In [50]:
selector = SelectKBest(mutual_info_regression, k=100)

selector.fit(X_train, y_train)

SelectKBest(k=100,
            score_func=<function mutual_info_regression at 0x7fd008cabe50>)

In [51]:
features = train_df.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake']).columns

In [52]:
selector.get_feature_names_out()

array(['pcm_zcr_sma_range', 'pcm_zcr_sma_quartile1',
       'pcm_zcr_sma_quartile2', 'pcm_fftMag_spectralEntropy_sma_range',
       'pcm_fftMag_spectralEntropy_sma_upleveltime90',
       'pcm_fftMag_spectralKurtosis_sma_upleveltime90',
       'pcm_fftMag_spectralSlope_sma_percentile99.0', 'mfcc_sma[3]_lpc0',
       'mfcc_sma[3]_lpc1', 'mfcc_sma[4]_lpc0', 'mfcc_sma[4]_lpc1',
       'mfcc_sma[5]_lpgain', 'mfcc_sma[5]_lpc0', 'mfcc_sma[5]_lpc1',
       'mfcc_sma[6]_lpgain', 'mfcc_sma[6]_lpc0', 'mfcc_sma[6]_lpc1',
       'mfcc_sma[7]_lpc0', 'mfcc_sma[7]_lpc1', 'mfcc_sma[8]_lpgain',
       'mfcc_sma[8]_lpc0', 'mfcc_sma[8]_lpc1', 'mfcc_sma[9]_lpgain',
       'mfcc_sma[9]_lpc0', 'mfcc_sma[9]_lpc1', 'mfcc_sma[10]_lpgain',
       'mfcc_sma[10]_lpc0', 'mfcc_sma[10]_lpc1', 'mfcc_sma[11]_lpc0',
       'mfcc_sma[11]_lpc1', 'mfcc_sma[12]_lpgain', 'mfcc_sma[12]_lpc0',
       'mfcc_sma[12]_lpc1', 'mfcc_sma[13]_lpgain', 'mfcc_sma[13]_lpc0',
       'mfcc_sma[13]_lpc1', 'mfcc_sma[14]_lpgain', 'mfcc_sma[14

In [53]:
selected_feats = selector.get_feature_names_out()

In [54]:
X_train[X_train.columns.intersection(selected_feats)]

Unnamed: 0,pcm_zcr_sma_range,pcm_zcr_sma_quartile1,pcm_zcr_sma_quartile2,pcm_fftMag_spectralEntropy_sma_range,pcm_fftMag_spectralEntropy_sma_upleveltime90,pcm_fftMag_spectralKurtosis_sma_upleveltime90,pcm_fftMag_spectralSlope_sma_percentile99.0,mfcc_sma[3]_lpc0,mfcc_sma[3]_lpc1,mfcc_sma[4]_lpc0,...,voicingFinalUnclipped_sma_quartile3,jitterLocal_sma_quartile1,jitterLocal_sma_percentile1.0,jitterDDP_sma_flatness,jitterDDP_sma_quartile1,jitterDDP_sma_quartile2,jitterDDP_sma_percentile1.0,shimmerLocal_sma_percentile1.0,pcm_RMSenergy_sma_de_peakMeanRel,pcm_fftMag_spectralHarmonicity_sma_de_peakMeanRel
8343,0.744129,-1.018211,-0.647140,-0.132438,-0.753690,0.392586,-0.146959,1.368374,-0.938985,-0.205703,...,-1.705113,0.971743,0.704688,0.051574,1.721181,1.613093,2.879919,1.530543,-0.259708,-0.253110
2022,0.510835,0.844262,1.114483,-0.269425,0.013657,-0.025384,0.379997,0.036377,-0.348287,0.733956,...,-0.664672,0.074668,-0.251880,0.286417,-0.544550,-0.431190,-0.873396,-0.673266,-0.243300,-0.252897
809,-3.716143,-2.976384,-2.599607,1.397116,-1.695104,1.149070,-1.086881,-0.035730,-0.447196,1.403735,...,2.008619,0.136457,-0.305842,-0.776906,-0.018894,1.069253,-0.283961,-1.347643,3.981625,3.959425
10145,0.411686,-0.605964,0.152930,-0.051467,-0.535905,-0.396318,-0.583173,-0.481372,0.330006,-0.943001,...,-0.270322,1.303428,0.161960,-0.096196,-0.230029,-0.010113,-0.071575,0.201429,-0.260608,-0.253110
6086,0.618734,0.476184,0.336432,0.093954,-0.795031,1.033614,-0.219516,1.523216,-1.052831,0.415784,...,0.057066,0.984174,3.070097,1.334189,1.585801,0.806846,2.468032,-0.240172,-0.260734,-0.253110
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9677,0.420434,0.284784,0.718117,-0.009796,0.688279,-0.747196,-0.047296,0.112309,-0.676011,0.326300,...,-0.281661,0.274695,0.359695,-0.353658,0.907066,1.208125,0.133086,0.552417,-0.255710,-0.253110
571,0.580824,-0.370394,-0.955425,-0.217365,1.149588,-0.592226,-0.241204,1.775676,-1.947888,1.054438,...,-0.779009,-0.309568,-0.665991,-1.581389,0.145287,-0.333467,0.146187,0.569832,-0.249048,-0.253110
10225,-1.206784,-0.083294,-0.595760,-0.207814,-0.283436,0.807183,-0.956524,2.540684,-2.186655,0.216562,...,1.007462,-1.816120,-0.222520,-0.106441,-1.807525,-1.809777,-1.385319,-0.223601,-0.260435,-0.253110
3057,0.653728,2.382827,4.189983,-0.049467,0.772695,-0.261322,2.249196,-0.684558,0.876848,-0.570333,...,-0.395734,0.942999,2.490751,0.062954,0.944581,0.524631,1.198789,1.429305,-0.260408,-0.253110


In [55]:
df_lr_final = pd.concat([train_df, dev_df])
X_train_lr_final = df_lr_final.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake']).copy()
y_train_lr_final = df_lr_final['fake'].copy()

X_test_lr_final = test_df.drop(columns=['id', 'file', 'type', 'duration(seconds)', 'fake']).copy()
y_test_lr_final = test_df['fake'].copy()

In [56]:
#select features identified from brute force
feature_index = train_df.columns.intersection(selected_feats)

#fit model
model_lr_final = LogisticRegression(max_iter=1000)
model_lr_final.fit(X_train_lr_final[feature_index], y_train_lr_final)

#predict on held out test data
yhat_train_final = model_lr_final.predict(X_train_lr_final[feature_index])
yhat_test_final = model_lr_final.predict(X_test_lr_final[feature_index])

#compute accuracy
accuracy_lr_train = accuracy_score(y_train_lr_final, yhat_train_final)
accuracy_lr_test = accuracy_score(y_test_lr_final, yhat_test_final)

#print
print('Logistic accuracy train = %.3f' % (accuracy_lr_train*100))
print('Logistic accuracy test = %.3f' % (accuracy_lr_test*100))

Logistic accuracy train = 93.038
Logistic accuracy test = 92.109


In [57]:
selected_feats

array(['pcm_zcr_sma_range', 'pcm_zcr_sma_quartile1',
       'pcm_zcr_sma_quartile2', 'pcm_fftMag_spectralEntropy_sma_range',
       'pcm_fftMag_spectralEntropy_sma_upleveltime90',
       'pcm_fftMag_spectralKurtosis_sma_upleveltime90',
       'pcm_fftMag_spectralSlope_sma_percentile99.0', 'mfcc_sma[3]_lpc0',
       'mfcc_sma[3]_lpc1', 'mfcc_sma[4]_lpc0', 'mfcc_sma[4]_lpc1',
       'mfcc_sma[5]_lpgain', 'mfcc_sma[5]_lpc0', 'mfcc_sma[5]_lpc1',
       'mfcc_sma[6]_lpgain', 'mfcc_sma[6]_lpc0', 'mfcc_sma[6]_lpc1',
       'mfcc_sma[7]_lpc0', 'mfcc_sma[7]_lpc1', 'mfcc_sma[8]_lpgain',
       'mfcc_sma[8]_lpc0', 'mfcc_sma[8]_lpc1', 'mfcc_sma[9]_lpgain',
       'mfcc_sma[9]_lpc0', 'mfcc_sma[9]_lpc1', 'mfcc_sma[10]_lpgain',
       'mfcc_sma[10]_lpc0', 'mfcc_sma[10]_lpc1', 'mfcc_sma[11]_lpc0',
       'mfcc_sma[11]_lpc1', 'mfcc_sma[12]_lpgain', 'mfcc_sma[12]_lpc0',
       'mfcc_sma[12]_lpc1', 'mfcc_sma[13]_lpgain', 'mfcc_sma[13]_lpc0',
       'mfcc_sma[13]_lpc1', 'mfcc_sma[14]_lpgain', 'mfcc_sma[14