## Speaker Identification using Whisper
* Train a classification model to identify if the speaker in a audio segment is lex or not

### Data:
* Expecting training audio clips in data/audio_dataset. Download from [here](https://drive.google.com/file/d/1SF0j1UmMxpwFNeY1wkj3R20pRB7L0a4t/view?usp=share_link)

In [1]:
import torch
from pathlib import Path
import os
import sys
import pandas as pd
from tqdm import tqdm

print('is cuda available:', torch.cuda.is_available())

# Add whisper repo to path to import
repo_dir = Path(os.getcwd()).parents[0]/'whisper'
sys.path.append(str(repo_dir))
import whisper


is cuda available: True


### Load Whisper model

In [2]:
# if this model is too slow, try the other smaller models such as small.en and base.en
model = whisper.load_model("medium.en")


### Labelled dataset

* Contains start and end defining segment of the audio clip
* audio_name is the name of the podcast. The podcasts audio files should be present in data/lex_podcasts
* `is_lex` is the label. 1 if speaker from start time to end time of the audio clip is lex. 0 if not

#### Data augmentation:
* All samples with audio_idx < 0 are heuristically labelled based on keyword, not hand labelled (for eg. "the following is a conv" in the start of a podcast is always lex"). Choose to select a subsample of this in training, since all audio clips will share a very similar spectral pattern

In [3]:
audio_dataset_dir = Path('data/audio_dataset')
labelled_path = 'data/labelled_dataset.csv'
df = pd.read_csv(labelled_path)

df.head()

Unnamed: 0,start,end,text,fname,audio_name,audio_idx,is_lex
0,02:49:11.280,02:49:15.120,"And, you know, some people also ask, are you ...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",0,0.0
1,02:20:14.140,02:20:17.260,I still do that often.,episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",1,1.0
2,00:19:15.360,00:19:17.320,things that you put into context of GPT.,episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",2,0.0
3,02:45:11.760,02:45:16.000,"and that also gives, you know, huge perspecti...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",3,0.0
4,01:33:44.600,01:33:49.160,"You, it's often the way how it works is you o...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",4,0.0


In [4]:
podcasts = list(df['audio_name'].unique())
print('Num audio clips:', len(df))
print(f'Num unique podcasts: {len(podcasts)}')

print(f'\n\nPostcasts containing the most tagged clips\n{df["audio_name"].value_counts().head(8)}')
print(f'\n\nPostcasts containing the least tagged clips\n{df["audio_name"].value_counts().tail(2)}')

Num audio clips: 698
Num unique podcasts: 68


Postcasts containing the most tagged clips
Elon Musk： Neuralink, AI, Autopilot, and the Pale Blue Dot ｜ Lex Fridman Podcast #49                   65
Ray Dalio： Principles, the Economic Machine, AI & the Arc of Life ｜ Lex Fridman Podcast #54            21
Judea Pearl： Causal Reasoning, Counterfactuals, and the Path to AGI ｜ Lex Fridman Podcast #56          21
Dmitry Korkin： Computational Biology of Coronavirus ｜ Lex Fridman Podcast #90                          21
Jeremy Howard： fast.ai Deep Learning Courses and Research ｜ Lex Fridman Podcast #35                    21
Cumrun Vafa： String Theory ｜ Lex Fridman Podcast #204                                                  21
Po-Shen Loh： Mathematics, Math Olympiad, Combinatorics & Contact Tracing ｜ Lex Fridman Podcast #183    20
Jim Keller： Moore's Law, Microprocessors, and First Principles ｜ Lex Fridman Podcast #70               20
Name: audio_name, dtype: int64


Postcasts containing the leas

## Create features for classifer
* Each hidden state is of shape (batch_size, 1500, hidden_size). The 1500 is the hidden state across 1500 time periods. We need to summarize features across the 3 time windows and create a feature vector of size (batch_size, hidden_size)
* This what get_feature_vector functions do: 
    * `get_feature_vector3`: Performs the best. Creates mean and std features across time window. Concatenate mean and std across time. Creates feature of shape (batch_size, hidden_size + hidden_size).
    * `get_feature_vector2`: Creates mean features across time windows. (batch_size, hidden_size)
    * `get_feature_vector1`: Calculates mean features across 3 time windows of 1500 (1-500, 500-1000, 1000-1500). Creates feature shape of (batch_size, hidden_size * 3)

In [5]:
def get_feature_vector1(batch):
    """
    Get features to train classifier.  Input: batch with 1 sample (bs, timewindows, hidden_size)
    
    Concatate mean features across 3 time windows 
    
    """
    out = [batch[0, :500, :].mean(1).flatten(), batch[0, 500:1000, :].mean(1).flatten(), batch[0, 1000:, :].mean(1).flatten()]
    X = torch.cat(out, dim=-1)
        
    return X[None, :]

def get_feature_vector2(batch):
    """
    Get features to train classifier.  Input: batch with 1 sample (bs, timewindows, hidden_size)
    
    Mean of features across entire timewindow of a clip
    
    """
    X = batch[0, :, :].mean(0)
    return X[None, :]

def get_feature_vector3(batch):
    """
    Get features to train classifier. Input: batch with 1 sample (bs, timewindows, hidden_size)
    
    Mean and std of features across entire timewindow of a clip
    
    """
    

    out = batch[0, :, :].mean(0)
    out2 = batch[0, :, :].std(0)
    X = torch.cat([out, out2], dim=-1)
        
    return X[None, :]




## Get Whisper Embeddings
* Requires GPU to finish fast. If it takes too long, consider using a smaller Whisper model 

In [6]:
if torch.cuda.is_available():
    model = model.cuda()
    
if not os.path.exists(audio_dataset_dir):
    if not audio_dataset_dir.exists():
        raise ValueError('Expecting audio clips data in ', audio_dataset_dir)
        
filenames = df['audio_idx'].apply(lambda x: str(x) + '.mp3')
audio_paths  = [audio_dataset_dir/filename for filename in filenames]

idx_to_path = {idx: path for idx, path in enumerate(audio_paths)}

hidden_l1 = []
hidden_l2 = []
hidden_last = []
hidden_middle = []

metadata_outputs = []
for audio_path in tqdm(audio_paths, total=len(audio_paths), disable=False):
    
    assert os.path.exists(audio_path)

    # load audio
    audio = whisper.load_audio(str(audio_path))
    audio = whisper.pad_or_trim(audio)
    
    # create mel spectogram input for whisper encoder
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    mel = mel[None, :, :]
    
    # forward pass through encoder
    with torch.no_grad():
        _ = model.embed_audio(mel)
    del _
    
    # get various hidden states of enocder
    hidden_l1.append(get_feature_vector3(model.encoder.encoder_out1.cpu()))
    hidden_l2.append(get_feature_vector3(model.encoder.encoder_out2.cpu()))
    hidden_last.append(get_feature_vector3(model.encoder.encoder_out_last.cpu()))
    hidden_middle.append(get_feature_vector3(model.encoder.encoder_out_middle.cpu()))
    
    
    metadata_outputs.append({"metadata": None, "audio_path": audio_path.name})


100%|████████████████████████████████████████████████████████████████████████████████| 698/698 [04:26<00:00,  2.61it/s]


In [7]:
df2 = pd.concat([df, pd.DataFrame(metadata_outputs)], axis=1)
df2.head()

Unnamed: 0,start,end,text,fname,audio_name,audio_idx,is_lex,metadata,audio_path
0,02:49:11.280,02:49:15.120,"And, you know, some people also ask, are you ...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",0,0.0,,0.mp3
1,02:20:14.140,02:20:17.260,I still do that often.,episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",1,1.0,,1.mp3
2,00:19:15.360,00:19:17.320,things that you put into context of GPT.,episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",2,0.0,,2.mp3
3,02:45:11.760,02:45:16.000,"and that also gives, you know, huge perspecti...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",3,0.0,,3.mp3
4,01:33:44.600,01:33:49.160,"You, it's often the way how it works is you o...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",4,0.0,,4.mp3


### Train test split

In [8]:
from collections import Counter
import numpy as np 

def cv_split(df, seed=42):
    """
    Split on podcast
    """
    
    import random
    
    randgen = random.Random(seed)
        
    subdf = df[df['audio_idx'] >= 0]
    
    # negative audio_idxs are not manually labelled, heuristically labelled. 
    # Near 100% are accurate, but speech pattern is same so only use few samples
    augmented_df = df[df['audio_idx'] < -1].sample(50)
    
    # split based on speakerid 
    sources = list(subdf['audio_name'].unique())
    
    test_split = randgen.sample(sources, len(sources) // 4)
    train_split = list(set(sources).difference(test_split))

    test_df = df[df['audio_name'].isin(test_split)]
    train_df = df[df['audio_name'].isin(train_split)]
    
    # add heuristic data to labelled data
    train_df = pd.concat([train_df, augmented_df], axis=0)
                               
    
    return train_df, test_df
    
def generate_splits(df, num_splits=5):
    """
    Splitting on speaker ID randomly leads to very high class imbalance in test set - in some podcasts lex tags are very few. 
    The strategy is to keep splitting on random seeds until a split of 40%-60% is reached
    
    """
    
    cvs = []
    seen_seeds = set()
    for split in range(num_splits):
        seeds = np.random.randint(low=0, high=1000, size=(50,))
        for seed in seeds: 
            # cant use same seed again 
            if seed in seen_seeds:
                continue
            train_df, test_df = cv_split(df, seed=seed)
            counts = test_df['is_lex'].value_counts()
            counts = counts/counts.sum()
            if (counts.loc[1.0] >= 0.40) and (counts.loc[1.0] <= 0.60):
                print(f'Split found with ratio: {counts.to_dict()}: seed: {seed}')
                break 
                
        cvs.append((train_df, test_df))
        seen_seeds.add(seed)
        
    return cvs 

        
splits = generate_splits(df, num_splits=5)        


Split found with ratio: {0.0: 0.5783783783783784, 1.0: 0.42162162162162165}: seed: 486
Split found with ratio: {0.0: 0.5928571428571429, 1.0: 0.40714285714285714}: seed: 676
Split found with ratio: {0.0: 0.5956284153005464, 1.0: 0.40437158469945356}: seed: 799
Split found with ratio: {0.0: 0.5815217391304348, 1.0: 0.41847826086956524}: seed: 846
Split found with ratio: {0.0: 0.5955882352941176, 1.0: 0.40441176470588236}: seed: 527


In [9]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, classification_report
from sklearn.exceptions import ConvergenceWarning
import warnings 
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn import svm

with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning)

def get_metrics(true, pred):
    f1, recall, precision, accuracy = f1_score(true, pred), recall_score(true, pred), precision_score(true, pred), accuracy_score(true, pred)
    
    return {'f1': f1, 'recall': recall, 'precision': precision, 'accuracy': accuracy}

def train_eval(X, splits):
    """
    Train model
    
    Inputs: 
        - X: All features X 
        - splits: List of tuples of (train_df, test_df)
    
    """
    test_metrics = []
    train_metrics = []
    fold_preds = []
    for train_df, test_df in splits:
        X_train, y_train = X[list(train_df.index), :], train_df['is_lex']
        X_test, y_test = X[list(test_df.index), :], test_df['is_lex']

        scalar = preprocessing.StandardScaler()
        
        # overfits fast with logistic regression
#         clf = LogisticRegression(random_state=0, max_iter=15, C=.7)
        clf = svm.SVC(kernel='rbf', C=.7)

        pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])

        pipeline.fit(X_train, y_train)
        pred_train = pipeline.predict(X_train)
        pred_test = pipeline.predict(X_test)

        m1 = get_metrics(y_train, pred_train)
        m1['num_positive_samples'] = (y_train==1).sum()
        m1['num_negative_samples'] = (y_train==0).sum()
        train_metrics.append(m1)
        
        m2 = get_metrics(y_test, pred_test)
        m2['num_positive_samples'] = (y_test==1).sum()
        m2['num_negative_samples'] = (y_test==0).sum()
        test_metrics.append(m2)
        
        fold_df = test_df.copy()
        fold_df['preds'] = pred_test
        
        fold_preds.append(fold_df)
        
        
        

    train_metrics = pd.DataFrame(train_metrics)
    test_metrics = pd.DataFrame(test_metrics)

    display('Test stats ', test_metrics.describe().loc[[ 'mean','std']])
    display('Train stats ', train_metrics.describe().loc[[ 'mean','std']])
    
    return train_metrics, test_metrics, fold_preds



## Train and Predict
* Train classifier

In [10]:
# utilize different hidden states for training and evaluate
for X, hidden_name in zip([hidden_l1, hidden_l2, hidden_last, hidden_middle], ['first_hidden', 'second_hidden', 'middle_hidden',  'last_hidden']):          
    X_use = torch.cat(X, dim=0)
    
    print(f'\n----Metrics of output of {hidden_name} encoder block output----\n')
    train_metrics, test_metrics, pred_dfs = train_eval(X_use, splits)
    


----Metrics of output of first_hidden encoder block output----



'Test stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.73847,0.745985,0.744412,0.786689,68.2,97.4
std,0.123165,0.159628,0.135087,0.092892,11.256109,14.099645


'Train stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.900951,0.85188,0.95611,0.932338,195.8,345.6
std,0.009651,0.012215,0.010986,0.005932,11.256109,14.099645



----Metrics of output of second_hidden encoder block output----



'Test stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.7876,0.795648,0.793055,0.824234,68.2,97.4
std,0.072253,0.108556,0.113546,0.059554,11.256109,14.099645


'Train stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.918516,0.878892,0.962092,0.943599,195.8,345.6
std,0.005632,0.011951,0.011779,0.004458,11.256109,14.099645



----Metrics of output of middle_hidden encoder block output----



'Test stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.894271,0.863248,0.929451,0.916785,68.2,97.4
std,0.036456,0.055418,0.033554,0.025991,11.256109,14.099645


'Train stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.982155,0.970733,0.993867,0.987234,195.8,345.6
std,0.00704,0.009525,0.005321,0.005117,11.256109,14.099645



----Metrics of output of last_hidden encoder block output----



'Test stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.916248,0.927209,0.906423,0.930823,68.2,97.4
std,0.039758,0.048416,0.042068,0.032059,11.256109,14.099645


'Train stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.987225,0.982593,0.991933,0.990807,195.8,345.6
std,0.00328,0.005636,0.005505,0.002368,11.256109,14.099645


### Explore predictions
* Get predictions of one fold

In [11]:
X_use = torch.cat(hidden_last, dim=0)
train_metrics, test_metrics, pred_dfs = train_eval(X_use, splits)


'Test stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.894271,0.863248,0.929451,0.916785,68.2,97.4
std,0.036456,0.055418,0.033554,0.025991,11.256109,14.099645


'Train stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.982155,0.970733,0.993867,0.987234,195.8,345.6
std,0.00704,0.009525,0.005321,0.005117,11.256109,14.099645


In [12]:
# predictions for one fold
fold_idx = 0
foldk_preds = pred_dfs[fold_idx]

display('Metrics ', test_metrics.loc[fold_idx])
display('Preds', foldk_preds.head())

'Metrics '

f1                        0.929032
recall                    0.923077
precision                 0.935065
accuracy                  0.940541
num_positive_samples     78.000000
num_negative_samples    107.000000
Name: 0, dtype: float64

'Preds'

Unnamed: 0,start,end,text,fname,audio_name,audio_idx,is_lex,preds
98,01:29:37.640,01:29:39.080,"Well, there's a meaning crisis",episode_250,Peter Wang： Python and the Source Code of Huma...,100,0.0,0.0
99,00:43:46.020,00:43:50.920,is that the aliens are all around us and we'r...,episode_250,Peter Wang： Python and the Source Code of Huma...,101,1.0,1.0
100,00:02:53.360,00:02:54.800,that fit in my head very easily.,episode_250,Peter Wang： Python and the Source Code of Huma...,102,0.0,0.0
101,02:02:07.180,02:02:09.080,"well, yeah, these libraries I need",episode_250,Peter Wang： Python and the Source Code of Huma...,103,0.0,0.0
102,00:13:23.560,00:13:25.520,"that don't have messiness in them,",episode_250,Peter Wang： Python and the Source Code of Huma...,104,1.0,0.0


#### Listen to audio_idx audio clip and explore the predictions

In [13]:
foldk_preds.sample(15)

Unnamed: 0,start,end,text,fname,audio_name,audio_idx,is_lex,preds
104,01:34:30.120,01:34:32.520,we're not encouraged to have more and more,episode_250,Peter Wang： Python and the Source Code of Huma...,106,0.0,0.0
584,00:00:07.280,00:00:13.120,"with parallels, if not in quality, than an ou...",episode_49,"Elon Musk： Neuralink, AI, Autopilot, and the P...",601,1.0,1.0
162,00:53:01.200,00:53:08.240,"the sexual stuff aside is the, it's more like...",episode_098,Kate Darling： Social Robotics ｜ Lex Fridman Po...,164,1.0,1.0
612,00:06:54.640,00:06:59.200,I think there's a lot of tremendous amount of...,episode_49,"Elon Musk： Neuralink, AI, Autopilot, and the P...",629,0.0,0.0
619,00:11:08.480,00:11:15.280,"Yes, exactly. Being able to have high precisi...",episode_49,"Elon Musk： Neuralink, AI, Autopilot, and the P...",636,0.0,0.0
616,00:07:37.120,00:07:43.920,I would argue that AI is unequivocally someth...,episode_49,"Elon Musk： Neuralink, AI, Autopilot, and the P...",633,0.0,0.0
624,00:12:37.120,00:12:42.160,both directions. So the brain is adjusting a ...,episode_49,"Elon Musk： Neuralink, AI, Autopilot, and the P...",641,1.0,1.0
116,02:25:17.000,02:25:19.280,So there's a hard thing.,episode_250,Peter Wang： Python and the Source Code of Huma...,118,0.0,0.0
112,00:31:59.520,00:32:00.680,You're gonna kind of have to do that,episode_250,Peter Wang： Python and the Source Code of Huma...,114,0.0,0.0
573,01:01:29.360,01:01:31.820,"but do not lose sight of the fact, and some p...",episode_232,"Brian Greene： Quantum Gravity, The Big Bang, A...",590,0.0,0.0
