## Speaker Identification using Whisper
* Train a classification model to identify if the speaker in a audio segment is lex or not

### Data:
* Expecting training audio clips in data/audio_dataset. Download from [here](https://drive.google.com/file/d/1SF0j1UmMxpwFNeY1wkj3R20pRB7L0a4t/view?usp=share_link)

In [1]:
import torch
from pathlib import Path
import os
import sys
import pandas as pd
from tqdm import tqdm

print('is cuda available:', torch.cuda.is_available())

# Add whisper repo to path to import
repo_dir = Path(os.getcwd()).parents[0]/'whisper'
sys.path.append(str(repo_dir))
import whisper


is cuda available: True


### Load Whisper model

In [2]:
# if this model is too slow, try the other smaller models such as small.en and base.en
model = whisper.load_model("medium.en")


### Labelled dataset

* Contains start and end defining segment of the audio clip
* audio_name is the name of the podcast. The podcasts audio files should be present in data/lex_podcasts
* `is_lex` is the label. 1 if speaker from start time to end time of the audio clip is lex. 0 if not

#### Data augmentation:
* All samples with audio_idx < 0 are heuristically labelled based on keyword, not hand labelled (for eg. "the following is a conv" in the start of a podcast is always lex"). Choose to select a subsample of this in training, since all audio clips will share a very similar spectral pattern

In [3]:
audio_dataset_dir = Path('data/audio_dataset')
labelled_path = 'data/labelled_dataset.csv'
df = pd.read_csv(labelled_path)

df.head()

Unnamed: 0,start,end,text,fname,audio_name,audio_idx,is_lex
0,02:49:11.280,02:49:15.120,"And, you know, some people also ask, are you ...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",0,0.0
1,02:20:14.140,02:20:17.260,I still do that often.,episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",1,1.0
2,00:19:15.360,00:19:17.320,things that you put into context of GPT.,episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",2,0.0
3,02:45:11.760,02:45:16.000,"and that also gives, you know, huge perspecti...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",3,0.0
4,01:33:44.600,01:33:49.160,"You, it's often the way how it works is you o...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",4,0.0


In [4]:
podcasts = list(df['audio_name'].unique())
print('Num audio clips:', len(df))
print(f'Num unique podcasts: {len(podcasts)}')

print(f'\n\nPostcasts containing the most tagged clips\n{df["audio_name"].value_counts().head(8)}')
print(f'\n\nPostcasts containing the least tagged clips\n{df["audio_name"].value_counts().tail(2)}')

Num audio clips: 698
Num unique podcasts: 68


Postcasts containing the most tagged clips
Elon Musk： Neuralink, AI, Autopilot, and the Pale Blue Dot ｜ Lex Fridman Podcast #49                   65
Ray Dalio： Principles, the Economic Machine, AI & the Arc of Life ｜ Lex Fridman Podcast #54            21
Judea Pearl： Causal Reasoning, Counterfactuals, and the Path to AGI ｜ Lex Fridman Podcast #56          21
Dmitry Korkin： Computational Biology of Coronavirus ｜ Lex Fridman Podcast #90                          21
Jeremy Howard： fast.ai Deep Learning Courses and Research ｜ Lex Fridman Podcast #35                    21
Cumrun Vafa： String Theory ｜ Lex Fridman Podcast #204                                                  21
Po-Shen Loh： Mathematics, Math Olympiad, Combinatorics & Contact Tracing ｜ Lex Fridman Podcast #183    20
Jim Keller： Moore's Law, Microprocessors, and First Principles ｜ Lex Fridman Podcast #70               20
Name: audio_name, dtype: int64


Postcasts containing the leas

## Create features for classifer
* Each hidden state is of shape (batch_size, 1500, hidden_size). The 1500 is the hidden state across 1500 time periods. We need to summarize features across the 3 time windows and create a feature vector of size (batch_size, hidden_size)
* This what get_feature_vector functions do: 
    * `get_feature_vector3`: Performs the best. Creates mean and std features across time window. Concatenate mean and std across time. Creates feature of shape (batch_size, hidden_size + hidden_size).
    * `get_feature_vector2`: Creates mean features across time windows. (batch_size, hidden_size)
    * `get_feature_vector1`: Calculates mean features across 3 time windows of 1500 (1-500, 500-1000, 1000-1500). Creates feature shape of (batch_size, hidden_size * 3)

In [5]:
def get_feature_vector1(batch):
    """
    Get features to train classifier.  Input: batch with 1 sample (bs, timewindows, hidden_size)
    
    Concatate mean features across 3 time windows 
    
    """
    out = [batch[0, :500, :].mean(1).flatten(), batch[0, 500:1000, :].mean(1).flatten(), batch[0, 1000:, :].mean(1).flatten()]
    X = torch.cat(out, dim=-1)
        
    return X[None, :]

def get_feature_vector2(batch):
    """
    Get features to train classifier.  Input: batch with 1 sample (bs, timewindows, hidden_size)
    
    Mean of features across entire timewindow of a clip
    
    """
    X = batch[0, :, :].mean(0)
    return X[None, :]

def get_feature_vector3(batch):
    """
    Get features to train classifier. Input: batch with 1 sample (bs, timewindows, hidden_size)
    
    Mean and std of features across entire timewindow of a clip
    
    """
    

    out = batch[0, :, :].mean(0)
    out2 = batch[0, :, :].std(0)
    X = torch.cat([out, out2], dim=-1)
        
    return X[None, :]




## Get Whisper Embeddings
* Requires GPU to finish fast. If it takes too long, consider using a smaller Whisper model 

In [6]:
if torch.cuda.is_available():
    model = model.cuda()
    
if not os.path.exists(audio_dataset_dir):
    if not audio_dataset_dir.exists():
        raise ValueError('Expecting audio clips data in ', audio_dataset_dir)
        
filenames = df['audio_idx'].apply(lambda x: str(x) + '.mp3')
audio_paths  = [audio_dataset_dir/filename for filename in filenames]

idx_to_path = {idx: path for idx, path in enumerate(audio_paths)}

hidden_l1 = []
hidden_l2 = []
hidden_last = []
hidden_middle = []

metadata_outputs = []


model = model.eval()

for audio_path in tqdm(audio_paths, total=len(audio_paths), disable=False):
    
    assert os.path.exists(audio_path)

    # load audio
    audio = whisper.load_audio(str(audio_path))
    audio = whisper.pad_or_trim(audio)
    
    # create mel spectogram input for whisper encoder
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    mel = mel[None, :, :]
    
    # forward pass through encoder
    with torch.no_grad():
        _ = model.embed_audio(mel)
    del _
    
    # get various hidden states of enocder
    hidden_l1.append(get_feature_vector3(model.encoder.encoder_out1.cpu()))
    hidden_l2.append(get_feature_vector3(model.encoder.encoder_out2.cpu()))
    hidden_last.append(get_feature_vector3(model.encoder.encoder_out_last.cpu()))
    hidden_middle.append(get_feature_vector3(model.encoder.encoder_out_middle.cpu()))
    
    metadata_outputs.append({"metadata": None, "audio_path": audio_path.name})


100%|████████████████████████████████████████████████████████████████████████████████| 698/698 [04:16<00:00,  2.72it/s]


In [7]:
df2 = pd.concat([df, pd.DataFrame(metadata_outputs)], axis=1)
df2.head()

Unnamed: 0,start,end,text,fname,audio_name,audio_idx,is_lex,metadata,audio_path
0,02:49:11.280,02:49:15.120,"And, you know, some people also ask, are you ...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",0,0.0,,0.mp3
1,02:20:14.140,02:20:17.260,I still do that often.,episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",1,1.0,,1.mp3
2,00:19:15.360,00:19:17.320,things that you put into context of GPT.,episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",2,0.0,,2.mp3
3,02:45:11.760,02:45:16.000,"and that also gives, you know, huge perspecti...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",3,0.0,,3.mp3
4,01:33:44.600,01:33:49.160,"You, it's often the way how it works is you o...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",4,0.0,,4.mp3


### Train test split

In [8]:
from collections import Counter
import numpy as np 

def cv_split(df, seed=42):
    """
    Split on podcast
    """
    
    import random
    
    randgen = random.Random(seed)
        
    subdf = df[df['audio_idx'] >= 0]
    
    # negative audio_idxs are not manually labelled, heuristically labelled. 
    # Near 100% are accurate, but speech pattern is same so only use few samples
    augmented_df = df[df['audio_idx'] < -1].sample(50)
    
    # split based on speakerid 
    sources = list(subdf['audio_name'].unique())
    
    test_split = randgen.sample(sources, len(sources) // 4)
    train_split = list(set(sources).difference(test_split))

    test_df = df[df['audio_name'].isin(test_split)]
    train_df = df[df['audio_name'].isin(train_split)]
    
    # add heuristic data to labelled data
    train_df = pd.concat([train_df, augmented_df], axis=0)
                               
    
    return train_df, test_df
    
def generate_splits(df, num_splits=5):
    """
    Splitting on speaker ID randomly leads to very high class imbalance in test set - in some podcasts lex tags are very few. 
    The strategy is to keep splitting on random seeds until a split of 40%-60% is reached
    
    """
    
    cvs = []
    seen_seeds = set()
    for split in range(num_splits):
        seeds = np.random.randint(low=0, high=1000, size=(50,))
        for seed in seeds: 
            # cant use same seed again 
            if seed in seen_seeds:
                continue
            train_df, test_df = cv_split(df, seed=seed)
            counts = test_df['is_lex'].value_counts()
            counts = counts/counts.sum()
            if (counts.loc[1.0] >= 0.40) and (counts.loc[1.0] <= 0.60):
                print(f'Split found with ratio: {counts.to_dict()}: seed: {seed}')
                break 
                
        cvs.append((train_df, test_df))
        seen_seeds.add(seed)
        
    return cvs 

        
splits = generate_splits(df, num_splits=5)        


Split found with ratio: {0.0: 0.5815217391304348, 1.0: 0.41847826086956524}: seed: 63
Split found with ratio: {0.0: 0.6, 1.0: 0.4}: seed: 650
Split found with ratio: {0.0: 0.5869565217391305, 1.0: 0.41304347826086957}: seed: 932
Split found with ratio: {0.0: 0.5934065934065934, 1.0: 0.4065934065934066}: seed: 56
Split found with ratio: {0.0: 0.5766423357664233, 1.0: 0.4233576642335766}: seed: 173


In [9]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score, classification_report
from sklearn.exceptions import ConvergenceWarning
import warnings 
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn import svm

with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning)

def get_metrics(true, pred):
    f1, recall, precision, accuracy = f1_score(true, pred), recall_score(true, pred), precision_score(true, pred), accuracy_score(true, pred)
    
    return {'f1': f1, 'recall': recall, 'precision': precision, 'accuracy': accuracy}

def train_eval(X, splits):
    """
    Train model
    
    Inputs: 
        - X: All features X 
        - splits: List of tuples of (train_df, test_df)
    
    """
    test_metrics = []
    train_metrics = []
    fold_preds = []
    for train_df, test_df in splits:
        X_train, y_train = X[list(train_df.index), :], train_df['is_lex']
        X_test, y_test = X[list(test_df.index), :], test_df['is_lex']

        scalar = preprocessing.StandardScaler()
        
        # overfits fast with logistic regression
#         clf = LogisticRegression(random_state=0, max_iter=15, C=.7)
        clf = svm.SVC(kernel='rbf', C=.7)

        pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])

        pipeline.fit(X_train, y_train)
        pred_train = pipeline.predict(X_train)
        pred_test = pipeline.predict(X_test)

        m1 = get_metrics(y_train, pred_train)
        m1['num_positive_samples'] = (y_train==1).sum()
        m1['num_negative_samples'] = (y_train==0).sum()
        train_metrics.append(m1)
        
        m2 = get_metrics(y_test, pred_test)
        m2['num_positive_samples'] = (y_test==1).sum()
        m2['num_negative_samples'] = (y_test==0).sum()
        test_metrics.append(m2)
        
        fold_df = test_df.copy()
        fold_df['preds'] = pred_test
        
        fold_preds.append(fold_df)
        
        
        

    train_metrics = pd.DataFrame(train_metrics)
    test_metrics = pd.DataFrame(test_metrics)

    display('Test stats ', test_metrics.describe().loc[[ 'mean','std']])
    display('Train stats ', train_metrics.describe().loc[[ 'mean','std']])
    
    return train_metrics, test_metrics, fold_preds



## Train and Predict
* Train classifier

In [10]:
# utilize different hidden states for training and evaluate
for X, hidden_name in zip([hidden_l1, hidden_l2, hidden_middle, hidden_last], ['first_hidden', 'second_hidden', 'middle_hidden', 'last_hidden']):          
    X_use = torch.cat(X, dim=0)
    
    print(f'\n----Metrics of output of {hidden_name} encoder block output----\n')
    train_metrics, test_metrics, pred_dfs = train_eval(X_use, splits)
    


----Metrics of output of first_hidden encoder block output----



'Test stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.741579,0.733944,0.763987,0.786568,67.8,96.6
std,0.059654,0.077747,0.111614,0.067345,10.917875,15.175638


'Train stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.902497,0.86197,0.947235,0.932789,196.2,346.4
std,0.014132,0.020646,0.012351,0.008928,10.917875,15.175638



----Metrics of output of second_hidden encoder block output----



'Test stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.794997,0.795201,0.813036,0.828643,67.8,96.6
std,0.05003,0.088273,0.115795,0.05858,10.917875,15.175638


'Train stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.920326,0.884621,0.959119,0.944697,196.2,346.4
std,0.009157,0.014203,0.005749,0.005722,10.917875,15.175638



----Metrics of output of middle_hidden encoder block output----



'Test stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.930347,0.931942,0.930254,0.942578,67.8,96.6
std,0.034269,0.038175,0.051011,0.028605,10.917875,15.175638


'Train stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.987889,0.986776,0.989027,0.991244,196.2,346.4
std,0.003545,0.004278,0.005904,0.002605,10.917875,15.175638



----Metrics of output of last_hidden encoder block output----



'Test stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.907284,0.890827,0.926739,0.924785,67.8,96.6
std,0.018847,0.028368,0.051824,0.016446,10.917875,15.175638


'Train stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.980991,0.969397,0.992885,0.986433,196.2,346.4
std,0.006044,0.007926,0.005553,0.004272,10.917875,15.175638


### Explore predictions
* Get predictions of one fold

In [11]:
X_use = torch.cat(hidden_middle, dim=0)
train_metrics, test_metrics, pred_dfs = train_eval(X_use, splits)


'Test stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.930347,0.931942,0.930254,0.942578,67.8,96.6
std,0.034269,0.038175,0.051011,0.028605,10.917875,15.175638


'Train stats '

Unnamed: 0,f1,recall,precision,accuracy,num_positive_samples,num_negative_samples
mean,0.987889,0.986776,0.989027,0.991244,196.2,346.4
std,0.003545,0.004278,0.005904,0.002605,10.917875,15.175638


In [12]:
# predictions for one fold
fold_idx = 0
foldk_preds = pred_dfs[fold_idx]

display('Metrics ', test_metrics.loc[fold_idx])
display('Preds', foldk_preds.head())

'Metrics '

f1                        0.962025
recall                    0.987013
precision                 0.938272
accuracy                  0.967391
num_positive_samples     77.000000
num_negative_samples    107.000000
Name: 0, dtype: float64

'Preds'

Unnamed: 0,start,end,text,fname,audio_name,audio_idx,is_lex,preds
39,00:13:00.920,00:13:04.400,"of what the genetics is like and the real,",episode_076,John Hopfield： Physics View of the Mind and Ne...,41,0.0,0.0
40,00:40:42.800,00:40:45.280,"After all, all the things your brain does,",episode_076,John Hopfield： Physics View of the Mind and Ne...,42,0.0,0.0
41,01:08:23.320,01:08:26.320,us weird descendants of apes.,episode_076,John Hopfield： Physics View of the Mind and Ne...,43,1.0,1.0
42,00:12:15.400,00:12:20.400,And since I can't see the evolutionary proces...,episode_076,John Hopfield： Physics View of the Mind and Ne...,44,0.0,0.0
43,01:00:06.200,01:00:08.440,in the near or maybe even far future,episode_076,John Hopfield： Physics View of the Mind and Ne...,45,1.0,1.0


#### Listen to audio_idx audio clip and explore the predictions

In [13]:
foldk_preds.sample(15)

Unnamed: 0,start,end,text,fname,audio_name,audio_idx,is_lex,preds
195,00:04:00.080,00:04:01.080,anything like that.,episode_035,Jeremy Howard： fast.ai Deep Learning Courses a...,198,0.0,0.0
421,00:14:49.900,00:14:51.500,from the start and from the end.,episode_120,François Chollet： Measures of Intelligence ｜ L...,432,1.0,1.0
55,00:36:51.640,00:36:53.240,on a form of the Boltzmann machine,episode_076,John Hopfield： Physics View of the Mind and Ne...,57,0.0,0.0
629,00:14:12.400,00:14:18.720,"will be on the machine side. This is just, th...",episode_49,"Elon Musk： Neuralink, AI, Autopilot, and the P...",646,0.0,0.0
583,00:00:00.000,00:00:07.280,The following is a conversation with Elon Mus...,episode_49,"Elon Musk： Neuralink, AI, Autopilot, and the P...",600,1.0,1.0
275,01:20:23.740,01:20:24.940,Who cares about birthdays?,episode_189,David Sinclair： Extending the Human Lifespan B...,281,0.0,0.0
168,00:05:02.960,00:05:03.760,even love to do.,episode_098,Kate Darling： Social Robotics ｜ Lex Fridman Po...,171,1.0,1.0
621,00:11:22.000,00:11:28.720,So you can see the consequences of if you fir...,episode_49,"Elon Musk： Neuralink, AI, Autopilot, and the P...",638,0.0,0.0
630,00:14:44.720,00:14:51.040,It's not the cortex that's steering the monke...,episode_49,"Elon Musk： Neuralink, AI, Autopilot, and the P...",647,0.0,0.0
585,00:00:13.120,00:00:20.720,"sequel of all time, Godfather Part 2. As many...",episode_49,"Elon Musk： Neuralink, AI, Autopilot, and the P...",602,1.0,1.0
