## Create Lex Speaker Dataset
* Given all the podcasts and the labelled dataset of audioclips, write each of these audio clips. 
* These audio clips will be used for training the speaker prediction model


In [1]:
import torch
from pathlib import Path
import os
import sys
import pandas as pd
import ffmpeg
from tqdm import tqdm
from datetime import datetime

print('is cuda available:', torch.cuda.is_available())

# Add whisper repo to path to import
repo_dir = Path(os.getcwd()).parents[0]/'whisper'
sys.path.append(str(repo_dir))
import whisper


is cuda available: True


### Labelled dataset

* Contains start and end defining segment of the audio clip
* audio_name is the name of the podcast. The podcasts audio files should be present in data/podcasts
* `is_lex` is the label. 1 if speaker from start time to end time of the audio clip is lex. 0 if not

In [2]:
labelled_path = 'data/labelled_dataset.csv'
df = pd.read_csv(labelled_path)

assert df['audio_idx'].duplicated().sum() == 0

df.head()

Unnamed: 0,start,end,text,fname,audio_name,audio_idx,is_lex
0,02:49:11.280,02:49:15.120,"And, you know, some people also ask, are you ...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",0,0.0
1,02:20:14.140,02:20:17.260,I still do that often.,episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",1,1.0
2,00:19:15.360,00:19:17.320,things that you put into context of GPT.,episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",2,0.0
3,02:45:11.760,02:45:16.000,"and that also gives, you know, huge perspecti...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",3,0.0
4,01:33:44.600,01:33:49.160,"You, it's often the way how it works is you o...",episode_215,"Wojciech Zaremba： OpenAI Codex, GPT-3, Robotic...",4,0.0


In [3]:
podcasts = list(df['audio_name'].unique())
print('Num audio clips:', len(df))
print(f'Num unique podcasts: {len(podcasts)}')

print(f'\n\nPostcasts containing the most tagged clips\n{df["audio_name"].value_counts().head(8)}')
print(f'\n\nPostcasts containing the least tagged clips\n{df["audio_name"].value_counts().tail(2)}')

Num audio clips: 698
Num unique podcasts: 68


Postcasts containing the most tagged clips
Elon Musk： Neuralink, AI, Autopilot, and the Pale Blue Dot ｜ Lex Fridman Podcast #49                   65
Ray Dalio： Principles, the Economic Machine, AI & the Arc of Life ｜ Lex Fridman Podcast #54            21
Judea Pearl： Causal Reasoning, Counterfactuals, and the Path to AGI ｜ Lex Fridman Podcast #56          21
Dmitry Korkin： Computational Biology of Coronavirus ｜ Lex Fridman Podcast #90                          21
Jeremy Howard： fast.ai Deep Learning Courses and Research ｜ Lex Fridman Podcast #35                    21
Cumrun Vafa： String Theory ｜ Lex Fridman Podcast #204                                                  21
Po-Shen Loh： Mathematics, Math Olympiad, Combinatorics & Contact Tracing ｜ Lex Fridman Podcast #183    20
Jim Keller： Moore's Law, Microprocessors, and First Principles ｜ Lex Fridman Podcast #70               20
Name: audio_name, dtype: int64


Postcasts containing the leas

### Create Audio Clips Dataset 
* Create individual audio segments based on the labelled data. So each audio clip will correspond to a label (is lex or not). 
* We'll be training our model to predict if the speaker in each audio clip is lex or not

In [4]:
def trim(in_file, out_file, start, end):
    """
    Write a segment of the audio file. in_file is trimmed from start to end and written to out_file
    
    Inputs:
        - in_file: path of the input audio pocast  
        - out_file: path of the output audio clip 
        - start: start timestamp of the audio segment in in_file 
        - end: end timestap of audio segment in in_file. 

    """
    
    if out_file.exists():
        os.remove(out_file)

    input_stream = ffmpeg.input(in_file)

    pts = "PTS-STARTPTS"
    audio = (input_stream
             .filter_("atrim", start=start, end=end)
             .filter_("asetpts", pts))
    
    output = ffmpeg.output(audio, str(out_file), format="mp3")
    output.run()

    out_file_probe_result = ffmpeg.probe(out_file)
    out_file_duration = out_file_probe_result.get(
        "format", {}).get("duration", None)

def get_seconds(ts):
    """
    Get seconds from timestamp
    
    """
    # convert to datetime instance
    date_time = datetime.strptime(ts, '%H:%M:%S.%f')
    time = date_time.hour * 3600 + date_time.minute * 60 + date_time.second + date_time.microsecond/10**6
    
    return time


for write_dir in ['data', 'data/audio_dataset']:
    write_dir = Path(write_dir)
    
    if not write_dir.exists():
        write_dir.mkdir()

# required podcasts expected to be present here
clips_dir = Path("data/podcasts/")
if not clips_dir.exists():
    raise ValueError('Expected lex podcasts audio files at', clips_dir)
    

### Write audio segments

In [None]:
for _, row in tqdm(df.iterrows(), total=len(df)):
    audio_idx, audio_name = row.audio_idx, row.audio_name
    start, end = row.start, row.end
    
    in_file = clips_dir/f"{audio_name}.mp3"
    out_file = write_dir/f"{audio_idx}.mp3"
    
    assert in_file.exists()
    trim(in_file, out_file, start, end)
    

 74%|███████████████████████████████████████████████████████████                     | 515/698 [22:14<05:42,  1.87s/it]