## WebVTT caption file

- https://developer.mozilla.org/en-US/docs/Web/API/WebVTT_API
- https://www.w3.org/TR/webvtt1/
- https://webvtt-py.readthedocs.io/en/latest/usage.html
- https://www.3playmedia.com/blog/how-to-create-a-webvtt-file/

In [None]:
# Reading WebVTT caption files from file-like object
import webvtt
import requests
from io import StringIO

url = "https://gist.githubusercontent.com/slevin48/4e6f7343376aef064992055b4c6da1bb/raw/e19399fbccbc069a2af4266e5120ae6bad62699a/sample.vtt"

payload = requests.get(url).text
buffer = StringIO(payload)

for caption in webvtt.read_buffer(buffer):
    print(caption.start)
    print(caption.end)
    print(caption.text)

In [1]:
import webvtt
# we can iterate over the captions
for caption in webvtt.read('test.vtt'):
    print(f'From {caption.start} to {caption.end}')
    print(caption.text)

From 00:01:14.815 to 00:01:18.114
- What?
- Where are we now?
From 00:01:18.171 to 00:01:20.991
- This is big bat country.
From 00:01:21.058 to 00:01:23.868
- [ Bats Screeching ]
- They won't get in your hair. They're after the bugs.


In [2]:
vtt = webvtt.read('test.vtt')
vtt[0]

<Caption start=00:01:14.815 end=00:01:18.114 text=- What?\n- Where are we now?>

In [3]:
len(vtt)

3

In [4]:
vtt[0].start

'00:01:14.815'

In [5]:
import time

time.strptime(vtt[0].start, "%H:%M:%S.%f")

time.struct_time(tm_year=1900, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=1, tm_sec=14, tm_wday=0, tm_yday=1, tm_isdst=-1)

In [6]:
# you can also iterate over the lines of a particular caption
for line in vtt[0].lines:
    print(line)

- What?
- Where are we now?


In [7]:
# caption text is returned clean without class tags
# we can access the raw text of a caption with raw_text
vtt[0].text

'- What?\n- Where are we now?'

In [8]:
vtt[0].raw_text

'- What?\n- Where are we now?'

In [9]:
import pandas as pd
start = [v.start for v in vtt]
end = [v.end for v in vtt]
text = [v.text for v in vtt]
# pd.DataFrame(list(zip(start,end,text)),columns=['start','end','text'])
# dictionary of lists 
dict = {'start': start, 'end': end, 'text': text} 
df = pd.DataFrame(dict)
df

Unnamed: 0,start,end,text
0,00:01:14.815,00:01:18.114,- What?\n- Where are we now?
1,00:01:18.171,00:01:20.991,- This is big bat country.
2,00:01:21.058,00:01:23.868,- [ Bats Screeching ]\n- They won't get in you...


In [10]:
# No header (option header=None)
df = pd.read_csv("timestamps.txt",sep=' ',names=['date','time','a','b'])
# df['datetime'] = pd.to_datetime(df['date']+ " " + df['time'])
df.index = pd.to_datetime(df['date']+ " " + df['time'])
df

Unnamed: 0,date,time,a,b
2016-11-13 20:00:10.617989120,2016-11-13,20:00:10.617989120,7.0,132.0
2016-11-13 22:00:00.022737152,2016-11-13,22:00:00.022737152,1.0,128.0
2016-11-13 22:00:28.417561344,2016-11-13,22:00:28.417561344,1.0,132.0


In [11]:
dt = pd.to_datetime("2016-11-13 22:01:25.450")
df.index.get_loc(dt, method='nearest')

2

https://stackoverflow.com/questions/42264848/pandas-dataframe-how-to-query-the-closest-datetime-index