# Get Subtitle Text by Timestamp

## Load File

In [25]:
# file source http://www.zmtiantang.com/sub/444958.html
file_substring = "半沢直樹＃0"
episode = 1
f = open(f"./hanzawa/{file_substring}{episode}.srt", encoding='UTF-8-sig')
subtitle_original = f.read()

## Split Subtitles

Split the subtitles into section

In [39]:
subtitles_text =  subtitle_original.split('\n\n')

print(f"{subtitles_text[0]}\n\n{subtitles_text[1]}\n\n{subtitles_text[2]}")

1
00:00:35,034 --> 00:00:40,005
(半沢)この産業中央銀行で
働くことは 私の夢でした

2
00:00:40,005 --> 00:00:43,642
≪(面接官)いや しかし銀行は
うちだけじゃないでしょう

3
00:00:43,642 --> 00:00:45,644
いいえ


## Function and Class

The `time2sec` function transforms a string format timestamp into milliseconds.

The `Subtitle` class stores all the method and information for a subtitle section.

In [40]:
def time2sec(s, fromSrt = True, delay = 0):
    hour = 0
    minute = 0
    second = 0
    millis = 0
    if fromSrt:
        #'00:01:28,337'
        s = s.split(',')
        hms = s[0].split(':')

        hour = int(hms[0])
        minute = int(hms[1])
        second = int(hms[2])
        millis = int(s[1])

    else:
        s = s.split(':')
        if len(s) == 2:
            #'12:20'
            minute = int(s[0])
            second = int(s[1])
        elif len(s) ==3:
            #'1:24:02'
            hour = int(s[0])
            minute = int(s[1])
            second = int(s[2])
        
        
    return (60 * 60 * hour + 60 * minute + second) * 1000 + millis + delay


class Subtitle:    
    def __init__(self, s=''):
        if (s == ''):
            self.id = 0
            self.start = 0
            self.end = 0
            self.text = ''
        else:
            s_list = s.split('\n')
            self.id = int(s_list[0])

            time_list = s_list[1].split(' --> ')
            self.start = time2sec(time_list[0])
            self.end = time2sec(time_list[1])

            self.text = ''.join(s_list[2:])
            
    def __str__(self):
        """ For print() function """
        return f"Line : {self.id} \t{self.start} ~ {self.end}\n{self.text} \n\n"

## Create Instance

For each string-formatted subtitle extracted from the srt file, we transform it into subtitle class for better readability. In order to match the indecies with the original id in the srt file, we add a dummy subtitle instance to the head.

In [29]:
subtitles = [Subtitle(st) for st in subtitles_text]
subtitles.insert(0, Subtitle())

In [30]:
## test
print(subtitles[17], subtitles[18], subtitles[19])

Line : 17 	91307 ~ 95494
そのとき私達を救ってくださったのは 

 Line : 18 	95494 ~ 99331
それまで つきあい程度しか取り引きをしていなかった 

 Line : 19 	99331 ~ 102331
こちらの産業中央銀行です 




## Search Function

A linear search to get the index of the subtitle that matches the given timestamp.

In [31]:
def search_timestamp(t):
    for i in range(0,len(subtitles)):
        s = subtitles[i]
        if (t >= s.start and t <= s.end):
            return '\n'.join([ subtitles[i-1].text, subtitles[i].text, subtitles[i+1].text])
    return "not found"

## Delay

To callibrate for the displacement of timestamp between the source video and the downloaded subtitle file, we calculate how much it is delayed by entering the timestamp where the first subtitle appears.

In [32]:
delay = subtitles[1].start - time2sec(input("First subtitle started = "), fromSrt = False)

First subtitle started = 0:3


In [33]:
# input timestamp
timestamp = input("Timestamp = ")
timestamp = time2sec(timestamp, fromSrt = False, delay = delay)
timestamp

Timestamp = 23:41


1453034

In [34]:
print(search_timestamp(timestamp))

(川原)半沢さん
これは うちの支店にとどまらず
関西支部全体に関わる戦略案件になるはずです
