# Cleaning Transcript Example

### Notebook formatting (Run this to pad spaces)

In [25]:
import IPython.core.display as di

di.display_html("""
$('<style>.code_cell { margin-bottom: 100px !important;}</style>').appendTo('head');
""", raw=True)

### Starting Example Caption

In [26]:
import re
with open("captions/example_caption.txt") as captionFile:
    text = captionFile.read()
print(text)

1
00:00:00,000 --> 00:00:05,281
Imagine, using real closed-captions!

2
00:00:05,281 --> 00:09:00,000
[Music]

3
00:00:11,000 --> 00:00:14,281
Professor: Couldn't be me... in 1993


### Remove Timestamps
00:00:00,000 --> 00:00:05,281<br>
00:00:05,281 --> 00:09:00,000<br>

In [27]:
R_TIMESTAMP = r"(\d\d:\d\d:\d\d,\d\d\d \-\-\> \d\d:\d\d:\d\d,\d\d\d\s)|(\d\d:\d\d:\d\d\d\d\d)"
clean_text1 = re.sub(R_TIMESTAMP, "", text)
print(clean_text1)

1
Imagine, using real closed-captions!

2
[Music]

3
Professor: Couldn't be me... in 1993


### Remove Caption Index Numbers
1<br>
2<br>
3<br>

In [28]:
R_CAPTION_NUMBERS = r"\b[\d]+\n"
clean_text2 = re.sub(R_CAPTION_NUMBERS, "", clean_text1)
print(clean_text2)

Imagine, using real closed-captions!

[Music]

Professor: Couldn't be me... in 1993


### Remove Actions
[Music]<br>
[Ruffles Paper]<br>
[Laughs]<br>

In [29]:
R_ACTIONS = r"\[[a-zA-Z]+\]"
clean_text3 = re.sub(R_ACTIONS, "", clean_text2)
print(clean_text3)

Imagine, using real closed-captions!



Professor: Couldn't be me... in 1993


### Remove Speakers
Professor:<br>
Student:<br>
Speaker:<br>

In [30]:
R_SPEAKERS = r"[a-zA-Z]+:\s"
clean_text4 = re.sub(R_SPEAKERS, "", clean_text3)
print(clean_text4)

Imagine, using real closed-captions!



Couldn't be me... in 1993


### Remove Punctuation (Final Product)
,.+=!%^&*()[]<br>
exclude - and '<br>

In [31]:
R_PUNCT = r"[^\w\s\'\-]"
clean_text5 = re.sub(R_PUNCT, "", clean_text4)
print(clean_text5)

Imagine using real closed-captions



Couldn't be me in 1993


### Lowercase text

In [32]:
clean_text6 = clean_text5.lower()
print(clean_text6)

imagine using real closed-captions



couldn't be me in 1993


# Real Transcript Example

In [33]:
import json
REMOVE_REGEX = R_TIMESTAMP + "|" + R_CAPTION_NUMBERS + "|" + R_ACTIONS + "|" + R_SPEAKERS + "|" + R_PUNCT
with open("captions/real_caption.txt", "r") as dataFile:
    real_text = dataFile.read()
print(real_text)

1
00:00:00,589 --> 00:00:05,549
hi everyone it's dr. LeBlanc with a

2
00:00:03,720 --> 00:00:08,189
video screen capture lecture on

3
00:00:05,549 --> 00:00:10,800
translation protein synthesis our

4
00:00:08,189 --> 00:00:13,559
learning goals are to understand the

5
00:00:10,800 --> 00:00:16,830
mechanism of protein translation in both

6
00:00:13,559 --> 00:00:19,650
prokaryotes and eukaryotes to understand

7
00:00:16,830 --> 00:00:22,199
how antibiotics interfere at specific

8
00:00:19,650 --> 00:00:24,529
ribosomal sites and with specific steps

9
00:00:22,199 --> 00:00:27,210
of translation in prokaryotes to

10
00:00:24,529 --> 00:00:29,730
understand the key specific events of

11
00:00:27,210 --> 00:00:33,840
each phase of translation initiation

12
00:00:29,730 --> 00:00:36,030
elongation and termination to understand

13
00:00:33,840 --> 00:00:38,100
what a poly ribosome is and that in

14
00:00:36,030 --> 00:00:40,530
prokaryotes transcription and

15
00:00:38,100 -->

### Cleaned Real Transcript

In [34]:
clean_real_text = re.sub(REMOVE_REGEX, "", real_text)
clean_real_text = clean_real_text.lower()
print(clean_real_text)

hi everyone it's dr leblanc with a

video screen capture lecture on

translation protein synthesis our

learning goals are to understand the

mechanism of protein translation in both

prokaryotes and eukaryotes to understand

how antibiotics interfere at specific

ribosomal sites and with specific steps

of translation in prokaryotes to

understand the key specific events of

each phase of translation initiation

elongation and termination to understand

what a poly ribosome is and that in

prokaryotes transcription and

translation from a gene occur

simultaneously in the same cellular

compartment to understand how the

features and subunits of the ribosome

coordinate to achieve initiation to

understand how ribosomes position

themselves on the messenger rna during

initiation and the role of complementary

base sequences in the messenger rna and

ribosomal rna in prokaryotes to

understand the roles of initiation and

release factors to understand how trna

is bind and move throug