# About
- This notebook does some basic cleaning for the transcript dataset. 
- It mostly looks for patterns and remove them. Patterns includes "all caps" and "[]". 
- These are often used to indicate ambient noise or to indicate speakers.
- This workflow should remove around 90-95% of the noise in the dataset. 
- However, a manual review is still needed after this clean up. 
- (e.g. special characters and one off noises are not captured in this process)

In [18]:
import pandas as pd
import re
import string
from bs4 import BeautifulSoup
import nltk
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import spacy


In [36]:
nlp = spacy.load('en_core_web_sm')

# 0) Read Data

In [146]:
with open('S12 E10.txt', 'r') as text:
      textfile = text.read()
      print(textfile[:1000])

It's the final, and the remaining bakers are set three challenges that will test every aspect of their baking skills, including a classic carrot cake and a Mad Hatter's tea party banquet. One baker will be crowned the Bake Off 2021 winner.
[Matt] In the beginningâ¦

Good luck in there.

â¦there were 12.

-[laughing]
-Hello, Prue, how you doing?

[singing "Meet the Flintstones" in German]

I don't want to get emotional.

Everybody's terrified
about collapsing cakes.

-[crowd laughing]
-[gasps]

-[Matt] Oh, no!
-[Freya] Oh, my God.

-[Matt] Nowâ¦
-[Crystelle] Breathe.

Just remember to breathe.



â¦there are three.

-[Giuseppe] There's so much to do.
-Are you taking the mick?

[Crystelle] Do not drop.

-This is not going to be pretty.
-[Crystelle gasps]

Can't do anything now, can I?

[gasps] Oh, the oven is off.

What have you done?

[opening theme music playing]

[Noel] The 2021 final is like no other.

-Are you serious?
-Yeah, absolutely.

[Noel] For the first time ever,

each of our

In [147]:
# Make a copy of raw text file
clean_textfile=textfile

# 1) Removing Caps

In [148]:
# identify all text in all caps
all_caps = set(re.findall(r"(\b(?:[A-Z]+[A-Z]*|[A-Z]*[A-Z]+)\b(?:\s+(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b)*)"
                      , clean_textfile
                      , re.DOTALL))

all_caps

{'A', 'DNA', 'I', 'TV'}

In [149]:
# Manually pick cap items that should be removed, commented out unwaited items
# Keeping items that looks like an actual words that should be in caps (e.g. A, I, DIY, HR, TV)
cap_to_remove={
    # 'A',
 'APPLAUSE',
 'BLEEP',
#  'DNA',
 'EXHALES',
 'HE LAUGHS',
 'HE LAUGHS\n\nHe',
#  'I',
 'LAUGHING',
 'LAUGHTER',
 'LAUGHTER\n\nNo',
 'LAUGHTER\n\nSo',
#  'OK',
 'ON PHONE',
 'PRUE CHUCKLES\n\nIn',
 'SHE SIGHS',
 'SHRIEKING\n\nOh',
 'SHRIEKING ON PHONE',
 'SO',
 'STIFLED LAUGHTER',
 'THEY HUM',
 'THEY LAUGH',
 'V',
 'STIFLED'} 

 # Delete cap words, by replacing them with ''
for words in cap_to_remove:
  clean_textfile = clean_textfile.replace(words, '')

In [150]:
# Check if caps are removed properly, if anything is missed, add teh word in the step above and rerun code

validate_caps = set(re.findall(r"(\b(?:[A-Z]+[A-Z]*|[A-Z]*[A-Z]+)\b(?:\s+(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b)*)"
                      , clean_textfile
                      , re.DOTALL))

validate_caps

{'A', 'DNA', 'I', 'T'}

# Remove [name]

In [151]:
# identify all text in all square 
all_square_bracket = set(re.findall(r"\[[^\[\]]+\]"
                      , clean_textfile
                      , re.DOTALL))

all_square_bracket

{'[Chigs blows raspberry]',
 '[Chigs chuckles]',
 '[Chigs giggling]',
 '[Chigs grumbles]',
 '[Chigs grunts]',
 '[Chigs sighs]',
 '[Chigs]',
 '[Crystelle and Matt laugh]',
 '[Crystelle gasps]',
 '[Crystelle laughing]',
 '[Crystelle laughs]',
 '[Crystelle sighs]',
 '[Crystelle yelling]',
 '[Crystelle]',
 '[Freya]',
 '[Giuseppe whistles]',
 '[Giuseppe]',
 '[Matt gasps]',
 '[Matt]',
 '[Noel chuckling]',
 '[Noel laughs]',
 '[Noel whispers]',
 '[Noel]',
 '[Paul chuckling]',
 '[Paul]',
 '[Prue chuckles]',
 '[Prue]',
 '[all applauding]',
 '[all cheering and applauding]',
 '[all cheering]',
 '[all chuckle]',
 '[all laugh]',
 '[all laughing]',
 '[all]',
 '[applause continues]',
 '[audience cheering]',
 '[both chuckling]',
 '[both laugh]',
 '[both laughing]',
 '[breathing heavily]',
 '[cheering continues]',
 '[cheers and applause continue]',
 '[chuckles]',
 '[chuckling]',
 '[clears throat]',
 '[crowd laughing]',
 '[crunching]',
 '[dramatically]',
 '[emotional music playing]',
 '[exhales heavily]'

In [152]:
# Manually pick [] items that should be removed
square_bracket_to_remove={'[Chigs blows raspberry]',
 '[Chigs chuckles]',
 '[Chigs giggling]',
 '[Chigs grumbles]',
 '[Chigs grunts]',
 '[Chigs sighs]',
 '[Chigs]',
 '[Crystelle and Matt laugh]',
 '[Crystelle gasps]',
 '[Crystelle laughing]',
 '[Crystelle laughs]',
 '[Crystelle sighs]',
 '[Crystelle yelling]',
 '[Crystelle]',
 '[Freya]',
 '[Giuseppe whistles]',
 '[Giuseppe]',
 '[Matt gasps]',
 '[Matt]',
 '[Noel chuckling]',
 '[Noel laughs]',
 '[Noel whispers]',
 '[Noel]',
 '[Paul chuckling]',
 '[Paul]',
 '[Prue chuckles]',
 '[Prue]',
 '[all applauding]',
 '[all cheering and applauding]',
 '[all cheering]',
 '[all chuckle]',
 '[all laugh]',
 '[all laughing]',
 '[all]',
 '[applause continues]',
 '[audience cheering]',
 '[both chuckling]',
 '[both laugh]',
 '[both laughing]',
 '[breathing heavily]',
 '[cheering continues]',
 '[cheers and applause continue]',
 '[chuckles]',
 '[chuckling]',
 '[clears throat]',
 '[crowd laughing]',
 '[crunching]',
 '[dramatically]',
 '[emotional music playing]',
 '[exhales heavily]',
 '[exhales]',
 '[gasps]',
 '[gibbering]',
 '[imitates Porky Pig]',
 '[in Cockney accent]',
 '[indistinct chatter]',
 '[inhales deeply]',
 '[inhales]',
 '[kisses]',
 '[laughing]',
 '[laughs]',
 '[makes indistinct noises]',
 '[man]',
 '[mumbles indistinctly]',
 '[muttering]',
 '[normal voice]',
 '[opening theme music playing]',
 '[oven beeping]',
 '[sighs in relief]',
 '[sighs]',
 '[sing-song]',
 '[singing "Meet the Flintstones" in German]',
 '[speaking Italian]',
 '[squeals cheerfully]',
 '[squeals]',
 '[stumbling]',
 '[thunder rumbling]',
 '[timer beeping]',
 '[voice breaking]',
 '[whispers]',
 '[whistles]',
 '[woman]'} 

 # Delete cap words, by replacing them with ''
for words in square_bracket_to_remove:
  clean_textfile = clean_textfile.replace(words, '')

clean_textfile = clean_textfile.replace('[]', '')  

In [153]:
# Check if they have been removed properly
# identify all text in all caps
all_square_bracket = set(re.findall(r"\[[^\[\]]+\]"
                      , clean_textfile
                      , re.DOTALL))

all_square_bracket

set()

# Remove "advertising" lines

In [154]:
clean_textfile = re.sub("Advertise your product or brand here", "", clean_textfile)
clean_textfile = re.sub("contact www.OpenSubtitles.org today", "", clean_textfile)
clean_textfile = re.sub("Support us and become VIP member", "", clean_textfile)
clean_textfile = re.sub("to remove all ads from www.OpenSubtitles.org", "", clean_textfile)
clean_textfile = re.sub("Advertise your product or brand here", "", clean_textfile)
clean_textfile = re.sub("contact www.OpenSubtitles.org today", "", clean_textfile)




# 2) Remove Music ♪

In [155]:
# might be easier to do manually?

In [156]:
%%time
clean_textfile = re.sub("[♪@*&?].*[♪@*&?]", "", clean_textfile)

CPU times: user 566 µs, sys: 0 ns, total: 566 µs
Wall time: 582 µs


# Export cleaned file

In [157]:
# Check how text is looking now
print(clean_textfile[:1000])

It's the final, and the remaining bakers are set three challenges that will test every aspect of their baking skills, including a classic carrot cake and a Mad Hatter's tea party banquet. One baker will be crowned the Bake Off 2021 winner.
 In the beginningâ¦

Good luck in there.

â¦there were 12.

-
-Hello, Prue, how you doing?



I don't want to get emotional.

Everybody's terrified
about collapsing cakes.

-
-

- Oh, no!
- Oh, my God.

- Nowâ¦
- Breathe.

Just remember to breathe.



â¦there are three.

- There's so much to do.
-Are you taking the mick?

 Do not drop.

-This is not going to be pretty.
-

Can't do anything now, can I?

 Oh, the oven is off.

What have you done?



 The 2021 final is like no other.

-Are you serious?
-Yeah, absolutely.

 For the first time ever,

each of our finalists
have won two Hollywood handshakesâ¦

Oh, my God!

â¦and also been
awarded Star Baker two times.



 Giuseppe.

 It's never been this close before.

-Giuseppe, congratulations.
-

 From t

In [158]:
%%time

# export file to check 
with open('S12E10_python_cleaned_v1.txt', 'a') as f:
    f.write(clean_textfile)

CPU times: user 1.77 ms, sys: 0 ns, total: 1.77 ms
Wall time: 3.69 ms


# Ignore from here

- The following save .ipynb as html so others can view (not needed for workflow)
- To do so, export the ipynb to your desktop, then re-upload them to your directory in colab (left panel) below sample_data
- Copy file path from the uploaded file and update code below
- Then run code below, it should save a copy of the notebook in html

In [164]:
%%shell
jupyter nbconvert --to html /content/20230228_test_clean_corpus.ipynb

[NbConvertApp] Converting notebook /content/20230228_test_clean_corpus.ipynb to html
[NbConvertApp] Writing 624794 bytes to /content/20230228_test_clean_corpus.html


