# About
- This notebook does some basic cleaning for the transcript dataset. 
- It mostly looks for patterns and remove them. Patterns includes "all caps" and "[]". 
- These are often used to indicate ambient noise or to indicate speakers.
- This workflow should remove around 90-95% of the noise in the dataset. 
- However, a manual review is still needed after this clean up. 
- (e.g. special characters and one off noises are not captured in this process)

In [None]:
import pandas as pd
import re
import string
from bs4 import BeautifulSoup
import nltk
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import spacy




In [None]:
nlp = spacy.load('en_core_web_sm')

# 0) Read Data

In [None]:
with open('S2 E10.txt', 'r') as text:
      textfile = text.read()
      print(textfile[:1000])

Welcome to The Great British Bake Off Masterclass.

Judges Mary Berry and I will be doing the baking.

We'll guide you through challenges faced by the bakers in this year's Bake Off.

We will show you some little tips and tricks that will help you at home to create something magical.

From the mixing, to the baking, to the finishing,

to the presentation, at home you will get the same results.

Coming up, my luxury pork pies

encased in notoriously-difficult-to-handle hot water crust pastry,

filled with the perfect combination of pork loin and a quail's egg.

Mary Berry's chocolate roulade recipe.

Mary will show you how to get the perfect roll every time.



My traditional iced fingers - a complex combination

of sweet yet buttery dough, precisely piped with whipped cream and strawberry jam.

And Mary's Sachertorte - a technically tricky, dense chocolate cake

with its signature glossy ganache icing.

'Over the course of eight weeks

'earlier this year, Mary and I saw twelve of the c

In [None]:
# Make a copy of raw text file
clean_textfile=textfile

# 1) Removing Caps

In [None]:
# identify all text in all caps
all_caps = set(re.findall(r"(\b(?:[A-Z]+[A-Z]*|[A-Z]*[A-Z]+)\b(?:\s+(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b)*)"
                      , clean_textfile
                      , re.DOTALL))

all_caps

{'A', 'ALL', 'BOTH', 'I', 'OK', 'SHE GASPS', 'WITH'}

In [None]:
# Manually pick cap items that should be removed, commented out unwaited items
# Keeping items that looks like an actual words that should be in caps (e.g. A, I, DIY, HR, TV)
cap_to_remove={
# 'A', 
'ALL', 
'BOTH', 
# 'I', 
# 'OK', 
'SHE GASPS', 
# 'WITH'
 } 

 # Delete cap words, by replacing them with ''
for words in cap_to_remove:
  clean_textfile = clean_textfile.replace(words, '')

In [None]:
# Check if caps are removed properly, if anything is missed, add teh word in the step above and rerun code

validate_caps = set(re.findall(r"(\b(?:[A-Z]+[A-Z]*|[A-Z]*[A-Z]+)\b(?:\s+(?:[A-Z]+[a-z]?[A-Z]*|[A-Z]*[a-z]?[A-Z]+)\b)*)"
                      , clean_textfile
                      , re.DOTALL))

validate_caps

{'A', 'I', 'OK', 'S', 'THAT', 'THEM', 'WE', 'WILL'}

# Remove [name]

In [None]:
# identify all text in all square 
all_square_bracket = set(re.findall(r"\[[^\[\]]+\]"
                      , clean_textfile
                      , re.DOTALL))

all_square_bracket

set()

In [None]:
# Manually pick [] items that should be removed
square_bracket_to_remove={'} 

 # Delete cap words, by replacing them with ''
for words in square_bracket_to_remove:
  clean_textfile = clean_textfile.replace(words, '')

clean_textfile = clean_textfile.replace('[]', '')  

SyntaxError: ignored

In [None]:
# Check if they have been removed properly
# identify all text in all caps
all_square_bracket = set(re.findall(r"\[[^\[\]]+\]"
                      , clean_textfile
                      , re.DOTALL))

all_square_bracket

set()

# Remove "advertising" lines

In [None]:
clean_textfile = re.sub("Advertise your product or brand here", "", clean_textfile)
clean_textfile = re.sub("contact www.OpenSubtitles.org today", "", clean_textfile)
clean_textfile = re.sub("Support us and become VIP member", "", clean_textfile)
clean_textfile = re.sub("to remove all ads from www.OpenSubtitles.org", "", clean_textfile)
clean_textfile = re.sub("Advertise your product or brand here", "", clean_textfile)
clean_textfile = re.sub("contact www.OpenSubtitles.org today", "", clean_textfile)




# Remove and/or replace weird characters

In [None]:
# identify all special characters

print(re.findall("â¦", clean_textfile, flags=0),
      re.findall("âª", clean_textfile, flags=0),
      re.findall("Ã¨", clean_textfile, flags=0),
      re.findall("Ã¼", clean_textfile, flags=0))


[] [] [] []


In [None]:
clean_textfile = re.sub("â¦", "", clean_textfile)
clean_textfile = re.sub("âª", "", clean_textfile)
clean_textfile = re.sub("Ã¨", "e", clean_textfile)
clean_textfile = re.sub("Ã¼", "u", clean_textfile)

# 2) Remove Music ♪

In [None]:
# might be easier to do manually?

In [None]:
%%time
clean_textfile = re.sub("[♪@*&?].*[♪@*&?]", "", clean_textfile)

CPU times: user 427 µs, sys: 0 ns, total: 427 µs
Wall time: 431 µs


# Export cleaned file

In [None]:
# Check how text is looking now
print(clean_textfile[:1000])

Welcome to The Great British Bake Off Masterclass.

Judges Mary Berry and I will be doing the baking.

We'll guide you through challenges faced by the bakers in this year's Bake Off.

We will show you some little tips and tricks that will help you at home to create something magical.

From the mixing, to the baking, to the finishing,

to the presentation, at home you will get the same results.

Coming up, my luxury pork pies

encased in notoriously-difficult-to-handle hot water crust pastry,

filled with the perfect combination of pork loin and a quail's egg.

Mary Berry's chocolate roulade recipe.

Mary will show you how to get the perfect roll every time.



My traditional iced fingers - a complex combination

of sweet yet buttery dough, precisely piped with whipped cream and strawberry jam.

And Mary's Sachertorte - a technically tricky, dense chocolate cake

with its signature glossy ganache icing.

'Over the course of eight weeks

'earlier this year, Mary and I saw twelve of the c

In [None]:
%%time

# export file to check 
with open('S2E10_python_cleaned_v1.txt', 'a') as f:
    f.write(clean_textfile)

CPU times: user 447 µs, sys: 943 µs, total: 1.39 ms
Wall time: 1.4 ms


# Ignore from here

- The following save .ipynb as html so others can view (not needed for workflow)
- To do so, export the ipynb to your desktop, then re-upload them to your directory in colab (left panel) below sample_data
- Copy file path from the uploaded file and update code below
- Then run code below, it should save a copy of the notebook in html

In [None]:
%%shell
jupyter nbconvert --to html /content/20230228_test_clean_corpus.ipynb

[NbConvertApp] Converting notebook /content/20230228_test_clean_corpus.ipynb to html
[NbConvertApp] Writing 613027 bytes to /content/20230228_test_clean_corpus.html


