## Run profanity filter on movie scripts

We use this script to filter out all occurences of profanity and their counts. For this, we make use of an external library called "profanity-filter", which adds a step in the spacy pipeline, after tokenization. Lastly we create a .txt containing imdb-ids and a list of profanity in the movie.  

In [1]:
import spacy
from profanity_filter import ProfanityFilter
from tqdm import tqdm
import glob
import time
from multiprocessing import Process

In [6]:
files = glob.glob('../../data/script/*', recursive=True)

# Remove unnaccessary tools in the spacy pipeline
nlp = spacy.load('en_core_web_sm', disable=["ner", "tagger", "parser", "textcat", "lemmatizer"])
profanity_filter = ProfanityFilter(nlps={'en': nlp})
# add the profanity filter to the spacy pipeline
nlp.add_pipe(profanity_filter.spacy_component, last=True)

def get_bad_words(text):
    words_and_counts = {}
    doc = nlp(text)
    for token in doc:
        if token._.is_profane:
            words_and_counts[token._.original_profane_word] = words_and_counts.get(token._.original_profane_word, 0) + 1
    return words_and_counts

In [8]:
# Location of the output file
resultFile = open('./output/imdb_id_with_profanity_list.txt', 'w')

for idx, filepath in tqdm(enumerate(files), total=len(files)):
    if(idx == 113):
        continue

    with open(filepath, 'r') as f:
        script = f.read()
    
    imdb_id = filepath.split('\\')[-1].split('.')[0]
    bad_words = get_bad_words(script)

    if(bad_words == {}):
        continue
    resultFile.write(imdb_id + ',' + str(bad_words).replace('{', '').replace('}', '').replace('\'', '') + '\n')

 24%|██▎       | 113/479 [09:03<28:30,  4.67s/it]

skipping


100%|██████████| 479/479 [39:54<00:00,  5.00s/it]
