## pre-processing dataset
The dataset for this project comes from [Friends TV Show Script](https://www.kaggle.com/datasets/divyansh22/friends-tv-show-script?rvi=1). This Dataset contains the text file of scripts of all the episodes in the FRIENDS TV Show. So I need to do some preprocessing on the originating dataset.

In [7]:
import re

input_file = 'dataset/Friends_Transcript.txt'
output_file = 'dataset/Friends_lines.txt'

# Import the original text file line by line, code reference source: https://www.freecodecamp.org/news/python-open-file-how-to-read-a-text-file-line-by-line/
with open(input_file, 'r', encoding='utf-8') as inputfile:
    lines = inputfile.readlines()

# As it is a script, there is a lot of voice-over and scene introductions, so the sections in '()' and '[]' need to be removed.
cleaned_lines = []
for line in lines:
    pattern = r'\[[^]]*\]|\([^)]*\)'
    cleaned_line = re.sub(pattern, '', line)
    cleaned_lines.append(cleaned_line)

# There are titles of each episode and writers' names, and these need be removed as well.
for i in range(len(cleaned_lines)):
    # Find the scriptwriters and remove it.
    if cleaned_lines[i].startswith('Written by:'):
        cleaned_lines[i] = ''
    # Find the scripttitles and remove it.
    elif cleaned_lines[i].isupper():
        cleaned_lines[i] = ''
    # Remove some specific scene words.
    elif cleaned_lines[i].startswith('Closing Credits'):
        cleaned_lines[i] = ''
    elif cleaned_lines[i].startswith('Commercial Break'):
        cleaned_lines[i] = ''
    elif cleaned_lines[i].startswith('End'):
        cleaned_lines[i] = ''

# At the beginning of each line there will be character names, these need to be removed as well.
for i in range(len(cleaned_lines)):
    # After I ran it I noticed that some of the lines in the dataset had two spaces following the colon, so it need to be handled before.
    if ':  ' in cleaned_lines[i]:
        cleaned_lines[i] = cleaned_lines[i].split(':  ', 1)[-1]
    elif ': ' in cleaned_lines[i]:
        cleaned_lines[i] = cleaned_lines[i].split(': ', 1)[-1]

# I Found there are some lines only have ' ', so I need to remove them.
for i in range(len(cleaned_lines)):
    if cleaned_lines[i].startswith(' '):
        cleaned_lines[i] = ''
        
# Finally remove the blank lines.
final = []
for line in cleaned_lines:
    # At first I used "!=" to find blank lines, but I found that those lines have blank characters (such as spaces) would be treated as non-blank lines as well.
   if line == '\n':
        line = line.strip('\n')# So I finally chose to use strip(), code reference source: https://blog.csdn.net/qq_36756866/article/details/123073264
   final.append(line)

# Output Line-by-Line Text
with open(output_file, 'w', encoding='utf-8') as outfile:
    for line in final:
        outfile.write(line)

After creating the line-by-line text, continue to create the line-by-pair text.

In [8]:
input_file = "dataset/Friends_lines.txt"
output_file = 'dataset/Friends_pairs.txt'

with open(input_file, 'r', encoding='utf-8') as inputfile:
    lines = inputfile.readlines()

# Create pairs of sentences line by line
pairs = []
for i in range(len(lines) - 1):
    # I found that there would be lots of space if no strip(), I don't know why, but ChatGPT told me to use the strip().
    line1 = lines[i].strip()
    line2 = lines[i + 1].strip()
    pairs.append((line1, line2))

# Output Line-by-Pair Text
with open(output_file, 'w', encoding='utf-8') as outfile:
    for pair in pairs:
        outfile.write(pair[0] + '\t' + pair[1] + '\n')