# **05-TV-Show-trained-chatbot-creation**

# ``Main Goals of Cleaning``

- **Remove noise**: timestamps, speaker labels (if any), scene directions.

- **Structure the data**: into question-response (or speaker1-speaker2) pairs if building a chatbot.

- **Preprocess text**: lowercase, punctuation cleanup, lemmatization, etc.


## Step-by-Step Process

The step-by-step from raw subtitle text to a cleaned conversational dataset and finally a chatbot model.

# **Libraries**

In [21]:
import re
import nltk
import string
import pandas as pd

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

import re
import nltk
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer

In [22]:
# One-time downloads (run once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rurig\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rurig\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rurig\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\rurig\AppData\Roaming\nltk_data...


True

# Step 1: Read and Inspect the SRT File

The SRP typically looks like this:

```python 
1
00:00:01,000 --> 00:00:03,000
Hello, how are you?

2
00:00:03,500 --> 00:00:05,000
I'm fine, thanks.
```

upon inspecting the srt (SubRip Subtitle) we need to extract just the dialogue lines, removing timestamps and sequence numbers.

# Step 2: Clean the SRT File

We'll clean the file by:

- Removing subtitle sequence numbers

- Removing timestamps

- Keeping only actual spoken lines

- Removing empty lines or noise

In [3]:
# 2.1 Cleaning the SRT

# Read the SRT file
with open("data/suits-1x01-pilot.en.srt", "r", encoding="utf-8") as f:
    lines = f.readlines()

cleaned_lines = []
for line in lines:
    # Skip sequence numbers
    if re.match(r"^\d+\s*$", line):
        continue
    # Skip timestamps
    if re.match(r"^\d{2}:\d{2}:\d{2},\d{3}", line):
        continue
    # Skip empty lines
    if line.strip() == "":
        continue
    # Keep actual spoken line
    cleaned_lines.append(line.strip())

# Join into clean dialogue
cleaned_text = "\n".join(cleaned_lines)

# Save the cleaned result to a new text file
with open("data/05-cleaned-suits-pilot.txt", "w", encoding="utf-8") as f:
    f.write(cleaned_text)


# Step 3: Structure the data into Pairs for Chatbot

Converting the dialogue into prompt-response pairs:

In [4]:
# Create prompt-response pairs
pairs = []

for i in range(len(cleaned_lines) - 1):
    input_text = cleaned_lines[i]
    target_text = cleaned_lines[i + 1]
    pairs.append((input_text, target_text))

# printing the first set 10 entries in the `pairs` LIST
pairs[:10]

[('\ufeff1', '[Muffled chatter]'),
 ('[Muffled chatter]', '[Knocking]'),
 ('[Knocking]', "Gerald Tate's here."),
 ("Gerald Tate's here.", 'He wants to know'),
 ('He wants to know', "what's happening to his deal."),
 ("what's happening to his deal.", 'Go get Harvey.'),
 ('Go get Harvey.',
  '== sync, corrected by <font color="#00ff00">elderman</font> =='),
 ('== sync, corrected by <font color="#00ff00">elderman</font> ==',
  'I check.'),
 ('I check.', 'Raise.'),
 ('Raise.', '5,000.')]

### Optional - Save these pairs into a CSV for training:

In [5]:
df = pd.DataFrame(pairs, columns=["input", "response"])
df.to_csv(r"data\05-chatbot-pairs-data.csv", index=False)

# Step 4: Preprocess Text for Training

You may want to:

- Lowercase everything

- Remove punctuation

- Tokenize (optional)

- Lemmatize or stem (optional for traditional models)

In [10]:
# Basic Fucntion
def preprocess(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text

df['input'] = df['input'].apply(preprocess)
df['response'] = df['response'].apply(preprocess)

Here is a more complete preprocess() function that addresses

In [11]:
def preprocess(text):
    # Lowercase everything
    text = text.lower()
    
    # Remove HTML-like tags and entities
    text = re.sub(r'<.*?>', '', text)           # Remove HTML tags like <i>, </b>
    text = re.sub(r'&[a-z]+;', ' ', text)       # Replace HTML entities like &nbsp;
    
    # Normalize punctuation spacing
    text = re.sub(r'[\.\?!,;:]+', '.', text)    # Replace runs of punctuation with a period
    text = re.sub(r'\.{2,}', '.', text)         # Replace multiple dots with one
    text = re.sub(r'[^a-z0-9\s\']', '', text)   # Remove all remaining punctuation except apostrophes

    # Replace multiple spaces/newlines/tabs with a single space
    text = re.sub(r'\s+', ' ', text)
    
    # Trim leading/trailing whitespace
    text = text.strip()
    
    return text

Applying this fucntion to the DataFrame

In [17]:
df['input'] = df['input'].apply(preprocess)
df['response'] = df['response'].apply(preprocess)

# Checking the frst 10 input
pd.concat([df['input'], df['response']], axis = 1).head()

Unnamed: 0,input,response
0,1,muffled chatter
1,muffled chatter,knocking
2,knocking,gerald tates here
3,gerald tates here,he wants to know
4,he wants to know,whats happening to his deal


Let's enhance the preprocessing function to include:

- 1. **Lemmatization** – Reduce words to their base (e.g., "running" → "run")

- 2. **Stopword removal** – Remove common non-informative words (like "the", "is", "and")

Updated `preprocess()` with NLTK

In [18]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    # Lowercase
    text = text.lower()

    # Remove HTML tags and entities
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'&[a-z]+;', ' ', text)

    # Replace punctuation with space
    text = re.sub(r'[^a-z0-9\s\']', ' ', text)

    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords and lemmatize
    cleaned = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return ' '.join(cleaned)


Applying this new function to the DataFrame

In [20]:
df['input'] = df['input'].apply(preprocess)
df['response'] = df['response'].apply(preprocess)

# Checking the frst 10 input
pd.concat([df['input'], df['response']], axis = 1).head(10)

Unnamed: 0,input,response
0,1,muffled chatter
1,muffled chatter,knocking
2,knocking,gerald tate
3,gerald tate,want know
4,want know,whats happening deal
5,whats happening deal,go get harvey
6,go get harvey,sync corrected font color00ff00eldermanfont
7,sync corrected font color00ff00eldermanfont,check
8,check,raise
9,raise,5000
