# **03-TV-Show-trained-chatbot-creation**

#### **Q: How do you clean a text file for nlp analysis in order to create a converational chatbot?**

Cleaning a text file for NLP analysis—especially for creating a conversational chatbot—involves several key steps. The goal is to prepare the data so that it's consistent, structured, and meaningful for a machine learning model. Here's a typical cleaning pipeline:

# Here is a Step-by-Step Text Cleaning Process

In [43]:
import re, nltk

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

# 1. Read the File

In [44]:
file_path = r"data\suits-1x01-pilot.en.txt"

with open(file_path, "r", encoding="utf-8") as file:
    text = file.read()

# 2. Remove Unwanted Characters
Remove special characters, numbers, or any non-conversational content (like timestamps, HTML, etc.).

In [45]:
# Remove things like HTML tags, special characters, numbers
text = re.sub(r"<.*?>", "", text)               # Remove HTML tags
text = re.sub(r"\[.*?\]", "", text)             # Remove brackets and their content
text = re.sub(r"[^a-zA-Z0-9\s.,!?]", "", text)  # Remove special characters except punctuation

# 3. Lowercasing
Helps reduce vocabulary size.

In [46]:
text = text.lower()

# 4. Tokenization
Split the text into sentences or words.

In [47]:
sentences = sent_tokenize(text)  # Useful for chatbot training

# 5. Remove Stopwords (optional)
Stopwords like “is”, “the”, etc., are often removed in NLP tasks, but for chatbots, they may be needed for natural-sounding replies.

In [48]:
stop_words = set(stopwords.words("english"))
filtered_sentences = []

for sentence in sentences:
    words = word_tokenize(sentence)
    filtered = [w for w in words if w not in stop_words]
    filtered_sentences.append(" ".join(filtered))

# 6. Lemmatization/Stemming (Optional)
Use lemmatization to reduce words to their base form.

In [49]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rurig\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [50]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

# Download required resources
nltk.download('wordnet')
nltk.download('punkt')

lemmatizer = WordNetLemmatizer()
lemmatized_sentences = []

for sentence in filtered_sentences:
    words = word_tokenize(sentence)
    lemmatized = [lemmatizer.lemmatize(w) for w in words]
    lemmatized_sentences.append(" ".join(lemmatized))


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rurig\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rurig\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# 7. Remove Duplicate or Blank Lines

In [51]:
cleaned_sentences = list(set([s.strip() for s in lemmatized_sentences if s.strip()]))

# 8. Pair Dialogues (for Chatbot Training)
If your text is conversational (e.g., lines of dialogue), structure it as input-output pairs.

In [52]:
pairs = []
for i in range(len(cleaned_sentences) - 1):
    input_text = cleaned_sentences[i]
    response = cleaned_sentences[i+1]
    pairs.append((input_text, response))


# Bonus: Save Cleaned Data

In [53]:
import csv
with open("01-TV-Show-chatbot_data.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["Input", "Response"])
    writer.writerows(pairs)


# `Note:` 

What does these terms mean?

- **Tokenize** - Split into sentences or words
- **Lemmatize/stem** - Normalize word form