# Main Goals of Cleaning

- Remove noise: timestamps, speaker labels (if any), scene directions.

- Structure the data: into question-response (or speaker1-speaker2) pairs if building a chatbot.

- Preprocess text: lowercase, punctuation cleanup, lemmatization, etc.

# Step-by-Step Cleaning Code

In [54]:
import re
import nltk
import pandas as pd

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Starting the process

In [10]:
# Load the raw text
with open(r'data\suits-1x01-pilot.en.srt', 'r', encoding='utf-8') as f:
    raw_text = f.read()

# STEP 0: Remove BOM or encoding artifacts
raw_text = raw_text.replace('\ufeff', '').replace('ï»¿', '')
raw_text = raw_text.replace('?.', '')

# STEP 0.1: Remove any tags between < and > (HTML-like or malformed)
raw_text = re.sub(r'<[^>]*?>', '', raw_text)

# STEP 1: Remove SRT artifacts like timestamps and numbers
cleaned = re.sub(r'\d+\n', '', raw_text)                             # remove sequence numbers
cleaned = re.sub(r'\d{2}:\d{2}:\d{2},\d{3} --> .*', '', cleaned)     # remove timestamps

# STEP 2: Remove scene directions (e.g., [Laughs], (phone rings), etc.)
cleaned = re.sub(r'\[.*?\]|\(.*?\)', '', cleaned)

# STEP 3: Remove any extra whitespace
cleaned = re.sub(r'\n+', '\n', cleaned).strip()

# STEP 3.1: Remove any URLs (http, https, www)
cleaned = re.sub(r'https?://\S+|www\.\S+', '', cleaned)

# STEP 3.2: Normalize whitespace and line breaks
cleaned = re.sub(r'\n+', '\n', cleaned).strip()

# STEP 4: Tokenize into sentences (each treated as a chat line)
sentences = sent_tokenize(cleaned)

# STEP 5: Lemmatization of the text
lemmatizer = WordNetLemmatizer()
lemmatized_sentences = []
for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    lemmatized = [lemmatizer.lemmatize(word) for word in words]
    lemmatized_sentences.append(" ".join(lemmatized))

# STEP 5.1 Ensuring even number of lines
# To ensures every prompt has a matching response — if the last line 
# was hanging, it's now paired with "The End".
if len(lemmatized_sentences) % 2 != 0: lemmatized_sentences.append("The End")

# STEP 6: Prepare chatbot data as conversational pairs (Q&A or turn-based)
dialogue_pairs = []
for i in range(len(lemmatized_sentences) - 1):
    pair = {
        'prompt': lemmatized_sentences[i],
        'response': lemmatized_sentences[i + 1]
    }
    dialogue_pairs.append(pair)

# Convert to DataFrame and save
df = pd.DataFrame(dialogue_pairs)
df.to_csv(r'data\05.3-step-suits_chatbot_data.csv', index=False)

df.head(6)

Unnamed: 0,prompt,response
0,what 's happening to his deal .,I got to go .
1,I got to go .,I 'm gon na get screwed ?
2,I 'm gon na get screwed ?,at a bad time ?
3,at a bad time ?,Harvey Specter .
4,Harvey Specter .,"the best closer , for the last three hour ?"
5,"the best closer , for the last three hour ?","in troubled situation , at 7:00 p.m. , in jeop..."


# Summary of Cleaning Rules Applied:

- ✅ BOMs and encoding junk (ï»¿, \ufeff)

- ✅ Timestamps and SRT numbers

- ✅ All <...> tags (including badly formatted HTML)

- ✅ Scene directions [Laughs], (phone rings)

- ✅ URLs (e.g., http://, www.)

- ✅ Extra line breaks

- ✅ Optional lemmatization

# Building a baseline model for the chatbot

## **Option 1: Baseline Rule-Based Chatbot (Quickest)**

Uses cosine similarity or TF-IDF to match inputs to the most similar prompt and return the corresponding response.

In [14]:
# Load your data
df = pd.read_csv(r'data\05.3-step-suits_chatbot_data.csv')

# Train vectorizer on all prompts
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['prompt'])

def chatbot_response(user_input):
    user_vec = vectorizer.transform([user_input])
    similarity = cosine_similarity(user_vec, X)
    best_match_idx = similarity.argmax()
    return df.iloc[best_match_idx]['response']

Trying it out


In [73]:
while True:
    user_input = input("You: ")
    if user_input.lower() in ['quit', 'exit']:
        break
    print("Bot:", chatbot_response(user_input))

Bot: I 'm a grown man .
Bot: I got to go .
Bot: I 'm a lawyer .
Bot: I got to go .


## **Option 2: Fine-Tune a Pretrained Transformer (Advanced but Powerful)**

Fine-tune a transformer like DialoGPT (a causal language model) on your prompt-response pairs using the Hugging Face transformers library.

If you're building a **neural chatbot**, fine-tune a model like DialoGPT or T5 using *Hugging Face’s* `transformers`.

In [46]:
# !pip show transformers datasets torch accelerate

## 1. Installing and Checking the Required **Libraries** and **Packages**

In [27]:
# To check if a Python package (like transformers, datasets, or torch) 
# is already installed without installing it again, 
# you can use the following methods:

for pkg in ['transformers', 'datasets', 'torch', 'accelerate']:
    try:
        module = __import__(pkg)
        __import__(pkg)
        print(f"{pkg} is installed ✅")
        print(f"{pkg} version: {module.__version__} ✅")
        print('')
    except ImportError:
        print(f"{pkg} is NOT installed ❌")

transformers is installed ✅
transformers version: 4.46.3 ✅

datasets is installed ✅
datasets version: 3.1.0 ✅

torch is installed ✅
torch version: 2.4.1+cpu ✅

accelerate is NOT installed ❌


## 2. Prepare Dataset

Use your CSV with `prompt` and `response` columns.

### 🔄 Format the dataset for training:

For DialoGPT, we concatenate prompt + response as a conversation, with special tokens

In [32]:
df = pd.read_csv(r"data\05.3-step-suits_chatbot_data.csv")

# Ensure text is string type
df['prompt'] = df['prompt'].astype(str)
df['response'] = df['response'].astype(str)

# Combine into single training text with turn separation
def format_convo(row):
    return f"<|user|> {row['prompt']} <|bot|> {row['response']}"

df['dialogue'] = df.apply(format_convo, axis=1)

# Save to .txt file — one dialogue per line
df['dialogue'].to_csv("data/suits_dialogues.txt", index=False, header=False)


## 3. Load and Tokenize Data (for DialoGPT)

In [43]:
# !pip install --upgrade datasets

In [44]:
# !pip install --upgrade datasets transformers tqdm

In [57]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load your dataset
file_pathi = "suits_dialogues.txt"
dataset = load_dataset("text", data_files={"train": file_pathi})

# Load DialoGPT tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")

# Tokenize the data
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

TqdmKeyError: "Unknown argument(s): {'delay': 5}"

In [58]:
dataset['train'][0]
# {'text': 'Some line of dialogue'}

NameError: name 'dataset' is not defined

## 4. Load Pretrained Model

In [38]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


## 5. Fine-Tune the Model

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./suits-dialo-model",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    logging_steps=100,
    save_steps=500,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir="./logs",
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    tokenizer=tokenizer,
)

trainer.train()

## 6. Save and Use the Model

In [None]:
trainer.save_model("./suits-dialo-model")
tokenizer.save_pretrained("./suits-dialo-model")

Then loading back for the chat

In [74]:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./suits-dialo-model")
model = AutoModelForCausalLM.from_pretrained("./suits-dialo-model")

# Simple chat function
def chat(prompt):
    input_ids = tokenizer.encode(f"<|user|> {prompt} <|bot|>", return_tensors="pt")
    output_ids = model.generate(input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
    response = tokenizer.decode(output_ids[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
    return response

# Trying it
chat("What happened to Gerald's deal?")

OSError: Incorrect path_or_model_id: './suits-dialo-model'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

# Summary and Tips for Better Results

- Start with DialoGPT-medium for better performance.

- Use a GPU (Colab, Kaggle, or local CUDA) — training on CPU is very slow.

- You can train on more episodes for better fluency and character continuity.

## Option 3: Best Model for Your Case: **DialoGPT-small**

Since your goal is to build a basic conversational chatbot, here's a clear roadmap using a fine-tuned transformer model (Option 1). We’ll prioritize ease of setup, speed, and simplicity.

### *Why DialoGPT-small?*

- Optimized for dialogue (Reddit conversations).

- Lightweight = fast to load and fine-tune.

- Plug-and-play with Hugging Face 🤗.

- Good for basic chat use cases (like "Hi, how are you?").

### Here Workflow Overview

| Step | Description                                      |
| ---- | ------------------------------------------------ |
| 1.   | Prepare your dialogue data (CSV or plain text)   |
| 2.   | Tokenize the data using DialoGPT tokenizer       |
| 3.   | Fine-tune DialoGPT using Hugging Face `Trainer`  |
| 4.   | Save and deploy your chatbot model               |
| 5.   | Chat with the bot using a text loop or Gradio UI |


### 3.1 Step 1: Install Dependencies

In [72]:
# !pip install transformers datasets accelerate

### 3.2 Step 2: Load Model and Tokenizer

In [60]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

### 3.3 Step 3: Prepare Dataset (we’ve already cleaned text from subtitles)

Assume you have a .txt file of dialogues like:

```python
Hi.
Hey there!
How are you?
I'm doing great.

```

### 3.4 Step 4: Format for Training

In [62]:
from datasets import load_dataset

dataset = load_dataset("text", data_files={"train": file_pathi})

TqdmKeyError: "Unknown argument(s): {'delay': 5}"

### 3.5 Step 5: Tokenize


In [63]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length")

tokenized = dataset.map(tokenize_function, batched=True)

NameError: name 'dataset' is not defined

### 3.6 Step 6: Fine-Tune the Model

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    logging_steps=10,
    save_steps=100,
    save_total_limit=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"]
)

trainer.train()

### 3.7 Step 7: Chat with Your Bot

In [75]:
# Simple chat loop
input_text = "Hi there!"
new_user_input_ids = tokenizer.encode(input_text + tokenizer.eos_token, return_tensors='pt')
bot_input_ids = new_user_input_ids

# Generate a response
output = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
reply = tokenizer.decode(output[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
print("Bot:", reply)

Bot: Hi there!


# **Step 8: Deploy Your Chatbot with Gradio**

## 8.1. Install `Gradio` in your notebook or Colab:

In [66]:
# !pip install gradio

## 8.2. Create a Chatbot Function

Assuming you’re using DialoGPT-small and already loaded your model and tokenizer:

In [67]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

chat_history_ids = None  # store conversation context


Here is a function to handle user inputs:

In [68]:
def respond(user_input, history=[]):
    global chat_history_ids

    # Encode the user input and append to history
    new_input_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors='pt')

    bot_input_ids = torch.cat([chat_history_ids, new_input_ids], dim=-1) if chat_history_ids is not None else new_input_ids

    # Generate response
    chat_history_ids = model.generate(
        bot_input_ids,
        max_length=1000,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        top_k=50,
        top_p=0.95
    )

    response = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)

    # Append to history and return
    history.append((user_input, response))
    return history, history


## 8.3. Create the Gradio Interface

In [69]:
import gradio as gr

with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    msg = gr.Textbox(label="Type your message")
    state = gr.State([])

    def user_chat(user_input, history):
        return respond(user_input, history)

    msg.submit(user_chat, [msg, state], [chatbot, state])


 ## 8.4 Launch the App

In [70]:
demo.launch()

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
