# 01 - TV Show trained chatbot creation

Building a chatbot using Transformers (like DialoGPT) and your .srt subtitle data in Python using Hugging Face Transformers.

# Prerequisites
You'll need:

- Python 3.8+

- transformers, datasets, torch, pandas

In [2]:
# Install the packages (in Jupyter Notebook or terminal):
! pip install transformers datasets torch pandas

Collecting transformers
  Downloading transformers-4.46.3-py3-none-any.whl (10.0 MB)
Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
Collecting torch
  Downloading torch-2.4.1-cp38-cp38-win_amd64.whl (199.4 MB)
Collecting safetensors>=0.4.1
  Downloading safetensors-0.5.3-cp38-abi3-win_amd64.whl (308 kB)
Collecting filelock
  Downloading filelock-3.16.1-py3-none-any.whl (16 kB)
Collecting tokenizers<0.21,>=0.20
  Downloading tokenizers-0.20.3-cp38-none-win_amd64.whl (2.4 MB)
Collecting huggingface-hub<1.0,>=0.23.2
  Downloading huggingface_hub-0.33.4-py3-none-any.whl (515 kB)
Collecting fsspec[http]<=2024.9.0,>=2023.1.0
  Downloading fsspec-2024.9.0-py3-none-any.whl (179 kB)
Collecting xxhash
  Downloading xxhash-3.5.0-cp38-cp38-win_amd64.whl (30 kB)
Collecting multiprocess<0.70.17
  Downloading multiprocess-0.70.16-py38-none-any.whl (132 kB)
Collecting pyarrow>=15.0.0
  Downloading pyarrow-17.0.0-cp38-cp38-win_amd64.whl (25.2 MB)
Collecting dill<0.3.9,>=0.3.

ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

huggingface-hub 0.33.4 requires packaging>=20.9, but you'll have packaging 20.4 which is incompatible.
datasets 3.1.0 requires requests>=2.32.2, but you'll have requests 2.24.0 which is incompatible.
datasets 3.1.0 requires tqdm>=4.66.3, but you'll have tqdm 4.50.2 which is incompatible.


# Step 1: Extract Conversation Pairs from `.srt`

In [13]:
# Feeding Function
def extract_dialogue_from_srt(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    dialogue_lines = []
    for line in lines:
        line = line.strip()
        if line == '' or line.isdigit() or '-->' in line:
            continue
        dialogue_lines.append(line)
    
    return dialogue_lines

def create_conversation_pairs(dialogue_lines):
    pairs = []
    for i in range(len(dialogue_lines) - 1):
        pairs.append({'input': dialogue_lines[i], 'response': dialogue_lines[i+1]})
    return pairs


## 1.2 Loading the data

In [14]:
# load the data
dialogues = extract_dialogue_from_srt(r"data\suits-1x01-pilot.en.srt")
# creating conversational pairs
conversation_pairs = create_conversation_pairs(dialogues)

# Step 2: Format for Training

In [15]:
import pandas as pd

df = pd.DataFrame(conversation_pairs)
df["text"] = df["input"] + " <|sep|> " + df["response"]
df = df[["text"]]
df.head()


Unnamed: 0,text
0,﻿1 <|sep|> [Muffled chatter]
1,[Muffled chatter] <|sep|> [Knocking]
2,[Knocking] <|sep|> Gerald Tate's here.
3,Gerald Tate's here. <|sep|> He wants to know
4,He wants to know <|sep|> what's happening to h...


# Step 3: Tokenize the Data
We’ll use DialoGPT (small version for speed)

In [16]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")

def tokenize_function(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=128)

# Step 4: Load into Dataset Format

In [22]:
import regex
from datasets import Dataset

dataset = Dataset.from_pandas(df)
tokenized_dataset = dataset.map(tokenize_function, batched=True)


AttributeError: module 'regex' has no attribute 'Pattern'

In [None]:
!pip uninstall regex -y
!pip install --upgrade regex

import regex
print(regex.__version__)

Found existing installation: regex 2024.11.6
Uninstalling regex-2024.11.6:
  Successfully uninstalled regex-2024.11.6
Collecting regex
  Using cached regex-2024.11.6-cp38-cp38-win_amd64.whl (274 kB)
Installing collected packages: regex
Successfully installed regex-2024.11.6
2.5.86


`AttributeError: module 'regex' has no attribute 'Pattern' on step 4`

Why this happens:

- Hugging Face’s datasets library uses regex.Pattern for regular expression matching.

- Older versions of regex don’t expose Pattern in a way compatible with this.

- Upgrading regex ensures compatibility with the datasets processing pipeline.

In [None]:
import regex
import os
print([f for f in os.listdir() if f.startswith("regex")])

os.remove("regex.py")

print(regex.__file__)
print(dir(regex))

[]


FileNotFoundError: [WinError 2] The system cannot find the file specified: 'regex.py'

# Step 5: Fine-tune DialoGPT

In [23]:
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="no",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    logging_dir='./logs',
    save_steps=500,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
)

trainer.train()


RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback):
Failed to import transformers.integrations.integration_utils because of the following error (look up to see its traceback):
Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

# Step 6: Chat with Your Bot!

In [24]:
import torch

def chat_with_bot():
    print("Chatbot is ready! Type 'quit' to exit.")
    while True:
        user_input = input("You: ")
        if user_input.lower() == 'quit':
            break
        
        input_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors='pt')
        chat_history_ids = model.generate(input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
        response = tokenizer.decode(chat_history_ids[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
        
        print(f"Bot: {response}")

chat_with_bot()


Chatbot is ready! Type 'quit' to exit.


NameError: name 'model' is not defined

# Special Notes

This is a **light fine-tuning** — ideal for experimenting with small subtitle files.

For production-level, you’ll want to:

- Clean and balance more data

- Train on larger datasets

- Use checkpointing and evaluation metrics

- Possibly fine-tune GPT-2, LLaMA, or Mistral models for more power.