#### Thursday, March 21, 2024

Re-run just to test ... Yup it still runs!

#### Saturday, March 16, 2024

mamba activate t4nlpacv

This all runs in one pass.

In this chapter, we will build a RoBERTa model, an advanced variant of BERT, from scratch. The model will use the bricks of the transformer construction kit we need for BERT models.

Also, no pretrained tokenizers or models will be used. The RoBERTa model will be built following the 15-step process described in this chapter.

In [1]:
# only target the 4090 ...
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

# Pretraining a Customer Support Model on X (former Twitter) Data
copyright 2023-2024, Denis Rothman

This is an educational notebook to show how to implement a Hugging Face RoRobertaForCausalLM model on messages on X(former Twitter). The goal is only to show the method(see limitations below).

**Pretraining a Generative AI model from scratch**

**Dataset:**Tweets from 20 Top Brands by Volume  
**Model:**  RobertaForCausalLM


![](https://i.imgur.com/nTv3Iuu.png)




The goal of the notebook is to train a Hugging Face RobertaForCausalLM model to simulate a customer support chat agent for X (former Twitter)

This notebook requires a GPU.

**Customer Support on Twitter**
Over 2 million tweets and replies from the biggest brands on Twitter

https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter

**Limitations**:

The scope of pretraining was limited to a subset of the dataset for time constraints. You can train the full dataset on Google Colab or another platform. You can also select another model if you find the generalized reponses insufficient.The reponses are only there to show how the system workds.

RoBERTa is not a standard generative AI model such as GPT models as in the Chapter07 directory. However, it can be implemented as a reasonably interesting autoregressive(token by token loop) model that illustrates how to begin to explore how generative AI works. 

In the following chapters we will be using **GPT-4** and other **LLM** models. *However, exploring smaller open source models for a specific domain can sometimes provide everything we need for our project.* 


In [2]:
# from google.colab import drive
# drive.mount('/content/drive')

Kaggle credentials for authentification

In [3]:
# import os
# import json
# with open(os.path.expanduser("drive/MyDrive/files/kaggle.json"), "r") as f:
#     kaggle_credentials = json.load(f)

# kaggle_username = kaggle_credentials["username"]
# kaggle_key = kaggle_credentials["key"]

# os.environ["KAGGLE_USERNAME"] = kaggle_username
# os.environ["KAGGLE_KEY"] = kaggle_key

In [4]:
# try:
#   import kaggle
# except:
#   !pip install kaggle
#   import kaggle

In [5]:
# kaggle.api.authenticate()

#Step 1: Downloading the dataset

https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter

In [6]:
# !kaggle datasets download -d thoughtvector/customer-support-on-twitter

In [7]:
# import zipfile

# with zipfile.ZipFile('/content/customer-support-on-twitter.zip', 'r') as zip_ref:
#     zip_ref.extractall('/content/')

# print("File Unzipped!")

#Step 2: Installing Hugging Face transformers

April 2023 update From Hugging Face Issue 22816:

https://github.com/huggingface/transformers/issues/22816

"The PartialState import was added as a dependency on the transformers development branch yesterday. PartialState was added in the 0.17.0 release in accelerate, and so for the development branch of transformers, accelerate >= 0.17.0 is required.

Downgrading the transformers version removes the code which is importing PartialState."

Denis Rothman: The following cell imports the latest version of Hugging Face transformers but without downgrading it.

To adapt to the Hugging Face upgrade, A GPU accelerator was activated using the Google Colab Pro with the following NVIDIA GPU:
GPU Name: NVIDIA A100-SXM4-40GB

In [8]:
# !pip install Transformers
# !pip install --upgrade accelerate
from accelerate import Accelerator

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


creating subdirectories to store the datasets, the logs and the trained model

In [9]:
# !mkdir -p /content/model/dataset/
# !mkdir -p /content/model/model/
# !mkdir -p /content/model/logs/

# Step 3:  Loading and filtering the data

We will use a subset of the dataset to train the model.

In [10]:
import pandas as pd

# Load the dataset
# df = pd.read_csv('/content/twcs/twcs.csv')
df = pd.read_csv('CustomerSupportOnTwitter/twcs/twcs.csv')

# Check the first few rows to understand the data
print(df.head())

# 7.1s

   tweet_id   author_id  inbound                      created_at  \
0         1  sprintcare    False  Tue Oct 31 22:10:47 +0000 2017   
1         2      115712     True  Tue Oct 31 22:11:45 +0000 2017   
2         3      115712     True  Tue Oct 31 22:08:27 +0000 2017   
3         4  sprintcare    False  Tue Oct 31 21:54:49 +0000 2017   
4         5      115712     True  Tue Oct 31 21:49:35 +0000 2017   

                                                text response_tweet_id  \
0  @115712 I understand. I would like to assist y...                 2   
1      @sprintcare and how do you propose we do that               NaN   
2  @sprintcare I have sent several private messag...                 1   
3  @115712 Please send us a Private Message so th...                 3   
4                                 @sprintcare I did.                 4   

   in_response_to_tweet_id  
0                      3.0  
1                      1.0  
2                      4.0  
3                      5.0  
4

Extracting relevant data

In this case, we are extracting the text

In [11]:
# Extract tweets from the 'text' column or any other relevant column
tweets = df['text'].dropna().tolist()  # This assumes the column with tweets is named 'text'

In [12]:
# Convert the list of tweets to a DataFrame
df_tweets = pd.DataFrame(tweets, columns=['text'])

# Save the DataFrame to a CSV file
df_tweets.to_csv('tweets.csv', index=False, encoding='utf-8')

# 6.1s

In [13]:
# Checking the length of df
formatted_length = "{:,}".format(len(df_tweets))
print(formatted_length)

2,811,774


Checking the extraction

In [14]:
for tweet in tweets[:10]:  # This will display the first 5 tweets
    print(tweet)

@115712 I understand. I would like to assist you. We would need to get you into a private secured link to further assist.
@sprintcare and how do you propose we do that
@sprintcare I have sent several private messages and no one is responding as usual
@115712 Please send us a Private Message so that we can further assist you. Just click ‘Message’ at the top of your profile.
@sprintcare I did.
@115712 Can you please send us a private message, so that I can gain further details about your account?
@sprintcare is the worst customer service
@115713 This is saddening to hear. Please shoot us a DM, so that we can look into this for you. -KC
@sprintcare You gonna magically change your connectivity for me and my whole family ? 🤥 💯
@115713 We understand your concerns and we'd like for you to please send us a Direct Message, so that we can further assist you. -AA


filtering the extraction to clean it and apply lowercase conversion

In [15]:
import re

def filter_tweet(tweet):
    # Keep only characters a to z, spaces, and apostrophes, then convert to lowercase
    return re.sub(r'[^a-z\s\']', '', tweet.lower())

filtered_tweets = [filter_tweet(tweet) for tweet in tweets]

# 7.8s

In [16]:
f=30

# let's try a smaller min word threshold to get back more tweets ... 
# f= 10

filtered_tweets = [tweet for tweet in filtered_tweets if len(tweet.split()) > f]  # Only keep tweets with more than f words

In [17]:
for filtered_tweet in filtered_tweets[:10]:  # This will display the first 5 tweets
    print(filtered_tweet)

marksandspencer i check with the gov office and legal they stated you are not right but its funny how the other stores dont but you do no wonder lidl and the rest are beating you
marksandspencer ou must charge at least p a bag including vat for carrier bags that are all of the following

unused  its new and hasnt already been used for sold goods to be taken away or delivered
plastic and  microns thick or less
it has handles an opening and isnt sealed
marksandspencer arent require charge  a bag
paper bags
shops in airports or on board trains aeroplanes or ships
bags which only contain certain items such as unwrapped food raw meat and fish where there is a food safety risk prescription medicines uncovered blades seeds bulbs amp s
 hi you can change your microsoft account email through the steps here httpstcodkehohboyy  if the email your son wants to change to is already associated with a microsoft account you'll need to follow those steps to switch the email address on that account too z

In [18]:
# Checking the length of dataset
formatted_length = "{:,}".format(len(filtered_tweets))
print(formatted_length)

# 228,637   when the miminum number of words in the tweet is 30
# 2,227,557 when the miminum number of words in the tweet is 10 ... yup, way more!

228,637


save the dataset

In [19]:
targetFolder = "twitter"

In [20]:
import csv

# Save to CSV
pt_csv = targetFolder +  "/dataset/processed_tweets.csv"

with open(pt_csv, 'w', newline='') as file:
    writer = csv.writer(file)
    for tweet in filtered_tweets:
        writer.writerow([tweet])

In [21]:
pt_csv

'twitter/dataset/processed_tweets.csv'

check the file

In [22]:
import csv

# Read from CSV
# with open('/content/model/dataset/processed_tweets.csv', 'r') as file:
with open(pt_csv, 'r') as file:
    
    reader = csv.reader(file)

    # Use islice from itertools to only get the first 5 lines
    from itertools import islice
    for row in islice(reader, 5):
        print(row[0])

marksandspencer i check with the gov office and legal they stated you are not right but its funny how the other stores dont but you do no wonder lidl and the rest are beating you
marksandspencer ou must charge at least p a bag including vat for carrier bags that are all of the following

unused  its new and hasnt already been used for sold goods to be taken away or delivered
plastic and  microns thick or less
it has handles an opening and isnt sealed
marksandspencer arent require charge  a bag
paper bags
shops in airports or on board trains aeroplanes or ships
bags which only contain certain items such as unwrapped food raw meat and fish where there is a food safety risk prescription medicines uncovered blades seeds bulbs amp s
 hi you can change your microsoft account email through the steps here httpstcodkehohboyy  if the email your son wants to change to is already associated with a microsoft account you'll need to follow those steps to switch the email address on that account too z

#Step 4: Checking Resource Constraints: GPU and CUDA

In [23]:
!nvidia-smi

Thu Mar 21 15:41:44 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce GTX 1050        Off | 00000000:01:00.0  On |                  N/A |
|  0%   57C    P0              N/A /  70W |    545MiB /  2048MiB |      6%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:02:00.0 Off |  

You can start running the code from here IF you want to skip the creation of the dataset ... 

In [24]:
# only target the 4090 ...
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

In [25]:
#@title Checking that PyTorch Sees CUDA
import torch
torch.cuda.is_available()

True

#Step 5: Defining the configuration of the model

In [26]:
from transformers import RobertaConfig, RobertaForCausalLM

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
    is_decoder=True,  # Set up the model for potential seq2seq use, allowing for autoregressive outputs
)

  _torch_pytree._register_pytree_node(


In [27]:
print(config)

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": true,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}



define and print model

In [28]:
# Create the RobertaForCausalLM model with the specified config
model = RobertaForCausalLM(config=config)
print(model)

RobertaForCausalLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): La

##  defining the tokenizer

In [29]:
from transformers import RobertaTokenizer

# Initialize the tokenizer using the 'roberta-base' pre-trained model
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

In [30]:
# Display special tokens
print("Special tokens:", tokenizer.special_tokens_map)

Special tokens: {'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}


## Exploring the parameters

In [31]:
print(model.num_parameters())

83504416


In [32]:
LP=list(model.parameters())
lp=len(LP)
print(lp)
for p in range(0,lp):
  print(LP[p])

106
Parameter containing:
tensor([[-0.0232,  0.0173, -0.0213,  ...,  0.0082,  0.0291, -0.0124],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0100, -0.0074,  0.0194,  ..., -0.0094,  0.0046, -0.0154],
        ...,
        [ 0.0088, -0.0138, -0.0243,  ...,  0.0096,  0.0168, -0.0485],
        [-0.0199,  0.0002,  0.0084,  ...,  0.0145,  0.0037,  0.0075],
        [-0.0158, -0.0118, -0.0374,  ...,  0.0129, -0.0072, -0.0003]],
       requires_grad=True)
Parameter containing:
tensor([[-0.0157, -0.0259, -0.0470,  ..., -0.0051,  0.0076, -0.0132],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0046, -0.0150, -0.0062,  ...,  0.0103,  0.0199,  0.0166],
        ...,
        [ 0.0037, -0.0047,  0.0236,  ...,  0.0121, -0.0348,  0.0042],
        [ 0.0157, -0.0316,  0.0112,  ..., -0.0121, -0.0040, -0.0049],
        [-0.0068, -0.0074, -0.0346,  ..., -0.0430,  0.0160,  0.0114]],
       requires_grad=True)
Parameter containing:
tensor([

In [33]:
#Shape of each tensor in the model
LP = list(model.parameters())
for i, tensor in enumerate(LP):
    print(f"Shape of tensor {i}: {tensor.shape}")

Shape of tensor 0: torch.Size([52000, 768])
Shape of tensor 1: torch.Size([514, 768])
Shape of tensor 2: torch.Size([1, 768])
Shape of tensor 3: torch.Size([768])
Shape of tensor 4: torch.Size([768])
Shape of tensor 5: torch.Size([768, 768])
Shape of tensor 6: torch.Size([768])
Shape of tensor 7: torch.Size([768, 768])
Shape of tensor 8: torch.Size([768])
Shape of tensor 9: torch.Size([768, 768])
Shape of tensor 10: torch.Size([768])
Shape of tensor 11: torch.Size([768, 768])
Shape of tensor 12: torch.Size([768])
Shape of tensor 13: torch.Size([768])
Shape of tensor 14: torch.Size([768])
Shape of tensor 15: torch.Size([3072, 768])
Shape of tensor 16: torch.Size([3072])
Shape of tensor 17: torch.Size([768, 3072])
Shape of tensor 18: torch.Size([768])
Shape of tensor 19: torch.Size([768])
Shape of tensor 20: torch.Size([768])
Shape of tensor 21: torch.Size([768, 768])
Shape of tensor 22: torch.Size([768])
Shape of tensor 23: torch.Size([768, 768])
Shape of tensor 24: torch.Size([768])
Sh

In [34]:
#counting the parameters
np=0
for p in range(0,lp):#number of tensors
  PL2=True
  try:
    L2=len(LP[p][0]) #check if 2D
  except:
    L2=1             #not 2D but 1D
    PL2=False
  L1=len(LP[p])
  L3=L1*L2
  np+=L3             # number of parameters per tensor
  if PL2==True:
    print(p,L1,L2,L3)  # displaying the sizes of the parameters
  if PL2==False:
    print(p,L1,L3)  # displaying the sizes of the parameters

print(np)              # total number of parameters

0 52000 768 39936000
1 514 768 394752
2 1 768 768
3 768 768
4 768 768
5 768 768 589824
6 768 768
7 768 768 589824
8 768 768
9 768 768 589824
10 768 768
11 768 768 589824
12 768 768
13 768 768
14 768 768
15 3072 768 2359296
16 3072 3072
17 768 3072 2359296
18 768 768
19 768 768
20 768 768
21 768 768 589824
22 768 768
23 768 768 589824
24 768 768
25 768 768 589824
26 768 768
27 768 768 589824
28 768 768
29 768 768
30 768 768
31 3072 768 2359296
32 3072 3072
33 768 3072 2359296
34 768 768
35 768 768
36 768 768
37 768 768 589824
38 768 768
39 768 768 589824
40 768 768
41 768 768 589824
42 768 768
43 768 768 589824
44 768 768
45 768 768
46 768 768
47 3072 768 2359296
48 3072 3072
49 768 3072 2359296
50 768 768
51 768 768
52 768 768
53 768 768 589824
54 768 768
55 768 768 589824
56 768 768
57 768 768 589824
58 768 768
59 768 768 589824
60 768 768
61 768 768
62 768 768
63 3072 768 2359296
64 3072 3072
65 768 3072 2359296
66 768 768
67 768 768
68 768 768
69 768 768 589824
70 768 768
71 768 768

# Step 6: Creating and processing the dataset

In [35]:
#installing Hugging Face datasets for data loading and preprocessing
# !pip install datasets

In [36]:
# This is needed if we skipped the creation of the dataset ... 
targetFolder = 'twitter'
pt_csv = 'twitter/dataset/processed_tweets.csv'

In [37]:
#load dataset
from datasets import load_dataset

# dataset = load_dataset('csv', data_files='/content/model/dataset/processed_tweets.csv', column_names=["text"])
dataset = load_dataset('csv', data_files=pt_csv, column_names=["text"])

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


In [38]:
# split datasets into train and eval
from datasets import DatasetDict

dataset = dataset['train'].train_test_split(test_size=0.1)  # 10% for evaluation

dataset = DatasetDict(dataset)

In [39]:
# This puts a 100% load on a single CPU Core ... 
# Tokenize datasets:
# - If a record's length is less than `max_length`, it's padded to ensure all records have the same length.
# - If a record's length exceeds `max_length`, it's truncated to the specified max length.

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# How can I save this dataset locally so that I do not need to run this cell again??

# 46.9s when f = 30 ... way faster!

# 4m 59.6s
# 5m 5.8s ... for when the min number of words in the tweet is 10, so for 2,227,557 tweets ... 

Map:   0%|          | 0/205773 [00:00<?, ? examples/s]

Map:   0%|          | 0/22864 [00:00<?, ? examples/s]

In [40]:
# datacollator to batch items together for training and evaluation
from transformers import DataCollatorForLanguageModeling

# Define the data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # For causal (autoregressive) language modeling
)

# Step 7: Initializing the trainer

The number of epochs can be empirically increased until the
accuracy versus training time reaches a limit.

In [41]:
# to display the time every x steps suring training
from transformers import Trainer
from datetime import datetime
from typing import Dict, Any

class CustomTrainer(Trainer):
    def log(self, logs: Dict[str, Any]) -> None:
        super().log(logs)
        if "step" in logs:  # Check if "step" key is in the logs dictionary
            step = int(logs["step"])
            if step % self.args.eval_steps == 0:
                print(f"Current time at step {step}: {datetime.now()}")

In [42]:
import logging
from transformers import Trainer, TrainingArguments

# Set up Python logging
logging.basicConfig(level=logging.INFO)

print(targetFolder)

training_args = TrainingArguments(
    # output_dir="/content/model/model/",
    output_dir=targetFolder + "/model/",
    overwrite_output_dir=True,
    num_train_epochs=2,                  # can be increased to increase accuracy if productive
    # per_device_train_batch_size=64,      # batch size per device
    per_device_train_batch_size=128,      # batch size per device ... increased from 64 to ... 1024, 512, 256 all too large ... 
    save_steps=10_000,                   # save a checkpoint every save_steps=10000
    save_total_limit=2,                  # the maximum number of checkpoint model files to keep
    # logging_dir='/content/model/logs/',  # directory for storing logs
    logging_dir=targetFolder + '/logs/',  # directory for storing logs
    logging_steps=100,                   # Log every 100 steps
    logging_first_step=True,             # Log the first step
    evaluation_strategy="steps",         # Evaluate every "eval_steps"
    eval_steps=500,                      # Evaluate every 500 steps
)

twitter


In [43]:
trainer = CustomTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset = tokenized_datasets["train"],
    eval_dataset = tokenized_datasets["test"]
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


# Step 8: Pretraining the model


In [44]:
%%time
trainer.train()

# Number of Tweets:
# 228,637   when the miminum number of words in the tweet is 30
# 2,227,557 when the miminum number of words in the tweet is 10 ... yup, way more!

# I tried running this for the 2,227,557 tweets .. 27minutes, only at 4% ... I am gonna stop this, increase the batch size, and see if it 
# appears to make stuff faster ... I think it will ... VRAM was at 10522 MiB before I interrupted the training ... 
# 20224 MiB when per_device_train_batch_size=128 ... OK, so I also interrupted the training bacause it was going to take too long .. 

# Interrupted the training at this time ... 
# 22m 2.6s ... min tweet length is 30, 

# 26m 37s  when min tweet length is 30 and batch_size=64
# 19m 7.5s when min tweet length is 30 and batch_size=128


  0%|          | 0/3216 [00:00<?, ?it/s]

{'loss': 11.0049, 'learning_rate': 4.998445273631841e-05, 'epoch': 0.0}
{'loss': 8.0227, 'learning_rate': 4.84452736318408e-05, 'epoch': 0.06}
{'loss': 6.1228, 'learning_rate': 4.68905472636816e-05, 'epoch': 0.12}
{'loss': 5.7052, 'learning_rate': 4.5335820895522394e-05, 'epoch': 0.19}
{'loss': 5.4444, 'learning_rate': 4.3781094527363184e-05, 'epoch': 0.25}
{'loss': 5.292, 'learning_rate': 4.222636815920398e-05, 'epoch': 0.31}


  0%|          | 0/2858 [00:00<?, ?it/s]

{'eval_loss': 5.163656711578369, 'eval_runtime': 26.0611, 'eval_samples_per_second': 877.324, 'eval_steps_per_second': 109.665, 'epoch': 0.31}
{'loss': 5.182, 'learning_rate': 4.067164179104478e-05, 'epoch': 0.37}
{'loss': 5.0833, 'learning_rate': 3.9116915422885576e-05, 'epoch': 0.44}
{'loss': 5.0012, 'learning_rate': 3.756218905472637e-05, 'epoch': 0.5}
{'loss': 4.9461, 'learning_rate': 3.600746268656717e-05, 'epoch': 0.56}
{'loss': 4.8636, 'learning_rate': 3.445273631840796e-05, 'epoch': 0.62}


  0%|          | 0/2858 [00:00<?, ?it/s]

{'eval_loss': 4.786151885986328, 'eval_runtime': 26.3393, 'eval_samples_per_second': 868.055, 'eval_steps_per_second': 108.507, 'epoch': 0.62}
{'loss': 4.8259, 'learning_rate': 3.289800995024876e-05, 'epoch': 0.68}
{'loss': 4.7907, 'learning_rate': 3.1343283582089554e-05, 'epoch': 0.75}
{'loss': 4.7448, 'learning_rate': 2.978855721393035e-05, 'epoch': 0.81}
{'loss': 4.705, 'learning_rate': 2.823383084577115e-05, 'epoch': 0.87}
{'loss': 4.6683, 'learning_rate': 2.6679104477611942e-05, 'epoch': 0.93}


  0%|          | 0/2858 [00:00<?, ?it/s]

{'eval_loss': 4.593803882598877, 'eval_runtime': 26.7794, 'eval_samples_per_second': 853.79, 'eval_steps_per_second': 106.724, 'epoch': 0.93}
{'loss': 4.6387, 'learning_rate': 2.512437810945274e-05, 'epoch': 1.0}
{'loss': 4.6033, 'learning_rate': 2.3569651741293533e-05, 'epoch': 1.06}
{'loss': 4.5726, 'learning_rate': 2.201492537313433e-05, 'epoch': 1.12}
{'loss': 4.5662, 'learning_rate': 2.0460199004975124e-05, 'epoch': 1.18}
{'loss': 4.5561, 'learning_rate': 1.890547263681592e-05, 'epoch': 1.24}


  0%|          | 0/2858 [00:00<?, ?it/s]

{'eval_loss': 4.485883712768555, 'eval_runtime': 25.7314, 'eval_samples_per_second': 888.566, 'eval_steps_per_second': 111.071, 'epoch': 1.24}
{'loss': 4.5419, 'learning_rate': 1.735074626865672e-05, 'epoch': 1.31}
{'loss': 4.5232, 'learning_rate': 1.5796019900497512e-05, 'epoch': 1.37}
{'loss': 4.4867, 'learning_rate': 1.424129353233831e-05, 'epoch': 1.43}
{'loss': 4.4749, 'learning_rate': 1.2686567164179105e-05, 'epoch': 1.49}
{'loss': 4.4809, 'learning_rate': 1.1131840796019902e-05, 'epoch': 1.55}


  0%|          | 0/2858 [00:00<?, ?it/s]

{'eval_loss': 4.420069694519043, 'eval_runtime': 25.8377, 'eval_samples_per_second': 884.907, 'eval_steps_per_second': 110.613, 'epoch': 1.55}
{'loss': 4.4689, 'learning_rate': 9.577114427860696e-06, 'epoch': 1.62}
{'loss': 4.4821, 'learning_rate': 8.022388059701493e-06, 'epoch': 1.68}
{'loss': 4.4641, 'learning_rate': 6.467661691542288e-06, 'epoch': 1.74}
{'loss': 4.4321, 'learning_rate': 4.9129353233830845e-06, 'epoch': 1.8}
{'loss': 4.4403, 'learning_rate': 3.358208955223881e-06, 'epoch': 1.87}


  0%|          | 0/2858 [00:00<?, ?it/s]

{'eval_loss': 4.3902907371521, 'eval_runtime': 25.7159, 'eval_samples_per_second': 889.101, 'eval_steps_per_second': 111.138, 'epoch': 1.87}
{'loss': 4.4589, 'learning_rate': 1.803482587064677e-06, 'epoch': 1.93}
{'loss': 4.4342, 'learning_rate': 2.4875621890547267e-07, 'epoch': 1.99}
{'train_runtime': 1147.5453, 'train_samples_per_second': 358.632, 'train_steps_per_second': 2.803, 'train_loss': 4.87441381471074, 'epoch': 2.0}
CPU times: user 19min 9s, sys: 2.93 s, total: 19min 12s
Wall time: 19min 7s


TrainOutput(global_step=3216, training_loss=4.87441381471074, metrics={'train_runtime': 1147.5453, 'train_samples_per_second': 358.632, 'train_steps_per_second': 2.803, 'train_loss': 4.87441381471074, 'epoch': 2.0})

**Sample run information:**

CPU times: user 26min 37s, sys: 3.01 s, total: 26min 40s
Wall time: 26min 35s


TrainOutput(global_step=3216, training_loss=5.05824806411468, metrics={'train_runtime': 1595.3194, 

'train_samples_per_second': 128.985, 'train_steps_per_second': 2.016, 'total_flos': 6822770940370944.0, 

'train_loss': 5.05824806411468, 'epoch': 1.0})

display results

In [45]:
results = trainer.evaluate()
print(results)

  0%|          | 0/2858 [00:00<?, ?it/s]

{'eval_loss': 4.386073112487793, 'eval_runtime': 26.2182, 'eval_samples_per_second': 872.065, 'eval_steps_per_second': 109.008, 'epoch': 2.0}


evaluate the trainer

In [46]:
trainer.evaluate()

  0%|          | 0/2858 [00:00<?, ?it/s]

{'eval_loss': 4.386073112487793,
 'eval_runtime': 26.4829,
 'eval_samples_per_second': 863.351,
 'eval_steps_per_second': 107.919,
 'epoch': 2.0}

#Step 9: Saving the trained model (+tokenizer + config) to disk

In [47]:
# trainer.save_model("/content/model/model/")
trainer.save_model(targetFolder + "/model/")

In [48]:
#Uncomment the following line to save the output for future use
#trainer.save_model("drive/MyDrive/files/model_C6/model/")

# Step 10: User Interface to Chat with the Generative AI Agent

In [49]:
# For standalone run : transformer library
#!pip install Transformers

In [50]:
#1.A.for standalone run : mount Google Drive and path to pretrained model
'''
from google.colab import drive
drive.mount('/content/drive')
model_path="drive/MyDrive/files/model_C6/model/"
'''

'\nfrom google.colab import drive\ndrive.mount(\'/content/drive\')\nmodel_path="drive/MyDrive/files/model_C6/model/"\n'

In [51]:
# 1.A For a run during the training session of this notebook,
# Load the trained model: model path
# local model path(comment for a standalone run):
# model_path="/content/model/model/"

model_path=targetFolder + "/model/"

In [52]:
# 1.B Load the trained model and tokenizer : model and tokenizer
from transformers import RobertaConfig, RobertaForCausalLM
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForCausalLM.from_pretrained(model_path)

## Running samples


In [53]:
# 2. Tokenize an input prompt
prompt = "I would like to know why they moved us"
inputs = tokenizer(prompt, return_tensors="pt", max_length=50, truncation=True)

# 3. Generate a response from the model
output = model.generate(**inputs, max_length=100, temperature=0.9, num_return_sequences=1)

# 4. Decode the generated output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)



I would like to know why they moved us to the iphone  and  i have to get a new iphone  and i have to get a new iphone                                                             


Example:    
Input   
I would like to assist       
output:   
I would like to assist you please give us your full name address and phone number so we can look into this for you

## Interface

In [54]:
# !pip install ipywidgets

In [55]:
import ipywidgets as widgets
from IPython.display import display, clear_output
from transformers import RobertaTokenizer, RobertaForCausalLM

# Define the function to generate response
def generate_response(prompt):
    inputs = tokenizer(prompt, return_tensors="pt", max_length=50, truncation=True)
    output = model.generate(**inputs, max_length=200, temperature=0.9, num_return_sequences=1)
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text

# Create widgets
text_input = widgets.Textarea(
    description='Prompt:',
    placeholder='Enter your prompt here...'
)

button = widgets.Button(
    description='Generate',
    button_style='success'
)

output_text = widgets.Output(layout={'border': '1px solid black', 'height': '100px'})

# Define button click event handler
def on_button_clicked(b):
    with output_text:
        clear_output()
        response = generate_response(text_input.value)
        print(response)

button.on_click(on_button_clicked)

# Display widgets
display(text_input, button, output_text)

Textarea(value='', description='Prompt:', placeholder='Enter your prompt here...')

Button(button_style='success', description='Generate', style=ButtonStyle())

Output(layout=Layout(border_bottom='1px solid black', border_left='1px solid black', border_right='1px solid b…