# Objectives 

* Create BERT embeddings for all the sentances in the dataset 
* Design and train a Bi LSTM to classify the dataset into 0 (generated by humans) and (1s Generated by AI)

# dataset Description

The dataset contains 44,000 paragraphs, some written by humans, some by LLM. the goal of this model is to predict which one is written by LLM and which ones are written by humans 

In [2]:
import pandas as pd 
import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam
from torch.utils.data import DataLoader, Dataset, TensorDataset, random_split
from transformers import BertModel, BertTokenizer
device = torch.device('mps') 

bert model importing 

In [3]:
link = './dataset/train_v2_drcat_02.csv'
data = pd.read_csv(link)
data.head()

Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
0,Phones\n\nModern humans today are always on th...,0,Phones and driving,persuade_corpus,False
1,This essay will explain if drivers should or s...,0,Phones and driving,persuade_corpus,False
2,Driving while the use of cellular devices\n\nT...,0,Phones and driving,persuade_corpus,False
3,Phones & Driving\n\nDrivers should not be able...,0,Phones and driving,persuade_corpus,False
4,Cell Phone Operation While Driving\n\nThe abil...,0,Phones and driving,persuade_corpus,False


In [10]:
data[data.label == 0].prompt_name.value_counts()

prompt_name
Does the electoral college work?         2714
Car-free cities                          2666
Facial action coding system              2167
Distance learning                        2157
Driverless cars                          1886
Exploring Venus                          1862
Summer projects                          1750
Mandatory extracurricular activities     1670
Cell phones at school                    1656
Grades for extracurricular activities    1626
The Face on Mars                         1583
Seeking multiple opinions                1552
Community service                        1542
"A Cowboy Who Rode the Waves"            1372
Phones and driving                       1168
Name: count, dtype: int64

In [11]:
data[data.label == 1].prompt_name.value_counts()

prompt_name
Seeking multiple opinions                3624
Distance learning                        3397
Car-free cities                          2051
Does the electoral college work?         1720
Mandatory extracurricular activities     1407
Summer projects                           951
Facial action coding system               917
Community service                         550
"A Cowboy Who Rode the Waves"             524
Grades for extracurricular activities     490
Cell phones at school                     463
Phones and driving                        415
Driverless cars                           364
Exploring Venus                           314
The Face on Mars                          310
Name: count, dtype: int64

In [5]:
print(data.iloc[25999].text)

While summer is meant as a break from the regular school routine, having some structured learning over the summer can help students maintain important skills and knowledge. When given the choice, I believe summer projects are best when student-designed rather than teacher-designed.

When students have autonomy in choosing their own summer project topics, they are more motivated to engage in meaningful learning. By allowing students to pick subjects they find genuinely interesting or relevant to their lives, they will be internally driven to explore the topic in depth. This type of intrinsic motivation leads to better focus and quality of work compared to assignments chosen by teachers without student input. 

Giving students ownership over project topics also fosters independence and skill development. By guiding their own work, students learn valuable skills like time management, decision making, and self-directed learning - skills that will serve them well in higher education and car

In [4]:
data.label.value_counts()

label
0    27371
1    17497
Name: count, dtype: int64

# Bert model

The datasset is passed to textDataset class which processes the dataset and returns it into the form that is most suitable for the model and the bert tokenizer

In [5]:
class textDataset(Dataset): 
    def __init__(self, dataset, tokenizer, max_length=120):
        self.texts = dataset.text.to_list() 
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.label = dataset.label
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.label[idx]
        encoding = self.tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length = self.max_length,
            return_tensors='pt',
        )
        
        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask':encoding['attention_mask'].squeeze(0),
            'label': torch.tensor(label, dtype=torch.long)
        }
    

importing the bert model 

In [6]:
model = BertModel.from_pretrained('bert-base-uncased').to(device=device)
model.eval()
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [7]:
batch_size = 128
dataset = textDataset(data, tokenizer=tokenizer, max_length=240)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

In [8]:
del data

Creating the bert embeddings

In [9]:
embeddings_dim = 768 
length_of_dataset = len(dataloader.dataset)

allEmbeddings = torch.empty(size=(length_of_dataset, embeddings_dim), dtype=torch.float32).to(device=device)
allLabels = torch.empty(length_of_dataset, dtype=torch.int64).to(device=device)

current_idx = 0

with torch.no_grad(): 
    for idx, batch in enumerate(dataloader):
        if idx % 10 == 0: 
            print(idx)
        input_ids = batch['input_ids'].to(device=device)
        attention_mask = batch['attention_mask'].to(device=device)
        label = batch['label'].to(device=device)
        
        # store embeddings
        
        output = model(input_ids=input_ids, attention_mask=attention_mask)
        embeddings = output.last_hidden_state[:,0,:]

        batch_size = embeddings.size(0)
        
        allEmbeddings[current_idx: current_idx+ batch_size] = embeddings
        allLabels[current_idx:current_idx + batch_size] = label
        
        current_idx += batch_size

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350


# Saving the embeddings (Run the code from this point on henceforth) 

In [250]:
torch.save(allEmbeddings, './Saved Embeddings/embeddings.pt')
torch.save(allLabels, './Saved Embeddings/outputs.pt')

NameError: name 'allEmbeddings' is not defined

Here the torch files have been saved. From now on the file will be run from here to save GPU overload 

Re run import statements

In [5]:
import torch
import torch.nn as nn
from torch.optim import Adam
from torch.utils.data import DataLoader, Dataset, TensorDataset, random_split
from torchmetrics import Accuracy
device = torch.device('mps')

In [6]:
embeddings = torch.load('./Saved Embeddings/embeddings.pt').to(device=device)
labels = torch.load('./Saved Embeddings/outputs.pt').to(device=device).float()

# Bi directional LSTM  

Creating the Bi LSTM, with 6 layers and dropout 0.5 
The loss function is Cross entropy loss and the activation function is softmax
Learning rate is 0.001 and optimizer is Adam 
These hyperparameters were obtained after hyperparameter tuning 

In [7]:
class BiDirectionalLSTM(nn.Module):  
    def __init__(self, emb_dim=768, hidden_dim=256):
        super(BiDirectionalLSTM, self).__init__()
        self.LSTM = nn.LSTM(
            input_size= emb_dim,
            hidden_size= hidden_dim, 
            num_layers= 6, 
            dropout= 0.5, 
            batch_first= True,
            bidirectional= True,
        )
        self.fc = nn.Sequential(
            nn.Linear(hidden_dim * 2, 128), 
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(), 
            nn.Linear(64,2),
        )
    
    def forward(self, x): # x is the bert embeddings
        x = x.view(-1,1,768)
        _, (hidden_state, _) = self.LSTM(x)
        lstm_output = torch.cat((hidden_state[-2,:,:], hidden_state[-1,:,:]), dim=1)
        output = self.fc(lstm_output)
        return output
        
lstm = BiDirectionalLSTM().to(device=device)
    

In [8]:
lossfn = nn.CrossEntropyLoss()
learning_rate = 0.001
opt = Adam(lstm.parameters(), lr=learning_rate)

Training the LSTM

Splitting the dataset into validation and train

In [9]:
# here need to split into training, validation and testing
batch_size = 32
dataset = TensorDataset(embeddings, labels)
trainDataset, valDataset = random_split(dataset, lengths=[0.7,0.3])
trainloader = DataLoader(trainDataset, batch_size=batch_size, shuffle=True)
valLoader = DataLoader(valDataset, batch_size=batch_size, shuffle=False)

# Training the model for 3 epochs

In [10]:
# for softmax cross entropy problem

epochs = 3
accuracy = Accuracy(task='multiclass', num_classes=2)
for epoch in range(epochs):
    running_loss = 0
    lstm.train()


    if lstm.training:
        print('training time')

    else:
        print('mistake')

    correct = 0
    for input, targets in trainloader:

        opt.zero_grad()

        # get outputs for LSTM
        output = lstm(input)
        
        correct += (output.argmax(axis=1) == targets).sum().item()
        # get loss value
        loss = lossfn(output, targets)
        # store loss value
        running_loss += loss.item()/len(targets)

        # backpropagate
        loss.backward()
        opt.step()
    
    trainingAcc = correct / len(trainDataset)

    # validation
    lstm.eval()
    if lstm.training:
        print('train mode during eval')
    else:
        print('eval time')
    running_val_loss = 0

    correct = 0
    total = 0
    for inputs, targets in valLoader:
        # batchwise output

        output = lstm(inputs)
        loss = lossfn(output, targets)
        running_val_loss += loss.item()/len(targets)

        correct += (output.argmax(axis=1) == targets).sum().item()

    # validation over 
    
    # printing metrics 
    val_accuracy = correct / len(valDataset)
    print(f'''epoch [{epoch+1}/{epochs}]
        \t training loss: {running_loss},
        \t validation loss: {running_val_loss},
        \t Training Acc: {trainingAcc},
        \t Val acc: {val_accuracy},
            ''')



training time
eval time
epoch [1/3]
        	 training loss: 3.6365216737249284,
        	 validation loss: 0.7507696123844653,
        	 Training Acc: 0.9575585838003057,
        	 Val acc: 0.9821693907875185,
            
training time
eval time
epoch [2/3]
        	 training loss: 2.1949133417383564,
        	 validation loss: 1.843424329998379,
        	 Training Acc: 0.9775216505348956,
        	 Val acc: 0.9643387815750372,
            
training time
eval time
epoch [3/3]
        	 training loss: 1.8647686400818202,
        	 validation loss: 1.2550570043642437,
        	 Training Acc: 0.9805781966377993,
        	 Val acc: 0.974739970282318,
            


# Saving the model 

In [11]:
torch.save(lstm.state_dict(), './saved model/lstm_model.pt')

# Load from here for testing the LSTM

# loading the model

In [16]:
lstm = BiDirectionalLSTM().to(device=device)
lstm.load_state_dict(torch.load('./saved model/lstm_model.pt', weights_only=True))
lstm.eval()

BiDirectionalLSTM(
  (LSTM): LSTM(768, 256, num_layers=6, batch_first=True, dropout=0.5, bidirectional=True)
  (fc): Sequential(
    (0): Linear(in_features=512, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=64, bias=True)
    (3): ReLU()
    (4): Linear(in_features=64, out_features=2, bias=True)
  )
)