    This is a kaggle notebook ran on accelerator GPU T4 X2 without using Internet

# **LLM - Detect AI Generated Text**

- This is a **Kaggle competition** hosted by Vanderbilt University and The Learning Agency Lab (October 31,2023 to January 23,2024). Developed a machine learning model to accurately **differentiate essays written by students from those generated by large language models**, addressing concerns about plagiarism and the impact of LLMs on education.

- **DataSet Details:** The competition dataset comprises about **10,000 essays**, some written by students and some generated by a variety of large language models (LLMs). The 
                       goal of the competition is to determine whether or not essay was generated by an LLM.
    - All of the essays were written in response to one of **seven essay prompts**.
    - **Training Data:** Essays from **two of the prompts** compose the training set; the remaining essays compose the hidden test set. Nearly all of the training set essays   
                         were written by students, with only a few generated essays given as examples. You may wish to generate more essays to use as training data.
    - **Testing Data:** The data in test_essays.csv is only dummy data.There are about **9,000 essays** in the test set, both student written and LLM generated.

- **Accuracy:**
    - **Public Score:**  0.963797 (Calculated on only 46% of test data)
    - **Private Score:** 0.908580 (Calculated on 54% of test data)

- **Leaderboard Details**
    - We secured **Silver medal** in this competition **ranking 125 out of 4359** participants **globally**, representing the **Top 3%** of the teams

- **Contributors:**
    1. Yash Shrivastava (Team Leader)
    2. Devendra Virani
    3. Toshan Gupta
    4. Smit Shah
    5. Garv Gupta


    Importing the required Libraries

In [1]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
#Importing the libraries
import sys
import gc
import pandas as pd
from sklearn.model_selection import StratifiedKFold
import numpy as np
from sklearn.metrics import roc_auc_score
import numpy as np
from lightgbm import LGBMClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
#<------------------------------------------------------------------------------------------------------------------------------------------>
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)
from datasets import Dataset
from tqdm.auto import tqdm
from transformers import PreTrainedTokenizerFast

from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier
#<------------------------------------------------------------------------------------------------------------------------------------------>



# Data Loading and Preprocessing for Text Classification

1. **Data Loading**: We load the data from the following CSV files:
    - `train_v2_drcat_02.csv` for Training data
    - `test_essays.csv` for Testing data
    - `sample_submission.csv` for sample submission format
2. **Data Cleaning**: 
    - Duplicate rows in the training data are removed based on the 'text' column
    - Additionally, certain prompts are excluded from the training set.
3. **Preprocessing**: Preprocessing steps such as setting lowercase conversion and defining the vocabulary size are applied.

In [2]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
#Reading data
test = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv')   #Testing data
sub = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/sample_submission.csv') #Predicted data
train = pd.read_csv("/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv", sep=',') #Training data
#<------------------------------------------------------------------------------------------------------------------------------------------>

In [3]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
print("Length of training data:",len(train))
print("Training data:")
train
#<------------------------------------------------------------------------------------------------------------------------------------------>

Length of training data: 44868
Training data:


Unnamed: 0,text,label,prompt_name,source,RDizzl3_seven
0,Phones\n\nModern humans today are always on th...,0,Phones and driving,persuade_corpus,False
1,This essay will explain if drivers should or s...,0,Phones and driving,persuade_corpus,False
2,Driving while the use of cellular devices\n\nT...,0,Phones and driving,persuade_corpus,False
3,Phones & Driving\n\nDrivers should not be able...,0,Phones and driving,persuade_corpus,False
4,Cell Phone Operation While Driving\n\nThe abil...,0,Phones and driving,persuade_corpus,False
...,...,...,...,...,...
44863,"Dear Senator,\n\nI am writing to you today to ...",1,Does the electoral college work?,kingki19_palm,True
44864,"Dear Senator,\n\nI am writing to you today to ...",1,Does the electoral college work?,kingki19_palm,True
44865,"Dear Senator,\n\nI am writing to you today to ...",1,Does the electoral college work?,kingki19_palm,True
44866,"Dear Senator,\n\nI am writing to you today to ...",1,Does the electoral college work?,kingki19_palm,True


In [4]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
print("Sample Testing Data:")
test
#<------------------------------------------------------------------------------------------------------------------------------------------>

Sample Testing Data:


Unnamed: 0,id,prompt_id,text
0,0000aaaa,2,Aaa bbb ccc.
1,1111bbbb,3,Bbb ccc ddd.
2,2222cccc,4,CCC ddd eee.


In [5]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
print("Submission Format:")
sub
#<------------------------------------------------------------------------------------------------------------------------------------------>

Submission Format:


Unnamed: 0,id,generated
0,0000aaaa,0.1
1,1111bbbb,0.9
2,2222cccc,0.4


In [6]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
#Deleting duplicate rows
train = train.drop_duplicates(subset=['text'])
train.reset_index(drop=True, inplace=True)
print("Initial length of training data:",len(train))
print("Initial Prompts:")
set(train["prompt_name"])
#<------------------------------------------------------------------------------------------------------------------------------------------>

Initial length of training data: 44868
Initial Prompts:


{'"A Cowboy Who Rode the Waves"',
 'Car-free cities',
 'Cell phones at school',
 'Community service',
 'Distance learning',
 'Does the electoral college work?',
 'Driverless cars',
 'Exploring Venus',
 'Facial action coding system',
 'Grades for extracurricular activities',
 'Mandatory extracurricular activities',
 'Phones and driving',
 'Seeking multiple opinions',
 'Summer projects',
 'The Face on Mars'}

In [7]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
#Removing few prompts (Because these prompts are irrelevant not included in testing data)
excluded_prompt_name_list = ['Distance learning','Grades for extracurricular activities','Summer projects']
train = train[~(train['prompt_name'].isin(excluded_prompt_name_list))]
train = train.drop_duplicates(subset=['text'])
train.reset_index(drop=True, inplace=True)
print("Final length of training data:",len(train))
print("Final prompts:")
set(train["prompt_name"])
#<------------------------------------------------------------------------------------------------------------------------------------------>

Final length of training data: 34497
Final prompts:


{'"A Cowboy Who Rode the Waves"',
 'Car-free cities',
 'Cell phones at school',
 'Community service',
 'Does the electoral college work?',
 'Driverless cars',
 'Exploring Venus',
 'Facial action coding system',
 'Mandatory extracurricular activities',
 'Phones and driving',
 'Seeking multiple opinions',
 'The Face on Mars'}

In [8]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
#Setting up parameters to lowercase=false and giving vocab size
LOWERCASE = False
VOCAB_SIZE = 14000000
#<------------------------------------------------------------------------------------------------------------------------------------------>

# Creating Byte-Pair Encoding Tokenizer with Hugging Face

1. Initialization of the raw tokenizer with BPE model and special tokens.
2. Addition of normalization and pre-tokenization processes, such as NFC normalization and byte-level pre-tokenization.
3. Training of the tokenizer using a trainer instance and a provided dataset.
4. Creation of a Hugging Face dataset object from a pandas DataFrame.
5. Iterative training of the tokenizer from the dataset using a custom corpus iterator.
6. Initialization of a PreTrainedTokenizerFast object using the trained raw tokenizer.
7. Tokenization of test and train text data using the tokenizer.

In [9]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
# Creating Byte-Pair Encoding tokenizer
raw_tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
#<------------------------------------------------------------------------------------------------------------------------------------------>
# Adding normalization and pre_tokenizer
raw_tokenizer.normalizer = normalizers.Sequence([normalizers.NFC()] + [normalizers.Lowercase()] if LOWERCASE else [])
raw_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
#<------------------------------------------------------------------------------------------------------------------------------------------>
# Adding special tokens and creating trainer instance
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.BpeTrainer(vocab_size=VOCAB_SIZE, special_tokens=special_tokens)
#<------------------------------------------------------------------------------------------------------------------------------------------>
# Creating huggingface dataset object
dataset = Dataset.from_pandas(test[['text']])
def train_corp_iter(): 
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]
raw_tokenizer.train_from_iterator(train_corp_iter(), trainer=trainer)
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=raw_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)
#<------------------------------------------------------------------------------------------------------------------------------------------>






    Converting training data to Byte Pair Encodings

In [11]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
#Create BPE for training data
tokenized_texts_train = []
for text in tqdm(train['text'].tolist()):
    tokenized_texts_train.append(tokenizer.tokenize(text))
print("Training tokens for 1st essay:")
print(tokenized_texts_train[0])
#<------------------------------------------------------------------------------------------------------------------------------------------>

  0%|          | 0/34497 [00:00<?, ?it/s]

Training tokens for 1st essay:
['Ġ', '[UNK]', '[UNK]', '[UNK]', '[UNK]', 'e', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', 'd', 'e', '[UNK]', '[UNK]', 'Ġ', '[UNK]', '[UNK]', '[UNK]', 'a', '[UNK]', '[UNK]', 'Ġ', '[UNK]', '[UNK]', 'd', 'a', '[UNK]', 'Ġ', 'a', '[UNK]', 'e', 'Ġ', 'a', '[UNK]', '[UNK]', 'a', '[UNK]', '[UNK]', 'Ġ', '[UNK]', '[UNK]', 'Ġ', '[UNK]', '[UNK]', 'e', '[UNK]', '[UNK]', 'Ġ', '[UNK]', '[UNK]', '[UNK]', '[UNK]', 'e', '.', 'Ġ', '[UNK]', '[UNK]', 'e', '[UNK]', 'Ġ', 'a', '[UNK]', 'e', 'Ġ', 'a', '[UNK]', '[UNK]', 'a', '[UNK]', '[UNK]', 'Ġ', '[UNK]', '[UNK]', 'Ġ', '[UNK]', '[UNK]', 'e', '[UNK]', '[UNK]', 'Ġ', '[UNK]', '[UNK]', '[UNK]', '[UNK]', 'e', 'Ġ', '[UNK]', '[UNK]', '[UNK]', 'e', 'Ġ', '[UNK]', '[UNK]', 'a', '[UNK]', 'Ġ', '[UNK]', 'Ġ', '[UNK]', '[UNK]', '[UNK]', '[UNK]', '[UNK]', 'Ġ', 'a', 'Ġ', 'd', 'a', '[UNK]', 'Ġ', '[UNK]', '[UNK]', 'Ġ', '[UNK]', '[UNK]', '[UNK]', '[UNK]', 'Ġ', '.', 'A', '[UNK]', '[UNK]', 'Ġ', '[UNK]', '[UNK]', 'e', '[UNK]', 'Ġ', 'd', '[UNK]', 'Ġ', 

    Converting testing data to Byte Pair Encodings

In [12]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
#Creating BPE for testing data
tokenized_texts_test = []
for text in tqdm(test['text'].tolist()):
    tokenized_texts_test.append(tokenizer.tokenize(text))
print("Testing tokens for 1st essay:")
print(tokenized_texts_test[0])
#<------------------------------------------------------------------------------------------------------------------------------------------>

  0%|          | 0/3 [00:00<?, ?it/s]

Testing tokens for 1st essay:
['ĠAaa', 'Ġbbb', 'Ġccc', '.']


# TF-IDF Vectorization with Custom Vocabulary

1. **Function Definition**: A dummy function `dummy` is defined, which simply returns the input text.
2. **Vectorizer Initialization**: The `TfidfVectorizer` is initialized with specific parameters such as ngram_range, lowercase, sublinear_tf, and the use of the dummy function for tokenization and preprocessing.
3. **Fitting and Vocabulary Extraction**: The vectorizer is fitted on tokenized texts from the test dataset, and the resulting vocabulary is extracted.
4. **Vectorization with Custom Vocabulary**: Another `TfidfVectorizer` is initialized with the vocabulary extracted in the previous step, ensuring that the same vocabulary is used for both training and test data vectorization.
5. **Vectorization of Texts**: The training and test datasets are transformed into TF-IDF representations using the vectorizer.
6. **Memory Management**: The vectorizer object is deleted, and garbage collection is triggered to manage memory usage.

In [13]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
def dummy(text):
    return text
#<------------------------------------------------------------------------------------------------------------------------------------------>

In [14]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
vectorizer = TfidfVectorizer(ngram_range=(3, 5), lowercase=False, sublinear_tf=True, analyzer = 'word',
    tokenizer = dummy,
    preprocessor = dummy,
    token_pattern = None, strip_accents='unicode')
#<------------------------------------------------------------------------------------------------------------------------------------------>
vectorizer.fit(tokenized_texts_test)
# Getting vocab
vocab = vectorizer.vocabulary_
print(vocab)
vectorizer = TfidfVectorizer(ngram_range=(3, 5), lowercase=False, sublinear_tf=True, vocabulary=vocab,
                            analyzer = 'word',
                            tokenizer = dummy,
                            preprocessor = dummy,
                            token_pattern = None, strip_accents='unicode'
                            )
#<------------------------------------------------------------------------------------------------------------------------------------------>

{'ĠAaa Ġbbb Ġccc': 0, 'Ġbbb Ġccc .': 6, 'ĠAaa Ġbbb Ġccc .': 1, 'ĠBbb Ġccc Ġddd': 2, 'Ġccc Ġddd .': 7, 'ĠBbb Ġccc Ġddd .': 3, 'ĠCCC Ġddd Ġeee': 4, 'Ġddd Ġeee .': 8, 'ĠCCC Ġddd Ġeee .': 5}


In [15]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
#Creating training and testing tf-idf representation
tf_train = vectorizer.fit_transform(tokenized_texts_train)
tf_test = vectorizer.transform(tokenized_texts_test)
#<------------------------------------------------------------------------------------------------------------------------------------------>
#Managing memory
del vectorizer
gc.collect()
#<------------------------------------------------------------------------------------------------------------------------------------------>

44

    Training labels

In [16]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
y_train = train['label'].values
print("Training labels:")
print(y_train)
#<------------------------------------------------------------------------------------------------------------------------------------------>

Training labels:
[0 0 0 ... 1 1 1]


# Ensemble Model Training and Prediction

1. **Model Definition**: The `get_model()` function defines an ensemble of classifiers:
    - Multinomial Naive Bayes
    - Stochastic Gradient Descent
    - LightGBM
    - CatBoost.
2. **Initialization**: The ensemble model is initialized with specific parameters and weights assigned to each base classifier.
3. **Model Training**: If the number of texts in the test dataset is more than 5, the ensemble model is trained on the TF-IDF transformed training data (`tf_train`) along with labels (`y_train`).
4. **Prediction**: The trained model is used to predict probabilities for the test dataset (`tf_test`), and the predictions are stored in the 'generated' column of the submission dataframe (`sub`).
5. **Saving Predictions**: The submission dataframe with predictions is saved to a CSV file named 'submission.csv'.

In [17]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
#Importing libraries
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression
#<------------------------------------------------------------------------------------------------------------------------------------------>
#Getting model
def get_model():
    clf = MultinomialNB(alpha=0.0225)

    sgd_model = SGDClassifier(max_iter=9000, tol=1e-4, loss="modified_huber", random_state=6743)
    p6={'n_iter': 3000,'verbose': -1,'objective': 'cross_entropy','metric': 'auc',
        'learning_rate': 0.00581909898961407, 'colsample_bytree': 0.78,
        'colsample_bynode': 0.8,
       }
    p6["random_state"] = 6743

    lgb=LGBMClassifier(**p6)

    cat=CatBoostClassifier(iterations=3000,
                           verbose=0,
                           random_seed=6543,
                           learning_rate=0.005599066836106983,
                           subsample = 0.35,
                           allow_const_label=True,loss_function = 'CrossEntropy')
    
    #Setting weights
    weights = [0.055,0.31,0.23,0.80]
 
    ensemble = VotingClassifier(estimators=[('mnb',clf),
                                            ('sgd', sgd_model),
                                            ('lgb',lgb), 
                                            ('cat', cat)
                                           ],
                                weights=weights, voting='soft', n_jobs=-1)
    return ensemble
#<------------------------------------------------------------------------------------------------------------------------------------------>

In [18]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
model = get_model()
print(model)
#<------------------------------------------------------------------------------------------------------------------------------------------>

VotingClassifier(estimators=[('mnb', MultinomialNB(alpha=0.0225)),
                             ('sgd',
                              SGDClassifier(loss='modified_huber',
                                            max_iter=9000, random_state=6743,
                                            tol=0.0001)),
                             ('lgb',
                              LGBMClassifier(colsample_bynode=0.8,
                                             colsample_bytree=0.78,
                                             learning_rate=0.00581909898961407,
                                             metric='auc', n_iter=3000,
                                             objective='cross_entropy',
                                             random_state=6743, verbose=-1)),
                             ('cat',
                              <catboost.core.CatBoostClassifier object at 0x78c7add0f250>)],
                 n_jobs=-1, voting='soft', weights=[0.055, 0.31, 0.23, 0.8])


In [19]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
if len(test.text.values) <= 5:
    sub.to_csv('submission.csv', index=False)
else:
    model.fit(tf_train, y_train)
    final_preds = model.predict_proba(tf_test)[:,1]
    sub['generated'] = final_preds
    sub.to_csv('submission.csv', index=False)
    sub
#<------------------------------------------------------------------------------------------------------------------------------------------>

In [20]:
#<------------------------------------------------------------------------------------------------------------------------------------------>
#Submission file
sub
#<------------------------------------------------------------------------------------------------------------------------------------------>

Unnamed: 0,id,generated
0,0000aaaa,0.1
1,1111bbbb,0.9
2,2222cccc,0.4
