# ****LLM AI Text Detection TF-IDF Method

* This Notebook is embarking on an important project to develop a model that distinguishes between essays written by middle and high school students and those generated by large language models (LLMs). The increasing use of LLMs has sparked a debate about their role in academic and creative fields. Particularly in education, there's a growing concern about their impact on student learning. While some educators see LLMs as potential tools for enhancing students' writing abilities, others worry about their negative effects, especially regarding skill development.

* A primary academic concern surrounding LLMs is their potential to facilitate plagiarism. These models, trained on extensive datasets of text and code, are adept at creating text remarkably similar to human writing. This capability raises the possibility of students using LLMs to produce essays, thereby bypassing essential learning experiences. Such actions not only undermine educational integrity but also impede the development of critical thinking and writing skills.

* The goal of this notebook to build classification model without using LLMs or Deep Learning models to classify text. By identifying unique characteristics of LLM-generated texts, we learn how to destinguish who or what wrote particular text. Our approach involves using texts of moderate length covering various topics, which are reflective of typical scenarios where LLM-generated content might be used. The challenge also includes dealing with texts produced by multiple, unknown generative models. The goal is to encourage the development of detection methods that are effective across different LLMs, thereby contributing to a broader understanding and capability in this emerging field.**

In [1]:
import sys
import gc

import pandas as pd
from sklearn.model_selection import StratifiedKFold
import numpy as np
from sklearn.metrics import roc_auc_score
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
    SentencePieceBPETokenizer
)

from datasets import Dataset
from tqdm.auto import tqdm
from transformers import PreTrainedTokenizerFast

from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier

from tqdm._tqdm_notebook import tqdm_notebook
from joblib import dump
tqdm_notebook.pandas()

Please use `tqdm.notebook.*` instead of `tqdm._tqdm_notebook.*`
  from tqdm._tqdm_notebook import tqdm_notebook


In [2]:
from sklearn.model_selection import train_test_split

In [3]:
import os
import pandas as pd

if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    pass
else:
    submit_df = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/sample_submission.csv')
    submit_df.to_csv('submission.csv', index=False)
    sys.exit()

SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [None]:
test_df = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv')
submit_df = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/sample_submission.csv')
train_df = pd.read_csv("/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv", sep=',')

In [None]:
train_df = train_df.drop_duplicates(subset=['text'])
train_df.reset_index(drop=True, inplace=True)

In [None]:
IS_LOWERCASE = False
MAX_VOCAB = 42000

# Building Byte-Pair Encoding Tokenizer

* In this section of the code, we set up a Byte-Pair Encoding (BPE) tokenizer, an advanced technique widely used in natural language processing for breaking down text into subword units. This approach is particularly effective in handling out-of-vocabulary words, as it splits them into recognizable subword segments.

* The configuration of the tokenizer includes several specialized tokens, specifically [UNK], [PAD], [CLS], [SEP], and [MASK], each serving a unique purpose in the tokenization process. For instance, [UNK] represents unknown words, while [PAD] is used for padding sequences to a uniform length.
 
* To optimize the tokenizer for BPE, we implement normalization and pre-tokenization processes. These steps ensure that the text is consistently formatted and prepared for efficient subword segmentation.
 
* The training of the tokenizer is conducted iteratively using portions of the dataset. This method allows for gradual learning and adaptation to the linguistic characteristics present in the data. Once trained, the tokenizer is encapsulated within a PreTrainedTokenizerFast object. This wrapper enhances the tokenizer's performance, enabling rapid and efficient tokenization.
 
* The final stage involves applying this tokenizer to both the test and training datasets. This step ensures that the text in these datasets is broken down into a format suitable for subsequent processing, maintaining consistency in tokenization across both sets of data. By doing so, we lay the groundwork for more effective and uniform analysis and modeling of the text data in later stages.

In [None]:
# Creating Byte-Pair Encoding tokenizer
bpe_tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
bpe_tokenizer.normalizer = normalizers.Sequence([normalizers.NFC()] + [normalizers.Lowercase()] if IS_LOWERCASE else [])
bpe_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
bpe_special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
bpe_trainer = trainers.BpeTrainer(vocab_size=MAX_VOCAB, special_tokens=bpe_special_tokens)
bpe_dataset = Dataset.from_pandas(test_df[['text']])

def bpe_corpus_iter(): 
    for idx in range(0, len(bpe_dataset), 1000):
        yield bpe_dataset[idx : idx + 1000]["text"]
bpe_tokenizer.train_from_iterator(bpe_corpus_iter(), trainer=bpe_trainer)
fast_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=bpe_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)
test_texts_tokenized = []

for text_item in tqdm(test_df['text'].tolist()):
    test_texts_tokenized.append(fast_tokenizer.tokenize(text_item))

train_texts_tokenized = []

for text_item in tqdm(train_df['text'].tolist()):
    train_texts_tokenized.append(fast_tokenizer.tokenize(text_item))

# TF-IDF Vectorization

* In this segment of the code, we're dealing with the implementation of the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization, a crucial step in text processing that emphasizes the importance of words in a corpus. The technique involves calculating the frequency of a word in a document relative to its frequency across all documents, helping to highlight words that are unique to certain documents.

* The TF-IDF vectorizer is configured to analyze word combinations within a range of 3 to 5 words (n-grams). This range allows the model to consider not only individual words but also phrases up to five words long, providing a more nuanced understanding of the text. The setup includes sublinear term frequency scaling, a method that moderates the influence of frequently occurring terms, and unicode accent stripping, which simplifies text by removing accents.

* To maintain precise control over the tokenization and preprocessing steps, custom functions are employed. These functions override the default behaviors, ensuring that the text is processed exactly as intended. This level of control is particularly important when dealing with complex datasets or specific tokenization protocols.

* After configuring the vectorizer, it is trained (or "fitted") using the already tokenized test data. This step involves the vectorizer learning the vocabulary of the test set, which includes identifying the most relevant terms and phrases as per the TF-IDF criteria.

* Subsequently, this learned vocabulary is utilized to initialize a new TF-IDF vectorizer. This new vectorizer is then applied to both the training and test datasets, transforming them into a numerical format that reflects the importance of each term in the context of the larger dataset.

In [None]:
def no_process(text):
    return text
tfidf_vectorizer = TfidfVectorizer(ngram_range=(3, 5), lowercase=False, sublinear_tf=True, analyzer = 'word',
    tokenizer = no_process,
    preprocessor = no_process,
    token_pattern = None, strip_accents='unicode')

tfidf_vectorizer.fit(test_texts_tokenized)

# Getting vocab
tfidf_vocab = tfidf_vectorizer.vocabulary_

print(tfidf_vocab)

tfidf_vectorizer = TfidfVectorizer(ngram_range=(3, 5), lowercase=False, sublinear_tf=True, vocabulary=tfidf_vocab,
                            analyzer = 'word',
                            tokenizer = no_process,
                            preprocessor = no_process,
                            token_pattern = None, strip_accents='unicode'
                            )

tf_train_data = tfidf_vectorizer.fit_transform(train_texts_tokenized)
tf_test_data = tfidf_vectorizer.transform(test_texts_tokenized)

del tfidf_vectorizer
gc.collect()

In [None]:
train_labels = train_df['label'].values

# **Ensemble Learning with Multiple Classifiers**

In [None]:
nb_classifier = MultinomialNB(alpha=0.02)
sgd_clf = SGDClassifier(max_iter=9000, tol=1e-4, loss="modified_huber") 
lgb_params = {'n_iter': 3000, 'verbose': -1, 'objective': 'cross_entropy', 'metric': 'auc',
              'learning_rate': 0.00581909898961407, 'colsample_bytree': 0.78,
              'colsample_bynode': 0.8, 'lambda_l1': 4.562963348932286, 
              'lambda_l2': 2.97485, 'min_data_in_leaf': 115, 'max_depth': 23, 'max_bin': 898}
lgb_model = LGBMClassifier(**lgb_params)
cat_model = CatBoostClassifier(iterations=3000,
                       verbose=0,
                       l2_leaf_reg=6.6591278779517808,
                       learning_rate=0.005599066836106983,
                       subsample = 0.4,
                       loss_function = 'CrossEntropy')
model_weights = [0.07, 0.31, 0.31, 0.31]

voting_ensemble = VotingClassifier(estimators=[('mnb', nb_classifier),
                                               ('sgd', sgd_clf),
                                               ('lgb', lgb_model), 
                                               ('cat', cat_model)
                                              ],
                                   weights=model_weights, voting='soft', n_jobs=-1)

if len(test_df.text.values) <= 5:
    submit_df.to_csv('submission.csv', index=False)
else:
    print('Big Apple')
    voting_ensemble.fit(tf_train_data, train_labels)
    gc.collect()

final_bpe_predictions = voting_ensemble.predict_proba(tf_test_data)[:,1]

del test_texts_tokenized, train_texts_tokenized, bpe_dataset, bpe_tokenizer, fast_tokenizer
_ = gc.collect()

In [None]:
submit_df['generated'] = final_bpe_predictions
submit_df.to_csv('submission.csv', index=False)
submit_df