**Overview:**<br>
In recent years, large language models (LLMs) have become increasingly sophisticated, capable of generating text that is difficult to distinguish from human-written text. In this competition, we hope to foster open research and transparency on AI detection techniques applicable in the real world.
<br>
This competition challenges participants to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

* Im using kaggle environment for training and testing purposes.
* So, i loaded the data according to it.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv
/kaggle/input/llm-mistral-7b-instruct-texts/Mistral7B_CME_v3.csv
/kaggle/input/llm-mistral-7b-instruct-texts/Mistral7B_CME_v5.csv
/kaggle/input/llm-mistral-7b-instruct-texts/Mistral7B_CME_v7.csv
/kaggle/input/llm-mistral-7b-instruct-texts/Mistral7B_CME_v6.csv
/kaggle/input/llm-mistral-7b-instruct-texts/Mistral7B_CME_v2.csv
/kaggle/input/llm-mistral-7b-instruct-texts/Mistral7B_CME_v1.csv
/kaggle/input/llm-mistral-7b-instruct-texts/Mistral7B_CME_v7_15_percent_corruption.csv
/kaggle/input/llm-mistral-7b-instruct-texts/Mistral7B_CME_v4.csv
/kaggle/input/daigt-proper-train-dataset/train_drcat_03.csv
/kaggle/input/daigt-proper-train-dataset/train_drcat_02.csv
/kaggle/input/daigt-proper-train-dataset/train_drcat_04.csv
/kaggle/input/daigt-proper-train-dataset/train_drcat_01.csv
/kaggle/input/llm-detect-ai-generated-text/sample_submission.csv
/kaggle/input/llm-detect-ai-generated-text/train_prompts.csv
/kaggle/input/llm-detect-ai-gener

* Below are the datasets provided with competition.
* we can see that train_essays is highly imbalanced and test_essays(test data) is hidden.

In [None]:
data_path="/kaggle/input/llm-detect-ai-generated-text/"
sample_submission=pd.read_csv(data_path+'sample_submission.csv')
train_prompts=pd.read_csv(data_path+'train_prompts.csv')
test_essays =pd.read_csv(data_path+"test_essays.csv")
train_essays = pd.read_csv(data_path+"train_essays.csv")

In [None]:
print(sample_submission.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         3 non-null      object 
 1   generated  3 non-null      float64
dtypes: float64(1), object(1)
memory usage: 176.0+ bytes
None


In [None]:
print(train_prompts.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   prompt_id     2 non-null      int64 
 1   prompt_name   2 non-null      object
 2   instructions  2 non-null      object
 3   source_text   2 non-null      object
dtypes: int64(1), object(3)
memory usage: 192.0+ bytes
None


In [None]:
print(train_essays.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1378 entries, 0 to 1377
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         1378 non-null   object
 1   prompt_id  1378 non-null   int64 
 2   text       1378 non-null   object
 3   generated  1378 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 43.2+ KB
None


In [None]:
print(test_essays.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         3 non-null      object
 1   prompt_id  3 non-null      int64 
 2   text       3 non-null      object
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes
None


In [None]:
train_essays["generated"].value_counts()

generated
0    1375
1       3
Name: count, dtype: int64

In [None]:
train_essays["prompt_id"].value_counts()

prompt_id
0    708
1    670
Name: count, dtype: int64

* Due to imbalance we need t find new data for the task
* So, im adding a dataset consists of 4.9k AI written text essays into train_essays.

In [None]:
oth_dp="/kaggle/input/llm-mistral-7b-instruct-texts/Mistral7B_CME_v7.csv"
dataset1=pd.read_csv(oth_dp)

In [None]:
dataset1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4900 entries, 0 to 4899
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   prompt_id    4900 non-null   int64 
 1   text         4900 non-null   object
 2   prompt_name  4900 non-null   object
 3   generated    4900 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 153.2+ KB


In [None]:
columns_to_delete = ['prompt_id','id']
train_essays.drop(columns=columns_to_delete, inplace=True)
columns_to_delete = ['prompt_id','prompt_name']
dataset1.drop(columns=columns_to_delete, inplace=True)

In [None]:
train_essays=pd.concat([train_essays,dataset1],ignore_index=True)

In [None]:
train_essays.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6278 entries, 0 to 6277
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       6278 non-null   object
 1   generated  6278 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 98.2+ KB


In [None]:
train_essays["generated"].value_counts()

generated
1    4903
0    1375
Name: count, dtype: int64

* Still the data is less to be sent into a training model for classification.
* So i added an another dataset with some applied filters.(with filters i got considerably higher public score than without.)

In [None]:
oth_dp="/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv"
dataset2=pd.read_csv(oth_dp)
dataset2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44868 entries, 0 to 44867
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   text           44868 non-null  object
 1   label          44868 non-null  int64 
 2   prompt_name    44868 non-null  object
 3   source         44868 non-null  object
 4   RDizzl3_seven  44868 non-null  bool  
dtypes: bool(1), int64(1), object(3)
memory usage: 1.4+ MB


In [None]:
dataset2 = dataset2[dataset2.RDizzl3_seven]
columns_to_delete = ['prompt_name','source','RDizzl3_seven']
dataset2.drop(columns=columns_to_delete, inplace=True)
dataset2=dataset2.rename(columns={
    "text":"text",
    "label":"generated"
})

In [None]:
train_essays=pd.concat([train_essays,dataset2],ignore_index=True)

* I tried adding more data for training but it does'nt helped to increase score.

In [None]:
# oth_dp="/kaggle/input/daigt-proper-train-dataset/train_drcat_04.csv"
# dataset3=pd.read_csv(oth_dp)
# dataset3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44206 entries, 0 to 44205
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   essay_id  44206 non-null  object
 1   text      44206 non-null  object
 2   label     44206 non-null  int64 
 3   source    44206 non-null  object
 4   prompt    12911 non-null  object
 5   fold      44206 non-null  int64 
dtypes: int64(2), object(4)
memory usage: 2.0+ MB


In [None]:
# columns_to_delete = ['essay_id','source','prompt','fold']
# dataset3.drop(columns=columns_to_delete, inplace=True)
# dataset3=dataset3.rename(columns={
#     "text":"text",
#     "label":"generated"
# })
# dataset3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44206 entries, 0 to 44205
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       44206 non-null  object
 1   generated  44206 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 690.8+ KB


In [None]:
# train_essays=pd.concat([train_essays,dataset3],ignore_index=True)

In [None]:
train_essays=train_essays.sample(frac=1, random_state=56)

* At the end, I left with the dataset consists of around 26.7k entries with 58% of human written essays and rest are Ai generated essays.

In [None]:
train_essays.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26728 entries, 21480 to 2532
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       26728 non-null  object
 1   generated  26728 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 626.4+ KB


In [None]:
train_essays["generated"].value_counts()

generated
0    15625
1    11103
Name: count, dtype: int64

**Data-Preprocessing**<br>
* For improving model performance, i did some changes to data By.


1.   Changing it into lower-case so, model doesnt discriminate between the same word which is written another case.
2.   Removing puncutations and special characters.
3. Removing stopwords( stopwords are the words which doesnt contribute to sentence meaning but used quiet often.)
4. Removing neumerical characters.
5. And at last trasforming text into tokens(tokenize).




In [None]:
import string
from tensorflow.keras.preprocessing.text import Tokenizer

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)


def remove_punctuation(text):
    translator = str.maketrans("", "", string.punctuation)
    return text.translate(translator)

def transform_text(df,column):
    df[column]=df[column].str.lower()
    df[column] = df[column].apply(remove_punctuation)
    df[column] = df[column].apply(remove_stopwords)
    df[column] = df[column].str.replace('\d+', '')
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(df[column])
    return df

* After pre-processing, The processed data is passed into a vectorizer, here im using tf-idf vectorizer from sklearn.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

train_essays=transform_text(train_essays,'text')

x=train_essays.text
y=train_essays.generated
vectorizer = TfidfVectorizer()
vectorizer.fit(x)

x = vectorizer.transform(x)
test_essays=transform_text(test_essays,'text')

test_text=test_essays.text
test_transform = vectorizer.transform(test_text)
test_transform[0]

<1x75567 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in Compressed Sparse Row format>

* The output of tf-idf vectorizer is a sparse matrix consists of floating point numbers.

**Model implementation**

**Training**<br>
&emsp; Before mid-eval i have implemented few models(namely : sgd classifier, logistic reggresion, and Multinomail Naive Bayes).
* By using the above models i got maximum score of 0.81.
* So i tried ensembling them, resulting a score increase to 0.857.
* i tried various combinations and changing the model parameters, for below one i got maximum score of 0.885.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import VotingClassifier


# Training classifier model
# Logistic Regression model
# lr = LogisticRegression(verbose=1)

# Multinomial Naive Bayes model
clf = MultinomialNB(alpha=0.0225)

# SGDClassifier models with different hyperparameters
sgd_model = SGDClassifier(max_iter=8000, tol=1e-3, loss="modified_huber")
sgd_model2 = SGDClassifier(max_iter=12000, tol=5e-4, loss="modified_huber", class_weight="balanced")
# sgd_model3 = SGDClassifier(max_iter=15000, tol=3e-4, loss="modified_huber", early_stopping=True)

# Ensemble VotingClassifier
final_model = VotingClassifier(
    estimators=[('sgd', sgd_model),('sgd_2', sgd_model2),('mnb', clf)],
    weights=[0.10,0.54,0.36],
    voting='soft',
    verbose=1
)

final_model.fit(x,y)

[Voting] ...................... (1 of 3) Processing sgd, total=   0.2s
[Voting] .................... (2 of 3) Processing sgd_2, total=   0.3s
[Voting] ...................... (3 of 3) Processing mnb, total=   0.0s


**Testing**

In [None]:
y_probs = final_model.predict_proba(test_transform)
y_probs

array([[0.30359801, 0.69640199],
       [0.51892787, 0.48107213],
       [0.51892787, 0.48107213]])

In [None]:
y_gen_prob =[y[1] for y in y_probs]
y_gen_prob

[0.6964019911057642, 0.48107213192821385, 0.48107213192821385]

In [None]:
sub = test_essays[["id"]].copy()

sub["generated"] = y_gen_prob

# Save Submission
sub.to_csv('submission.csv',index=False)

sub.head()

Unnamed: 0,id,generated
0,0000aaaa,0.696402
1,1111bbbb,0.481072
2,2222cccc,0.481072


Final score for the above model i got **0.885** score in kaggle. <br>*username : uranaveer*

* In addition to above i have also tried, other models like bert. but i didnt got better score. for parameters tuning and training it consumed lot of time and computaional resources. so i didnt proceed with that.

In [None]:
# import keras_nlp
# import keras
# from keras import layers

# preprocessor1 = keras_nlp.models.DistilBertPreprocessor.from_preset(
#     "distil_bert_base_en_uncased",
#     sequence_length=512,
# )
# x = preprocessor1(x)
# x_val = preprocessor1(x_val)
# classifier = keras_nlp.models.DistilBertClassifier.from_preset(
#     "distil_bert_base_en_uncased",
#     num_classes=1,
# )
# classifier.backbone.trainable = False

# classifier.compile(
#     loss=['binary_crossentropy'],
#     optimizer=keras.optimizers.Adam(1e-4),
#     jit_compile=True,
#     metrics = ['accuracy','AUC']
# )


# classifier.fit(x= x, y=y, batch_size=8 ,epochs =5,
#                validation_data = (x_val,y_val))