## EDA

DATASET SOURCE: https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset

In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose
This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

Dataset contains more than 28,000 essay written by student and AI generated.

Features :

    text : Which contains essay text

    generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay

From these features for us the only important values will be the texts.

In [1]:
# import nltk
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4')

import pandas as pd
import numpy as np
import random
import os
import re

import torch
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


### Data cleaning

In this step we are cleaning the original text to be suitable for further work. We are making:

    - removing special characters
    - removing numbers
    - removing punctations
    - tokenization
    - removing stopwords
    - lemmatization
    
Then we can see an example from the the original esseys and the cleaned as well.

In [3]:
# Load and clean the dataset
def clean_text(text_to_clean):
    text_to_clean = text_to_clean.lower()

    # Remove special characters
    text_to_clean = re.sub(r'[^a-zA-Z0-9\s]', '', text_to_clean)

    # Remove numbers
    text_to_clean = re.sub(r'\d+', '', text_to_clean)

    # Remove punctuation
    text_to_clean = re.sub(r'[^\w\s]', '', text_to_clean)

    # Tokenize text
    tokens = word_tokenize(text_to_clean)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatize tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Join tokens
    cleaned_text = ' '.join(tokens)
    return text_to_clean


df = pd.read_csv('./datasets/Training_Essay_Data.csv')
print("Essey example (not cleaned):")
print(df['text'].iloc[0])

# Clean the text column
df['clean_text'] = df['text'].apply(clean_text)
df = df[df['clean_text'].str.split().apply(len) > 0]

Essey example (not cleaned):
Car-free cities have become a subject of increasing interest and debate in recent years, as urban areas around the world grapple with the challenges of congestion, pollution, and limited resources. The concept of a car-free city involves creating urban environments where private automobiles are either significantly restricted or completely banned, with a focus on alternative transportation methods and sustainable urban planning. This essay explores the benefits, challenges, and potential solutions associated with the idea of car-free cities.  Benefits of Car-Free Cities  Environmental Sustainability: Car-free cities promote environmental sustainability by reducing air pollution and greenhouse gas emissions. Fewer cars on the road mean cleaner air and a significant decrease in the contribution to global warming.  Improved Public Health: A reduction in automobile usage can lead to better public health outcomes. Fewer cars on the road result in fewer accidents

In [4]:
print("Essey example (cleaned):")
print(df['clean_text'].iloc[0])

Essey example (cleaned):
carfree cities have become a subject of increasing interest and debate in recent years as urban areas around the world grapple with the challenges of congestion pollution and limited resources the concept of a carfree city involves creating urban environments where private automobiles are either significantly restricted or completely banned with a focus on alternative transportation methods and sustainable urban planning this essay explores the benefits challenges and potential solutions associated with the idea of carfree cities  benefits of carfree cities  environmental sustainability carfree cities promote environmental sustainability by reducing air pollution and greenhouse gas emissions fewer cars on the road mean cleaner air and a significant decrease in the contribution to global warming  improved public health a reduction in automobile usage can lead to better public health outcomes fewer cars on the road result in fewer accidents and a safer urban envi

### Creating vocabulary

In this step we are using a tokenizer to get the vocabulary of these all esseys. 

In [5]:
# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['clean_text'])
vocab_size = len(tokenizer.word_index) + 1

# Save the vocabulary, if does not exist
file_path = 'tokenizer_vocab_with_frequency.csv'
if not os.path.exists(file_path):
    word_index = tokenizer.word_index
    word_counts = tokenizer.word_counts
    word_index_df = pd.DataFrame(word_index.items(), columns=['Word', 'Index'])
    word_counts_df = pd.DataFrame(word_counts.items(), columns=['Word', 'Frequency'])
    tokenizer_vocab_df = pd.merge(word_index_df, word_counts_df, on='Word', how='left')
    tokenizer_vocab_df = tokenizer_vocab_df.sort_values(by='Frequency', ascending=False)
    tokenizer_vocab_df = tokenizer_vocab_df[['Word', 'Index', 'Frequency']]
    tokenizer_vocab_df.to_csv('tokenizer_vocab_with_frequency.csv', index=False)
    print("Vocabulary saved")
else:
    tokenizer_vocab_df = pd.read_csv(file_path)

In [6]:
tokenizer_vocab_df

Unnamed: 0,Word,Index,Frequency
0,the,1,535929
1,to,2,370172
2,and,3,286126
3,a,4,261618
4,of,5,257973
...,...,...,...
87554,compurer,57772,1
87555,upsed,57773,1
87556,inreality,57774,1
87557,sugarcoated,57775,1


### Handling missing words

In this part we are creating esseys with marked missing words like we are choosing 1 to 4 random words in the text and adding a mark. 

missing_word -> \*missing_word\*

We are creating position_index_pairs_all as well, which are representing each esseys missing words the position in the text and the representation index according to the vocabulary.

Then we are saving these new esseys in training_examples, and position_index_pairs_all as well.

In [7]:
# Handling Missing Words
position_index_pairs_all = []
def create_missing_word_examples(text, tokenizer, min_missing_words=1, max_missing_words=4):
    words = text.split()
    if len(words) == 0:
        return 0
    num_missing_words = random.randint(min_missing_words, min(max_missing_words, len(words)))
    missing_word_indices = random.sample(range(len(words)), num_missing_words)
    missing_word_examples = []
    words_with_marks = words.copy()
    position_index_pairs = []
    for index in missing_word_indices:
        # Createing missing word example texts
        missing_word = words[index]
        words_with_marks[index] = '*' + missing_word + '*'
        # position missing word pairs
        position = index
        index_in_vocab = tokenizer.word_index.get(missing_word, 0)
        position_index_pairs.append((position, index_in_vocab))
    position_index_pairs_all.append(position_index_pairs)
    input_text = ' '.join(words_with_marks)  # Join the words with marks back into a string
    output_word_indices = [tokenizer.word_index.get(words[index], 0) for index in missing_word_indices]
    # print(missing_word_indices, output_word_indices)
    # missing_word_examples.append((input_text, output_word_indices))
    missing_word_examples.append(input_text)
    return missing_word_examples


# Creating Training Examples
training_examples = []
for text in df['clean_text']:
    examples = create_missing_word_examples(text, tokenizer)
    if examples == 0:
        continue
    training_examples.extend(examples)
    
# Convert position_index_pairs to a NumPy array
file_path2 = 'position_index_pairs.npy'
position_index_pairs_array = np.array(position_index_pairs_all, dtype=object)
np.save(file_path2, position_index_pairs_array)
print("Position-index pairs saved to NumPy array:", file_path2)

Position-index pairs saved to NumPy array: position_index_pairs.npy


In [12]:
print("training_examples first element:")
print(training_examples[0])
print("---------------------------------------------------------------")
print("position_index_pairs_all first element:")
print(position_index_pairs_all[0])

training_examples first element:
carfree cities have become a subject of increasing interest and debate in recent years as urban areas around the world grapple with the challenges of congestion pollution and limited resources the concept of a carfree city involves creating urban environments where private automobiles are either significantly restricted or completely banned with a focus on alternative transportation methods and sustainable urban planning this essay explores the benefits challenges and potential solutions associated with the idea of carfree cities benefits of carfree cities environmental sustainability carfree cities promote environmental sustainability by reducing air pollution and greenhouse *gas* emissions fewer cars on the road mean cleaner air and a significant decrease in the contribution to global warming improved public health a reduction in automobile usage can lead to better public health outcomes fewer cars on the road result in fewer accidents and a safer urb

### Calculation word numbers in esseys

In [14]:
def text_to_tensor(text, tokenizer, position_index_pairs_all):
    # Tokenize the text
    tokens = tokenizer.texts_to_sequences([text])[0]
    tensor_representation = []

    # Iterate through the tokens
    for i, token in enumerate(tokens):
        # Check if the current token is the index of a missing word
        if (i, token) in position_index_pairs_all:
            tensor_representation.append(-1)  # Mark the missing word with -1
        else:
            tensor_representation.append(token)  # Append the token index

    return tensor_representation


# Calculate average and minimum length of essays
essay_lengths = [len(text.split()) for text in df['clean_text']]
average_length = sum(essay_lengths) / len(essay_lengths)
min_length = min(essay_lengths)

# Vectorization
max_sequence_length = max(essay_lengths)

essays_tensor = []

# Iterate through each essay
for essay, position_index_pair in zip(training_examples, position_index_pairs_all):
    # Convert the essay to tensor representation
    tensor_representation = text_to_tensor(essay, tokenizer, position_index_pair)
    # Pad the tensor representation based on the length statistics
    padded_representation = pad_sequences([tensor_representation], maxlen=max_sequence_length, padding='post')[0]
    # Append the padded tensor representation to the list
    essays_tensor.append(padded_representation)

# Save esseys as a number representation in a tensor
torch.save(essays_tensor, 'essays_tensor.pt')

# Convert the list to a numpy array
essays_tensor = np.array(essays_tensor)

# Print statistics
print("Average length of essays:", average_length)
print("Minimum length of essays:", min_length)
print("Maximum length of essays:", max_sequence_length)

# Print the tensor representation of the first essay
print("Tensor representation of the first essay:")
print(essays_tensor[0])

Average length of essays: 380.28060664287676
Minimum length of essays: 3
Maximum length of essays: 1647
Tensor representation of the first essay:
[312 239  17 ...   0   0   0]


In these tensors the missing word is represented with -1.