## EDA

DATASET SOURCE: https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset

In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose
This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

Dataset contains more than 28,000 essay written by student and AI generated.

Features :

    text : Which contains essay text

    generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay

From these features for us the only important values will be the texts.

In [1]:
import pandas as pd
import numpy as np
import random
import os
import re
import torch
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')

### Data cleaning

In this step we are cleaning the original text to be suitable for further work. We are making:

    - removing special characters
    - removing numbers
    - removing punctations
    - tokenization
    - removing stopwords
    - lemmatization
    
Then we can see an example from the the original esseys and the cleaned as well.

In [2]:
# Initialize WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def clean_text(text_to_clean):
    text_to_clean = text_to_clean.lower()

    # Remove special characters, numbers, and punctuation
    text_to_clean = re.sub(r'[^a-zA-Z\s]', '', text_to_clean)

    # Tokenize text
    tokens = word_tokenize(text_to_clean)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatize tokens using WordNet
    tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens]

    # Join tokens
    cleaned_text = ' '.join(tokens)
    return cleaned_text

df = pd.read_csv('./datasets/Training_Essay_Data.csv')

# Check if the cleaned text CSV file exists
cleaned_text_file = 'cleaned_text.csv'
if not os.path.exists(cleaned_text_file):
    print("Cleaning dataset")
    df['clean_text'] = df['text'].apply(clean_text)
    df = df[df['clean_text'].str.split().apply(len) > 0]
    df.to_csv(cleaned_text_file, index=False)
    print("Cleaned text saved to:", cleaned_text_file)
else:
    df = pd.read_csv(cleaned_text_file)
    print("Cleaned text loaded from:", cleaned_text_file)

Cleaned text loaded from: cleaned_text.csv


In [12]:
print("Essey example (not cleaned):")
print(df['text'].iloc[0])

Essey example (not cleaned):
Car-free cities have become a subject of increasing interest and debate in recent years, as urban areas around the world grapple with the challenges of congestion, pollution, and limited resources. The concept of a car-free city involves creating urban environments where private automobiles are either significantly restricted or completely banned, with a focus on alternative transportation methods and sustainable urban planning. This essay explores the benefits, challenges, and potential solutions associated with the idea of car-free cities.  Benefits of Car-Free Cities  Environmental Sustainability: Car-free cities promote environmental sustainability by reducing air pollution and greenhouse gas emissions. Fewer cars on the road mean cleaner air and a significant decrease in the contribution to global warming.  Improved Public Health: A reduction in automobile usage can lead to better public health outcomes. Fewer cars on the road result in fewer accidents

In [3]:
print("Essey example (cleaned):")
print(df['clean_text'].iloc[0])

Essey example (cleaned):
carfree city become subject increase interest debate recent year urban area around world grapple challenge congestion pollution limited resource concept carfree city involves create urban environment private automobile either significantly restrict completely ban focus alternative transportation method sustainable urban planning essay explores benefit challenge potential solution associate idea carfree city benefit carfree city environmental sustainability carfree city promote environmental sustainability reduce air pollution greenhouse gas emission few car road mean cleaner air significant decrease contribution global warm improve public health reduction automobile usage lead well public health outcome few car road result few accident safer urban environment pedestrian cyclist moreover less air pollution lead reduce respiratory cardiovascular problem efficient use space carfree city utilize urban space efficiently parking lot wide road repurposed green space p

### Creating vocabulary

In this step we are using a tokenizer to get the vocabulary of these all esseys. 

In [4]:
# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['clean_text'])
vocab_size = len(tokenizer.word_index) + 1

# Save the vocabulary, if does not exist
file_path = 'tokenizer_vocab_with_frequency_embedding.csv'
file_path3 = 'embedding_matrix.npy'
if not os.path.exists(file_path):
    word_index = tokenizer.word_index
    word_counts = tokenizer.word_counts
    word_index_df = pd.DataFrame(word_index.items(), columns=['Word', 'Index'])
    word_counts_df = pd.DataFrame(word_counts.items(), columns=['Word', 'Frequency'])
    tokenizer_vocab_df = pd.merge(word_index_df, word_counts_df, on='Word', how='left')
    tokenizer_vocab_df = tokenizer_vocab_df.sort_values(by='Frequency', ascending=False)
    tokenizer_vocab_df = tokenizer_vocab_df[['Word', 'Index', 'Frequency']]
    tokenizer_vocab_df = tokenizer_vocab_df[tokenizer_vocab_df['Frequency'] >= 20]

    # Load pre-trained GloVe embeddings
    embeddings_index = {}
    embedding_dim = 50
    glove_file = 'glove.6B.50d.txt'

    with open(glove_file, encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs

    # "Unknown" token and initialization of embedding vector
    unknown_token = '<UNK>'
    unknown_vector = np.zeros(embedding_dim)
    word_count = 0

    # Update the vocabulary based on GloVe embeddings
    for index, row in tokenizer_vocab_df.iterrows():
        word = row['Word']

        # Change token to <UNK> for words not in GloVe
        if word not in embeddings_index:
            tokenizer_vocab_df.at[index, 'Word'] = unknown_token
        else:
            unknown_vector += embeddings_index[word]
            word_count += 1

    # Calculate average for unknown vector 
    unknown_vector = np.divide(unknown_vector, word_count)

    # Remove duplicates and reset indices of the vocabulary DataFrame
    tokenizer_vocab_df = tokenizer_vocab_df.drop_duplicates(subset='Word').reset_index(drop=True)

    # Add missing word to tokenizer
    missing_token = '<MISSING>'
    row_miss = [missing_token, 0, 0]
    tokenizer_vocab_df 
    tokenizer_vocab_df = pd.concat([pd.DataFrame([row_miss], columns=tokenizer_vocab_df.columns), tokenizer_vocab_df], ignore_index=True)

    # Create embedding matrix, do not change vocab after this step!!!
    new_vocab_size = tokenizer_vocab_df.shape[0]
    embedding_matrix = np.zeros((new_vocab_size, embedding_dim))
    print(embedding_matrix.shape)

    for index, row in tokenizer_vocab_df.iterrows():
        word = row['Word']
        # Set index to index in vocab
        tokenizer_vocab_df.at[index, 'Index'] = index

        if(word == unknown_token):
            embedding_matrix[index] = unknown_vector
        elif(word == missing_token):
            embedding_matrix[index] = np.zeros(embedding_dim)
        else: 
            embedding_matrix[index] = embeddings_index[word]
    
    tokenizer_vocab_df.to_csv(file_path, index=False)
    np.save(file_path3, embedding_matrix)
    print("Vocabulary saved")
else:
    tokenizer_vocab_df = pd.read_csv(file_path)
    print("Vocabulary loaded")

Vocabulary loaded


In [5]:
tokenizer_vocab_df

Unnamed: 0,Word,Index,Frequency
0,<MISSING>,0,0
1,people,1,85061
2,car,2,68145
3,make,3,51179
4,vote,4,45807
...,...,...,...
7048,spit,7048,20
7049,coexist,7049,20
7050,lawsuit,7050,20
7051,boss,7051,20


### Handling missing words

In this part we are creating esseys with marked missing words like we are choosing 1 to 4 random words in the text and adding a mark. 

missing_word -> \*missing_word\* (to označenie sme na konci nepotrebovali, lebo sme generovali number representation s pomocou position_index_pairs)

We are creating position_index_pairs_all as well, which are representing each esseys missing words the position in the text and the representation index according to the vocabulary.

Then we are saving these new esseys in training_examples, and position_index_pairs_all as well.

In [6]:
def create_missing_word_examples(text, tokenizer_vocab_df, min_missing_words=1, max_missing_words=1, max_essay_length=50):
    words = text.split()[:max_essay_length]
    if len(words) <= max_missing_words + 2:
        return 0

    missing_word_examples = []
    words_with_marks = words.copy()
    position_index_pairs = []

    while len(missing_word_examples) < 1:
        num_missing_words = random.randint(min_missing_words, max_missing_words)
        missing_word_indices = random.sample(range(1, len(words) - 1), num_missing_words)

        for index in missing_word_indices:
            # Choose a random word until it's in the vocabulary
            try_choice = 0
            while True:
                missing_word = words[index]
                index_in_vocab = tokenizer_vocab_df[tokenizer_vocab_df['Word'] == missing_word]['Index'].values
                if len(index_in_vocab) > 0:
                    break

                # Choose another random word
                missing_word = random.choice(list(tokenizer_vocab_df['Word']))
                index_in_vocab = tokenizer_vocab_df[tokenizer_vocab_df['Word'] == missing_word]['Index'].values
                words[index] = missing_word
                try_choice += 1

            words_with_marks[index] = missing_word  # '*' + missing_word + '*'
            # Position missing word pairs
            position = index
            position_index_pairs.append((position, index_in_vocab[0]))

        input_text = ' '.join(words_with_marks)  # Join the words with marks back into a string
        output_word_indices = [tokenizer_vocab_df[tokenizer_vocab_df['Word'] == words[index]]['Index'].values[0] for index in missing_word_indices]
        missing_word_examples.append(input_text)

    position_index_pairs_all.append(position_index_pairs)
    # print(position_index_pairs)

    return missing_word_examples


# Handling Missing Words
position_index_pairs_all = []

# Creating Training Examples
training_examples = []
k = 0
for text in df['clean_text']:
    examples = create_missing_word_examples(text, tokenizer_vocab_df)
    if examples == 0:
        continue
    training_examples.extend(examples)
    print(f"Example {k} created")
    k += 1

# Convert position_index_pairs to a NumPy array
file_path2 = 'position_index_pairs.npy'
position_index_pairs_array = np.array(position_index_pairs_all, dtype=object)
np.save(file_path2, position_index_pairs_array)

Example 0 created
Example 1 created
Example 2 created
Example 3 created
Example 4 created
Example 5 created
Example 6 created
Example 7 created
Example 8 created
Example 9 created
Example 10 created
Example 11 created
Example 12 created
Example 13 created
Example 14 created
Example 15 created
Example 16 created
Example 17 created
Example 18 created
Example 19 created
Example 20 created
Example 21 created
Example 22 created
Example 23 created
Example 24 created
Example 25 created
Example 26 created
Example 27 created
Example 28 created
Example 29 created
Example 30 created
Example 31 created
Example 32 created
Example 33 created
Example 34 created
Example 35 created
Example 36 created
Example 37 created
Example 38 created
Example 39 created
Example 40 created
Example 41 created
Example 42 created
Example 43 created
Example 44 created
Example 45 created
Example 46 created
Example 47 created
Example 48 created
Example 49 created
Example 50 created
Example 51 created
Example 52 created
Exa

In [7]:
print("training_examples first element:")
print(training_examples[0])
print("---------------------------------------------------------------")
print("position_index_pairs_all first element:")
print(position_index_pairs_all[0])

training_examples first element:
carfree city become subject increase interest debate recent year urban area around world grapple challenge congestion pollution limited resource concept carfree city involves create urban environment private automobile either significantly restrict completely ban focus alternative transportation method sustainable urban planning essay explores benefit challenge potential solution associate idea carfree city
---------------------------------------------------------------
position_index_pairs_all first element:
[(36, 243)]


### Calculation word numbers in esseys

In [11]:
def text_to_tensor(text, tokenizer, position_index_pairs_all):
    # Tokenize the text
    tokens = tokenizer.texts_to_sequences([text])[0]
    tensor_representation = []

    # Iterate through the tokens
    for i, token in enumerate(tokens):
        tensor_representation.append(token)

    return tensor_representation

missing_word_index = tokenizer_vocab_df.index[tokenizer_vocab_df['Word'] == "<MISSING>"][0]
unknown_word_index = tokenizer_vocab_df.index[tokenizer_vocab_df['Word'] == "<UNK>"][0]

# Calculate average and minimum length of essays
essay_lengths = [len(text.split()) for text in df['clean_text']]
average_length = sum(essay_lengths) / len(essay_lengths)
min_length = min(essay_lengths)

# Vectorization
max_sequence_length = 50

essays_tensor = []

# Iterate through each essay
k = 0
padding_index = -1
for essay, position_index_pair in zip(training_examples, position_index_pairs_all):
    tokens = essay.split()
    token_indices = []

    # Convert each token to its corresponding index in the vocabulary
    for i, token in enumerate(tokens):
        if any(position == i for position, _ in position_index_pair):
            # Missing word
            token_indices.append(missing_word_index)
        else:
            index = tokenizer_vocab_df[tokenizer_vocab_df['Word'] == token]['Index'].values
            if len(index) > 0:
                token_indices.append(index[0])
            else:
                # If the token is not in the vocabulary -> unknown token
                token_indices.append(unknown_word_index)

    # Pad the sequence to the maximum length
    if len(token_indices) > max_sequence_length:
        token_indices = token_indices[:max_sequence_length]
    else:
        token_indices += [padding_index] * (max_sequence_length - len(token_indices))

    essays_tensor.append(token_indices)
    print(f"token indices {k} appended")
    k += 1

# Convert the list to a numpy array
essays_tensor = np.array(essays_tensor)

# Save essays as a number representation in a tensor
torch.save(essays_tensor, 'essays_tensor.pt')

# Convert the list to a numpy array
essays_tensor = np.array(essays_tensor)

# Print statistics
print("Average length of essays:", average_length)
print("Minimum length of essays:", min_length)
print("Maximum length of essays:", max_sequence_length)

# Print the tensor representation of the first essay
print("Tensor representation of the first essay:")
print(essays_tensor[0])

token indices 0 appended
token indices 1 appended
token indices 2 appended
token indices 3 appended
token indices 4 appended
token indices 5 appended
token indices 6 appended
token indices 7 appended
token indices 8 appended
token indices 9 appended
token indices 10 appended
token indices 11 appended
token indices 12 appended
token indices 13 appended
token indices 14 appended
token indices 15 appended
token indices 16 appended
token indices 17 appended
token indices 18 appended
token indices 19 appended
token indices 20 appended
token indices 21 appended
token indices 22 appended
token indices 23 appended
token indices 24 appended
token indices 25 appended
token indices 26 appended
token indices 27 appended
token indices 28 appended
token indices 29 appended
token indices 30 appended
token indices 31 appended
token indices 32 appended
token indices 33 appended
token indices 34 appended
token indices 35 appended
token indices 36 appended
token indices 37 appended
token indices 38 appen

In these tensors the missing word is represented with 0, padding is -1, unknown word (\<UNK>) is 490, and the other words are the representation indexes according to vocabulary. In the experiments we will have 50 150 and 500 word length essays.