# AI Detector - Data Preprocessing

## 1. Import Necessary Dependencies

At first, we need to import required libraries for preprocessing

In [1]:
import os
import json
import pandas as pd

## 2. Define Necessary Functions
Before preprocessing, we must define all the necessary utility functions and components required for preprocessing

### 2.1 Initialize File Extensions Dictionary
Each of the subdirectories under `dataset-source-codes` directory contains a `.json` file that has a coding question asked for a particular programming language. And rest of the files in that subdirectory are code snippets representing the coding answers for both candidate and LLMs. These files have an extension that is required when loading the code snippet in the next step. The `file_extensions_dict` maps the coding question's programming language from the `.json` file to the coding answer's file extensions.

For example, the answer code snippets with `.jav` extension represents the answers to coding question asked in `Java` programming language.

In [2]:
file_extensions_dict = {
    "Java": "jav",
    "JavaScript": "js",
    "C++": "cpp",
    "Python": "py",
    "Swift": "swift",
    "React": "js",
    "Kotlin": "kt",
    "SQL": "sql",
    "Docker": "dockerfile",
    "C#": "cs",
    "PHP & JavaScript": "js",
    "HTML/CSS": "css",
    "JavaScript-React": "js",
    "PHP": "php",
    "TypeScript": "ts",
    "Go": "go",
    "Dart": "dart"
}

### 2.2 `load_question_data()` function
- **Param:** `source_code_path` -> Absolute path to the provided `dataset-source-codes` directory
- **Returns:** `questions` -> A list of all question data as dictionaries

This function traverses each subdirectory (`source_code_000` ... `source_code_062`) under `dataset-source-codes` directory and loads the source code dataset as a list of dictionaries. For each of subdirectory this function performs the following actions:
1. Initialize a `question_data` dictionary 
2. Saves the coding question into the dictionary from the respective `.json` file
3. Saves the candidate's answer into the dictionary from the respective source code file
4. Saves all the variants of the LLM answers into the dictionary from the respective LLM source code files
5. Adds the dictionary to the list

In [3]:
def load_question_data(source_codes_path):
    questions = []
    for folder in os.listdir(source_codes_path):
        if folder.startswith('source_code_'):
            question_path = os.path.join(source_codes_path, folder)
            question_data = {}

            # Load and save JSON question data
            with open(os.path.join(question_path, f'{folder}.json'), 'r') as f:
                metadata = json.load(f)
                question_data['id'] = folder
                question_data['question'] = metadata['question']

            # Load candidate's answer
            with open(os.path.join(question_path, f'{folder}.{file_extensions_dict.get(metadata["programming_language"], '')}'), 'r') as f:
                question_data['candidate_answer'] = f.read()

            # Load AI-generated answers
            ai_models = ['gpt-3.5-turbo', 'gpt-4', 'gpt-4-turbo']
            for model in ai_models:
                for i in range(2):
                    with open(os.path.join(question_path, f'{folder}_{model}_0{i}.{file_extensions_dict.get(metadata["programming_language"], '')}'), 'r') as f:
                        question_data[f'{model}_0{i}'] = f.read()

            questions.append(question_data)

    return questions

### 2.3 `combine_question_answer()` Function
This function simply takes a coding question and its corresponding answer (candidate/LLM), merges them, and returns the merged string

In [4]:
def combine_question_answer(question, answer):
    return f"Question: {question} Answer: {answer}"

### 2.4 `preprocess_data()` Function
- **Params:** 
  - `source_code_path` -> Absolute path to the provided `dataset-source-codes` directory
  - `source_codes_labels` -> Absolute path to the provided `CodeAid Source Codes Labeling.xlsx` file with plagiarism scores
- **Returns:** `df` -> A Pandas DataFrame containing the preprocessed data

This is the core function that preprocesses the provided dataset. First it loads all the `questions` list from the `load_question_data()` function. It also loads in the file with `plagiarism_scores`. Now for each of question object in the `questions` list, it does the following:
1. 

In [5]:
def preprocess_data(source_codes_path, source_codes_labels):
    questions = load_question_data(source_codes_path)
    plagiarism_scores = pd.read_excel(source_codes_labels)

    data = []
    for question in questions:
        for ai_model in ['gpt-3.5-turbo', 'gpt-4', 'gpt-4-turbo']:
            for i in range(2):
                score = plagiarism_scores[
                    (plagiarism_scores['coding_problem_id'] == question['id']) &
                    (plagiarism_scores['llm_answer_id'] == f'{ai_model}_0{i}')
                ]['plagiarism_score'].values[0]

                data.append({
                    'question': question['question'],
                    'candidate_answer': question['candidate_answer'],
                    'ai_answer': question[f'{ai_model}_0{i}'],
                    'similarity_score': score
                })

    df = pd.DataFrame(data)

    df['candidate_combined'] = df.apply(lambda row: combine_question_answer(
        row['question'], row['candidate_answer']), axis=1)
    df['ai_combined'] = df.apply(lambda row: combine_question_answer(
        row['question'], row['ai_answer']), axis=1)

    return df

### 3. Load the Save the Preprocessed Data
Now that we defined all the necessary functions and components for preprocessing, now we perform actual preprocessing and save the data.

In [6]:
# Load the original data
data_dir = os.path.join(os.path.abspath(''), os.pardir, 'data')
source_codes_path = os.path.join(data_dir, 'dataset-source-codes')
source_codes_labels = os.path.join(
    data_dir, 'CodeAid Source Codes Labeling.xlsx')

# Preprocess the data
df = preprocess_data(source_codes_path, source_codes_labels)

# Save the preprocessed data
preprocessed_data = os.path.join(data_dir, 'preprocessed_data.csv')
df.to_csv(preprocessed_data, index=False)
print(f"Data preprocessing complete. File saved as {preprocessed_data}")

Data preprocessing complete. File saved as e:\Data Science\AI-Detector\notebooks\..\data\preprocessed_data.csv


## 4. Data Exploration
This section deals with exploring the preprocessed data. Let's start by quickly checking the shape of the data.

In [7]:
print(f"Shape of data: {df.shape}")

Shape of data: (378, 6)


The data has 6 columns as expected. It also has 378 rows. Which it should have. Since, the original dataset had 63 coding questions, and although a single question had a single candidate answer, it actually had 6 different LLM answers. So, 63 x 6 = 378 rows in total. 

Let's see the first five rows of the data.

In [8]:
df.head()

Unnamed: 0,question,candidate_answer,ai_answer,similarity_score,candidate_combined,ai_combined
0,Write a program to find the largest element in...,fun findLargestElement(array: IntArray) : Int ...,public class LargestElementFinder {\n publi...,0.0,Question: Write a program to find the largest ...,Question: Write a program to find the largest ...
1,Write a program to find the largest element in...,fun findLargestElement(array: IntArray) : Int ...,public class Main {\n public static void ma...,0.0,Question: Write a program to find the largest ...,Question: Write a program to find the largest ...
2,Write a program to find the largest element in...,fun findLargestElement(array: IntArray) : Int ...,public class Main {\n public static void main...,0.0,Question: Write a program to find the largest ...,Question: Write a program to find the largest ...
3,Write a program to find the largest element in...,fun findLargestElement(array: IntArray) : Int ...,public class Main {\n public static void ma...,0.0,Question: Write a program to find the largest ...,Question: Write a program to find the largest ...
4,Write a program to find the largest element in...,fun findLargestElement(array: IntArray) : Int ...,public class Main {\n public static void ma...,0.0,Question: Write a program to find the largest ...,Question: Write a program to find the largest ...


Let's see the last five rows of the data

In [9]:
df.tail()

Unnamed: 0,question,candidate_answer,ai_answer,similarity_score,candidate_combined,ai_combined
373,Create a PHP script that will accept a string ...,<?php\nfunction getTopThreeWords($text) {\n// ...,<?php\n\nfunction findTopThreeWords($input) {\...,0.3,Question: Create a PHP script that will accept...,Question: Create a PHP script that will accept...
374,Create a PHP script that will accept a string ...,<?php\nfunction getTopThreeWords($text) {\n// ...,function findTopWords($input) {\n // Remove...,0.3,Question: Create a PHP script that will accept...,Question: Create a PHP script that will accept...
375,Create a PHP script that will accept a string ...,<?php\nfunction getTopThreeWords($text) {\n// ...,<?php\n function mostCommonWords($input) {\...,0.3,Question: Create a PHP script that will accept...,Question: Create a PHP script that will accept...
376,Create a PHP script that will accept a string ...,<?php\nfunction getTopThreeWords($text) {\n// ...,<?php\n\nfunction getTopThreeWords($text) {\n ...,0.3,Question: Create a PHP script that will accept...,Question: Create a PHP script that will accept...
377,Create a PHP script that will accept a string ...,<?php\nfunction getTopThreeWords($text) {\n// ...,<?php\n\n// Function to get the top three most...,0.3,Question: Create a PHP script that will accept...,Question: Create a PHP script that will accept...


Let's check DataFrame info.

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 378 entries, 0 to 377
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   question            378 non-null    object 
 1   candidate_answer    378 non-null    object 
 2   ai_answer           378 non-null    object 
 3   similarity_score    378 non-null    float64
 4   candidate_combined  378 non-null    object 
 5   ai_combined         378 non-null    object 
dtypes: float64(1), object(5)
memory usage: 17.8+ KB


Let's describe the dataset.

In [11]:
df.describe()

Unnamed: 0,similarity_score
count,378.0
mean,0.29709
std,0.267847
min,0.0
25%,0.1
50%,0.2
75%,0.4
max,0.9


Let's check for null/empty values

In [12]:
df.isnull().sum()

question              0
candidate_answer      0
ai_answer             0
similarity_score      0
candidate_combined    0
ai_combined           0
dtype: int64

The preprocessed data looks good. We can proceed to the training phase.