# Tier B - The Semanticist

This section focuses more on the vector embeddings of the text, and their meanings.  

**How do we capture that?**  
Here, every sentence / piece of text is mapped to a vector high-dimensional vector space. Texts with similar meaning will have similar embeddings in the vector space. For example, the phrase *"Ferb is smart"* will be mapped to a vector which is very close to the vector *"Ferb is intelligent."* But *"Harry is smart"* will be very far off from the phrase *"Curse you Perry the platypus!!!"*

# Implementation

We use [Word2Vec](https://arxiv.org/pdf/1301.3781) embeddings with a Multi-Layer Perceptron (MLP). The MLP will classify text based purely on semantic vector embeddings. This is a test to see if our dataset separates texts based on topics or on actual authorship patterns.

[Chiny et al., 2023](https://www.researchgate.net/publication/366138831_Effect_of_word_embedding_vector_dimensionality_on_sentiment_analysis_through_short_and_long_texts/citations) is a resource I used to learn about vector embeddings, as well as figure out how many dimensions I need.

## Justification for High-Dimensional Embeddings (Tier B)

**Decision:**
We selected **300-dimensional Word2Vec vector embeddings** by Google (`word2vec-google-news-300`) over the standard 50-dimensional versions for the Semanticist Tier.

**Rationale:**
1. The standard 50-dimensional embeddings are trained on a corpus of 6 billion tokens (Wikipedia 2014). In contrast, the 300-dimensional embeddings are trained on 840 billion tokens from the [Common Crawl](https://commoncrawl.org/) dataset. This represents a 140x increase in the model's exposure to linguistic contexts, allowing it to capture semantic distinctions (e.g., the difference between "melancholy" and "sad").
2. Tier B utilizes a "Bag of Means" approach [2], where individual word vectors are averaged to create a single paragraph vector. This averaging process is inherently destructive to information.
    * Averaging low-dimensional vectors (50d) results in a bad representation where distinct authors become indistinguishable.
    * High-dimensional vectors (300d) provide a larger mathematical "volume," ensuring that the averaged vector retains sufficient unique signal to act as a classifier between the human authors and AI.

## NOTE on why Word2Vec instead of GloVe** 
- We chose Word2Vec over GloVe for its Skip-gram and CBOW architecture, which is better for larger vocabularies. [source](https://stackoverflow.com/questions/56071689/whats-the-major-difference-between-glove-and-word2vec)

An Explanation:

Word2Vec has a pretty simple architecture. It's just a 2-layer NN, it deals with addition of new words in the vocabulary, and it preserves the relationoship between words. Training Word2Vec is also quite resource intensive, as it requires a large amount of RAM to store the vocabulary of the corpus.

CBOW (Continuous Bag of Words) - predicts the target word from the context.
SkipGram - predicts context words from the target word.

**NOTE:** I also remove `stopwords` like 'and', 'in', 'the' etc because they are just noise in the embedding as mentioned by [2]

**Reference:**
1. [*Pennington et al., 2014. GloVe: Global Vectors for Word Representation.*](https://nlp.stanford.edu/pubs/glove.pdf)
2. [*Effect of word embedding vector dimensionality on sentiment analysis through short and long texts (International Journal of Social Sciences Bulletin).*](https://www.researchgate.net/publication/366138831_Effect_of_word_embedding_vector_dimensionality_on_sentiment_analysis_through_short_and_long_texts/citations)

In [None]:
import pandas as pd
import numpy as np
import gensim.downloader as api
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import glob
import os
from pathlib import Path
from tqdm import tqdm

EMBEDDING_MODEL_NAME = 'word2vec-google-news-300' 
DATASET_DIR = Path('../../dataset')

STOPWORDS = set([
    'the', 'and', 'is', 'in', 'to', 'of', 'a', 'an', 'that', 'it', 'for', 'on', 
    'with', 'as', 'was', 'at', 'by', 'be', 'this', 'which', 'or', 'from'
])

print(f"Loading {EMBEDDING_MODEL_NAME}... (This may take a few minutes on first run)")
wv = api.load(EMBEDDING_MODEL_NAME)
print("Embeddings loaded successfully.")

def text_to_average_vector(text):
    """
    Converts a paragraph into a single 300-dimensional vector
    by averaging the vectors of its meaningful words.
    """
    if not isinstance(text, str):
        return np.zeros(300) 
    words = text.lower().split()
    
    valid_vectors = [
        wv[word] for word in words 
        if word in wv and word not in STOPWORDS
    ]
    
    if len(valid_vectors) == 0:
        return np.zeros(300)
    
    return np.mean(valid_vectors, axis=0)

def load_texts_from_directory(directory_path, class_label):
    """Load all text files from a directory"""
    data = []
    txt_files = glob.glob(os.path.join(str(directory_path), '*.txt'))
    
    for file_path in tqdm(txt_files, desc=f"  Loading {class_label}"):
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                text = f.read().strip()
            if text:
                data.append({
                    'text': text,
                    'label': class_label,
                    'file_name': os.path.basename(file_path)
                })
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
    
    return data

# Load Class 1 (Human-written)
print("\nLoading Class 1 (Human-written)...")
class1_data = []
for author in ['01-arthur-conan-doyle', '02-pg-wodehouse', '03-mark-twain', '04-william-shakespeare']:
    path = DATASET_DIR / 'class1-human-written' / author / 'extracted_paragraphs'
    class1_data.extend(load_texts_from_directory(path, 0))

# Load Class 2 (AI-written)
print("\nLoading Class 2 (AI-written)...")
class2_path = DATASET_DIR / 'class2-ai-written' / 'ai-generated-paragraphs'
class2_data = load_texts_from_directory(class2_path, 1)

# Load Class 3 (AI-mimicry)
print("\nLoading Class 3 (AI-mimicry)...")
class3_data = []
for author in ['01-arthur-conan-doyle', '02-pg-wodehouse', '03-mark-twain', '04-william-shakespeare']:
    path = DATASET_DIR / 'class3-ai-mimicry' / author
    class3_data.extend(load_texts_from_directory(path, 2))

# Combine all data
all_data = class1_data + class2_data + class3_data
df = pd.DataFrame(all_data)

print(f"\nDataset loaded: {len(df)} total samples")
print(f"  Class 1 (Human): {len(class1_data)}")
print(f"  Class 2 (AI): {len(class2_data)}")
print(f"  Class 3 (AI-mimicry): {len(class3_data)}")

print("\nVectorizing paragraphs")
X = np.array([text_to_average_vector(text) for text in tqdm(df['text'], desc="Vectorizing")])
y = df['label'].values

print("\nBINARY CLASSIFICATION RESULTS - SEMANTICIST")

# Binary Classification 1: Class 1 vs Class 2
mask_12 = (y == 0) | (y == 1)
X_12, y_12 = X[mask_12], y[mask_12]
df_12 = df[mask_12].reset_index(drop=True)

X_train_12, X_test_12, y_train_12, y_test_12, idx_train_12, idx_test_12 = train_test_split(
    X_12, y_12, df_12.index, test_size=0.2, stratify=y_12, random_state=42
)

clf_12 = MLPClassifier(
    hidden_layer_sizes=(128, 64), 
    activation='relu',
    solver='adam',
    max_iter=500,      
    random_state=42,    
    early_stopping=True 
)
clf_12.fit(X_train_12, y_train_12)
y_pred_12 = clf_12.predict(X_test_12)
accuracy_12 = accuracy_score(y_test_12, y_pred_12)
print(f"\nClass 1 (Human) vs Class 2 (AI): {accuracy_12:.4f} ({accuracy_12*100:.2f}%)")
print("Confusion Matrix:")
print(confusion_matrix(y_test_12, y_pred_12))

# Binary Classification 2: Class 1 vs Class 3
mask_13 = (y == 0) | (y == 2)
X_13, y_13 = X[mask_13], y[mask_13]
df_13 = df[mask_13].reset_index(drop=True)
# Remap: 0 stays 0, 2 becomes 1
y_13_binary = np.where(y_13 == 2, 1, y_13)

X_train_13, X_test_13, y_train_13, y_test_13, idx_train_13, idx_test_13 = train_test_split(
    X_13, y_13_binary, df_13.index, test_size=0.2, stratify=y_13_binary, random_state=42
)

clf_13 = MLPClassifier(
    hidden_layer_sizes=(128, 64), 
    activation='relu',
    solver='adam',
    max_iter=500,      
    random_state=42,    
    early_stopping=True 
)
clf_13.fit(X_train_13, y_train_13)
y_pred_13 = clf_13.predict(X_test_13)
accuracy_13 = accuracy_score(y_test_13, y_pred_13)
print(f"\nClass 1 (Human) vs Class 3 (AI-mimicry): {accuracy_13:.4f} ({accuracy_13*100:.2f}%)")
print("Confusion Matrix:")
print(confusion_matrix(y_test_13, y_pred_13))

# Binary Classification 3: Class 2 vs Class 3
mask_23 = (y == 1) | (y == 2)
X_23, y_23 = X[mask_23], y[mask_23]
df_23 = df[mask_23].reset_index(drop=True)
# Remap: 1 becomes 0, 2 becomes 1
y_23_binary = np.where(y_23 == 1, 0, 1)

X_train_23, X_test_23, y_train_23, y_test_23, idx_train_23, idx_test_23 = train_test_split(
    X_23, y_23_binary, df_23.index, test_size=0.2, stratify=y_23_binary, random_state=42
)

clf_23 = MLPClassifier(
    hidden_layer_sizes=(128, 64), 
    activation='relu',
    solver='adam',
    max_iter=500,      
    random_state=42,    
    early_stopping=True 
)
clf_23.fit(X_train_23, y_train_23)
y_pred_23 = clf_23.predict(X_test_23)
accuracy_23 = accuracy_score(y_test_23, y_pred_23)
print(f"\nClass 2 (AI) vs Class 3 (AI-mimicry): {accuracy_23:.4f} ({accuracy_23*100:.2f}%)")
print("Confusion Matrix:")
print(confusion_matrix(y_test_23, y_pred_23))

# Keep the full model for later analysis
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

clf = MLPClassifier(
    hidden_layer_sizes=(128, 64), 
    activation='relu',
    solver='adam',
    max_iter=500,      
    random_state=42,    
    early_stopping=True 
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nMulti-class (All 3 classes): {accuracy:.4f} ({accuracy*100:.2f}%)")

Loading word2vec-google-news-300... (This may take a few minutes on first run)
Embeddings loaded successfully.

Loading Class 1 (Human-written)...


  Loading 0: 100%|██████████| 500/500 [00:00<00:00, 8332.25it/s]
  Loading 0: 100%|██████████| 500/500 [00:00<00:00, 5328.64it/s]
  Loading 0: 100%|██████████| 480/480 [00:00<00:00, 8313.51it/s]
  Loading 0: 100%|██████████| 480/480 [00:00<00:00, 8530.57it/s]



Loading Class 2 (AI-written)...


  Loading 1: 100%|██████████| 988/988 [00:00<00:00, 6344.53it/s]



Loading Class 3 (AI-mimicry)...


  Loading 2: 100%|██████████| 250/250 [00:00<00:00, 7902.51it/s]
  Loading 2: 100%|██████████| 250/250 [00:00<00:00, 7493.90it/s]
  Loading 2: 100%|██████████| 237/237 [00:00<00:00, 6295.56it/s]
  Loading 2: 100%|██████████| 236/236 [00:00<00:00, 7482.07it/s]



Dataset loaded: 3921 total samples
  Class 1 (Human): 1960
  Class 2 (AI): 988
  Class 3 (AI-mimicry): 973

Vectorizing paragraphs


Vectorizing: 100%|██████████| 3921/3921 [00:00<00:00, 6839.70it/s]



BINARY CLASSIFICATION RESULTS - SEMANTICIST

Class 1 (Human) vs Class 2 (AI): 0.9746 (97.46%)
Confusion Matrix:
[[388   4]
 [ 11 187]]

Class 1 (Human) vs Class 3 (AI-mimicry): 0.9591 (95.91%)
Confusion Matrix:
[[377  15]
 [  9 186]]

Class 2 (AI) vs Class 3 (AI-mimicry): 0.9695 (96.95%)
Confusion Matrix:
[[190   8]
 [  4 191]]

Multi-class (All 3 classes): 0.9618 (96.18%)


## Misclassified Texts

Now let's save all misclassified examples to understand where the Semanticist model fails.

In [None]:
import os

# Create output directory
output_dir = 'semanticist_misclassified'
os.makedirs(output_dir, exist_ok=True)

# We need to map back to the full dataset to get test indices
test_indices = y_test.copy()

# Create results dataframe with file names
# Get the original indices from the train_test_split
from sklearn.model_selection import train_test_split as tts
_, test_idx = tts(range(len(df)), test_size=0.2, stratify=y, random_state=42)

results_df = pd.DataFrame({
    'actual': y_test,
    'predicted': y_pred,
    'file_name': df.iloc[test_idx]['file_name'].values
})

reverse_mapping = {0: 'Class 1: Human-written', 1: 'Class 2: AI-written', 2: 'Class 3: AI-mimicry'}

# Define misclassification categories
categories = [
    (0, 1, 'class1_as_class2.txt', 'Class 1 (Human) misclassified as Class 2 (AI)'),
    (0, 2, 'class1_as_class3.txt', 'Class 1 (Human) misclassified as Class 3 (AI-mimicry)'),
    (1, 0, 'class2_as_class1.txt', 'Class 2 (AI) misclassified as Class 1 (Human)'),
    (1, 2, 'class2_as_class3.txt', 'Class 2 (AI) misclassified as Class 3 (AI-mimicry)'),
    (2, 0, 'class3_as_class1.txt', 'Class 3 (AI-mimicry) misclassified as Class 1 (Human)'),
    (2, 1, 'class3_as_class2.txt', 'Class 3 (AI-mimicry) misclassified as Class 2 (AI)')
]

total_saved = 0

for actual_class, predicted_class, filename, description in categories:
    # Filter misclassified examples for this category
    category_misclassified = results_df[(results_df['actual'] == actual_class) & 
                                        (results_df['predicted'] == predicted_class)]
    
    if len(category_misclassified) == 0:
        continue
    
    filepath = os.path.join(output_dir, filename)
    
    with open(filepath, 'w', encoding='utf-8') as outfile:
        outfile.write("-" * 80 + "\n")
        outfile.write(f"{description}\n")
        outfile.write(f"Total: {len(category_misclassified)} files\n")
        outfile.write("-" * 80 + "\n\n")
        
        for idx, row in category_misclassified.iterrows():
            text_file = row['file_name']
            actual = row['actual']
            predicted = row['predicted']
            
            # Construct the full path to the text file
            if actual == 0:
                actual_class_folder = 'class1-human-written'
            elif actual == 1:
                actual_class_folder = 'class2-ai-written'
            else:
                actual_class_folder = 'class3-ai-mimicry'
            
            # Find the file in the dataset
            base_path = '../../dataset'
            file_path = None
            
            # Search for the file
            for root, dirs, files in os.walk(os.path.join(base_path, actual_class_folder)):
                if text_file in files:
                    file_path = os.path.join(root, text_file)
                    break
            
            if file_path and os.path.exists(file_path):
                outfile.write("-" * 80 + "\n")
                outfile.write(f"File: {text_file}\n")
                outfile.write(f"Actual: {reverse_mapping[actual]}\n")
                outfile.write(f"Predicted: {reverse_mapping[predicted]}\n")
                outfile.write("-" * 80 + "\n")
                
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        content = f.read()
                        outfile.write(content)
                except Exception as e:
                    outfile.write(f"Error reading file: {e}\n")
                
                outfile.write("\n\n")
            else:
                outfile.write(f"Could not find file: {text_file}\n\n")
    
    total_saved += len(category_misclassified)
    print(f"Saved {len(category_misclassified)} files to {filename}")

print(f"\nTotal: {total_saved} misclassified text files saved to {output_dir}/")

Saved 1 files to class1_as_class2.txt
Saved 9 files to class1_as_class3.txt
Saved 6 files to class2_as_class1.txt
Saved 11 files to class3_as_class1.txt
Saved 3 files to class3_as_class2.txt

Total: 30 misclassified text files saved to semanticist_misclassified/


# Results

Class 1 vs Class 2 reports 97.46% accuracy.
Class 1 vs Class 3 reports 95.91% accuracy.

This is amazing accuracy... The semanticist is far outperforming the statistician, and it is clear that Word2Vec is able to clearly depict the nuance of human vs AI texts.

Now, I'll try to analyse how / why I got these findings.


## What We Found
The semanticist significantly outperformed the statistician. Even though the AI can count commas and mimic sentence lengths, it fails to fake the semantics which humans are able to uniquely author.

Detecting class-2 was v easy. The model saw that the AI stuck to very safe, average word choices (low variance). The Human author used concrete, specific nouns, while the Generic AI used broader, more predictable language. 

It detected class-3 quite easily as well, showing that despite good prompt engineering, the semantic structures are still quite different.

# Visualising the vector space

I now want to see how the classes are visualised in the multi-dimensional vector space. Since it is impossible for us to envision a 4th dimension, let alone the 300th, I will use a technique called dimensionality reduction. Dimensionality reduction uses PCA to preserve the overall structure, while still keeping vectors close to vectors that they should be close to in the new 3D vector space. This can be done using using scikitlearn's PCA library. We also use it's t-SNE library, which is good for preserving local clusters and seeing neighbourss.

In [14]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D 
from sklearn.decomposition import PCA
import pandas as pd

# This is where we use the 300-dimensional vector embeddings from earlier!
# X_test contains the averaged Word2Vec embeddings for each test paragraph
print("Reducing 300 dimensions to 3 dimensions using PCA...")
print(f"Original shape: {X_test.shape} (785 paragraphs × 300 dimensions)")

pca = PCA(n_components=3)
X_3d = pca.fit_transform(X_test)

print(f"Reduced shape: {X_3d.shape} (785 paragraphs × 3 dimensions)")
print(f"Variance explained by 3 components: {sum(pca.explained_variance_ratio_):.2%}")

# Create dataframe for visualization
viz_df = pd.DataFrame(data=X_3d, columns=['PC1', 'PC2', 'PC3'])
viz_df['Label'] = y_test  # The actual class labels (0, 1, or 2)
label_map = {0: 'Human', 1: 'Generic AI', 2: 'Mimic AI'}
viz_df['Class'] = viz_df['Label'].map(label_map)

print("\nReady to visualize! The 300D semantic space has been compressed to 3D.")

Reducing 300 dimensions to 3 dimensions using PCA...
Original shape: (785, 300) (785 paragraphs × 300 dimensions)
Reduced shape: (785, 3) (785 paragraphs × 3 dimensions)
Variance explained by 3 components: 28.82%

Ready to visualize! The 300D semantic space has been compressed to 3D.


In [17]:
import plotly.express as px
import os

os.makedirs('semanticist-visuals', exist_ok=True)

color_map = {
    'Human': '#1f77b4',
    'Generic AI': '#ff7f0e',
    'Mimic AI': '#2ca02c'
}

fig_all = px.scatter_3d(
    viz_df,
    x='PC1',
    y='PC2',
    z='PC3',
    color='Class',
    color_discrete_map=color_map,
    title='All 3 Classes - Interactive 3D Semantic Space',
    labels={'PC1': 'PC 1', 'PC2': 'PC 2', 'PC3': 'PC 3'},
    opacity=0.6,
    height=800
)
fig_all.update_traces(marker=dict(size=4))
fig_all.update_layout(
    scene=dict(xaxis_title='PC 1', yaxis_title='PC 2', zaxis_title='PC 3'),
    font=dict(size=12)
)
fig_all.write_html('semanticist-visuals/all_classes.html')
print("All classes visualization saved")

print("\nAll visualizations saved to semanticist-visuals/ directory!")

All classes visualization saved

All visualizations saved to semanticist-visuals/ directory!


~ visualisation codes generated by Claude Sonnet 4.5

If anything, this just shows us why 3-dimensions would not have been good enough imo