# Importing our dataset (Final_Dataset.pkl)

In [1]:
import pandas as pd

merged_df = pd.read_pickle("Final_Dataset.pkl")

In [2]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2065 entries, 0 to 2089
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    2065 non-null   object
 1   label   2065 non-null   object
dtypes: object(2)
memory usage: 48.4+ KB


# Feature Engineering Plan

This plan outlines the key features to be engineered for distinguishing between human and AI-generated texts, focusing on linguistic, readability, n-gram, emotional and semantic coherence. These features aim to capture the differentiation between AI from human writing styles and content.

### 1. Linguistic Features
These features aim to capture the essence of how texts are constructed, focusing on the structural and lexical aspects.

- **Syntactic Complexity**: Syntactic complexity looks at how complex the structure of sentences is. Since AI might not always mimic the wide range of ways humans put sentences together, this aspect is key in figuring out who wrote something, highlighting the skillful way words are woven together (Lu, 2010).
- **Lexical Richness**: Lexical richness is about the variety of words used. AI-written texts might use words differently because of what they've learned from their data, so looking at the Type-Token Ratio (TTR) and other similar measures can point out differences in how complex and unique the text is (McCarthy and Jarvis, 2010).
- **Burstiness**: Burstiness tells us about how often different words are used in a text, shedding light on whether the language feels natural or machine-made. This can help spot the unique patterns AI uses in creating text (Church and Gale, 1995).
- **Perplexity**: Perplexity measures how predictable a text is, with AI-written texts often being more predictable because of how their algorithms tend to pick more expected words (Juola, 2006). This is checked here, using a language model - GPT-2.
- **Semantic Coherence**: Semantic coherence is all about how logically and consistently ideas are connected. Human writing usually flows better and links ideas more smoothly than AI, which can sometimes jump from one idea to another less gracefully..

### 2. Readability Feature
Readability scores help figure out how easy or hard a text is to understand. Since AI might not really focus on making texts match certain levels of difficulty the way people do, checking these scores can point out texts where the difficulty level hasn't been adjusted on purpose by a human writer (Kincaid et al., 1975).

### 3. N-Gram Feature
Looking at n-grams, which are groups of words that appear together, helps us see patterns in language that are important for figuring out who wrote something. AI-written texts might keep using the same n-grams over and over because those are the ones they learned from their data, showing a different pattern than texts written by people (Stamatatos, 2009). Hence, checking unique N-grams here. 

### 4. Sentiment Feature
Sentiment analysis checks the emotional tone of texts. People can write with a rich mix of emotions, but AI might either not get the subtlety right or might show emotions in a more predictable way. This difference can help us tell apart texts written by humans from those generated by AI (Liu, 2012).

Using these specialized methods, we're working on building a solid system that can tell the difference between the intricate details of human-written texts and those created by AI, helping us better understand how artificial content stacks up against human creativity.

In [3]:
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag
from sklearn.feature_extraction.text import TfidfVectorizer
import textstat
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import BertModel, BertTokenizer, GPT2LMHeadModel, GPT2Tokenizer
import torch
import nltk
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

  torch.utils._pytree._register_pytree_node(
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
# Converting labels to binary
merged_df['label'] = (merged_df['label'] == 'AI-written').astype(int)

# Function to calculate Flesch Reading Ease score using textstat
def readability_score(text):
    return textstat.flesch_reading_ease(text)

# Applying the readability score function to our data
merged_df['readability_score'] = merged_df['text'].apply(readability_score)

# Loading BERT model and tokenizer for embeddings (as before)
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

## Syntactic Complexity, Lexical Richness, Semantic Coherence, Sentiment Score, Burstiness and Perplexity features:

In [5]:
def syntactic_complexity(text):
    # Breaking the text into sentences and then tokenize each sentence into words.
    sentences = sent_tokenize(text)
    tokens = [word_tokenize(sentence) for sentence in sentences]
    # Counting the total number of sentences and tokens.
    total_sentences = len(sentences)
    total_tokens = sum(len(token) for token in tokens)
    # Calculating the average length of a sentence as a measure of complexity.
    avg_sentence_length = total_tokens / total_sentences if total_sentences > 0 else 0
    return avg_sentence_length

def lexical_richness(text):
    # Tokenizing the text and find the set of unique words (types).
    tokens = word_tokenize(text)
    types = set(tokens)
    # The Type-Token Ratio (TTR) is the number of unique words divided by the total number of words.
    ttr = len(types) / len(tokens) if len(tokens) > 0 else 0
    return ttr

def semantic_coherence(text, num_topics=5):
     # Lowercase and tokenize the text, preparing it for topic modeling
    tokens = [word_tokenize(text.lower())]
     # Return NaN if tokens are empty or missing
    if not tokens or all(not token for token in tokens):
        return np.nan    
    dictionary = Dictionary(tokens)
    if len(dictionary) == 0:  
        return np.nan    
    corpus = [dictionary.doc2bow(token) for token in tokens]
    if not corpus:
        return np.nan   
    # Performing LDA to find topics and calculate the coherence score.
    lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=42)
    top_topics = lda.top_topics(corpus)
    coherence_score = sum(score for topic, score in top_topics) / num_topics if top_topics else np.nan
    return coherence_score

def sentiment_score(text):
    analyser = SentimentIntensityAnalyzer()
    # Getting the compound sentiment score, which ranges from -1 (negative) to 1 (positive).
    score = analyser.polarity_scores(text)
    return score['compound']

from nltk import FreqDist

def calculate_burstiness(text):
    # Tokenizing the text and calculate the frequency distribution of words
    words = word_tokenize(text.lower())
    freq_dist = FreqDist(words)
    frequencies = list(freq_dist.values())
    # Calculating mean and standard deviation of word frequencies
    mean_freq = np.mean(frequencies)
    std_dev_freq = np.std(frequencies)
    # Burstiness is the standard deviation divided by the mean frequency
    burstiness = std_dev_freq / mean_freq if mean_freq > 0 else 0
    return burstiness


# Setting up the tokenizer and model for GPT-2
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained("gpt2")
model_gpt2 = GPT2LMHeadModel.from_pretrained("gpt2")

def calculate_perplexity(text):
     # If the text is empty or whitespace, return NaN.
    if not text.strip(): 
        return float('nan')  
    # Tokenizing the input text, keeping within model's maximum length
    tokenize_input = tokenizer_gpt2.encode(text, add_special_tokens=True, max_length=1024, truncation=True)
    # If no tokens, return NaN
    if len(tokenize_input) == 0:
        return float('nan')
    # Converting tokens to tensor and calculate the loss with the GPT-2 model
    tensor_input = torch.tensor([tokenize_input]).to(model_gpt2.device)
    with torch.no_grad():
        outputs = model_gpt2(tensor_input, labels=tensor_input)
        loss = outputs.loss
     # Perplexity is the exponentiation of the loss, measuring how well the model predicts the text
    return np.exp(loss.item())

### Applying the above functions to our dataframe:

In [6]:
merged_df['syntactic_complexity'] = merged_df['text'].apply(syntactic_complexity)
merged_df['lexical_richness'] = merged_df['text'].apply(lexical_richness)
merged_df['readability_score'] = merged_df['text'].apply(readability_score)
merged_df['sentiment_score'] = merged_df['text'].apply(sentiment_score)
merged_df['burstiness'] = merged_df['text'].apply(calculate_burstiness)
merged_df['semantic_coherence'] = merged_df['text'].apply(lambda x: semantic_coherence(x))

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,


In [7]:
merged_df['perplexity'] = merged_df['text'].apply(calculate_perplexity)

## N-Gram Feature:

In [8]:
import pandas as pd
import nltk
from nltk import bigrams, trigrams, word_tokenize
from collections import Counter
from nltk.util import ngrams

In [9]:
def generate_ngram_features(text, n=2):
    # First, breaking the text down into individual words
    tokens = word_tokenize(text)
     # Then, create=ing n-grams from these tokens. An n-gram is a sequence of 'n' tokens
    ngrams_list = list(ngrams(tokens, n))
    # We're interested in the unique n-grams here, so converting the list to a set and counting them
    return len(set(ngrams_list))

In [10]:
merged_df['unique_bigrams'] = merged_df['text'].apply(lambda x: generate_ngram_features(x, 2))
merged_df['unique_trigrams'] = merged_df['text'].apply(lambda x: generate_ngram_features(x, 3))

In [11]:
merged_df

Unnamed: 0,text,label,readability_score,syntactic_complexity,lexical_richness,sentiment_score,burstiness,semantic_coherence,perplexity,unique_bigrams,unique_trigrams
0,"In the aftermath of the Nakaba, the Palestinia...",1,70.23,21.294118,0.424033,-0.9650,2.032830,1.000089e-12,6.519134,720,722
1,The Rafal crossing is a major source of humani...,1,51.07,23.681818,0.424184,-0.9948,1.394122,1.000089e-12,7.640199,514,519
2,Hezbollah has also said that it has launched r...,1,58.62,23.000000,0.869565,0.1280,0.310497,1.000089e-12,11.168272,22,21
3,"A number of people were injured, including a w...",1,59.30,13.500000,0.925926,-0.3400,0.390209,1.000089e-12,17.722048,26,25
4,"Nadeem Anjarwalla, the regiona of Nigerias cap...",1,41.36,15.000000,0.866667,-0.7579,0.408248,1.000089e-12,68.086239,29,28
...,...,...,...,...,...,...,...,...,...,...,...
2085,It is a cinematic masterpiece that delves into...,0,47.52,25.600000,0.703125,0.9697,1.102310,1.000089e-12,31.175757,126,126
2086,It takes audiences on an unforgettable cinemat...,0,30.20,26.800000,0.731343,0.9080,0.889031,1.000089e-12,31.824208,131,132
2087,It is a timeless tale of hope and resilience t...,0,39.87,25.400000,0.692913,0.9565,1.056791,1.000089e-12,26.974663,123,125
2088,Andy Dufresne Tim Robbins is a banker convicte...,0,76.52,14.923077,0.618557,0.8931,1.058548,1.000089e-12,55.711385,179,189


In [12]:
merged_df.to_pickle("Features.pkl")

# References:

1. Lu, X. (2010) Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), pp. 474-496.
2. McCarthy, P.M. and Jarvis, S. (2010) MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), pp. 381-392.
3. Church, K.W. and Gale, W.A. (1995) Poisson mixtures. Natural Language Engineering, 1(2), pp. 163-190.
4. Juola, P. (2006) Authorship attribution. Foundations and Trends in Information Retrieval, 1(3), pp. 233-334.
5. Kincaid, J.P., Fishburne, R.P., Jr., Rogers, R.L., and Chissom, B.S. (1975) Derivation of new readability formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy enlisted personnel. Research Branch Report 8-75, Naval Technical Training, U. S. Naval Air Station, Memphis, TN.
6. Stamatatos, E. (2009) A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3), pp. 538-556.
7. Liu, B. (2012) Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, 5(1), pp. 1-167.