# Goal of the final project

## Introduction

In recent years, we have witnessed a remarkable revolution in the field of artificial intelligence (AI). The rapid advancements in machine learning and neural networks have unlocked unprecedented capabilities, enabling AI systems to perform complex tasks and mimic human intelligence with remarkable accuracy. This revolution has had a profound impact on various aspects of our lives, including communication, entertainment, and information dissemination.

**One particular area of concern that has emerged in the wake of the AI revolution is the generation of synthetic content.** AI-powered systems are now capable of producing highly realistic and convincing text, images, videos, and audio that can be indistinguishable from those created by humans. While this technological advancement has brought about exciting possibilities, it has also raised important questions and challenges regarding the authenticity and reliability of the content we encounter.

The ability to identify content that was produced by AI has become increasingly crucial in today's digital landscape. The sheer scale and speed at which AI-generated content can be disseminated pose significant risks to individuals, organizations, and society as a whole. Misinformation, fake news, and fraudulent activities are some of the detrimental consequences that can arise if AI-generated content goes unchecked and unverified.

Moreover, the rise of deepfakes, which are AI-generated media that superimpose one person's likeness onto another's body, has heightened concerns about the potential for malicious misuse of AI technology. Deepfakes have the potential to deceive and manipulate individuals, erode trust in visual evidence, and create social and political chaos. In a world where anyone with access to AI algorithms can generate highly realistic but fabricated content, the need to distinguish between what is genuine and what is artificially created is paramount.

Identifying AI-generated content is also essential to preserve the integrity of intellectual property and protect creative works. AI algorithms are capable of creating original music compositions, artwork, and even written works. Ensuring proper attribution and copyright protection becomes more challenging when AI systems can mimic the style and creativity of human creators. Differentiating between human-authored and AI-generated content helps maintain the ethical and legal boundaries surrounding intellectual property rights.

Additionally, understanding the origin and nature of AI-generated content is crucial for informed decision-making and public discourse. Transparently labeling AI-generated content allows individuals to assess the credibility and bias of the information they consume. It empowers users to make more informed judgments, fosters critical thinking, and safeguards against the manipulation of public opinion.

In conclusion, as the AI revolution continues to shape our digital landscape, the ability to identify content produced by AI has become vital. Doing so helps combat the spread of misinformation, protect intellectual property rights, and preserve the integrity of public discourse. By developing robust tools, guidelines, and awareness surrounding AI-generated content, we can navigate this new era of technological advancements more responsibly and confidently.

-- *Generated Introduction by ChatGPT from 30th May 2023*

## What we will do

We will try to develop a classifier that is capable of identifying AI written content. This notebook should give a quick starting point and an overview of what we can do for our project. We might also create further AI generated text by more advanced models like GPT 3.x or later and other similar models. For this purpose we can get an API key from OpenAI to get text data from newer GPT models by writing a python script to quickly access AI text. We should talk about it next week...

# Create a baseline model

## Import datasets

Download whole data from kaggle (other options possible -> github repo from OpenAI; takes longer and sometimes stops downloading) into a local folder at first; maybe we move it to the cloud later.
https://www.kaggle.com/datasets/abhishek/gpt2-output-data

In [1]:
# your folder should look like this
! ls

Prep_final_project.ipynb [1m[36mdata[m[m


In [2]:
! cd data && ls

large-762M-k40.test.csv   medium-345M.test.csv      webtext.test.csv
large-762M-k40.train.csv  medium-345M.train.csv     webtext.train.csv
large-762M-k40.valid.csv  medium-345M.valid.csv     webtext.valid.csv
large-762M.test.csv       small-117M-k40.test.csv   xl-1542M-k40.test.csv
large-762M.train.csv      small-117M-k40.train.csv  xl-1542M-k40.train.csv
large-762M.valid.csv      small-117M-k40.valid.csv  xl-1542M-k40.valid.csv
medium-345M-k40.test.csv  small-117M.test.csv       xl-1542M.test.csv
medium-345M-k40.train.csv small-117M.train.csv      xl-1542M.train.csv
medium-345M-k40.valid.csv small-117M.valid.csv      xl-1542M.valid.csv


In [3]:
import pandas as pd
import numpy as np
import os
from pathlib import Path

In [4]:
! pwd

/Users/michaelfiedler/code/michafdlr/data-final_project_prep


In [5]:
# make sure you select to correct path to your files
path_data = Path('/Users/michaelfiedler/code/michafdlr/data-final_project_prep/data')

In [6]:
# function to load data depending on size; loads human and AI written text
def load_data(source: str="xl-1542M", 
              truncation: bool=True,
              n_rows: int=500_000) -> dict[pd.DataFrame]:
    '''Load the data in dictionary of pandas Dataframes.
    ---
    source: specifies the outputs of a GPT-2 model
    
    ---
    truncation: specifies if Top-K 40 truncation data is used
    
    ---
    n_rows: specifies the fraction of data loaded. Smaller values for testing the code.'''
    final_data={}
    for split in ["train", "valid", "test"]:
        data={}
        if truncation:
            file_path = path_data / f"{source}-k40.{split}.csv"
        else:    
            file_path = path_data / f"{source}.{split}.csv"
        data['fake'] = pd.read_csv(file_path, usecols=["text"], nrows=n_rows//2) # nrows to have balanced dataset
        data['fake']["AI"] = 1 # AI written
        
        file_path = path_data / f"webtext.{split}.csv"
        data['true'] = []
        data['true'] = pd.read_csv(file_path, usecols=["text"], nrows=n_rows//2) # nrows to have balanced dataset
        data['true']["AI"] = 0 # not AI written
        
        final_data[split] = pd.concat([data["true"], data["fake"]])
        
    return final_data

In [7]:
data_train = load_data(n_rows=500_000)["train"].reset_index(drop=True)
data_val = load_data(n_rows=10_000)["valid"].reset_index(drop=True)
data_test = load_data(n_rows=10_000)["test"].reset_index(drop=True)

In [8]:
data_train

Unnamed: 0,text,AI
0,These girlfriends deserves a special mention f...,0
1,LeSean McCoy going through warmups with first ...,0
2,Tom Curran has been called up to England's Ash...,0
3,"We'll have turkey on the table Thursday but, a...",0
4,The 1945 Sinkings of the Cap Arcona and the Th...,0
...,...,...
499995,There are a lot of things that I don't like ab...,1
499996,A year after an unprecedented public outcry ag...,1
499997,Battles Between the English and the Scots\n\nT...,1
499998,Kurt Rambis is the new head coach of the Knick...,1


In [9]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    500000 non-null  object
 1   AI      500000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 7.6+ MB


In [10]:
data_val.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    10000 non-null  object
 1   AI      10000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 156.4+ KB


In [11]:
data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    10000 non-null  object
 1   AI      10000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 156.4+ KB


In [12]:
print(data_train.AI.value_counts(normalize=True))

print(data_val.AI.value_counts(normalize=True))

print(data_test.AI.value_counts(normalize=True))

0    0.5
1    0.5
Name: AI, dtype: float64
0    0.5
1    0.5
Name: AI, dtype: float64
0    0.5
1    0.5
Name: AI, dtype: float64


Data is balanced 👍🏽

## Preprocess data

### Cleaning

Cleaning might not be a good idea for our classification task as removing stopwords, and lowercase everything might worsen performance of a model. We might try different options later. In this https://github.com/openai/gpt-2-output-dataset/blob/master/baseline.py repository (it is from developers at OpenAI) they did not clean the data at all. I also tried it out. Result: with cleaning like in the modules on NLP the accuracy-score is way worse then without cleaning! So we shouldn't clean.

In [13]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

In [14]:
def remove_punctuation(text: str) -> str:
    text = ''.join([l for l in text if not l in string.punctuation])
    return text

def lower(text: str) -> str:
    return text.lower()

def remove_numbers(text) -> str:
    text = ''.join([char for char in text if not char.isdigit()])
    return text

def tokenize(text: str) -> list:
    return word_tokenize(text)

def remove_stopwords(text: list) -> list:
    stop_words = set(stopwords.words('english')) # we should be careful, there are also other lang present
    tokens_clean = [word for word in text if not word in stop_words]
    return tokens_clean

def lemmatize(text: list) -> str:
    for pos in ["v", "n", "a", "r", "s"]:
        text = [WordNetLemmatizer().lemmatize(word, pos=pos) for word in text]
    return ' '.join(text)

def clean(text: str, remove_stopword: bool=False, lemmatize: bool=False) -> str:
    text = remove_numbers(lower(remove_punctuation(text)))
    if remove_stopword or lemmatize:
        text=tokenize(text)
    if remove_stopword:
        text = remove_stopwords(text)
    if lemmatize:
        text = lemmatize(text)
    return text.strip()

In [15]:
#print(clean(small_train.loc[0, "text"]))

In [15]:
data_train_clean = data_train
data_val_clean = data_val
data_test_clean = data_test
#for df in [data_train_clean, data_val_clean, data_test_clean]:
#    df.text = df.text.apply(clean)

In [16]:
# if ram to small
#batch_size= 200
#steps = round(small_train_short.shape[0]/batch_size)
#small_short_clean = small_train_short
#for step in range(steps):
#    small_short_clean.iloc[step*batch_size: step*batch_size+batch_size, 0] = small_train_short.iloc[step*batch_size: step*batch_size+batch_size, 0].apply(clean)

In [18]:
#print(data_val_clean.loc[0,"text"])

In [19]:
#data_train_clean.AI.value_counts()

### Vectorizing

Vectorization is an important part. For the baseline model we could use TfIdfvectorizer.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.model_selection import cross_val_score, GridSearchCV, PredefinedSplit
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.pipeline import make_pipeline
from scipy import sparse #to handle sparse matrices from the vectorizer

In [18]:
X_train = data_train_clean[["text"]]
X_val = data_val_clean[["text"]]
X_test = data_test_clean[["text"]]
y_train = data_train_clean["AI"]
y_val = data_val_clean["AI"]
y_test = data_test_clean["AI"]

In [19]:
X_train.shape, y_train.shape

((500000, 1), (500000,))

In [20]:
X_train

Unnamed: 0,text
0,These girlfriends deserves a special mention f...
1,LeSean McCoy going through warmups with first ...
2,Tom Curran has been called up to England's Ash...
3,"We'll have turkey on the table Thursday but, a..."
4,The 1945 Sinkings of the Cap Arcona and the Th...
...,...
499995,There are a lot of things that I don't like ab...
499996,A year after an unprecedented public outcry ag...
499997,Battles Between the English and the Scots\n\nT...
499998,Kurt Rambis is the new head coach of the Knick...


In [21]:
vectorizer = TfidfVectorizer(ngram_range=(1,2), 
                             max_features=2**18, 
                             min_df=100)

X_train_vec = vectorizer.fit_transform(X_train.text) # this takes a long time

In [22]:
X_val_vec = vectorizer.transform(X_val.text)
X_test_vec = vectorizer.transform(X_test.text)

In [23]:
X_train_vec #this gives a sparse matrix -> we should let it like this because of memory issues

<500000x234672 sparse matrix of type '<class 'numpy.float64'>'
	with 230050038 stored elements in Compressed Sparse Row format>

In [24]:
X_search = sparse.vstack([X_train_vec, X_val_vec]) # to feed in Gridsearch
y_search = np.hstack((y_train, y_val))

In [25]:
# to make sure training data is not used for validation
split = PredefinedSplit([-1]*X_train.shape[0]+[0]*X_val.shape[0])

In [26]:
params = {
    'C': [2**k for k in range(3,6)],
    #'solver': ['liblinear', 'sag']
}

model = LogisticRegression(solver="liblinear", max_iter=1000)
search = GridSearchCV(model, 
                     params,
                     cv=split,
                     n_jobs=-1,
                      verbose = 3,
                      refit=False,
                     scoring="accuracy")

In [27]:
search.fit(X_search, y_search)

Fitting 1 folds for each of 3 candidates, totalling 3 fits


In [28]:
search.best_score_

0.9234

In [29]:
baseline = model.set_params(**search.best_params_)

In [30]:
baseline

In [31]:
baseline.fit(X_train_vec, y_train)

In [32]:
valid_score = baseline.score(X_val_vec, y_val)
test_score = baseline.score(X_test_vec, y_test)

In [33]:
test_score

0.9234

[CV 1/1] END ...............................C=8;, score=0.923 total time= 3.1min
[CV 1/1] END ..............................C=16;, score=0.921 total time= 3.4min
[CV 1/1] END ..............................C=32;, score=0.921 total time= 3.8min


The baseline model already gives a pretty good accuracy score on which we can iterate.

## Create a better model

Possible ways to improve our baseline-model:
- different tokenizers (tiktoken from OpenAI, LlamaTokenizer from huggingface transformers, Roberta tokenizer from huggingface transformer https://huggingface.co/docs/transformers/index)
- using word-embeddings instead of tokenizer (Word2Vec and others)
- different model from classical ML (SVM Classifier -> takes much memory and takes huge time amounts (was running it wit 10% of the data and each fit took about 245 min!) -> try SGDClassifier https://scikit-learn.org/0.15/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier; Ensemble methods)
- DL model (transfer learning with some base model like https://www.tensorflow.org/text/tutorials/classify_text_with_bert at or huggingface transformers https://huggingface.co/roberta-base-openai-detector)

### Try SGDClassifier

Depending on the loss specified in the SGDClassifier different SGD models are available. Using SGDClassifier allows us to use batches which is needed as the size of our data is huge and training more complicated models will result in having not enough RAM (unless you have a +5k € computer with RAM > 16 GB or when using a VM which we might have to do later to have enough power). For this procedure the most 'difficult' part is to find good parameters of the classifier...

In [74]:
classes = [0,1] # this needs to be specified for the partial_fit method

In [79]:
# define an iterator to train the SGDClassifier on batches
chunksize=4000
data_train_shuffle = data_train_clean.sample(frac=1, random_state=1).reset_index(drop=True)
X_train_shuffle = data_train_shuffle[["text"]]
y_train_shuffle = data_train_shuffle["AI"]

def chunk_iterator(chunksize, size_train):
    index_start = 0
    while index_start < size_train:
        X_chunk = X_train_shuffle.iloc[index_start: index_start+chunksize]
        X_chunk_vec = vectorizer.transform(X_chunk.text)
        y_chunk = y_train_shuffle[index_start: index_start+chunksize]
        yield X_chunk_vec, y_chunk
        index_start += chunksize

In [80]:
loss_history = {}

for loss in ["hinge", "log_loss", "modified_huber", "squared_hinge", "perceptron"]:
    model_sgd = SGDClassifier(loss=loss,
                              penalty=None,
                              n_jobs=-1, 
                              random_state=1,
                              verbose=0)
    iterator = chunk_iterator(chunksize, X_train_shuffle.shape[0])
    for X_chunk, y_chunk in iterator:
        model_sgd.partial_fit(X_chunk, y_chunk, classes=classes)
        
    loss_history[loss] = np.mean([model_sgd.score(X_val_vec, y_val), model_sgd.score(X_test_vec, y_test)])

In [81]:
loss_history

{'hinge': 0.8674,
 'log_loss': 0.8605,
 'modified_huber': 0.8542,
 'squared_hinge': 0.8173,
 'perceptron': 0.8479}