# Mulualem Asmare Practiclum Project Wk3 progress
## AI vs Human-generated text Classification

## Introduction
AI is increasingly intertwined in our daily lives. Even though the integration of AI into our daily lives has several benefits, it also poses a threat to our survival. Identifying whether AI or a human generates the text can allow us to identify fake news, facilitate plagiarism identification, and promote security and privacy. Recognizing the potential threat of AI-generated texts, this project aims to develop a classification model capable of detecting whether text is generated by AI or a human. Additionally, the project seeks to identify common patterns inherent in AI-generated and human-generated texts.
   
In this project, I am utilizing three datasets obtained from Kaggle and Hugging Face. These datasets are employed to train classification models and identify common patterns. The classification models employed in this project include logistic regression, random forest, and naive Bayes models. The model that produces the highest accuracy will be selected, and testing will be performed using the test data. In addition to developing the model, this project incorporates n-gram analysis to identify patterns and information in the text data. Tools such as Jupiter Notebook in Python 3 will be utilized for interacting with and analyzing the data. Data cleansing will be conducted using appropriate methods, and visualization will be performed to showcase the most frequent words in both AI and human-generated texts. Tokenization of the text will be implemented to break it into tokens (individual words), and lemmatization will be carried out to transform words into their basic root form, reducing dimensionality for improved machine learning model learning. Furthermore, TF-IDF vectorization will be applied to convert the text into numerical vectors suitable for machine-learning tasks.

## Datasets
1. First dataset called 'Hello-SimpleAI/HC3' is obtained from https://huggingface.co/datasets/Hello-SimpleAI/HC3 using the python code that is used by Yannick Stephan.
2. Second dataset is obtained https://www.kaggle.com/competitions/human-or-machine-generated-text/data?select=train.tbz2
3. The third dataset is collected from Kaggle NaveenFream. (2023, September 22). Ai-and-human text. Kaggle. https://www.kaggle.com/datasets/naveenfream/ai-and-human-text

## Project Details

## Training Procedures

## Project Objectives

## Import Libraries

In [1]:
import pandas as pd
from datasets import load_dataset

## Load Datasets
### Load First Dataset

In [2]:
#Loading data using  load_dataset
hello_dataset = load_dataset("Hello-SimpleAI/HC3", name="all")
hello_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'human_answers', 'chatgpt_answers', 'source'],
        num_rows: 24322
    })
})

In [3]:
#From the dataset that is loaded selecting the train dataset
hello_df = pd.DataFrame(hello_dataset['train'])
hello_df.head()

Unnamed: 0,id,question,human_answers,chatgpt_answers,source
0,0,"Why is every book I hear about a "" NY Times # ...","[Basically there are many categories of "" Best...",[There are many different best seller lists th...,reddit_eli5
1,1,"If salt is so bad for cars , why do we use it ...",[salt is good for not dying in car crashes and...,[Salt is used on roads to help melt ice and sn...,reddit_eli5
2,2,Why do we still have SD TV channels when HD lo...,[The way it works is that old TV stations got ...,[There are a few reasons why we still have SD ...,reddit_eli5
3,3,Why has nobody assassinated Kim Jong - un He i...,[You ca n't just go around assassinating the l...,[It is generally not acceptable or ethical to ...,reddit_eli5
4,4,How was airplane technology able to advance so...,[Wanting to kill the shit out of Germans drive...,[After the Wright Brothers made the first powe...,reddit_eli5


Now we know the training data from hugging face is downloaded before we modify the dataset I will make copy

In [4]:
#making copy before modification
hello_df_copy= hello_df.copy()
hello_df_copy.head()

Unnamed: 0,id,question,human_answers,chatgpt_answers,source
0,0,"Why is every book I hear about a "" NY Times # ...","[Basically there are many categories of "" Best...",[There are many different best seller lists th...,reddit_eli5
1,1,"If salt is so bad for cars , why do we use it ...",[salt is good for not dying in car crashes and...,[Salt is used on roads to help melt ice and sn...,reddit_eli5
2,2,Why do we still have SD TV channels when HD lo...,[The way it works is that old TV stations got ...,[There are a few reasons why we still have SD ...,reddit_eli5
3,3,Why has nobody assassinated Kim Jong - un He i...,[You ca n't just go around assassinating the l...,[It is generally not acceptable or ethical to ...,reddit_eli5
4,4,How was airplane technology able to advance so...,[Wanting to kill the shit out of Germans drive...,[After the Wright Brothers made the first powe...,reddit_eli5


In [5]:
#Column names
hello_df_copy.columns

Index(['id', 'question', 'human_answers', 'chatgpt_answers', 'source'], dtype='object')

In [6]:
# selecting the important futures and save them as a new dataframe  

hello_df_copy = hello_df_copy.drop(['question', 'source'], axis=1)
hello_df_copy.head()

Unnamed: 0,id,human_answers,chatgpt_answers
0,0,"[Basically there are many categories of "" Best...",[There are many different best seller lists th...
1,1,[salt is good for not dying in car crashes and...,[Salt is used on roads to help melt ice and sn...
2,2,[The way it works is that old TV stations got ...,[There are a few reasons why we still have SD ...
3,3,[You ca n't just go around assassinating the l...,[It is generally not acceptable or ethical to ...
4,4,[Wanting to kill the shit out of Germans drive...,[After the Wright Brothers made the first powe...


In [7]:
#used to reshape the dataframe
hello_df_copy = pd.melt(hello_df_copy, id_vars='id', var_name='text_source', value_name='output_text')
hello_df_copy = hello_df_copy.drop(columns= 'id')

In [8]:
#after reshaping human answers are set tobe 1 and chatgpt_answer set to be 0
hello_df_copy['text_source'] = hello_df_copy['text_source'].replace({'human_answers': 1,'chatgpt_answers':0} )

In [9]:
hello_df_copy.sample(25)

Unnamed: 0,text_source,output_text
243,1,"[Silica gel is n't toxic , it is a desicant , ..."
29326,0,[\nNoise-cancelling headphones work by using t...
10,1,[> I 've always wanted to know why hackers are...
26696,0,"[The price of a flash drive, or any other prod..."
10289,1,[The main benefit is Whatsapp 's client base -...
20461,1,[Yes these are the number of shareholders that...
35143,0,"[When two mirrors face each other, the light t..."
28916,0,[The Ninth Amendment to the United States Cons...
34718,0,[]
32979,0,[Dogs lick people for a variety of reasons. On...


Since I wanted output text to be on the left from text sourse I will use the following  code to do that 

In [10]:
# reversing the order of the columns 
hello_df_copy = hello_df_copy.iloc[:, ::-1]
hello_df_copy

Unnamed: 0,output_text,text_source
0,"[Basically there are many categories of "" Best...",1
1,[salt is good for not dying in car crashes and...,1
2,[The way it works is that old TV stations got ...,1
3,[You ca n't just go around assassinating the l...,1
4,[Wanting to kill the shit out of Germans drive...,1
...,...,...
48639,[It's not uncommon for blood pressure to fluct...,0
48640,[There are several possible causes of a painle...,0
48641,[It is not appropriate for me to recommend a s...,0
48642,[It is not uncommon for people with rheumatoid...,0


In [11]:
# checking is there is empty rows
hello_df_copy.isna().sum()

output_text    0
text_source    0
dtype: int64

The first data is set for the next step of data processing now lets get the second data set 



### Load Second Dataset 
This dataset is obtained from Kaggle. 

Human or machine generated text?. Kaggle. (n.d.). https://www.kaggle.com/competitions/human-or-machine-generated-text/data?select=train.tbz2 

In [12]:
#read the csv file from the dataset
human_machin_gen_text = pd.read_csv('/Users/mulualemasmare/Downloads/train.txt', delimiter='\t') 

In [13]:
human_machin_gen_text.head()

Unnamed: 0,id,context,response,human-generated
0,0,<first_speaker> 9@@ 5 de@@ gre@@ es with <numb...,<second_speaker> <at> i forgot that thank@@ s@...,0
1,1,<first_speaker> <at> <at> y@@ ar ! o@@ y just ...,<first_speaker> <at> lol b we 'll see . we hea...,0
2,2,<first_speaker> ohh ! de@@ u cer@@ to ! ! dddd...,<second_speaker> <at> ac@@ or@@ de@@ i ag@@ or...,1
3,3,<first_speaker> ugh@@ hhh i wanted a pic@@ kle...,<second_speaker> <at> lol g@@ m,0
4,4,<first_speaker> <at> <at> <at> need to know to...,"<first_speaker> <at> ok , will do - don 't be ...",1


In [14]:
#before modification making a copy of the dataset
human_machin_gen_text_copy=human_machin_gen_text.copy()
human_machin_gen_text_copy.head()

Unnamed: 0,id,context,response,human-generated
0,0,<first_speaker> 9@@ 5 de@@ gre@@ es with <numb...,<second_speaker> <at> i forgot that thank@@ s@...,0
1,1,<first_speaker> <at> <at> y@@ ar ! o@@ y just ...,<first_speaker> <at> lol b we 'll see . we hea...,0
2,2,<first_speaker> ohh ! de@@ u cer@@ to ! ! dddd...,<second_speaker> <at> ac@@ or@@ de@@ i ag@@ or...,1
3,3,<first_speaker> ugh@@ hhh i wanted a pic@@ kle...,<second_speaker> <at> lol g@@ m,0
4,4,<first_speaker> <at> <at> <at> need to know to...,"<first_speaker> <at> ok , will do - don 't be ...",1


Since the data is preprocessed with BPE encoding the authore recomanded to use the code below inorder to  properly print a massage

In [15]:
#used to replace the @@ inorder to change it into useful format for future data analysis 

human_machin_gen_text_copy['context'] = human_machin_gen_text_copy['context'].apply(lambda x: x.replace('@@ ', ''))
human_machin_gen_text_copy['response'] = human_machin_gen_text_copy['response'].apply(lambda x: x.replace('@@ ', ''))


In [16]:
human_machin_gen_text_copy.head()

Unnamed: 0,id,context,response,human-generated
0,0,<first_speaker> 95 degrees with <number> % hum...,<second_speaker> <at> i forgot that thanksabit...,0
1,1,<first_speaker> <at> <at> yar ! oy just appear...,<first_speaker> <at> lol b we 'll see . we hea...,0
2,2,<first_speaker> ohh ! deu certo ! ! dddddddddd...,"<second_speaker> <at> acordei agora , qqqqq ? ...",1
3,3,<first_speaker> ughhhh i wanted a pickle . non...,<second_speaker> <at> lol gm,0
4,4,<first_speaker> <at> <at> <at> need to know to...,"<first_speaker> <at> ok , will do - don 't be ...",1


In the 'context' column, the text is human-generated. For my analysis, I only need the responses and the features generated by humans, where human-generated entries are designated as 1, and AI-generated ones as 0. I solely require the text response and the 'human-generated' column.

In [17]:
# drop the unimportant  features 
human_machin_gen_text_copy = human_machin_gen_text_copy.drop(['id','context'], axis=1)
human_machin_gen_text_copy.head()

Unnamed: 0,response,human-generated
0,<second_speaker> <at> i forgot that thanksabit...,0
1,<first_speaker> <at> lol b we 'll see . we hea...,0
2,"<second_speaker> <at> acordei agora , qqqqq ? ...",1
3,<second_speaker> <at> lol gm,0
4,"<first_speaker> <at> ok , will do - don 't be ...",1


To concatenate this dataset with the other two datasets, I will change the 'response' column to 'output_text' and the 'human-generated' features to 'text_source.' This adjustment will enable me to combine them with other datasets, as they will have the same feature names, facilitating the concatenation process along the rows.

In [18]:
# Changing the column names for easy concatination with other datasets 
human_machin_gen_text_copy.rename(columns={'response': 'output_text', 'human-generated': 'text_source'}, inplace=True)
human_machin_gen_text_copy.head()

Unnamed: 0,output_text,text_source
0,<second_speaker> <at> i forgot that thanksabit...,0
1,<first_speaker> <at> lol b we 'll see . we hea...,0
2,"<second_speaker> <at> acordei agora , qqqqq ? ...",1
3,<second_speaker> <at> lol gm,0
4,"<first_speaker> <at> ok , will do - don 't be ...",1


In [19]:
#Checking dtypes
human_machin_gen_text_copy.dtypes

output_text    object
text_source     int64
dtype: object

In [20]:
#Checking missing values
human_machin_gen_text_copy.isna().sum()

output_text    0
text_source    0
dtype: int64

The words '<second_speaker>', '<at>', and '<first_speaker>' have no use for my analysis. Since my goal is to develop a machine learning algorithm that can predict whether the text is from AI or human, keeping these words holds no value for my analysis. Therefore, I will replace them with spaces​.

In [21]:
# Replacing unnecessary words with space 
human_machin_gen_text_copy['output_text'] = human_machin_gen_text_copy['output_text'].str.replace('<first_speaker>', '').str.replace('<second_speaker>', '').str.replace('<at>', '')
human_machin_gen_text_copy.head()

Unnamed: 0,output_text,text_source
0,i forgot that thanksabit i feel exactly like...,0
1,lol b we 'll see . we hearing the bangin ' s...,0
2,"acordei agora , qqqqq ? qqqqq ? this is the ...",1
3,lol gm,0
4,"ok , will do - don 't be late though , you '...",1


Now the second dataset is ready for concatination with other dataset lets get the third dataset.

# Third DATA
This dataset is obtained from Kaggle NaveenFream. (2023, September 22). Ai-and-human text. Kaggle. https://www.kaggle.com/datasets/naveenfream/ai-and-human-text 

In [22]:
# Read the CSV file 
ai_human_text = pd.read_csv("/Users/mulualemasmare/Downloads/AI-and-human-text.csv")
ai_human_text.head()

Unnamed: 0.1,Unnamed: 0,text,class
0,0,Sekhukhune I (Matsebe; circa 1814 – 13 Septemb...,AI-Generated-Text
1,1,Mount Washington is a peak in the White Mount...,AI-Generated-Text
2,2,Acer hillsi is an extinct maple species that w...,AI-Generated-Text
3,3,Derrick George Sherwin (16 April 1936 – 17 Oct...,Human-Generated-Text
4,4,The Windows shell is the graphical user interf...,Human-Generated-Text


In [23]:
ai_human_text.dtypes

Unnamed: 0     int64
text          object
class         object
dtype: object

In [24]:
ai_human_text.isna().sum()

Unnamed: 0    0
text          0
class         0
dtype: int64

In [25]:
ai_human_text_copy=ai_human_text.copy()
ai_human_text_copy.head()

Unnamed: 0.1,Unnamed: 0,text,class
0,0,Sekhukhune I (Matsebe; circa 1814 – 13 Septemb...,AI-Generated-Text
1,1,Mount Washington is a peak in the White Mount...,AI-Generated-Text
2,2,Acer hillsi is an extinct maple species that w...,AI-Generated-Text
3,3,Derrick George Sherwin (16 April 1936 – 17 Oct...,Human-Generated-Text
4,4,The Windows shell is the graphical user interf...,Human-Generated-Text


Here, the dataset contains an unnamed column that is not important for my analysis, so I will drop it. The 'class' feature shows that it is an object, and the output indicating whether the text is from AI or human is displayed as words. In order to concatenate it with my other dataset, I need to change the feature names to be identical to the other datasets. Additionally, I need to transform the 'class' feature output into 0s and 1s.

In [26]:
#replacing the Class out put lables into 0 for AI generated text and 1 for human generated text 
ai_human_text_copy['class'].replace(['AI-Generated-Text', 'Human-Generated-Text'], value=[0, 1], inplace=True)
ai_human_text_copy.head()

Unnamed: 0.1,Unnamed: 0,text,class
0,0,Sekhukhune I (Matsebe; circa 1814 – 13 Septemb...,0
1,1,Mount Washington is a peak in the White Mount...,0
2,2,Acer hillsi is an extinct maple species that w...,0
3,3,Derrick George Sherwin (16 April 1936 – 17 Oct...,1
4,4,The Windows shell is the graphical user interf...,1


In [27]:
#renaming the columns for smooth Concatenatination with other dataset  
ai_human_text_copy.rename(columns={'text': 'output_text', 'class': 'text_source'},inplace=True)

In [28]:
#Drop unnessery feature 
ai_human_text_copy.drop('Unnamed: 0', axis=1, inplace=True)
ai_human_text_copy.head()

Unnamed: 0,output_text,text_source
0,Sekhukhune I (Matsebe; circa 1814 – 13 Septemb...,0
1,Mount Washington is a peak in the White Mount...,0
2,Acer hillsi is an extinct maple species that w...,0
3,Derrick George Sherwin (16 April 1936 – 17 Oct...,1
4,The Windows shell is the graphical user interf...,1


In [29]:
#Check the dtypes
ai_human_text_copy.dtypes

output_text    object
text_source     int64
dtype: object

In [37]:
#Unifying the three datasets
concat_df = pd.concat([hello_df_copy, human_machin_gen_text_copy, ai_human_text_copy], axis=0)


In [38]:
#checking if there is null rows 
concat_df.isnull().sum()

output_text    0
text_source    0
dtype: int64

In [39]:
#Checking dtype of the unifying dataset
concat_df.dtypes

output_text    object
text_source     int64
dtype: object

In [40]:
# checking the dimension of the df
concat_df.shape

(7539620, 2)

In [41]:
concat_df.sample(20)

Unnamed: 0,output_text,text_source
205763,it hates you,0
5110821,best the instrumental to class of the year <...,0
3475115,haha thanks for the laugh ! your boy is just...,1
3032214,"i didn 't say anything , i was asking , what...",0
2910303,very good point ! but not everyone would be ...,1
5591894,"didn 't know , you 're too lovely for words ...",0
2067463,haha i just dont see the purpose in it . and...,0
5321978,lmao wow . the lies you tell ! ! ! ! ! ! ! ! ...,1
755594,what is wendy talking about today ?,1
1380367,not sure exactly but it will but in time to ...,1


# Data Cleaning....... countinue 

## References
NaveenFream. (2023, September 22). Ai-and-human text. Kaggle. https://www.kaggle.com/datasets/naveenfream/ai-and-human-text 

Stephan, Y. (n.d.). | NLP | LLM | Fine-tuning | QA LoRA T5 | Natural Language Processing (NLP) and Large Language Models (LLM) with Fine-Tuning LLM and make Question answering (QA) with LoRA and Flan-T5 Large. GitHub. https://github.com/YanSte/NLP-LLM-Fine-tuning-QA-LoRA-T5/blob/dff62be80fd83b97c2997e1e2c7fd954fb901a54/nlp-llm-fine-tuning-lora-t5-l.ipynb 