## Preparing the fake review dataset for GAN-BERT

Through this notebook the dataset used by the recently published paper 'Creating and detecting fake reviews of online products' will be prepared to be used by GAN-BERT. Since GAN-BERT required unlabeled, labeled, and a test dataset, the dataset will be divided into three seperate datasets. Additionally, the datasets are cleaned, formatted, and later on exported as csv files to meet the requirements for GAN-BERT.

The dataset is acquired from the paper 'Creating and detecting fake reviews of online products' published by Elsevier in early 2022, and can be downloaded using the following link:
https://jonisalminen.com/fake-reviews-dataset-and-generation/

In [1]:
# Importing necessary libraries
import pandas as pd

In [2]:
# Reading dataset
df = pd.read_csv (r'C:\Users\Gebruiker\Desktop\Master Project\Datasets\fake reviews dataset.csv')
df

<bound method NDFrame.head of                            category  rating label  \
0                Home_and_Kitchen_5     5.0    CG   
1                Home_and_Kitchen_5     5.0    CG   
2                Home_and_Kitchen_5     5.0    CG   
3                Home_and_Kitchen_5     1.0    CG   
4                Home_and_Kitchen_5     5.0    CG   
...                             ...     ...   ...   
40427  Clothing_Shoes_and_Jewelry_5     4.0    OR   
40428  Clothing_Shoes_and_Jewelry_5     5.0    CG   
40429  Clothing_Shoes_and_Jewelry_5     2.0    OR   
40430  Clothing_Shoes_and_Jewelry_5     1.0    CG   
40431  Clothing_Shoes_and_Jewelry_5     5.0    OR   

                                                   text_  
0      Love this!  Well made, sturdy, and very comfor...  
1      love it, a great upgrade from the original.  I...  
2      This pillow saved my back. I love the look and...  
3      Missing information on how to use it, but it i...  
4      Very nice set. Good quality. We

### Cleaning the dataset for GAN-BERT purposes

In [3]:
# Removing the 'category' & 'rating label' columns, which are unnecessary for GAN-BERT
df_clean = df.drop(df.columns[[0, 1]], axis=1)

### Dividing the dataset

In [5]:
# Creating a unlabeled and test dataset out of the entire dataset
unlabeled_data = df_clean.sample(frac=0.8, random_state=25)
test_data = df_clean.drop(unlabeled_data.index)

print(f"No. of unlabled examples: {unlabeled_data.shape[0]}")
print(f"No. of test examples: {test_data.shape[0]}")

No. of unlabled examples: 32346
No. of test examples: 8086


In [6]:
df_unlabeled = unlabeled_data
df_test = test_data

In [7]:
# Creating a test and labeled data set out of the test dataset
test_data = df_test.sample(frac=0.8, random_state=25)
labeled_data = df_test.drop(test_data.index)

print(f"No. of test examples: {test_data.shape[0]}")
print(f"No. of labeled examples: {labeled_data.shape[0]}")

No. of test examples: 6469
No. of labeled examples: 1617


In [None]:
df_labeled = labeled_data

### Unlabeling the unlabeled dataset

In [9]:
# Replacing 'CG' & 'OR' with UNK, to make the dataset unlabeled
df_unlabeled['label'] = df_unlabeled['label'].str.replace('CG', 'UNK')
df_unlabeled['label'] = df_unlabeled['label'].str.replace('OR', 'UNK')
df_unlabeled

Unnamed: 0,label,text_
27974,UNK,"Love the farm life, though, and the fact that ..."
10869,UNK,I use this at a time when I am running a few t...
38660,UNK,"SIZE IS NOT ACCURED, YOU NEED TO EXPLAIN THE B..."
14904,UNK,I liked this movie and still do! I just wish H...
21824,UNK,The initial shipment of treats was destroyed w...
...,...,...
18955,UNK,Love this Moen 26100 Magnetix (Cordless) in th...
31175,UNK,98-precent of this book is that it is an inter...
4866,UNK,These look to be very nice. The only problem i...
35078,UNK,"Darn, I must not have read the fine print some..."


### Exporting the datasets as CSV files

In [10]:
# Exporting the unlabeled dataset as a CSV file
df_unlabeled.to_csv(r'C:\Users\Gebruiker\Desktop\Master Project\Datasets\unlabeled_amazon.csv', index=False)

In [11]:
# Exporting the test dataset as a CSV file
df_test.to_csv(r'C:\Users\Gebruiker\Desktop\Master Project\Datasets\test_amazon.csv', index=False)

In [12]:
# Exporting the labeled dataset as a CSV file
df_labeled.to_csv(r'C:\Users\Gebruiker\Desktop\Master Project\Datasets\labeled_amazon.csv', index=False)