**Classification**

Now that the fake reviews have been generated using the re-trained gpt2 model, its time to prepare the datasets for final use within the classification model; this is done through standardizing the data and then tokenisation. 

In [59]:
import os
import numpy as np
from tensorflow import keras
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
os.environ["TFHUB_CACHE_DIR"] = '/tmp/tfhub'

#Load datasets
fake_reviews = pd.read_csv('reviews_generated.csv',usecols=['text'])
real_reviews = pd.read_csv('bigreviews.csv',usecols=['text'])

#Add new column indicating real or fake
#real = 1 / fake = 0
real_reviews['real'] = 1
fake_reviews['real'] = 0


*Rate of 33% of reviews suspected to be fake, so dataset will be made with this concept in mind*

Generated = 7252 

Real = 14508

Total = 21756

In [60]:
#Select the last 21756 reviews
real_reviews = real_reviews.tail(14504)

In [61]:
#Join databases
full_reviews = pd.concat([real_reviews, fake_reviews], ignore_index=True)
full_reviews.to_csv('full_reviews.csv', index=False)

In [62]:
print(full_reviews)

                                                    text  real
0      My MacBook Pro retina was failing do the stupi...     1
1      My boyfriend and I found this place doing a lo...     1
2      Hubby and I decided to try.  Never been to Ger...     1
3      Ok so this is really Aneu! I really don't know...     1
4      Finally got to try Smee's recently.  I like th...     1
...                                                  ...   ...
21751  we were looking for a place to eat and we foun...     0
21752  second time here.  the food is good, but the s...     0
21753  tucked on 76, it's a great place to go to for ...     0
21754  these hand grenades are the best! \n\nthe staf...     0
21755  this is totallly a great place to go for a cas...     0

[21756 rows x 2 columns]


*Clean the dataset*

In [9]:
full_reviews = pd.read_csv('full_reviews.csv')

In [63]:
#Standardization and spell check
import itertools
import re
from autocorrect import Speller
import nltk
from nltk.stem import WordNetLemmatizer


def correct_text(text):
    #One letter in a word should not be present more than twice in continuation
    text_correction = ''.join(''.join(s)[:3] for _, s in itertools.groupby(text))
    #Apply autocorrection to the corrected text
    spell = Speller(lang='en')
    ans = spell(text_correction)
    return ans


def standardize_text(text):
    #Remove unicode characters
    text = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text)
    #Turn to lower case
    text = text.lower()
    #Remove numbers
    text = re.sub(r'\d+', '', text)
    #Remove punctuation
    text = re.sub("[^-9A-Za-z ]", "" , text)
    #Remove double spaces
    text = re.sub('\s{2,}', ' ', text)
    return text


#Implement lemmatization, group words by root stem but keep the different tenses 
lemmatizer = WordNetLemmatizer()

def lemm_text(text):
    ans = lemmatizer.lemmatize(text)
    return ans

In [64]:
full_reviews['text'] = full_reviews['text'].apply(correct_text)
full_reviews['text'] = full_reviews['text'].apply(standardize_text)
full_reviews['text'] = full_reviews['text'].apply(lemm_text)

full_reviews.to_csv('full_reviews_cleaned.csv', index=False)

In [None]:
full_reviews