# Shuffled augmented dataset creation

@Author: Siôn William Davies

Creation of the Shuffled dataset.

Please note, there is referenced code in this document.

It starts at: ** Start of referenced code **

And ends at: ** End of referenced code **

In [7]:
import pandas as pd
import numpy as np
import re
import string 
import nltk
import inflect
from nltk import sent_tokenize
import random as random
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split

In [8]:
# Upload the csv file with the data.

In [9]:
df = pd.read_csv('/Users/siondavies/Desktop/NLP/Datasets/Original_Datasets/Augmented_1.csv')

In [10]:
df.head()

Unnamed: 0,Index,Message_Post,Label,Fascist_Speech,Source_Dataset,Forum,String_Length,Character_Length,Language_ID
0,1,My account is mergeable just give it to me,Non-fascist,No,Reddit,FortNiteBR,42,34,en
1,2,There is an entire religion that uses this myt...,Non-fascist,No,Reddit,todayilearned,82,68,en
2,3,Where’s the Turning the frogs gay sticker,Non-fascist,No,Reddit,trashy,41,35,en
3,4,Sounds like something you get at a Japanese st...,Non-fascist,No,Reddit,hockey,100,85,en
4,5,"Yeah, the tripple A scene is in a very weird p...",Non-fascist,No,Reddit,gaming,391,316,en


In [11]:
df.shape

(1999, 9)

In [12]:
df.isnull().sum()

Index               0
Message_Post        0
Label               0
Fascist_Speech      0
Source_Dataset      0
Forum               0
String_Length       0
Character_Length    0
Language_ID         0
dtype: int64

In [5]:
# We will create another column 'Numeric_Label' which will indicate:
# 0: Non-fascist sample, 1: fascist sample

In [13]:
def converter(Fascist_Speech):
    if Fascist_Speech == 'Yes':
        return 1
    else:
        return 0

In [14]:
df['Numeric_Label'] = df['Fascist_Speech'].apply(converter)

In [16]:
# Now we create a new dataset consisting only of the Message Posts and the Labels

In [17]:
shuffled_df = df[['Message_Post', 'Numeric_Label', 'Label']].copy()

In [18]:
shuffled_df.head()

Unnamed: 0,Message_Post,Numeric_Label,Label
0,My account is mergeable just give it to me,0,Non-fascist
1,There is an entire religion that uses this myt...,0,Non-fascist
2,Where’s the Turning the frogs gay sticker,0,Non-fascist
3,Sounds like something you get at a Japanese st...,0,Non-fascist
4,"Yeah, the tripple A scene is in a very weird p...",0,Non-fascist


""

Below are the methods we will apply for preprocessing techniques to the textual data (Message_posts).
This will clean the data and normalise the text.

""

In [19]:
# Function to remove emoticons from a String. 

def remove_emoticons(data):
    emoticons = regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  
        u"\U0001F300-\U0001F5FF"  
        u"\U0001F680-\U0001F6FF"  
        u"\U0001F1E0-\U0001F1FF" "]+", flags = re.UNICODE)
    return emoticons.sub(r'', data)


# Function to replace numerical numbers with their text counterparts.

def convert_numbers(data):
    inf = inflect.engine()
    for word in data:
        if word.isdigit():
            data = re.sub(word, inf.number_to_words(word), data)
        else:
            continue
    return data

# A function to remove stopwords from tokenized words.

def remove_stopwords(data):
    return[word for word in data if not word in stopwords.words('english')]


# This function can be used if we only want to stem the text.
# Must be applied as -> gold_df= stem(gold_df)

def stem(data):
    stemmer = nltk.stem.PorterStemmer()
    data['Message_Post'] = data['Message_Post'].apply(lambda x: [stemmer.stem(word) for word in x.split()])
    return data


# Function 1 to clean data in pre-processing steps.
# Converts String to lower case.
# Deletes text between < and > 
# Removes punctuation from text ... 
# ...(Remember this line should not be applied to the shuffled dataset.) 
# Removes URLs

def clean_data_1(data):
    data = data.lower()
    data = re.sub('<.*?>', '', data)
    data = re.sub(r'http\S+', '', data)
    return data

# Function 2 to clean data in pre-processing steps.
# Removes non-sensical data.
# Removes emoticons
# clears up white space.

def clean_data_2(data):
    data = re.sub('-', ' ', data) 
    data = re.sub('\n', '', data)
    data = remove_emoticons(data)
    data = convert_numbers(data)
    data = re.sub(' +', ' ', data)
    return data   

# Function 3 removes punctuation - it is performed after the shuffle augmentation has taken place.
def clean_data_3(data):
    data = re.sub('[%s]' % re.escape(string.punctuation), '', data)
    return data

Now to apply the preprocessing methods on shuffle_df...

It is important to note that as we want to shuffle sentences, we must leave the full stops in our text for now.

In [20]:
shuffled_df.isnull().sum()

Message_Post     0
Numeric_Label    0
Label            0
dtype: int64

In [21]:
# Clean the data...

shuffled_df['Message_Post'] = pd.DataFrame(shuffled_df.Message_Post.apply(clean_data_1).apply(clean_data_2))

In [15]:
# Apply Stemming on the text....

# shuffled_df= stem(shuffled_df)

In [22]:
shuffled_df.head()

Unnamed: 0,Message_Post,Numeric_Label,Label
0,my account is mergeable just give it to me,0,Non-fascist
1,there is an entire religion that uses this myt...,0,Non-fascist
2,where’s the turning the frogs gay sticker,0,Non-fascist
3,sounds like something you get at a japanese st...,0,Non-fascist
4,"yeah, the tripple a scene is in a very weird p...",0,Non-fascist


""

Now we split the data into training and test sets.

""

As we don't want the model to have become familiar with the sentences within the corpuses when making its predictions, we only perform the 'shuffling' augmentation on the training data set. That way, the model is evaluated on completely unseen data. 

In [25]:
# X = what we want to predict / y = the target class

X = shuffled_df.Message_Post
y = shuffled_df.Numeric_Label

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle = True)

In [27]:
print('Fascist training samples before augmentation performed: {}'.format(sum(y_train == 1)))

Fascist training samples before augmentation performed: 349


Below we create a temporary dataframe with the fascist training samples... we then apply the shuffle augmentation technique to it... recombine it with the non-fascist training samples... and save it to our Shuffle training set.

In [28]:
temp_train = pd.DataFrame({'Message_Post':X_train, 'Numeric_Label':y_train})

In [29]:
fascist_train = temp_train[(temp_train['Message_Post'].notnull()) & 
          temp_train['Numeric_Label'] == 1].copy()

fascist_train.head()

Unnamed: 0,Message_Post,Numeric_Label
1686,i think that kind of anti semitic literature a...,1
1751,i think the relationship between religion and...,1
1652,"exactly, ethnic nationalism (and most of the ""...",1
1649,i'm thinking of making a neo fascist party. al...,1
1935,asians don't integrate. they may not rape as ...,1


Here we have only fascist training samples that we shall augment. 

** Start of referenced code **

** PLEASE NOTE: The code directly below is not original code. It was inspired by a publication which has been refereneced below along with the authors name. It has been slightly modified to serve the intended purpose. **


In [45]:
def tokenize(text):
    tokenized =  sent_tokenize(text)
    return tokenized


def shuffle_tokenized(data):
    random.shuffle(data)
    new_doc = list(data)
    shuffled.append(new_doc)
    return data

augmented = []
repeats = []
for fascist_sample in fascist_train['Message_Post']:
    token = tokenize(fascist_sample)
    shuffled = [token]
    for i in range(4):
        shuffle_tokenized(shuffled[-1])
    for k in shuffled:
        s = ' '
        new_sample = s.join(k)
        if new_sample not in augmented:
            augmented.append(new_sample)
        else:
            repeats.append(new_sample)

Code Reference:

Publication: https://towardsdatascience.com/data-augmentation-for-text-data-obtain-more-data-faster-525f7957acc9

Author: Nandesora Tjihero

Date: 26/09/18

** End of referenced code **



In [46]:
print(augmented)

["hitler's jewish soldiers is proof of that as well. i think that kind of anti semitic literature along with the rest of banjo_billy's stuff is exactly the kind of stuff that will turn people. i believe communism to be the embodiment of jewishness and the victim culture. granted they mainly only had one/eightth or one/onesixth jewish ancestry. i was certainly transfixed when i read his stuff.the fact is jews are overrepresented in all that is wrong in the world and let us not forget that it was a jew who conceived communism. jews are always looking to blame someone else but themselves. as i've said though, jews can transcend their jewishness. he gave some jews a pass if they converted to catholic. marxist leninism deviates a little bit from trotskyism (theoretical/cultural marxism) but it isn't that much better.", "hitler's jewish soldiers is proof of that as well. as i've said though, jews can transcend their jewishness. granted they mainly only had one/eightth or one/onesixth jewish 

In [47]:
augmented_df = pd.DataFrame(augmented)

In [48]:
augmented_df.head()

Unnamed: 0,0
0,hitler's jewish soldiers is proof of that as w...
1,hitler's jewish soldiers is proof of that as w...
2,i believe communism to be the embodiment of je...
3,hitler's jewish soldiers is proof of that as w...
4,at least i have not come across any yet in boo...


In [49]:
augmented_df.count

<bound method DataFrame.count of                                                       0
0     hitler's jewish soldiers is proof of that as w...
1     hitler's jewish soldiers is proof of that as w...
2     i believe communism to be the embodiment of je...
3     hitler's jewish soldiers is proof of that as w...
4     at least i have not come across any yet in boo...
...                                                 ...
1112  both destroy anything sacred to a man and his ...
1113  i feel that's a much more pragmatic approach r...
1114  in the us we've seen populist candidates like ...
1115  we need public support, we need political infl...
1116  i feel that's a much more pragmatic approach r...

[1117 rows x 1 columns]>

In [50]:
# We save the augmented training posts...
augmented_df.to_csv(r'/Users/siondavies/Desktop/NLP/Datasets.Aug_12.csv')

In [51]:
# We will combine with our non-fascist training samples
non_fascist_train = temp_train[(temp_train['Message_Post'].notnull()) & 
          temp_train['Numeric_Label'] == 0]

In [52]:
non_fascist_train.count

<bound method DataFrame.count of                                            Message_Post  Numeric_Label
13              this is like moe sticking up for curly.              0
1053  look at uudd’s too videos, most of them have s...              0
585   hey can you send me the info on this? my mom i...              0
1214  its fine. this bothers me much less that "why ...              0
174   yes my mood is quite down now, i think it was ...              0
...                                                 ...            ...
1257  zack still won't make it on the e &amp; c podc...              0
124   a story the writer himself seemingly doesn't k...              0
908   you'd only lose $one since you so get $five in...              0
1494                    yeah it's pretty cringe writing              0
1204  "you were sold a bunch of snakeoil, and now th...              0

[1050 rows x 2 columns]>

In [53]:
# we save the non-augmented training posts..
non_fascist_train.to_csv(r'/Users/siondavies/Desktop/NLP/Datasets.non_aug_12.csv')

In [54]:
# Now we upload the combined augmented fascist / non-augmented non-fascist training posts...
shuffle_train = pd.read_csv('/Users/siondavies/Desktop/NLP/Aug_train_12.csv')

In [55]:
shuffle_train.head()

Unnamed: 0.1,Unnamed: 0,Message_Post,Numeric_Label
0,0,hitler's jewish soldiers is proof of that as w...,1
1,1,hitler's jewish soldiers is proof of that as w...,1
2,2,i believe communism to be the embodiment of je...,1
3,3,hitler's jewish soldiers is proof of that as w...,1
4,4,at least i have not come across any yet in boo...,1


In [56]:
# We need to re-shuffle the data, as after concactenating X is one class followed by another.

shuffle_train = shuffle_train.sample(frac = 1)

In [57]:
shuffle_train.head()

Unnamed: 0.1,Unnamed: 0,Message_Post,Numeric_Label
1990,1990,"he basically said if it wasnt for you, he woul...",0
823,823,"! yes, in fact im a member of golden dawn.peop...",1
1399,1399,i know it's crazy how much this place has expl...,0
368,368,we should promote any alternative media we can...,1
1995,1995,any advice of how to go about bringing up this...,0


In [58]:
# We don't forgot to remove punctuation from the training set before saving...
shuffle_train['Message_Post'] = pd.DataFrame(shuffle_train.Message_Post.apply(clean_data_3))
#shuffle_train.to_csv(r'/Users/siondavies/Desktop/Temp_Datasets/Shuffle_train_1.csv')

In [59]:
# Now save the test data...

shuffle_test = pd.DataFrame({'Message_Post':X_test, 'Numeric_Label':y_test})
shuffle_test['Message_Post'] = pd.DataFrame(shuffle_test.Message_Post.apply(clean_data_3))
#shuffle_test.to_csv(r'/Users/siondavies/Desktop/Temp_Datasets/Shuffle_test_1.csv')

In [60]:
# SAVE the full dataset...

shuffled_df['Message_Post'] = pd.DataFrame(shuffled_df.Message_Post.apply(clean_data_3))
#shuffled_df.to_csv(r'/Users/siondavies/Desktop/Temp_Datasets/Shuffled_clean.csv')