# Synonym Replacement augmented dataset creation

@Author Siôn William Davies

Creation of the SR dataset.

Please note, there is referenced code in this document.

It starts at: ** Start of referenced code **

And ends at: ** End of referenced code **

In [1]:
import pandas as pd
import numpy as np
import re
import string 
import nltk
import inflect
import random
from random import shuffle
from nltk.corpus import wordnet, stopwords
from sklearn.model_selection import train_test_split

In [2]:
stop_words = set(stopwords.words('english'))

In [3]:
# Upload the csv file with the data.

In [4]:
df = pd.read_csv('/Users/siondavies/Desktop/NLP/Datasets/Original_Datasets/Augmented_1.csv')

In [5]:
df.head()

Unnamed: 0,Index,Message_Post,Label,Fascist_Speech,Source_Dataset,Forum,String_Length,Character_Length,Language_ID
0,1,My account is mergeable just give it to me,Non-fascist,No,Reddit,FortNiteBR,42,34,en
1,2,There is an entire religion that uses this myt...,Non-fascist,No,Reddit,todayilearned,82,68,en
2,3,Where’s the Turning the frogs gay sticker,Non-fascist,No,Reddit,trashy,41,35,en
3,4,Sounds like something you get at a Japanese st...,Non-fascist,No,Reddit,hockey,100,85,en
4,5,"Yeah, the tripple A scene is in a very weird p...",Non-fascist,No,Reddit,gaming,391,316,en


In [6]:
df.shape

(1999, 9)

In [7]:
def converter(Fascist_Speech):
    if Fascist_Speech == 'Yes':
        return 1
    else:
        return 0

In [8]:
df['Numeric_Label'] = df['Fascist_Speech'].apply(converter)

In [9]:
# Now we create a new Gold dataset consisting only of the Message Posts and the Labels

In [10]:
sr_df = df[['Message_Post', 'Numeric_Label', 'Label']].copy()

In [11]:
sr_df.head()

Unnamed: 0,Message_Post,Numeric_Label,Label
0,My account is mergeable just give it to me,0,Non-fascist
1,There is an entire religion that uses this myt...,0,Non-fascist
2,Where’s the Turning the frogs gay sticker,0,Non-fascist
3,Sounds like something you get at a Japanese st...,0,Non-fascist
4,"Yeah, the tripple A scene is in a very weird p...",0,Non-fascist


""

Below are the methods we will apply for preprocessing techniques to the textual data (Message_posts).
This will clean the data and normalise the text.

""

In [12]:
# Function to remove emoticons from a String. 

def remove_emoticons(data):
    emoticons = regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  
        u"\U0001F300-\U0001F5FF"  
        u"\U0001F680-\U0001F6FF"  
        u"\U0001F1E0-\U0001F1FF" "]+", flags = re.UNICODE)
    return emoticons.sub(r'', data)


# Function to replace numerical numbers with their text counterparts.

def convert_numbers(data):
    inf = inflect.engine()
    for word in data:
        if word.isdigit():
            data = re.sub(word, inf.number_to_words(word), data)
        else:
            continue
    return data

# A function to remove stopwords from tokenized words.

def remove_stopwords(data):
    return[word for word in data if not word in stopwords.words('english')]


# This function can be used if we only want to stem the text.
# Must be applied as -> gold_df= stem(gold_df)

def stem(data):
    stemmer = nltk.stem.PorterStemmer()
    data['Message_Post'] = data['Message_Post'].apply(lambda x: [stemmer.stem(word) for word in x.split()])
    return data


# Function 1 to clean data in pre-processing steps.
# Converts String to lower case.
# Deletes text between < and > 
# Removes punctuation from text ... 
# ...(Remember this line should not be applied to the shuffled dataset.) 
# Removes URLs

def clean_data_1(data):
    data = data.lower()
    data = re.sub('<.*?>', '', data)
    data = re.sub(r'http\S+', '', data)
    return data

# Function 2 to clean data in pre-processing steps.
# Removes non-sensical data.
# Removes emoticons
# clears up white space.

def clean_data_2(data):
    data = re.sub('-', ' ', data) 
    data = re.sub('\n', '', data)
    data = remove_emoticons(data)
    data = convert_numbers(data)
    data = re.sub(' +', ' ', data)
    return data   

# Function 3 removes punctuation - it is performed after the shuffle augmentation has taken place.
def clean_data_3(data):
    data = re.sub('[%s]' % re.escape(string.punctuation), '', data)
    return data

In [13]:
sr_df.isnull().sum()

Message_Post     0
Numeric_Label    0
Label            0
dtype: int64

In [14]:
# Clean the data...

sr_df['Message_Post'] = pd.DataFrame(sr_df.Message_Post.apply(clean_data_1).apply(clean_data_2))

In [15]:
# Apply Stemming on the text....

# sr_df= stem(sr_df)

""

Now we split the data into training and test sets.

""

As we don't want the model to have become familiar with the sentences within the corpuses when making its predictions, we only perform the 'shuffling' augmentation on the training data set. That way, the model is evaluated on completely unseen data. 

In [15]:
# X = what we want to predict / y = the target class

X = sr_df.Message_Post
y = sr_df.Numeric_Label

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle = True)

In [17]:
print('Fascist training samples before augmentation performed: {}'.format(sum(y_train == 1)))

Fascist training samples before augmentation performed: 352


Below we create a temporary dataframe with the fascist training samples... we then apply the synonym replacement augmentation technique to it... recombine it with the non-fascist training samples... and save it to our SR training set.

In [18]:
temp_train = pd.DataFrame({'Message_Post':X_train, 'Numeric_Label':y_train})

In [19]:
fascist_train = temp_train[(temp_train['Message_Post'].notnull()) & 
          temp_train['Numeric_Label'] == 1].copy()

fascist_train.head()

Unnamed: 0,Message_Post,Numeric_Label
1982,national socialism is about the blood and the...,1
1662,jews do have higher iq's but they are inferior...,1
1519,"google manifest destiny, that's how i reconcil...",1
1738,hitler had to unite these people in his nation...,1
1712,thank you for putting that spic in her place. ...,1


Here we have only fascist training samples that we shall augment. 

** Start of referenced code **

** PLEASE NOTE: The code directly below is not original code. It was taken from a paper which uses a synonym replacement technique. It has been refereneced below along with the authors name. It includes the methods:

- synonym_replacement 
- get_synonyms 
- SR 
- get_only_chars
 
                                                                                                     **

In [20]:
def synonym_replacement(data, n):
    new_words = data.copy()
    random_word_list = list(set([word for word in data if word not in stop_words]))
    random.shuffle(random_word_list)
    num_replaced = 0
    for random_word in random_word_list:
        synonyms = get_synonyms(random_word)
        if len(synonyms) >= 1:
            synonym = random.choice(list(synonyms))
            new_words = [synonym if word == random_word else word for word in new_words]
            num_replaced += 1
        if num_replaced >= n:
            break
    sentence = ' '.join(new_words)
    new_words = sentence.split(' ')
    
    return new_words

def get_synonyms(word):
    synonyms = set()
    for syn in wordnet.synsets(word):
        for letter in syn.lemmas():
            synonym = letter.name().replace("_", " ").replace("-", " ").lower()
            synonym = "".join([char for char in synonym if char in ' qwertyuiopasdfghjklzxcvbnm'])
            synonyms.add(synonym)
    if word in synonyms:
        synonyms.remove(word)
   
    return list(synonyms)

def SR(sentence, alpha_sr, n_aug):
    sentence = get_only_chars(sentence)
    words = sentence.split(' ')
    num_words = len(words)
    augmented_sentences = []
    n_sr = max(1, int(alpha_sr * num_words))
    
    for _ in range (n_aug):
        a_words = synonym_replacement(words, n_sr)
        augmented_sentences.append(' '.join(a_words))
    augmented_sentences.append(' '.join(a_words))
    shuffle(augmented_sentences)
    augmented_sentences.append(sentence)
    
    return augmented_sentences

def get_only_chars(line):
    
    clean_line = ""
    line = line.replace(",", "")
    line = line.replace("'", "")
    line = line.replace("-", " ")
    line = line.replace("\t", " ")
    line = line.replace("\n", " ")
    line = line.lower()
    
    for char in line:
        if char in 'qwertyuiopasdfghjklzxcvbnm ':
            clean_line += char
        else:
            clean_line += ' '
    
    clean_line = re.sub(' +', ' ', clean_line)
    if clean_line[0] == ' ':
        clean_line = clean_line[1:]
    return clean_line

Code Reference:

Publication: 

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks (Wei and Zou, )

https://arxiv.org/abs/1901.11196

https://github.com/jasonwei20/eda_nlp/blob/master/experiments/nlp_aug.py

Authors: Jason Wei and Kai Zou 

Date: 2019

    
** End of referenced code **

In [24]:
# The parameters for SR() are...
# fascist_sample -> the sample of text to be augmented.
# alpha_sr = 0.2 -> roughly 20% of words within the sentence will be transformed.
# n_aug = 2 -> given 1 fascist sample, 2 shall be returned.

def get_SR(data):
    augmented = []
    repeats = []
    for fascist_sample in data['Message_Post']:
        divide = len(fascist_sample)
        sentence = SR(fascist_sample, 0.2, 2)
        for aug in sentence:
            if aug not in augmented:
                augmented.append(aug)
            else:
                repeats.append(aug)
    return augmented

In [25]:
augmented = get_SR(fascist_train)

In [26]:
print(augmented)

['subject socialism is about the blood and the stain of our nation and as such has no disinclination in laying title to each of these way out as an inseparable share of our greater struggle against globalism capitalism and physicalism in this shore ', 'interior socialist economy is about the rip and the ground of our state and as such has no hesitation in laying claim to each of these issues as an inseparable component of our greater clamber against globalism capitalism and philistinism in this land ', 'national socialism is about the blood and the soil of our nation and as such has no hesitation in laying claim to each of these issues as an inseparable part of our greater struggle against globalism capitalism and materialism in this land ', 'jews do have high iq but they are inferior in almost every other elbow room the thing is jews use their sinful intellect for evilness they support commie cause take up on what the nsdap said on the field of study and tell me what you think ', 'isr

In [27]:
augmented_df = pd.DataFrame(augmented)

In [28]:
augmented_df.head()

Unnamed: 0,0
0,subject socialism is about the blood and the s...
1,interior socialist economy is about the rip an...
2,national socialism is about the blood and the ...
3,jews do have high iq but they are inferior in ...
4,israelite do have higher iq but they are subsc...


In [29]:
augmented_df.count

<bound method DataFrame.count of                                                       0
0     subject socialism is about the blood and the s...
1     interior socialist economy is about the rip an...
2     national socialism is about the blood and the ...
3     jews do have high iq but they are inferior in ...
4     israelite do have higher iq but they are subsc...
...                                                 ...
1049  there is alot of cynicism you nonplus from al ...
1050  there is alot of cynicism you get from books l...
1051  i dont mind shogunate but i suppose the system...
1052  i dont mind despotism but i suppose the organi...
1053  i dont mind authoritarianism but i think the s...

[1054 rows x 1 columns]>

In [30]:
# We save the augmented training posts...
augmented_df.to_csv(r'/Users/siondavies/Desktop/NLP/Datasets.Aug_SR_13.csv')

In [31]:
# We will combine with our non-fascist training samples
non_fascist_train = temp_train[(temp_train['Message_Post'].notnull()) & 
          temp_train['Numeric_Label'] == 0]

In [32]:
non_fascist_train.count

<bound method DataFrame.count of                                            Message_Post  Numeric_Label
160   ok. esh/nta. she needs to transfer the account...              0
796   reconstituted potato flake/paste, pretty sure ...              0
1016  lol what? arya has literally been preparing fo...              0
1123  interestingly, it's also the same group that c...              0
470   &gt;i thought that shit only happened to polie...              0
...                                                 ...            ...
1055  see all these dodgy cats being exposed? he's j...              0
1376        went ahead and edited my comment, thank you              0
549   &gt;how would he know?did you not read past th...              0
1144             literally everyone is talking about it              0
1134  take the pencil vertically, push it against th...              0

[1047 rows x 2 columns]>

In [33]:
# we save the non-augmented training posts..
non_fascist_train.to_csv(r'/Users/siondavies/Desktop/NLP/Datasets.non_aug_SR_13.csv')

In [34]:
# Now we upload the combined augmented fascist / non-augmented non-fascist training posts...
SR_train = pd.read_csv('/Users/siondavies/Desktop/NLP/Aug_train_13.csv')

In [35]:
SR_train.head()

Unnamed: 0.1,Unnamed: 0,Message_Post,Numeric_Label
0,0,subject socialism is about the blood and the s...,1
1,1,interior socialist economy is about the rip an...,1
2,2,national socialism is about the blood and the ...,1
3,3,jews do have high iq but they are inferior in ...,1
4,4,israelite do have higher iq but they are subsc...,1


In [36]:
# We need to re-shuffle the data, as after concactenating X is one class followed by another.

SR_train = SR_train.sample(frac = 1)

In [37]:
SR_train.head()

Unnamed: 0.1,Unnamed: 0,Message_Post,Numeric_Label
1494,1494,you’ll get over it right?,0
1391,1391,you can tell it wasnt joe biden because he did...,0
1312,1312,woooooooooohoooooooooohoooooooooooooooo fuck t...,0
955,955,in the united states weve seen democrat candid...,1
1360,1360,rosen should get a bonus for how well he dealt...,0


In [38]:
# We don't forgot to remove punctuation from the training set before saving...
SR_train['Message_Post'] = pd.DataFrame(SR_train.Message_Post.apply(clean_data_3))
SR_train.to_csv(r'/Users/siondavies/Desktop/Temp_Datasets/SR_train_1.csv')

In [39]:
# Now save the test data...

SR_test = pd.DataFrame({'Message_Post':X_test, 'Numeric_Label':y_test})
SR_test['Message_Post'] = pd.DataFrame(SR_test.Message_Post.apply(clean_data_3))
SR_test.to_csv(r'/Users/siondavies/Desktop/Temp_Datasets/SR_test_1.csv')

In [40]:
# SAVE the full dataset...

sr_df['Message_Post'] = pd.DataFrame(sr_df.Message_Post.apply(clean_data_3))
sr_df.to_csv(r'/Users/siondavies/Desktop/Temp_Datasets/SR_clean_1.csv')