## **Pre-processing and Synthetic Oversampling**

This notebook is the pre-processing steps before we can use twitter data in our model.

<br>

In step 1, we will randomly renumber the conversation id's so that we can sort on them to give a random train/val/test split.

<br>

In step 2, we will reformat the data into the conversational form needed for the model.

<br>

In step 3, we will implement synthetic oversampling to help with our large class imbalance.

#### Package Imports and data file

In [None]:
import pandas as pd
from numpy.random import default_rng
from datetime import date
from google.colab import drive
import numpy as np
import gensim.models.keyedvectors as word2vec
import gc
from time import time
from sklearn.neighbors import NearestNeighbors

#Random sampling
import random

#NLTK
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
nltk.download('stopwords')
nltk.download('wordnet')

#For file saving
version = '1'
today = date.today()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
file_path = '/content/drive/My Drive/data_twitter/Conversations_Tagged_w_sup_4_4_21.xlsx'

tweetDF = pd.read_excel(file_path)

### **Step 1: Randomly Resort Conversations and Split train/val**

In [None]:
def randomizeConvos(tweetDF):
  '''Takes conversation pandas DataFrame and randomizes by giving conversations
  new id numbers. Tweet ID's stay the same. Resorts by new Conversation ID's.

  Input: 
    - tweetDF: Pandas DF(obj)
  '''

  # How many conversations do we have? Store in variable
  num_convos = tweetDF['Conversation ID'].nunique()

  # Random range of numbers length of unique conversation number
  rng = default_rng()
  numbers = rng.choice(num_convos, size=num_convos, replace=False)

  # Randomly map old convo id's to new convo id's
  oldToNewIds = pd.DataFrame({'Conversation ID' :tweetDF['Conversation ID'].unique(),
                              'new_convo_ids': numbers})


  # Merge new convo id's into DF, Drop old convo id column and rename new one.
  newTweets = pd.merge(tweetDF, oldToNewIds, how = 'left', 
          left_on = 'Conversation ID',
          right_on = 'Conversation ID')
  newTweets = newTweets.drop(columns=['Conversation ID'])
  newTweets = newTweets.rename(columns = {'new_convo_ids': 'Conversation ID'})

  # Re-order Columns
  newTweets = newTweets[['Tweet ID',
                        'Conversation ID',
                        'Depth', 
                        'Username',
                        'Tweet',
                        'Date',
                        'Personal Attack?'
        ]]  

  # Sort by new conversation id's and depth within the conversations.
  newTweets = newTweets.sort_values(by = ['Conversation ID', 'Depth'],
                                    ascending = True)

  # Return new DF
  return newTweets

def SplitTrainValTest(tweetDF, split_perc = [.6, .2]):
  '''
  Split DF into train, val and test ensuring that conversations are not split up
  between the splits based on split percentages given for the train and val split points.

  Inputs
    - tweetDF: Pandas DF(obj)
    - split_perc: list(obj)
  '''
  
  # How many conversations do we have? Store in variable  
  num_convos = tweetDF['Conversation ID'].nunique()

  # Number to split training,val sets on
  train_split_point = int(num_convos*split_perc[0])
  val_split_point = int(num_convos*(split_perc[0] + split_perc[1]))

  # Train, val, test splits
  train = tweetDF[tweetDF['Conversation ID'].isin(range(train_split_point))]
  val = tweetDF[tweetDF['Conversation ID'].isin(range(train_split_point, val_split_point))]
  test = tweetDF[tweetDF['Conversation ID'].isin(range(val_split_point, num_convos))]

  return train, val, test



#### Run Functions above to return randomly split train, val, test

In [None]:
randomizedTweetDF = randomizeConvos(tweetDF)
corpus_train, corpus_val, corpus_test = SplitTrainValTest(randomizedTweetDF, split_perc = [.5,.25])


### **Step 2: Reformat data into conversational form**

In [None]:
def formatCorpus(corpus, modelTransformer = True):
  '''Each input will be the TWO TWEET context before the personal attack
    we are trying to predict.
    
    Most recent tweet to the FRONT of input string so that they are
    not truncated.
    
    If a positive attack is registered in the conversation, we consider that conversation
    "tainted". If future tweets in that conversation are negative, we will not count them, however
    if they are positive we will still count them.

    Note: by this method, the top-level tweet will be the 2nd most recent tweet
    of our first input from the conversation. The most recent tweet in one input
    will become the 2nd most recent in the next.

    ex.
    Conversation x:
      Input 1: str((most recent tweet (x)) + EOS + (2nd most recent (y)))   Output 1: 'pos' or 'neg
      Input 2: str((most recent tweet (z)) + EOS + (2nd most recent (x)))   Output 1: 'pos' or 'neg      
  ...etc. '''


  texts = []
  texts2 = []
  labels = []
  reply = []
#     Iterate through conversation groups in pandas df.
  for name, group in corpus.groupby(['Conversation ID']):
    
    # Start with empty conversation in each group..
    i = 0
    attack_registered = 0
    # Iterate through each tweet and corresponding label.
    for tweet,label in zip(group['Tweet'],group['Personal Attack?']):

      # If it is the top-level tweet in the convo, this tweet becomes most recent tweet.
      # We do not add the label of the
      # top level tweet is not part of our prediction project. We are noting if a
      # personal attack is registered.
      if i == 0:
        tweet_most_recent = tweet
        if label == 'pos':
          attack_registered = 1

        i += 1

      # If it is a second-level tweet in the convo, the top-level tweet becomes 2nd
      # most recent and this tweet is now most recent. We do not add the label of the
      # second-level tweet is not part of our prediction project. We are again noting if a
      # personal attack is registered.  
      elif i == 1:
        tweet_2nd_most_recent = tweet_most_recent
        tweet_most_recent = tweet
        if label == 'pos':
          attack_registered = 1
        
        i += 1

      # If it is below the 2nd-level tweet, context is the combination of
      # our most recent and 2nd most recent tweets, seperated by our seperator token.
      # If positive attack is noted, and we have not seen a positive attack yet,
      # we add to our context list and labels.
      # If positive attack is noted, and we HAVE seen a positive attack already,
      # we add to our context list and labels only if this is another personal attack.           
      else:

        if label == 'pos':
          if modelTransformer == True:
            context = tweet_most_recent + '</s>' + tweet_2nd_most_recent
            texts.append(context)
          else:
            context = tweet_2nd_most_recent
            context2 = tweet_most_recent
            texts.append(context)
            texts2.append(context2)
         
          labels.append(label)
          reply.append(tweet)

          attack_registered = 1

          tweet_2nd_most_recent = tweet_most_recent
          tweet_most_recent = tweet
          i += 1
          
        elif label == 'neg':
          if attack_registered == 0:
            if modelTransformer == True:
              context = tweet_most_recent + '</s>' + tweet_2nd_most_recent
              texts.append(context)
            else:
              context = tweet_2nd_most_recent
              context2 = tweet_most_recent
              texts.append(context)
              texts2.append(context2)             
            labels.append(label)
            reply.append(tweet)
            
            tweet_2nd_most_recent = tweet_most_recent
            tweet_most_recent = tweet
            i += 1   


          elif attack_registered == 1:
            tweet_2nd_most_recent = tweet_most_recent
            tweet_most_recent = tweet
            i += 1 
  if modelTransformer == True:
    return (texts, labels)
  else:

    labels = [label.replace('neg', '0') for label in labels]
    labels = [label.replace('pos', '1') for label in labels]

    labels = [int(label) for label in labels]

    return (texts, texts2, reply, labels)

#### Run train, val and test splits through our formatting function above.

In [None]:
texts_train, labels_train = formatCorpus(corpus_train)
texts_val, labels_val = formatCorpus(corpus_val)
texts_test, labels_test = formatCorpus(corpus_test)

corpus_formatted_train = pd.DataFrame({'texts':texts_train,'labels':labels_train})
corpus_formatted_val = pd.DataFrame({'texts':texts_val,'labels':labels_val})
corpus_formatted_test = pd.DataFrame({'texts':texts_test,'labels':labels_test})

In [None]:

context1_train_baseline, context2_train_baseline, replies_train_baseline, labels_train_baseline = formatCorpus(corpus_train, modelTransformer=False)
context1_val_baseline, context2_val_baseline, replies_val_baseline, labels_val_baseline = formatCorpus(corpus_val, modelTransformer=False)
context1_test_baseline, context2_test_baseline, replies_test_baseline, labels_test_baseline = formatCorpus(corpus_test, modelTransformer=False)

corpus_formatted_train_baseline = pd.DataFrame({'context_1':context1_train_baseline, 
                                                'context_2':context2_train_baseline,
                                                'labels':labels_train_baseline, 
                                                'replies': replies_train_baseline})
corpus_formatted_val_baseline = pd.DataFrame({'context_1':context1_val_baseline,
                                              'context_2':context2_val_baseline,
                                              'labels':labels_val_baseline,
                                              'replies': replies_val_baseline})
corpus_formatted_test_baseline = pd.DataFrame({'context_1':context1_test_baseline,
                                              'context_2':context2_test_baseline,
                                               'labels':labels_test_baseline,
                                               'replies': replies_test_baseline})

#### Save val and test data to file. Train data will be appended with synthetic oversampling.

In [None]:
corpus_formatted_val.to_csv("/content/drive/My Drive/data_twitter/val_data_" + str(today) + "_v" + version + ".csv",
                            index = False)
corpus_formatted_test.to_csv("/content/drive/My Drive/data_twitter/test_data_" + str(today) + "_v" + version + ".csv",
                             index = False)

In [None]:
corpus_formatted_val_baseline.to_csv("/content/drive/My Drive/data_twitter/val_data_baseline_" + str(today) + "_v" + version + ".csv",
                            index = False)
corpus_formatted_test_baseline.to_csv("/content/drive/My Drive/data_twitter/test_data_baseline_" + str(today) + "_v" + version + ".csv",
                             index = False)

### **Step 3: Synthetic Oversampling**

Adding new positive samples to training data by using nearest embeddings from Twitter Glove embeddings.

In [None]:
# Uncomment if this file has not been downloaded yet
# !wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
# ! unzip glove*.zip

#### Loading Embedding Matrix
Helper functions to loading embedding matrix for our training text corpus.

In [None]:
def loadEmbeddingMatrix(typeToLoad, vocab_dict):
    '''Create an embedding matrix from Twitter Glove embeddings.
    Matrix dimensions will be the size of vocab x 100 (embedding size).
    '''

    # Load embedding file from path. We are using embedding size 100.
    EMBEDDING_FILE = '/content/drive/My Drive/Embeddings/glove.twitter.27B.100d.txt'
    embed_size = 100
 
    def get_coefs(word, *arr):
      '''Inner function, retrieve coefficients from embeddings.
      '''
      return word, np.asarray(arr, dtype='float32')


    embeddings_index = dict()
    # Transfer the embedding weights into a dictionary by iterating through every line of the file.
    f = open(EMBEDDING_FILE)
    for line in f:
        # split up line into an indexed array
        values = line.rstrip().rsplit(' ')  # line.split()
        # first index is word
        word = values[0]
        # store the rest of the values in the array as a new array
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs  # 100 dimensions
    f.close()

    print('Loaded %s word vectors.' % len(embeddings_index))

    gc.collect()
    # We get the mean and standard deviation of the embedding weights so that we could maintain the
    # same statistics for the rest of our own random generated weights.
    all_embs = np.stack(list(embeddings_index.values()))
    emb_mean, emb_std = all_embs.mean(), all_embs.std()

    nb_words = len(vocab_dict)
    # We are going to set the embedding size to the pretrained dimension as we are replicating it.
    # the size will be Number of Words in Vocab X Embedding Size
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    gc.collect()

    # With the newly created embedding matrix, we'll fill it up with the words that we have in both
    # our own dictionary and loaded pretrained embedding.
    embeddedCount = 0
    for word, i in vocab_dict.items():
        #i -= 1
        # then we see if this word is in glove's dictionary, if yes, get the corresponding weights
        embedding_vector = embeddings_index.get(word)
        # and store inside the embedding matrix that we will train later on.
        if embedding_vector is not None:
            try :
                embedding_matrix[i] = embedding_vector
                embeddedCount += 1
            except IndexError:
                pass
    print('total embedded:', embeddedCount, 'common words')

    del embeddings_index
    gc.collect()

    # finally, return the embedding matrix
    return embedding_matrix

def make_tokenizer(texts):
    '''Keras preprocessing tokenizer wil be used for tokezing texts in order
    to build vocabulary to cross reference with Twitter Glove embeddings.
    '''
    from keras.preprocessing.text import Tokenizer
    t = Tokenizer()
    t.fit_on_texts(texts)
    return t

def create_dictionary(tokenizer):
    '''Created dictionaries of words to index
    and index to words
    '''
    index_word = {}
    for word in tokenizer.word_index.keys():
        index_word[tokenizer.word_index[word]] = word

    vocab_dict = tokenizer.word_index


    return vocab_dict, index_word 


Load embeddings matrix for our text through functions above.

In [None]:
tokenizer = make_tokenizer(corpus_formatted_train['texts'])
vocab_dict, index_word = create_dictionary(tokenizer)
embed_mat = loadEmbeddingMatrix("gloveTwitter100d", vocab_dict)


Loaded 1193514 word vectors.
total embedded: 7267 common words


#### Crafting synonym dictionaries and using to build synthetic examples
First, building our synonym index and word dictionary.

In [None]:
def synonyms_from_embeddings (embed_mat):
  ''' Determine synonyms for words using word embeddings and knn.
  Output is dictionary of word indexes.
  '''
  word_number = embed_mat.shape[0]

  nn = NearestNeighbors(n_neighbors=synonyms_number+1).fit(embed_mat)

  neighbours_mat = nn.kneighbors(embed_mat[1:word_number])[1]

  synonyms = {x[0]: x[1:] for x in neighbours_mat}

  return synonyms

def map_words_to_synonyms (synonyms, index_word):
  ''' Convert word index dictionary to actual word dictionary.
  '''
  synonym = {}
  for x in range(1,len(synonyms)):
    try :
      synonym.update({index_word[x] : [index_word[synonyms[x][i]] for i in range(synonyms_number-1)]})
    except :
      pass
  return synonym

Storing synonym dictionaries using functions above.

In [None]:
# How many synonyms to use in dictionary?
synonyms_number = 3

# Two dictionaries mapping a word to its synonyms-
# One for indexes, one for words
synonym_index_dict = synonyms_from_embeddings (embed_mat)
synonym_word_dict = map_words_to_synonyms(synonym_index_dict, index_word)

#### Creating Synthetic Tweets
Functions for creating synthetic tweets through our synonym dictionary

In [None]:
def create_synthetic_tweet(tweet, replacements):
  '''Sub function for creating synthetic tweet.
  '''

  #Random int between 0 and 2
  rand_int = int(np.random.choice(range(3), 1, p=[0.7, 0.2, 0.1]))
                                                  

  for i in replacements:

    tweet = tweet.replace(i[0], i[1][rand_int])

  print('Synthetic Tweet: ' + tweet)
  return tweet

def create_synthetic_tweet_df(df, num_artificial_examples):
  '''
  Given training dataframe and number of artificial examples desired,
  create a dataframe of synthetic tweets and append to original dataframe.

  Output
    df: pandas DF (obj)
    num_artificial_examples: int
  '''

  pos_tweets = df[df["labels"]=="pos"]

  artificial_examples = {"tweet":[], "label":[]}

  tokenizer = TweetTokenizer()

  for tweet in pos_tweets["texts"]:
    
      # Tokenize the text
      tokenized = tokenizer.tokenize(tweet)

      # Get the list of words from the entire text
      words = word_tokenize(tweet)
      
      replacements = []

      print('New Tweet ----')
      print('')
      print('Words and Syns:')
      print('')
      for word in words:
          
          synonym = []
          word_index = vocab_dict.get(word, None)



          if not word_index:
            continue

          #We will only replace 20% of the time
          rndm_replace_decision = int(np.random.randint(5, size=1))

          if rndm_replace_decision != 1:
            continue

          if (word not in nltk.corpus.stopwords.words('english')) :
            syn_indexes = synonym_index_dict[word_index]
            syn_words = [index_word[syn_index] for syn_index in syn_indexes]
          
            if syn_words :
                print('word: ' + word + '| synonyms: ' + str(syn_words))
                replacements.append((word, syn_words))
      
      for i in range(num_artificial_examples):
        print('')
        print('Original Tweet: ' + tweet)
        artificial_examples["tweet"].append(create_synthetic_tweet(tweet, replacements))
        artificial_examples["label"].append("pos")
        print("Artificial example created")

        
        
  artificial_example_df = pd.DataFrame(artificial_examples).drop_duplicates(subset=["tweet"])
  artificial_example_df = artificial_example_df.rename(columns={'tweet': 'texts', 'label': 'labels'})

  df_with_synthetic_tweets = pd.concat([df, artificial_example_df]).sample(frac=1)
  print()
  print('{} synthetic tweets manufactured from {} positive examples'.format(artificial_example_df.shape[0],
                                                                            pos_tweets.shape[0]))

  return df_with_synthetic_tweets

In [None]:
corpus_formatted_train_w_synthetic = create_synthetic_tweet_df(corpus_formatted_train, 1)

New Tweet ----

Words and Syns:

word: prepared| synonyms: ['preparing', 'expect', 'however']
word: shiny| synonyms: ['colorful', 'yellow', 'paint']
word: politics| synonyms: ['political', 'democracy', 'policy']
word: sterilized| synonyms: ['labelled', 'modernized', 'weaponry']
word: women| synonyms: ['woman', 'girls', 'men']
word: especially| synonyms: ['except', 'both', 'unlike']
word: years| synonyms: ['year', 'months', 'days']
word: r| synonyms: ['b', 'm', 'u']
word: aliens| synonyms: ['robots', 'alien', 'invasion']
word: population| synonyms: ['average', 'percent', 'species']

Original Tweet: @SinaiGail9 So, you‚Äôre prepared? You‚Äôve come with - shiny objects - to toss away from the subject of abortion in relationship to white-identity-politics aka ‚Äúlily-white‚Äù politics?  In other words: You are attempting to deceive   You‚Äôre blocked for your crimes!  https://t.co/eqcPNPiRvK</s>@blackrepublican So called ‚ÄúFeeble minded‚Äù whites were also sterilized &amp; encouraged 2 ab

Below are functions for additional over/undersampling beyond the artificial oversampling above. We choose to underesample in addition to the oversampling above.

In [None]:
 def over_sample(train_df, baseline = False):

  if baseline == False:
    pos_texts = train_df[train_df.labels == 'pos']
    neg_texts = train_df[train_df.labels == 'neg']
  elif baseline == True:
    pos_texts = train_df[train_df.labels == 1]
    neg_texts = train_df[train_df.labels == 0]
  
  count_class_neg = neg_texts.shape[0]

  print('Minority class has {} rows'.format(pos_texts.shape[0]))
  print('Majority class has {} rows'.format(count_class_neg))
  print('{} new rows created'.format(count_class_neg - pos_texts.shape[0]))

  pos_texts_over = pos_texts.sample(count_class_neg, replace=True)

  texts_oversampled = pd.concat([neg_texts, pos_texts_over], axis=0)

  #Randomize before returning
  return texts_oversampled.sample(frac=1)

def under_sample(train_df, baseline = False):

  if baseline == False:
    pos_texts = train_df[train_df.labels == 'pos']
    neg_texts = train_df[train_df.labels == 'neg']
  elif baseline == True:
    pos_texts = train_df[train_df.labels == 1]
    neg_texts = train_df[train_df.labels == 0]

  count_class_pos = pos_texts.shape[0]

  print('Minority class has {} rows'.format(count_class_pos))
  print('Majority class has {} rows'.format(neg_texts.shape[0]))
  print('{} new rows removed'.format(neg_texts.shape[0] - count_class_pos))

  neg_texts_over = neg_texts.sample(count_class_pos, replace=True)

  texts_undersampled = pd.concat([pos_texts, neg_texts_over], axis=0)

  #Randomize before returning  
  return texts_undersampled.sample(frac=1)

In [None]:
# Undersampling our train corpus with synthetic data
corpus_formatted_train_w_synthetic_us = over_sample(corpus_formatted_train_w_synthetic)

# Standard oversampling for our final model for comparison
corpus_formatted_train_standard_os = over_sample(corpus_formatted_train)

# Standard oversampling for our baslilne model for comparison
corpus_formatted_train_baseline_standard_os = over_sample(corpus_formatted_train_baseline, baseline= True)

Minority class has 710 rows
Majority class has 857 rows
147 new rows created
Minority class has 355 rows
Majority class has 857 rows
502 new rows created
Minority class has 355 rows
Majority class has 857 rows
502 new rows created


In [None]:
corpus_formatted_train_w_synthetic_us.to_csv("/content/drive/My Drive/data_twitter/train_data_w_ae_us_" + str(today) + "_v" + version + ".csv",
                                          index = False)


#### Train Corpus with Standard Oversampling for Comparison

In [None]:
corpus_formatted_train_standard_os.to_csv("/content/drive/My Drive/data_twitter/train_data_standard_os_" + str(today) + "_v" + version + ".csv",
                                          index = False)


#### Standard Oversampling on Baseline Corpus

In [None]:
corpus_formatted_train_baseline_standard_os.to_csv("/content/drive/My Drive/data_twitter/train_data_baseline_" + str(today) + "_v" + version + ".csv",
                            index = False)


In [None]:
corpus_train = pd.read_csv("/content/drive/My Drive/data_twitter/train_data_w_ae_us_2021-04-04_v1.csv")

Additional Texting

In [None]:
corpus_train['texts2'] = corpus_train['texts'].apply(lambda x: x.split('</s>')[0])


In [None]:
corpus_train['texts3'] = corpus_train['texts'].apply(lambda x: x.replace('</s>', ' '))

In [None]:
corpus_train_1 = corpus_train.drop(columns=['texts', 'texts3'])
corpus_train_1 = corpus_train_1.rename(columns = {'texts2': 'texts'})
corpus_train_1.to_csv("/content/drive/My Drive/data_twitter/train_data_singleconvoo_" + str(today) + "_v" + version + ".csv",
                            index = False)

In [None]:
corpus_train_2 = corpus_train.drop(columns=['texts', 'texts2'])
corpus_train_2 = corpus_train_2.rename(columns = {'texts3': 'texts'})
corpus_train_2.to_csv("/content/drive/My Drive/data_twitter/train_data_nosplitconvo_" + str(today) + "_v" + version + ".csv",
                            index = False)