# Data Preparation and Text Cleaning

There are numerous data augmentation techniques that could improve predictions in NLP datasets. These can include 

Synonym Replacement: Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.

Random Insertion: Same as Synonym Replacement except it inserts the synonyms at random positions of the sentence. 

Random Swap: Randomly choose two words in the sentence and swap their positions. 

Random Deletion: Randomly remove each word in the sentence with probability p. 




In this notebook the nlpaug package will be used: https://nlpaug.readthedocs.io/en/latest/. 



In [None]:
!pip install keras-core --upgrade
!pip install -q keras-nlp --upgrade
!pip install nlpaug
!pip install sacremoses

import os
os.environ['KERAS_BACKEND'] = 'tensorflow'

In [None]:
import numpy as np 
import pandas as pd 
import tensorflow as tf
import keras_core as keras
import keras_nlp
import nlpaug.augmenter.word as naw
from tqdm import tqdm

import random as python_random
import re
import string
import emoji


print("TensorFlow version:", tf.__version__)
print("KerasNLP version:", keras_nlp.__version__)

This code was necessary to download the resources from nltk within the Kaggle kernal. 

In [None]:
import nltk
import subprocess

# Download and unzip wordnet
try:
    nltk.data.find('wordnet.zip')
except:
    nltk.download('wordnet', download_dir='/kaggle/working/')
    command = "unzip /kaggle/working/corpora/wordnet.zip -d /kaggle/working/corpora"
    subprocess.run(command.split())
    nltk.data.path.append('/kaggle/working/')

# Now you can import the NLTK resources as usual
from nltk.corpus import wordnet

In [None]:
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

print('Training Set Shape = {}'.format(train.shape))
print('Training Set Memory Usage = {:.2f} MB'.format(train.memory_usage().sum() / 1024**2))
print('Test Set Shape = {}'.format(test.shape))
print('Test Set Memory Usage = {:.2f} MB'.format(test.memory_usage().sum() / 1024**2))

In [None]:
train.head()

In [None]:
test.head()

A lot of the tweets in the dataset need to be cleaned up. Doing so should improve the results. In researching a way to clean up this text, the following Stack Overflow post was extremely helpful: https://stackoverflow.com/questions/64719706/cleaning-twitter-data-pandas-python

In [None]:
train_clean_tweets = []
for tweet in train['text']:
    tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
    tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) #Remove http links
    tweet = " ".join(tweet.split())
    emojis = emoji.distinct_emoji_list(tweet)
    tweet = ''.join(c for c in tweet if c not in emojis) #Remove Emojis
    tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
    #tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) \
         #if w.lower() in tweet or not w.isalpha())
    train_clean_tweets.append(tweet)
    
train['clean_text'] = train_clean_tweets

In [None]:
train

In [None]:
test_clean_tweets = []
for tweet in test['text']:
    tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
    tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) #Remove http links
    tweet = " ".join(tweet.split())
    emojis = emoji.distinct_emoji_list(tweet)
    tweet = ''.join(c for c in tweet if c not in emojis) #Remove Emojis
    tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
    #tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) \
         #if w.lower() in tweet or not w.isalpha())
    test_clean_tweets.append(tweet)
    
test['clean_text'] = test_clean_tweets

# Varying Target Values for the Same Tweets

Looking at the number of unique values in each column of the train dataset it shows that there are 7613 total columns, but only 6922 of the input columns are unique, which is a total of 791 rows. That is a lot. The question whether a unique input value with many occurances are all labeled with the same target values.

In [None]:
train.nunique()

To explore this potential labeling issue, a new column called 'unique_input' is created to be able to look at some of the larger occurances of unique values.

In [None]:
train['unique_text'] = pd.factorize(train['clean_text'])[0] + 1

In [None]:
train

Looking at the top five unique occurances, only the 4th one, 4061, had variations in the target values. It doesn't appear to be a disaster, but 5 out of 17 occurances were coded as a disaster.

In [None]:
train['unique_text'].value_counts().nlargest(5)

In [None]:
print(train.loc[train['unique_text'] == 4061])

There are 314 tweets that that are repeated more than once. There is a pretty good chance that some more of these may have different target codes for the same text. 

In [None]:
train['unique_text'].value_counts().ne(1).sum()

One way to correct this potential problem is to use the target mode for a set of duplicate tweets and change any targets that don't match to this mode value. For instance, in the example above for number 4061, the mode would be 0 and the 5 values that are not 0 would be changed to 0. 

To start this process a new dataframe is created to capture the mode for each unique tweet. 

In [None]:
train_unique_mode = train.groupby('unique_text').agg({'target': lambda x: x.value_counts().index[0]}).reset_index()

In [None]:
train_unique_mode

These mode values are then added as a new column called 'new_target' in the train dataset. 

In [None]:
train['new_target'] = train['unique_text'].map(train_unique_mode.set_index('unique_text')['target'])

In [None]:
train

It looks like there are 89 rows where the new target is not equal to the original target, which means 89 rows were changed based on looking at the mode of unique tweets with more than one occurance. 

In [None]:
len(train.query('new_target != target'))

# Synonym Replacement

nlpaug has a synonym replacement function. It seemed to work best by going through a list instead of a dataframe column, so the train_chunk list was created out of the 'clean_text' column. 

In [None]:
train_chunk = train['clean_text']
len(train_chunk)

To download the actual model. The default device is CPU, but used 'cuda' to take advantage of Kaggle environment GPUs. The default batch_size was 32, but ran into a CUDA memory issue, so reduced it to 16. 

In [None]:
synonym_replace_aug = naw.SynonymAug()

As mentioned above it seems to work best iterating through a list, the translate list was created to hold the final results. It was wrapped in tqdm progress bar since the process took about 2 hours. 

In [None]:
synonym = []

for i in tqdm(train_chunk):
    row_synonym = synonym_replace_aug.augment(i)
    synonym.append(row_synonym)

In [None]:
len(synonym)

In [None]:
synonym[0:10]

To create a new 'translate' column in the train dataframe with the results of the translation. 

In [None]:
train['augment'] = synonym
train['augment'] = train['augment'].str[0]

In [None]:
train

Since the backtranslation took about 2 hours, a csv file was saved to use in future notebooks without re-running the full process. 

In [None]:
#train.to_csv("train_augment.csv", index=False)

To do the same backtranslating process on the test dataframe. 

In [None]:
test_chunk = test['clean_text']
len(test_chunk)

In [None]:
test_synonym = []

for i in tqdm(test_chunk):
    row_synonym = synonym_replace_aug.augment(i)
    test_synonym.append(row_synonym)

In [None]:
len(test_synonym)

In [None]:
test['augment'] = test_synonym
test['augment'] = test['augment'].str[0]

In [None]:
test

In [None]:
#test.to_csv("test_augment.csv", index=False)

# Preparing to Use the Model

The parameters from the starter notebook are used here and an 80/20 validation split is performed below. 

In [None]:
BATCH_SIZE = 32
NUM_TRAINING_EXAMPLES = train.shape[0]
TRAIN_SPLIT = 0.8
VAL_SPLIT = 0.2
STEPS_PER_EPOCH = int(NUM_TRAINING_EXAMPLES)*TRAIN_SPLIT // BATCH_SIZE

EPOCHS = 2
AUTO = tf.data.experimental.AUTOTUNE

In [None]:
from sklearn.model_selection import train_test_split

X = train["augment"]
y = train["new_target"]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=VAL_SPLIT, random_state=42)

X_test = test["augment"]

To ensure the results are the same for multiple iterations

In [None]:
def reset_seeds():
   np.random.seed(42) 
   python_random.seed(42)
   tf.random.set_seed(42)

reset_seeds() 

# Running the Model

Text inputs need to be transformed to numeric token ids and arranged in several Tensors before being input to BERT.

The BertClassifier model can be configured with a preprocessor layer, in which case it will automatically apply preprocessing to raw inputs during fit(), predict(), and evaluate(). This is done by default when creating the model with from_preset().

The DistilBERT model that is chosen learns a distilled (approximate) version of BERT, retaining 97% performance but using only half the number of parameters ([paper](https://arxiv.org/abs/1910.01108)). 

It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

Specifically, it doesn't have token-type embeddings, pooler and retains only half of the layers from Google's BERT.

In [None]:
# Load a DistilBERT model.
preset= "distil_bert_base_en_uncased"

# Use a shorter sequence length.
preprocessor = keras_nlp.models.DistilBertPreprocessor.from_preset(preset,
                                                                   sequence_length=160,
                                                                   name="preprocessor_4_tweets"
                                                                  )

# Pretrained classifier.
classifier = keras_nlp.models.DistilBertClassifier.from_preset(preset,
                                                               preprocessor = preprocessor, 
                                                               num_classes=2)

classifier.summary()

In [None]:
# Compile
classifier.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True), #'binary_crossentropy',
    optimizer=keras.optimizers.Adam(1e-5),
    metrics= ["accuracy"]  
)

# Fit
history = classifier.fit(x=X_train,
                         y=y_train,
                         batch_size=BATCH_SIZE,
                         epochs=EPOCHS, 
                         validation_data=(X_val, y_val)
                        )

# Submission 

In [None]:
def reset_seeds():
   np.random.seed(42) 
   python_random.seed(42)
   tf.random.set_seed(42)

reset_seeds() 

In [None]:
sample_submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")
sample_submission.head()

In [None]:
sample_submission["target"] = np.argmax(classifier.predict(X_test), axis=1)

In [None]:
sample_submission.to_csv("submission.csv", index=False)