The aim of this notebook is explore the dataset text in the lens of the embeddings. We want to maximize the words in the dataset that have embeddings and minimize out-of-vocabulary words. So essentially, this notebook contributes to finding the best preprocessing operations on the dataset that make it optimal given an embedding.

In [2]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import operator
import pickle
from nltk.tokenize.treebank import TreebankWordTokenizer
from contractions import fix
import re

In [3]:
train_df = pd.read_csv('../data/train.csv')

In [4]:
train_df['comment_text'].head()

0    This is so cool. It's like, 'would you want yo...
1    Thank you!! This would make my life a lot less...
2    This is such an urgent design problem; kudos t...
3    Is this something I'll be able to install on m...
4                 haha you guys are a bunch of losers.
Name: comment_text, dtype: object

Dropping the rows where the comment is empty. Forntunately its only 3 of the ~180000 rows.

In [5]:
print(train_df['comment_text'].isnull().sum())
train_df = train_df.dropna(subset=['comment_text'])
print(train_df['comment_text'].isnull().sum())

3
0


### Load Glove Embeddings

In [6]:
def load_glove_vocab(filepath='../data/glove.6B/glove.6B.50d.txt'):
    glove_vocab = {}
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            glove_vocab[word] = np.array(values[1:], dtype='float32')
    glove_vocab['<pad>'] = np.zeros(len(values)-1)  # Add padding token
    return glove_vocab

In [7]:
embeddings = load_glove_vocab()

### Vocabulary and Coverage Check

In [None]:
def build_vocab(sentences, verbose=True):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab_count = {}
    for sentence in tqdm(sentences, disable=(not verbose)):
        for word in sentence.split():
            if word in vocab_count:
                vocab_count[word] += 1
            else:
                vocab_count[word] = 1
    return vocab_count

In [28]:
def check_coverage(vocab, embeddings):
    oov = {}
    embeddings_found_unique, embeddings_found_total, no_embeddings_unique, no_embeddings_total = 0, 0, 0, 0
    for word in tqdm(vocab):
        if word == '"the':
            print('Found')
        if word in embeddings:
            embeddings_found_unique += 1
            embeddings_found_total += vocab[word]
        else:
            no_embeddings_unique += 1
            no_embeddings_total += vocab[word]
            oov[word] = vocab[word]


    print('Found embeddings for {:.2%} of unique words'.format(embeddings_found_unique / len(vocab)))
    print('Found embeddings for {:.2%} of all text'.format(embeddings_found_total / (embeddings_found_total + no_embeddings_total)))
    
    oov_in_order_of_decreasing_occurrence = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return oov_in_order_of_decreasing_occurrence

In [10]:
dataset_vocab = build_vocab(list(train_df['comment_text']))

100%|██████████| 1804871/1804871 [00:37<00:00, 47886.90it/s]


In [11]:
oov = check_coverage(dataset_vocab, embeddings)

100%|██████████| 1670966/1670966 [00:02<00:00, 777042.66it/s]


Found embeddings for 5.58% of unique words
Found embeddings for 74.82% of all text


Unsurprisingly we only found embeddings for 5% of all the words in the text. These words (like 'the', 'a') occur often and hence the 75% count for all text. Let's see what words were not found.

In [12]:
oov[:10]

[('I', 861783),
 ('The', 435047),
 ("don't", 178881),
 ('Trump', 156956),
 ('It', 153815),
 ('You', 144381),
 ('If', 143987),
 ('And', 128132),
 ('This', 121363),
 ("it's", 100959)]

So we see upper case letters are not found, and also the use of the apostrephe (') in contractions.

### Handling Contractions and Case

In [13]:
def fixContractionsAndConvertCase(sentence):
    sentence = fix(sentence)
    sentence = sentence.lower()
    return sentence

In [14]:
sentence = train_df['comment_text'][5]
print(sentence, ' _______ ', fixContractionsAndConvertCase(sentence))
sentence = train_df['comment_text'][3]
print(sentence, ' _______ ', fixContractionsAndConvertCase(sentence))
sentence = train_df['comment_text'][0]
print(sentence, ' _______ ', fixContractionsAndConvertCase(sentence))

ur a sh*tty comment.  _______  you are a sh*tty comment.
Is this something I'll be able to install on my site? When will you be releasing it?  _______  is this something i will be able to install on my site? when will you be releasing it?
This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!  _______  this is so cool. it is like, 'would you want your mother to read this??' really great idea, well done!


Pretty cool - contractions expanded and sentence converted to lower case. Let's check the coverage again

In [15]:
train_df['processed_text'] = train_df['comment_text'].apply(lambda x: fixContractionsAndConvertCase(x))

In [16]:
dataset_vocab = build_vocab(list(train_df['processed_text']))
oov = check_coverage(dataset_vocab, embeddings)
oov[:10]

100%|██████████| 1804871/1804871 [00:49<00:00, 36611.17it/s]
100%|██████████| 1458321/1458321 [00:01<00:00, 861663.38it/s]


Found embeddings for 8.75% of unique words
Found embeddings for 86.85% of all text


[('it.', 85295),
 ('them.', 37493),
 ('it,', 29869),
 ('that.', 28301),
 ('yes,', 27927),
 ('you.', 26861),
 ('not.', 26028),
 ('"the', 25151),
 ("trump's", 24479),
 ('time.', 22364)]

We have an improvement. In the out-of-vocabulary set we see cases of special characters not found in the embeddings. Let's handle these

### Special Characters

In [52]:
def removeSpecialCharacters(sentence):
    sentence = re.sub(r'[_]', ' ', sentence)
    # Removing some special characters that usually don't add meaning to a sentence
    # sentence = re.sub(r"[#*;[\\^`{|}~'\"]", '', sentence)
    sentence = re.sub(r"[^?.-:()%@!&=+/><,a-zA-Z\s0-9\w]", '', sentence)
    # Changes multiple occurrences of a character to one occurrence
    sentence = re.sub(r'([?.!#$%&()*+,-/:;_<=>@[^`|])\1+', r'\1', sentence)
    # Inserts a space before and after special characters
    sentence = re.sub(r'([?.!#$%&()*+,-/:;_<=>@[^`|])', r' \1 ', sentence)
    # Removes extra spaces that may have come in from the previous operation
    sentence = re.sub(r'([\s])\1+', r'\1', sentence)
    return sentence

In [53]:
sentence = 'Testing #999!! on_set numbers.."the game is :/rigged"'
print(sentence, ' _______ ', removeSpecialCharacters(sentence))
sentence = train_df['processed_text'][3]
print(sentence, ' _______ ', removeSpecialCharacters(sentence))
sentence = train_df['processed_text'][0]
print(sentence, ' _______ ', removeSpecialCharacters(sentence))

Testing #999!! on_set numbers.."the game is :/rigged"  _______  Testing 999 ! on set numbers . the game is : / rigged
is this something i will be able to install on my site ? when will you be releasing it ?   _______  is this something i will be able to install on my site ? when will you be releasing it ? 
this is so cool . it is like , would you want your mother to read this ? really great idea , well done !   _______  this is so cool . it is like , would you want your mother to read this ? really great idea , well done ! 


In [54]:
train_df['processed_text'] = train_df['processed_text'].apply(lambda x: removeSpecialCharacters(x))

In [55]:
dataset_vocab = build_vocab(list(train_df['processed_text']))
oov = check_coverage(dataset_vocab, embeddings)
oov[:10]

100%|██████████| 1804871/1804871 [00:36<00:00, 48859.49it/s]
100%|██████████| 326864/326864 [00:00<00:00, 969133.99it/s]

Found embeddings for 38.30% of unique words
Found embeddings for 99.41% of all text





[('trudeaus', 5060),
 ('alaskas', 4433),
 ('antifa', 2513),
 ('daca', 2509),
 ('brexit', 1888),
 ('hawaiis', 1880),
 ('siemian', 1870),
 ('sb21', 1852),
 ('theglobeandmail', 1354),
 ('washingtonpost', 1353)]

This seems to be a good point. We have embeddings for over 99% of the training dataset. Most of the oov words are proper nouns or typos. I'm happy to mark these as unknown.

### NLTK's tokenizers

In [60]:
from nltk.tokenize import word_tokenize
def tokenize(sentence):
    sentence = word_tokenize(sentence)
    return sentence

In [None]:
train_df['processed_text_2'] = train_df['comment_text'].apply(lambda x: tokenize(x))

In [None]:
dataset_vocab = build_vocab(list(train_df['processed_text_2']))
oov = check_coverage(dataset_vocab, embeddings)
oov[:10]