### Goals

- Reproducible 
- Quantifiable

### Summary

1. #### Don't use standard preprocessing steps like stemming or stopword removal 
The reason is simple: You loose valuable information, and thereby low performance in identifying signal
2. #### percentage coverage of vocabulary and overall coverage percentage
3. #### And the text cleaning is embedding specific
I just want to mention that the cleaning should be specific for each embedding. For instance, punctuation is present in the Glove embedding, so I believe it should not be removed.
4.   #### Glove  Vs  fasttext  <br>
5. How are numbers treated in embeddings?  89.999 to ##.### and 29.4 to ##.#
6. How is heart symbol treated ? 

### Motivation

1. Setting up the preprocessing is eda 

If a word vector for a token (see remark below for what I mean with token) is available strongly depends on the preprocessing used by the people who trained the embeddings. Unfortunatly most are quite intransparent about this point. (e.g. did they use lower casing, removing contractions, replacement of words, etc. So you need to research their github repositories and/or read the related papers. Google pretrained word vectors replace numbers with "##" or the guys training glove twitter embeddings did `text = re.sub("<3", '<HEART>', text)` 
That all leads to the second conclusion:


Similary King- Man + Woman  = King/Queen ? 

2. Each pretrained embedding needs its own preprocessing

If people used different preprocessing for training their embeddings you would also need to do the same, 

Especially point to can be quite challenging, if you want to concatenate embeddings as in this kernel. Imagine Embedding A preprocesses `"don't"` to a single token`["dont"]` and Embedding B to two tokens`["do","n't"]`. You are basically not able to do both. So you need to find a compromise.

#### Get your vocabulary as close to the embeddings as possible
I will focus in this notebook, how to achieve that. For an example I take the GoogleNews pretrained embeddings, there is no deeper reason for this choice.
<br><br>

The 3 main contributions of this notebook are the following:

- loading embedding from pickles 
- aimed preprocessing for GloVe and fasttext vectors (the main content of this notebook)
- fixing some unknown words

In [1]:
# Put these at the top of every notebook, to get automatic reloading and inline plotting
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import pickle
import numpy as np
from pprint import pprint
import pandas as pd
import os
import time
import logging
import operator
import string
import gc
import random
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import text, sequence
from nltk.tokenize.treebank import TreebankWordTokenizer

Using TensorFlow backend.


In [2]:
os.chdir('/home/swaroop/Downloads/jigsaw-unintended-bias-in-toxicity-classification/')

In [5]:
random.seed(123)
os.environ['PYTHONHASHSEED'] = str(123)
np.random.seed(123)

Use pkl files if possible

In [6]:
CRAWL_EMBEDDING_PATH = 'crawl-300d-2M.pkl'
GLOVE_EMBEDDING_PATH = 'glove.840B.300d.pkl'

we have to adjust the load_embeddings function, to handle the pickled dict.

In [7]:

def get_coefs(word, *arr):
    return word, np.asarray(arr, dtype='float32')


def load_embeddings(path):
    with open(path,'rb') as f:
        emb_arr = pickle.load(f)
    return emb_arr

def build_matrix(word_index, path):
    embedding_index = load_embeddings(path)
    embedding_matrix = np.zeros((len(word_index) + 1, 300))
    unknown_words = []
    
    for word, i in word_index.items():
        try:
            embedding_matrix[i] = embedding_index[word]
        except KeyError:
            unknown_words.append(word)
    return embedding_matrix, unknown_words



In [1]:
def bad_preprocess(data):
    '''
    Most common pre-processing used
    '''
    punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~`" + '""‚Äú‚Äù‚Äô' + '‚àûŒ∏√∑Œ±‚Ä¢√†‚àíŒ≤‚àÖ¬≥œÄ‚Äò‚Çπ¬¥¬∞¬£‚Ç¨\√ó‚Ñ¢‚àö¬≤‚Äî‚Äì&'
    def clean_special_chars(text, punct):
        for p in punct:
            text = text.replace(p, ' ')
        return text

    data = data.astype(str).apply(lambda x: clean_special_chars(x, punct))
    return data

In principle this functions just deletes some special characters. <br>
What is additionally inefficient is that later the keras tokenizer with its default parameters is used which has its own with the above function redundant behavior.

In [9]:
train = pd.read_csv('test.csv')

train.columns

# test = pd.read_csv('test.csv')

Index(['id', 'comment_text'], dtype='object')

## Preprocessing

Two important functions which we use throughout this notebook

In [10]:
def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in vocab:
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

def build_vocab(sentences, verbose =  True):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

###### Pkl files or Text files for embedding ? 

In [11]:
def load_embeddings_test(path):
    with open(path) as f:
        return dict(get_coefs(*line.strip().split(' ')) for line in f)

tic = time.time()
embedding_index = load_embeddings_test('/home/swaroop/Downloads/glove.6B/glove.6B.300d.txt')
print(f'loaded {len(embedding_index)} word vectors in {time.time()-tic}s')

del embedding_index
gc.collect()

loaded 400000 word vectors in 23.357550859451294s


11

In [12]:
tic = time.time()
glove_embeddings = load_embeddings('glove.840B.300d.pkl')
print(f'loaded {len(glove_embeddings)} word vectors in {time.time()-tic}s')

loaded 2196008 word vectors in 6.887946128845215s


###### Loading pkl files is twenty times faster

In [13]:
vocab = build_vocab(list(train['comment_text'].apply(lambda x:x.split())))

In [14]:
crawl_embeddings = load_embeddings('crawl-300d-2M.pkl')
oov = check_coverage(vocab,crawl_embeddings)

Found embeddings for 36.63% of vocab
Found embeddings for  91.44% of all text


In [15]:
oov[:10]

[("Trump's", 1220),
 ("aren't", 1168),
 ("Don't", 1080),
 ("wouldn't", 1067),
 ('Yes,', 997),
 ("wasn't", 931),
 ("Let's", 763),
 ("You're", 752),
 ('So,', 709),
 ("He's", 672)]

In [16]:
oov = check_coverage(vocab,glove_embeddings)

Found embeddings for 32.62% of vocab
Found embeddings for  89.70% of all text


In [17]:
oov[:10]

[("isn't", 2228),
 ("That's", 1974),
 ("won't", 1678),
 ("he's", 1319),
 ("Trump's", 1220),
 ("aren't", 1168),
 ("wouldn't", 1067),
 ('Yes,', 997),
 ("they're", 966),
 ("wasn't", 931)]

In [18]:
print (len(glove_embeddings))
print (len(crawl_embeddings))

2196008
2000000


Seems like `'` and other punctuation directly on or in a word is an issue. <br>
We could simply delete punctuation to fix that words, but there are better methods.<br>


In [19]:
latin_similar = "‚Äô'‚Äò√Ü√ê∆é∆è∆ê∆îƒ≤≈ä≈í·∫û√û«∑»ú√¶√∞«ù…ô…õ…£ƒ≥≈ã≈ìƒ∏≈ø√ü√æ∆ø»ùƒÑ∆Å√áƒê∆äƒòƒ¶ƒÆ∆ò≈Å√ò∆†≈û»ò≈¢»ö≈¶≈≤∆ØYÃ®∆≥ƒÖ…ì√ßƒë…óƒôƒßƒØ∆ô≈Ç√∏∆°≈ü»ô≈£»õ≈ß≈≥∆∞yÃ®∆¥√Å√Ä√Ç√Ñ«çƒÇƒÄ√É√Ö«∫ƒÑ√Ü«º«¢∆ÅƒÜƒäƒàƒå√áƒé·∏åƒê∆ä√ê√â√àƒñ√ä√ãƒöƒîƒíƒò·∫∏∆é∆è∆êƒ†ƒú«¶ƒûƒ¢∆î√°√†√¢√§«éƒÉƒÅ√£√•«ªƒÖ√¶«Ω«£…ìƒáƒãƒâƒç√ßƒè·∏çƒë…ó√∞√©√®ƒó√™√´ƒõƒïƒìƒô·∫π«ù…ô…õƒ°ƒù«ßƒüƒ£…£ƒ§·∏§ƒ¶I√ç√åƒ∞√é√è«èƒ¨ƒ™ƒ®ƒÆ·ªäƒ≤ƒ¥ƒ∂∆òƒπƒª≈ÅƒΩƒø ºN≈ÉNÃà≈á√ë≈Ö≈ä√ì√í√î√ñ«ë≈é≈å√ï≈ê·ªå√ò«æ∆†≈íƒ•·∏•ƒßƒ±√≠√¨i√Æ√Ø«êƒ≠ƒ´ƒ©ƒØ·ªãƒ≥ƒµƒ∑∆ôƒ∏ƒ∫ƒº≈Çƒæ≈Ä≈â≈ÑnÃà≈à√±≈Ü≈ã√≥√≤√¥√∂«í≈è≈ç√µ≈ë·ªç√∏«ø∆°≈ì≈î≈ò≈ñ≈ö≈ú≈†≈û»ò·π¢·∫û≈§≈¢·π¨≈¶√û√ö√ô√õ√ú«ì≈¨≈™≈®≈∞≈Æ≈≤·ª§∆Ø·∫Ç·∫Ä≈¥·∫Ñ«∑√ù·ª≤≈∂≈∏»≤·ª∏∆≥≈π≈ª≈Ω·∫í≈ï≈ô≈ó≈ø≈õ≈ù≈°≈ü»ô·π£√ü≈•≈£·π≠≈ß√æ√∫√π√ª√º«î≈≠≈´≈©≈±≈Ø≈≥·ª•∆∞·∫É·∫Å≈µ·∫Ö∆ø√Ω·ª≥≈∑√ø»≥·ªπ∆¥≈∫≈º≈æ·∫ì"
white_list = string.ascii_letters + string.digits + latin_similar + ' '
white_list = string.ascii_letters + string.digits  + ' '
white_list += "'"

In [20]:
glove_chars = ''.join([c for c in glove_embeddings if len(c) == 1])
glove_symbols = ''.join([c for c in glove_chars if not c in white_list])
glove_symbols

',.":)(-!?|;$&/[]>%=#*+\\‚Ä¢~@¬£¬∑_{}¬©^¬Æ`<‚Üí¬∞‚Ç¨‚Ñ¢‚Ä∫‚ô•‚Üê√ó¬ß‚Ä≥‚Ä≤√Ç‚ñà¬Ω√†‚Ä¶‚Äú‚òÖ‚Äù‚Äì‚óè√¢‚ñ∫‚àí¬¢¬≤¬¨‚ñë¬°¬∂‚Üë¬±¬ø‚ñæ‚ïê¬¶‚ïë‚Äï¬•‚ñì‚Äî‚Äπ‚îÄ‚ñíÔºö¬º‚äï‚ñº‚ñ™‚Ä†‚ñ†‚Äô‚ñÄ¬®‚ñÑ‚ô´‚òÜ√©¬Ø‚ô¶¬§‚ñ≤√®¬∏¬æ√É‚ãÖ‚Äò‚àû‚àôÔºâ‚Üì„ÄÅ‚îÇÔºà¬ªÔºå‚ô™‚ï©‚ïö¬≥„Éª‚ï¶‚ï£‚ïî‚ïó‚ñ¨‚ù§√Ø√ò¬π‚â§‚Ä°‚àö‚óÑ‚îÅ‚áí‚ñ∂¬∫‚â•‚ïù‚ô°‚óä„ÄÇ‚úà‚â°‚ò∫‚úî‚Üµ‚âà√£‚úì√ê‚ô£‚òé‚ÑÉ‚ó¶√∏‚îî‚Äü√ÖÔΩûÔºÅ‚óã‚óÜ‚Ññ‚ô†‚ñå‚úø‚ñ∏‚ÅÑ‚ñ°√â‚ùñ√≠‚ú¶Ôºé√∑ÔΩú√Ä‚îÉ√•ÔºèÔø•‚ï†‚Ü©‚ú≠‚ñê‚òº¬µ‚òª‚îê√ì‚îú√º¬´√°‚àº‚îå‚Ñâ‚òÆ‡∏ø‚â¶‚ô¨‚úß‚å™Ôºç‚åÇ‚úñÔΩ•‚óï‚Äª‚Äñ‚óÄ‚Ä∞\x97‚Ü∫√¶‚àÜ√ë≈ì‚îò‚î¨‚ï¨ÿå‚åò≈°‚äÇ√é¬™Ôºû‚å©‚éô‚Ñ´Ôºü‚ò†‚áê‚ñ´‚àó‚àà‚â†‚ôÄ√±∆í‚ôîÀö√ß‚Ñó‚îóÔºä‚îº‚ùÄ√§ƒ±ÔºÜ‚à©‚ôÇ‚Äø‚àë‚Ä£‚ûú‚îõ‚áì‚òØ‚äñ‚òÄ‚î≥Ôºõ‚àá‚áë‚ú∞‚óá‚ôØ‚òû¬¥…ô‚Üî‚îèÔΩ°√ü‚óò‚àÇ√õ‚úå‚ô≠√≥‚î£‚î¥‚îì‚ú®√ñ√ÑÀàÀú‚ù•‚î´√ú‚Ñ†‚úí≈æÔºª‚à´\x93‚âßÔºΩ\x94‚àÄ‚ôõ\x96‚à®‚óéÀë√∂‚Üª‚Öì√Ü‚á©Ôºú‚â´‚ú©ÀÜ‚ú™√à‚ôï√ôÿü‚Ç§‚òõ√á‚ïÆ‚êäÔºã‚îà…°ÔºÖ‚ïã‚ñΩ‚á®‚îª√æ‚äó√ÅÔø°‡•§‚ñÇ‚úØ‚ñáÔºø‚û§√¥‚ÇÇ‚úûÔºù‚ñ∑‚ñ≥√û‚óô√Æ‚ñÖ‚úùÔæü√è‚àß‚êâ‚ò≠√∞‚îä‚ïØ‚òæ‚ûî√™‚à¥\x92‚ñÉ‚Ü≥Ôºæ◊≥√∫‚û¢‚ï≠‚û°Ôº†‚äô√¨‚ò¢Àù√î‚

Here we see all symbols that we have an embedding vector for. Intrestingly its not only a vast amount of punctuation but also emojis and other symbols. <br><br>

Especially when doing sentiment analysis emojis and other symbols carrying sentiments should not be deleted! <br><br>

What we can delete are symbols we have no embeddings for

In [21]:
jigsaw_chars = build_vocab(list(train["comment_text"]))
jigsaw_symbols = ''.join([c for c in jigsaw_chars if not c in white_list])
jigsaw_symbols

'.,?\n!&-":()%/>‚Ä¶;√Ø`‚Äú#‚Äù[]<+$‚Äô*‚Äîüê¥=ƒ∫_ÿ¨ŸÖÿßÿπÿ©ŸÑŸÅŸÇÿ±ÿ°√©‚Äò‚Äì√¨Àà ä…í‚òêüê¢~üòÇ{}@üòâ\xad‚òë‚ÇÇ\u200b‚Ä¢üòêüòä‚Äï‚óÑ‚ñ∫ƒì^¬´√†¬ª·¥õ·¥Ä Ä·¥°·¥è·¥ã…™…¥…¢“ì·¥ç ú·¥á·¥ä ô·¥ú·¥Ö è·¥Ñ ü·¥ò·¥†≈ç‚ò†Ô∏è\xa0\t·¥µ\\üòÉüò¶üòáüò™œÄŒøŒªœÖŒºŒ±Œ∏ŒÆœÇüòÄ‚òÆ\u200a‚òí¬ßüò≥‚Ñ¢ùêßùêöùê≠ùê¢ùê®√±üôèüôÇ¬¢ùíëùíìùíÜùíîùíäùíÖùíèùíïùíÇùíñùíàùíâüôÑüòÅüòëüí©üí®√ß√™√°üò•‚ÄëÀå…îÀê í…ë…ô‚òÑ\u2009‚óè¬∞ü§îüò©üòüüò≠üëøƒç¬£√º¬∑ƒÅ\x7f√≠√≥√§√â‚ò∫üéâ¬∫\u202a\u202c¬¥üò∞‚Ç¨\u200e‚òπ‚úîü§°‚©õ‚èñ‚òÅ‚ÑÖ¬Ωüòé√®√Åüëçüá∫üá∏¬º‚ñ∞œÑ·ø¶·æΩ·ºîŒ¥ŒµŒπœÉŒΩ·º°Œ≤Œ∫·Ω∂·Ω∫·ºê·øñŒ¨Œ∑·ø∂œÅŒ≠œâŒæŒ≥œå·Ω≤Œú·ºàŒØ·Ω∞·Ωë·øÜŒ∂·ø∑·ø≥·º∞œáœçœÜ·ºÄœé·Ω∏·Ω¥·Ω°·ºÑ·ΩÄ·Ωê·æ∂·ΩÖ·ºë·ºÖ·Ωñ É åüé≠Ôºéüòà ª‚úåüáæüòîùíé\u2028√∫‚óû‚óù|üòú‚û§¬©„Äã„ÄäüòÜ‚òÄüòòÕû≈´√¢Œîùô•ùô§ùôñùôØùôûùô©ùôÄùôíùô¨üò¢≈ì\u2004‚ñÄ‚ñîüòÆƒÄüòùüëèùíÉùíÑüò¨üëå…°üëéüòÑüíî\x85\x10ùê°ùê©ùê¨ùê∞ùê≤ùêÆùêõùêûùêúùê¶ùêØùê≥ùêîùê±ùüîùüìùêÖùêìùêÉùêò≈°ƒáüòõüíú¬≤·µó ∞‚ù§‚â†ùíêùíóùíôùë∞ùíáùíö

Basically we can delete all symbols we have no embeddings for:

In [22]:
symbols_to_delete = ''.join([c for c in jigsaw_symbols if not c in glove_symbols])
symbols_to_delete

'\nüê¥ÿ¨ŸÖÿßÿπÿ©ŸÑŸÅŸÇÿ±ÿ°üê¢üòÇüòâ\xad\u200büòêüòä·¥õ·¥Ä Ä·¥°·¥è·¥ã…¥…¢“ì·¥ç ú·¥á·¥ä ô·¥ú·¥Ö è·¥Ñ ü·¥ò·¥†Ô∏è\xa0\t·¥µüòÉüò¶üòáüò™ŒøœÖŒÆœÇüòÄ\u200aüò≥ùêßùêöùê≠ùê¢ùê®üôèüôÇùíëùíìùíÜùíîùíäùíÖùíèùíïùíÇùíñùíàùíâüôÑüòÅüòëüí©üí®üò•‚Äë\u2009ü§îüò©üòüüò≠üëø\x7füéâ\u202a\u202cüò∞\u200eü§°‚©õ‚èñüòéüëçüá∫üá∏·ø¶·æΩ·ºîŒπ·º°Œ∫·Ω∂·Ω∫·ºê·øñŒ¨·ø∂Œ≠Œæœå·Ω≤Œú·ºàŒØ·Ω∞·Ωë·øÜŒ∂·ø∑·ø≥·º∞œáœç·ºÄœé·Ω∏·Ω¥·Ω°·ºÑ·ΩÄ·Ωê·æ∂·ΩÖ·ºë·ºÖ·Ωñüé≠üòàüáæüòîùíé\u2028‚óùüòú„Äã„ÄäüòÜüòòÕûùô•ùô§ùôñùôØùôûùô©ùôÄùôíùô¨üò¢\u2004üòÆüòùüëèùíÉùíÑüò¨üëåüëéüòÑüíî\x85\x10ùê°ùê©ùê¨ùê∞ùê≤ùêÆùêõùêûùêúùê¶ùêØùê≥ùêîùê±ùüîùüìùêÖùêìùêÉùêòüòõüíú·µó ∞ùíêùíóùíôùë∞ùíáùíöùíåùíòùíçùë¥üòèüíù\ufeff\u200f–¥–µ—Ä—å–º–æüïçùëπüò£\x08–î—è–π–Ω–øüòçÂçêÂççüïäüêµüèªÂ∞èÂúüË±Ü·ä†·ãé\u2005‚íà‚íâ‚íä‚íã‚íå‚íçüåà‚ñ±üí•ÔøΩ‚Åé‚ÅçÃõ·¥ó·∏µüôàüåéüôåüèΩüåûüçÄüòÖüöÇ·ºïüòßüôÅüèºüò±ü§öüí∞üò°\x81üò≤üéæü§£·Ω†‚Ñê\uf070\uf071\uf03d\uf031\uf02f\uf0

The symbols we want to keep we need to isolate from our words. So lets setup a list of those to isolate.

In [23]:
symbols_to_isolate = ''.join([c for c in jigsaw_symbols if c in glove_symbols])
symbols_to_isolate

'.,?!&-":()%/>‚Ä¶;√Ø`‚Äú#‚Äù[]<+$‚Äô*‚Äî=ƒ∫_√©‚Äò‚Äì√¨Àà ä…í‚òê~{}@‚òë‚ÇÇ‚Ä¢‚Äï‚óÑ‚ñ∫ƒì^¬´√†¬ª…™≈ç‚ò†\\œÄŒªŒºŒ±Œ∏‚òÆ‚òí¬ß‚Ñ¢√±¬¢√ß√™√°Àå…îÀê í…ë…ô‚òÑ‚óè¬∞ƒç¬£√º¬∑ƒÅ√≠√≥√§√â‚ò∫¬∫¬¥‚Ç¨‚òπ‚úî‚òÅ‚ÑÖ¬Ω√®√Å¬º‚ñ∞œÑŒ¥ŒµœÉŒΩŒ≤Œ∑œÅœâŒ≥œÜ É åÔºé ª‚úå√∫‚óû|‚û§¬©‚òÄ≈´√¢Œî≈ì‚ñÄ‚ñîƒÄ…°≈°ƒá¬≤‚ù§‚â†√∂‚Ü©¬°–≤‚Öî√≤–∞‚Öìƒü‚Ñ†‚öæ‚öΩ‚òª√î‚àô√´Ã¥Ã±‚ô•ƒ´√£‚ô°‚ô´‚ô™√ñ√Æ¬¨‚àÜ—Å–∏√¥≈ã‚ïê‚òÖ‚òÜ‚ôÇ‚óï‚ñÜ‚Üí‚óã√∏√ô'

Next comes the next trick. Instead of using an inefficient loop of `replace` we use `translate`. 

In [24]:
isolate_dict = {ord(c):f' {c} ' for c in symbols_to_isolate}
remove_dict = {ord(c):f'' for c in symbols_to_delete}


def handle_punctuation(x):
    x = x.translate(remove_dict)
    x = x.translate(isolate_dict)
    return x

In [25]:
handle_punctuation('You look like a horse?  üê¥' )

'You look like a horse ?   '

So lets apply that function to our text and reasses the coverage

In [26]:
train['comment_text'] = train['comment_text'].apply(lambda x:handle_punctuation(x))

In [27]:
vocab = build_vocab(list(train['comment_text'].apply(lambda x:x.split())))
oov = check_coverage(vocab,glove_embeddings)
oov[:10]

Found embeddings for 76.75% of vocab
Found embeddings for  98.75% of all text


[("isn't", 2329),
 ("That's", 1997),
 ("won't", 1763),
 ("he's", 1359),
 ("Trump's", 1243),
 ("aren't", 1223),
 ("wouldn't", 1085),
 ("they're", 989),
 ("wasn't", 963),
 ("there's", 836)]

In [28]:
tokenizer = TreebankWordTokenizer()

In [29]:
def handle_contractions(x):
    x = tokenizer.tokenize(x)
    x = ' '.join(x)
    return x

In [30]:
train['comment_text'] = train['comment_text'].apply(lambda x:handle_contractions(x))

In [31]:
vocab = build_vocab(list(train['comment_text'].apply(lambda x:x.split())),verbose=False)
oov = check_coverage(vocab,glove_embeddings)
oov[:10]

Found embeddings for 81.80% of vocab
Found embeddings for  99.57% of all text


[('Brexit', 136),
 ('tRump', 134),
 ("gov't", 97),
 ('theglobeandmail', 69),
 ("'the", 64),
 ('theguardian', 59),
 ('deplorables', 56),
 ('Drumpf', 52),
 ("'The", 42),
 ("'bout", 40)]

Now the oov words look "normal", apart from those still carrying the `'` token in the beginning of the word. Will need to fix those "per hand"

In [32]:
def fix_quote(x):
    x = [x_[1:] if x_.startswith("'") else x_ for x_ in x]
    x = ' '.join(x)
    return x

In [33]:
train['comment_text'] = train['comment_text'].apply(lambda x:fix_quote(x.split()))

In [34]:
train['comment_text'].head()

0    Jeff Sessions is another one of Trump s Orwell...
1    I actually inspected the infrastructure on Gra...
2    No it wo n't . That s just wishful thinking on...
3    Instead of wringing our hands and nibbling the...
4    how many of you commenters have garbage piled ...
Name: comment_text, dtype: object

In [35]:
vocab = build_vocab(list(train['comment_text'].apply(lambda x:x.split())),verbose=False)
oov = check_coverage(vocab,glove_embeddings)
oov[:50]

Found embeddings for 84.13% of vocab
Found embeddings for  99.66% of all text


[('Brexit', 139),
 ('tRump', 134),
 ("gov't", 97),
 ('theglobeandmail', 69),
 ('theguardian', 59),
 ('deplorables', 58),
 ('Drumpf', 52),
 ("Gov't", 39),
 ('Trumpcare', 38),
 ('garycrum', 37),
 ('Klastri', 29),
 ('SB91', 27),
 ('Trumpism', 24),
 ("y'all", 23),
 ('Saullie', 21),
 ('financialpost', 21),
 ('HIHS', 20),
 ('bigly', 20),
 ('907AK', 19),
 ('klastri', 19),
 ('Trumpian', 19),
 ('Hoopili', 19),
 ('Trumpsters', 18),
 ('Auwe', 18),
 ('l2g', 18),
 ('civilbeat', 17),
 ('shibai', 17),
 ('Trudeaus', 17),
 ('SHOPO', 16),
 ('Donkel', 16),
 ('cashapp24', 16),
 ('Deplorables', 15),
 ('Koncerned', 15),
 ('Meggsy', 15),
 ('BCLibs', 14),
 ('Zupta', 14),
 ('Anbang', 14),
 ('Layla4', 14),
 ('gofundme', 14),
 ('Crapwell', 13),
 ('wiliki', 13),
 ('xbt', 13),
 ('TFWs', 13),
 ('vancouversun', 13),
 ('tRUMP', 13),
 ('SJWs', 12),
 ('TrumpCare', 12),
 ('FakeNews', 12),
 ('MAGAphants', 12),
 ('MUPTE', 12)]

Looks good. Now we can implement the preprocessing functions and train a model. See 

In [36]:
print (glove_embeddings['Schwab'])
print(glove_embeddings['rmd'])
print(glove_embeddings['vanguard'])
print(glove_embeddings['blackrock'])
print(glove_embeddings['ameritrade'])
print(glove_embeddings['SCHW'])

[-0.26362    0.17013   -0.2353     0.23      -0.17085    0.31102
  0.45467    0.24224    0.099638   1.0841     0.3914    -0.58338
  0.23517    0.19334   -0.12729   -0.25476   -0.54572   -0.32191
  0.18551    0.53016    0.010809   0.64877    0.38942    0.027237
 -0.21348    0.44491   -0.3409     0.1347    -0.25356   -0.093586
  0.60238   -0.48313    0.55752    0.20925   -0.091584  -0.1464
 -0.22533   -0.42055   -0.098816  -0.42965   -0.53209   -0.24078
  0.2644     0.15721    0.16086    0.28977   -0.57275   -0.23805
  0.12122   -0.53071   -0.25444   -0.31427   -0.24907    0.28778
  0.62412   -0.15069   -0.66065   -0.37106    0.22876   -0.43941
  0.40445   -0.12752    0.49618    0.13293    0.5845     0.22621
  0.37511    0.090205   0.06014    0.070341   0.53625   -0.077997
  0.41538    0.41096   -0.79239    0.61845    0.43017    0.14255
 -0.11131   -0.14772   -0.009152  -0.12959    0.13898    0.44132
 -0.42023    0.1564    -0.021194  -0.084832  -0.21505   -0.0074104
 -0.020073   0.030071

In [37]:
print(crawl_embeddings['Schwab'])
print(crawl_embeddings['rmd'])
print(crawl_embeddings['vanguard'])
print(crawl_embeddings['blackrock'])
print(crawl_embeddings['ameritrade'])
print(crawl_embeddings['SCHW'])

[ 1.799e-01 -6.090e-02  3.940e-02  1.563e-01 -1.152e-01 -7.056e-01
  3.599e-01 -3.455e-01 -5.410e-01  3.297e-01 -2.468e-01  3.851e-01
 -4.890e-02  3.203e-01 -9.200e-03  4.183e-01  1.799e-01 -1.656e-01
 -1.567e-01  1.897e-01 -1.128e-01  2.519e-01 -7.587e-01  2.604e-01
  1.763e-01  3.257e-01  8.870e-02 -3.518e-01 -6.119e-01  1.431e-01
  4.926e-01  1.298e-01 -2.824e-01  2.730e-01  3.544e-01  1.392e-01
 -3.062e-01  5.110e-02 -1.973e-01 -2.703e-01  9.570e-02 -1.982e-01
 -4.788e-01  6.519e-01  2.258e-01 -4.140e-01 -2.274e-01  3.050e-02
  3.159e-01 -3.560e-01  1.755e-01 -2.430e-01  6.916e-01  4.952e-01
 -2.976e-01  4.380e-01 -2.934e-01  1.485e-01 -2.430e-02 -2.160e-02
 -4.477e-01 -1.428e-01  2.560e-01  8.540e-02  4.970e-02  2.503e-01
  3.531e-01  1.051e-01  8.440e-02  1.166e-01 -3.892e-01  7.640e-02
  1.361e-01 -4.056e-01 -4.945e-01 -1.586e-01 -4.023e-01  2.919e-01
 -3.220e-01 -2.810e-02  4.450e-01 -9.910e-02 -1.145e-01 -2.313e-01
 -4.943e-01 -5.789e-01 -3.248e-01  2.097e-01  2.743e-01 -2.104

 try a "lower/upper case version of a" word if an embedding is not found, which sometimes gives us an embedding

In [2]:
print(crawl_embeddings['SchwaB'])

NameError: name 'crawl_embeddings' is not defined

In [39]:
# print(glove_embeddings['SchwaB'])

In [40]:
def build_matrix(word_index, embedding_index):
    unknown_words = []
    lower_words = []
    title_words = []
    known_words = []
    
    
    for word, i in word_index.items():
        if i <= max_features:
            try:
                val = embedding_index[word]
                known_words.append(word)
            except KeyError:
                try:
                    val = embedding_index[word.lower()]
                    lower_words.append(word)
                except KeyError:
                    try:
                        val = embedding_index[word.title()]
                        title_words.append(word)
                    except KeyError:
                        unknown_words.append(word)
    return lower_words, title_words, unknown_words, known_words

In [41]:
max_features = 400000
tokenizer = text.Tokenizer(num_words = max_features, filters='',lower=False)

In [42]:

tokenizer.fit_on_texts(list(train['comment_text']))

lower_words, title_words, unknown_words, known_words = build_matrix(tokenizer.word_index, crawl_embeddings)
print('n lower words (crawl): ', len(lower_words))
print('n title words (crawl): ', len(title_words))
print('n unknown words (crawl): ', len(unknown_words))
print('n known words (crawl): ', len(known_words))



n lower words (crawl):  843
n title words (crawl):  667
n unknown words (crawl):  14318
n known words (crawl):  84438


In [43]:

lower_words, title_words, unknown_words, known_words = build_matrix(tokenizer.word_index, glove_embeddings)
print('n lower words (glove): ', len(lower_words))
print('n title words (glove): ', len(title_words))
print('n unknown words (glove): ', len(unknown_words))
print('n known words (glove): ', len(known_words))


n lower words (glove):  738
n title words (glove):  572
n unknown words (glove):  14598
n known words (glove):  84358


In [44]:
symbols_to_isolate = '.,?!-;*"‚Ä¶:‚Äî()%#$&_/@Ôºº„Éªœâ+=‚Äù‚Äú[]^‚Äì>\\¬∞<~‚Ä¢‚â†‚Ñ¢Àà ä…í‚àû¬ß{}¬∑œÑŒ±‚ù§‚ò∫…°|¬¢‚ÜíÃ∂`‚ù•‚îÅ‚î£‚î´‚îóÔºØ‚ñ∫‚òÖ¬©‚Äï…™‚úî¬Æ\x96\x92‚óè¬£‚ô•‚û§¬¥¬π‚òï‚âà√∑‚ô°‚óê‚ïë‚ñ¨‚Ä≤…îÀê‚Ç¨€©€û‚Ä†Œº‚úí‚û•‚ïê‚òÜÀå‚óÑ¬Ω ªœÄŒ¥Œ∑ŒªœÉŒµœÅŒΩ É‚ú¨Ôº≥ÔºµÔº∞Ôº•Ôº≤Ôº©Ôº¥‚òª¬±‚ôç¬µ¬∫¬æ‚úì‚óæÿüÔºé‚¨Ö‚ÑÖ¬ª–í–∞–≤‚ù£‚ãÖ¬ø¬¨‚ô´Ôº£Ôº≠Œ≤‚ñà‚ñì‚ñí‚ñë‚áí‚≠ê‚Ä∫¬°‚ÇÇ‚ÇÉ‚ùß‚ñ∞‚ñî‚óû‚ñÄ‚ñÇ‚ñÉ‚ñÑ‚ñÖ‚ñÜ‚ñá‚ÜôŒ≥ÃÑ‚Ä≥‚òπ‚û°¬´œÜ‚Öì‚Äû‚úãÔºö¬•Ã≤ÃÖÃÅ‚àô‚Äõ‚óá‚úè‚ñ∑‚ùì‚ùó¬∂ÀöÀôÔºâ—Å–∏ ø‚ú®„ÄÇ…ë\x80‚óïÔºÅÔºÖ¬Ø‚àíÔ¨ÇÔ¨Å‚ÇÅ¬≤ å¬º‚Å¥‚ÅÑ‚ÇÑ‚å†‚ô≠‚úò‚ï™‚ñ∂‚ò≠‚ú≠‚ô™‚òî‚ò†‚ôÇ‚òÉ‚òé‚úà‚úå‚ú∞‚ùÜ‚òô‚óã‚Ä£‚öìÂπ¥‚àé‚Ñí‚ñ™‚ñô‚òè‚ÖõÔΩÉÔΩÅÔΩì«Ä‚ÑÆ¬∏ÔΩó‚Äö‚àº‚Äñ‚Ñ≥‚ùÑ‚Üê‚òº‚ãÜ í‚äÇ„ÄÅ‚Öî¬®Õ°‡πè‚öæ‚öΩŒ¶√óŒ∏Ôø¶ÔºüÔºà‚ÑÉ‚è©‚òÆ‚ö†Êúà‚úä‚ùå‚≠ï‚ñ∏‚ñ†‚áå‚òê‚òë‚ö°‚òÑ«´‚ï≠‚à©‚ïÆÔºå‰æãÔºû ï…êÃ£Œî‚ÇÄ‚úû‚îà‚ï±‚ï≤‚ñè‚ñï‚îÉ‚ï∞‚ñä‚ñã‚ïØ‚î≥‚îä‚â•‚òí‚Üë‚òù…π‚úÖ‚òõ‚ô©‚òûÔº°Ôº™Ôº¢‚óî‚ó°‚Üì‚ôÄ‚¨ÜÃ±‚Ñè\x91‚†ÄÀ§‚ïö‚Ü∫‚á§‚àè‚úæ‚ó¶‚ô¨¬≥„ÅÆÔΩúÔºè‚àµ‚à¥‚àöŒ©¬§‚òú‚ñ≤‚Ü≥‚ñ´‚Äø‚¨á‚úßÔΩèÔΩñÔΩçÔºçÔºíÔºêÔºòÔºá‚Ä∞‚â§‚àïÀÜ‚öú‚òÅ'
symbols_to_delete = '\nüçï\rüêµüòë\xa0\ue014\t\uf818\uf04a\xadüò¢üê∂Ô∏è\uf0e0üòúüòéüëä\u200b\u200eüòÅÿπÿØŸàŸäŸáÿµŸÇÿ£ŸÜÿßÿÆŸÑŸâÿ®ŸÖÿ∫ÿ±üòçüíñüíµ–ïüëéüòÄüòÇ\u202a\u202cüî•üòÑüèªüí•·¥ç è Ä·¥á…¥·¥Ö·¥è·¥Ä·¥ã ú·¥ú ü·¥õ·¥Ñ·¥ò ô“ì·¥ä·¥°…¢üòãüëè◊©◊ú◊ï◊ù◊ë◊ôüò±‚Äº\x81„Ç®„É≥„Ç∏ÊïÖÈöú\u2009üöå·¥µÕûüåüüòäüò≥üòßüôÄüòêüòï\u200füëçüòÆüòÉüòò◊ê◊¢◊õ◊óüí©üíØ‚õΩüöÑüèº‡Æúüòñ·¥†üö≤‚Äêüòüüòàüí™üôèüéØüåπüòáüíîüò°\x7füëå·ºê·Ω∂ŒÆŒπ·Ω≤Œ∫·ºÄŒØ·øÉ·º¥ŒæüôÑÔº®üò†\ufeff\u2028üòâüò§‚õ∫üôÇ\u3000ÿ™ÿ≠ŸÉÿ≥ÿ©üëÆüíôŸÅÿ≤ÿ∑üòèüçæüéâüòû\u2008üèæüòÖüò≠üëªüò•üòîüòìüèΩüéÜüçªüçΩüé∂üå∫ü§îüò™\x08‚Äëüê∞üêáüê±üôÜüò®üôÉüíïùòäùò¶ùò≥ùò¢ùòµùò∞ùò§ùò∫ùò¥ùò™ùòßùòÆùò£üíóüíöÂú∞ÁçÑË∞∑—É–ª–∫–Ω–ü–æ–ê–ùüêæüêïüòÜ◊îüîóüöΩÊ≠åËàû‰ºéüôàüò¥üèøü§óüá∫üá∏–ºœÖ—Ç—ï‚§µüèÜüéÉüò©\u200aüå†üêüüí´üí∞üíé—ç–ø—Ä–¥\x95üñêüôÖ‚õ≤üç∞ü§êüëÜüôå\u2002üíõüôÅüëÄüôäüôâ\u2004À¢·µí ≥ ∏·¥º·¥∑·¥∫ ∑·µó ∞·µâ·µò\x13üö¨ü§ì\ue602üòµŒ¨ŒøœåœÇŒ≠·Ω∏◊™◊û◊ì◊£◊†◊®◊ö◊¶◊òüòíÕùüÜïüëÖüë•üëÑüîÑüî§üëâüë§üë∂üë≤üîõüéì\uf0b7\uf04c\x9f\x10ÊàêÈÉΩüò£‚è∫üòåü§ëüåèüòØ–µ—Öüò≤·º∏·æ∂·ΩÅüíûüöìüîîüìöüèÄüëê\u202düí§üçá\ue613Â∞èÂúüË±Üüè°‚ùî‚Åâ\u202füë†„Äã‡§ï‡§∞‡•ç‡§Æ‡§æüáπüáºüå∏Ëî°Ëã±Êñáüåûüé≤„É¨„ÇØ„Çµ„ÇπüòõÂ§ñÂõΩ‰∫∫ÂÖ≥Á≥ª–°–±üíãüíÄüéÑüíúü§¢ŸêŸé—å—ã–≥—è‰∏çÊòØ\x9c\x9düóë\u2005üíÉüì£üëø‡ºº„Å§‡ºΩüò∞·∏∑–ó–∑‚ñ±—ÜÔøºü§£ÂçñÊ∏©Âì•ÂçéËÆÆ‰ºö‰∏ãÈôç‰Ω†Â§±ÂéªÊâÄÊúâÁöÑÈí±Âä†ÊãøÂ§ßÂùèÁ®éÈ™óÂ≠êüêù„ÉÑüéÖ\x85üç∫ÿ¢ÿ•ÿ¥ÿ°üéµüåéÕü·ºîÊ≤πÂà´ÂÖãü§°ü§•üò¨ü§ß–π\u2003üöÄü§¥ ≤—à—á–ò–û–†–§–î–Ø–ú—é–∂üòùüñë·Ωê·ΩªœçÁâπÊÆä‰ΩúÊà¶Áæ§—âüí®ÂúÜÊòéÂõ≠◊ß‚Ñêüèàüò∫üåç‚èè·ªáüçîüêÆüçÅüçÜüçëüåÆüåØü§¶\u200dùìíùì≤ùìøùìµÏïàÏòÅÌïòÏÑ∏Ïöî–ñ—ô–ö—õüçÄüò´ü§§·ø¶ÊàëÂá∫ÁîüÂú®‰∫ÜÂèØ‰ª•ËØ¥ÊôÆÈÄöËØùÊ±âËØ≠Â•ΩÊûÅüéºüï∫üç∏ü•ÇüóΩüéáüéäüÜòü§†üë©üñíüö™Â§©‰∏ÄÂÆ∂‚ö≤\u2006‚ö≠‚öÜ‚¨≠‚¨Ø‚èñÊñ∞‚úÄ‚ïåüá´üá∑üá©üá™üáÆüá¨üáßüò∑üá®üá¶–•–®üåê\x1fÊùÄÈ∏°ÁªôÁå¥Áúã Åùó™ùóµùó≤ùóªùòÜùóºùòÇùóøùóÆùóπùó∂ùòáùóØùòÅùó∞ùòÄùòÖùóΩùòÑùó±üì∫œñ\u2000“Ø’Ω·¥¶·é•“ªÕ∫\u2007’∞\u2001…©ÔΩôÔΩÖ‡µ¶ÔΩå∆ΩÔΩàùêìùê°ùêûùê´ùêÆùêùùêöùêÉùêúùê©ùê≠ùê¢ùê®ùêß∆Ñ·¥®◊ü·ëØ‡ªêŒ§·èß‡Ø¶–Ü·¥ë‹Åùê¨ùê∞ùê≤ùêõùê¶ùêØùêëùêôùê£ùêáùêÇùêòùüé‘ú–¢·óû‡±¶„Äî·é´ùê≥ùêîùê±ùüîùüìùêÖüêãÔ¨Éüíòüíì—ëùò•ùòØùò∂üíêüåãüåÑüåÖùô¨ùôñùô®ùô§ùô£ùô°ùôÆùôòùô†ùôöùôôùôúùôßùô•ùô©ùô™ùôóùôûùôùùôõüë∫üê∑‚ÑãùêÄùê•ùê™üö∂ùô¢·ºπü§òÕ¶üí∏ÿ¨Ìå®Ìã∞Ôº∑ùôá·µªüëÇüëÉ…úüé´\uf0a7–ë–£—ñüö¢üöÇ‡™ó‡´Å‡™ú‡™∞‡™æ‡™§‡´Ä·øÜüèÉùì¨ùìªùì¥ùìÆùìΩùìº‚òòÔ¥æÃØÔ¥ø‚ÇΩ\ue807ùëªùíÜùíçùíïùíâùíìùíñùíÇùíèùíÖùíîùíéùíóùíäüëΩüòô\u200c–õ‚Äíüéæüëπ‚éåüèí‚õ∏ÂÖ¨ÂØìÂÖªÂÆ†Áâ©ÂêóüèÑüêÄüöëü§∑ÊìçÁæéùíëùíöùíêùë¥ü§ôüêíÊ¨¢ËøéÊù•Âà∞ÈòøÊãâÊñØ◊°◊§ùô´üêàùíåùôäùô≠ùôÜùôãùôçùòºùôÖÔ∑ªü¶ÑÂ∑®Êî∂Ëµ¢ÂæóÁôΩÈ¨ºÊÑ§ÊÄíË¶Å‰π∞È¢ù·∫Ωüöóüê≥ùüèùêüùüñùüëùüïùíÑùüóùê†ùôÑùôÉüëáÈîüÊñ§Êã∑ùó¢ùü≥ùü±ùü¨‚¶Å„Éû„É´„Éè„Éã„ÉÅ„É≠Ê†™ÂºèÁ§æ‚õ∑ÌïúÍµ≠Ïñ¥„Ñ∏„ÖìÎãàÕú ñùòøùôî‚Çµùí©‚ÑØùíæùìÅùí∂ùìâùìáùìäùìÉùìàùìÖ‚Ñ¥ùíªùíΩùìÄùìåùí∏ùìéùôèŒ∂ùôüùòÉùó∫ùüÆùü≠ùüØùü≤üëãü¶äÂ§ö‰º¶üêΩüéªüéπ‚õìüèπüç∑ü¶Ü‰∏∫Âíå‰∏≠ÂèãË∞äÁ•ùË¥∫‰∏éÂÖ∂ÊÉ≥Ë±°ÂØπÊ≥ïÂ¶ÇÁõ¥Êé•ÈóÆÁî®Ëá™Â∑±ÁåúÊú¨‰º†ÊïôÂ£´Ê≤°ÁßØÂîØËÆ§ËØÜÂü∫Áù£ÂæíÊõæÁªèËÆ©Áõ∏‰ø°ËÄ∂Á®£Â§çÊ¥ªÊ≠ªÊÄ™‰ªñ‰ΩÜÂΩì‰ª¨ËÅä‰∫õÊîøÊ≤ªÈ¢òÊó∂ÂÄôÊàòËÉúÂõ†Âú£ÊääÂÖ®Â†ÇÁªìÂ©öÂ≠©ÊÅêÊÉß‰∏îÊ†óË∞ìËøôÊ†∑Ëøò‚ôæüé∏ü§ïü§í‚õëüéÅÊâπÂà§Ê£ÄËÆ®üèùü¶Åüôãüò∂Ï•êÏä§ÌÉ±Ìä∏Î§ºÎèÑÏÑùÏú†Í∞ÄÍ≤©Ïù∏ÏÉÅÏù¥Í≤ΩÏ†úÌô©ÏùÑÎ†µÍ≤åÎßåÎì§ÏßÄÏïäÎ°ùÏûòÍ¥ÄÎ¶¨Ìï¥ÏïºÌï©Îã§Ï∫êÎÇòÏóêÏÑúÎåÄÎßàÏ¥àÏôÄÌôîÏïΩÍ∏àÏùòÌíàÎü∞ÏÑ±Î∂ÑÍ∞àÎïåÎäîÎ∞òÎìúÏãúÌóàÎêúÏÇ¨Ïö©üî´üëÅÂá∏·Ω∞üí≤üóØùôà·ºåùíáùíàùíòùíÉùë¨ùë∂ùïæùñôùñóùñÜùñéùñåùñçùñïùñäùñîùñëùñâùñìùñêùñúùñûùñöùñáùïøùñòùñÑùñõùñíùñãùñÇùï¥ùñüùñàùï∏üëëüöøüí°Áü•ÂΩºÁôæ\uf005ùôÄùíõùë≤ùë≥ùëæùíãùüíüò¶ùôíùòæùòΩüèêùò©ùò®·Ωº·πëùë±ùëπùë´ùëµùë™üá∞üáµüëæ·ìá·íß·î≠·êÉ·êß·ê¶·ë≥·ê®·ìÉ·ìÇ·ë≤·ê∏·ë≠·ëé·ìÄ·ê£üêÑüéàüî®üêéü§ûüê∏üíüüé∞üåùüõ≥ÁÇπÂáªÊü•Áâàüç≠ùë•ùë¶ùëßÔºÆÔºßüë£\uf020„Å£üèâ—Ñüí≠üé•Œûüê¥üë®ü§≥ü¶ç\x0büç©ùëØùííüòóùüêüèÇüë≥üçóüïâüê≤⁄Ü€åùëÆùóïùó¥üçíÍú•‚≤£‚≤èüêë‚è∞ÈâÑ„É™‰∫ã‰ª∂—óüíä„Äå„Äç\uf203\uf09a\uf222\ue608\uf202\uf099\uf469\ue607\uf410\ue600ÁáªË£Ω„Ç∑ËôöÂÅΩÂ±ÅÁêÜÂ±à–ìùë©ùë∞ùíÄùë∫üå§ùó≥ùóúùóôùó¶ùóßüçä·Ω∫·ºà·º°œá·øñŒõ‚§èüá≥ùíôœà’Å’¥’•’º’°’µ’´’∂÷Ä÷Ç’§’±ÂÜ¨Ëá≥·ΩÄùíÅüîπü§öüçéùë∑üêÇüíÖùò¨ùò±ùò∏ùò∑ùòêùò≠ùòìùòñùòπùò≤ùò´⁄©Œíœéüí¢ŒúŒüŒùŒëŒïüá±‚ô≤ùùà‚Ü¥üíí‚äò»ªüö¥üñïüñ§ü•òüìçüëà‚ûïüö´üé®üåëüêªùêéùêçùêäùë≠ü§ñüééüòºüï∑ÔΩáÔΩíÔΩéÔΩîÔΩâÔΩÑÔΩïÔΩÜÔΩÇÔΩãùü∞üá¥üá≠üáªüá≤ùóûùó≠ùóòùó§üëºüìâüçüüç¶üåàüî≠„Ääüêäüêç\uf10a·Éö⁄°üê¶\U0001f92f\U0001f92aüê°üí≥·º±üôáùó∏ùóüùó†ùó∑ü•ú„Åï„Çà„ÅÜ„Å™„Çâüîº'


from nltk.tokenize.treebank import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()


isolate_dict = {ord(c):f' {c} ' for c in symbols_to_isolate}
remove_dict = {ord(c):f'' for c in symbols_to_delete}


def handle_punctuation(x):
    x = x.translate(remove_dict)
    x = x.translate(isolate_dict)
    return x

def handle_contractions(x):
    x = tokenizer.tokenize(x)
    return x

def fix_quote(x):
    x = [x_[1:] if x_.startswith("'") else x_ for x_ in x]
    x = ' '.join(x)
    return x

def preprocess(x):
    x = handle_punctuation(x)
    x = handle_contractions(x)
    x = fix_quote(x)
    return x

train['comment_text'] = train['comment_text'].apply(lambda x:preprocess(x))


In [45]:
tokenizer = text.Tokenizer(num_words = max_features, filters='',lower=False)
tokenizer.fit_on_texts(list(train['comment_text']))


In [46]:

tokenizer.fit_on_texts(list(train['comment_text']))

lower_words, title_words, unknown_words, known_words = build_matrix(tokenizer.word_index, crawl_embeddings)
print('n lower words (crawl): ', len(lower_words))
print('n title words (crawl): ', len(title_words))
print('n unknown words (crawl): ', len(unknown_words))
print('n known words (crawl): ', len(known_words))



n lower words (crawl):  842
n title words (crawl):  667
n unknown words (crawl):  14287
n known words (crawl):  84412


In [47]:

lower_words, title_words, unknown_words, known_words = build_matrix(tokenizer.word_index, glove_embeddings)
print('n lower words (glove): ', len(lower_words))
print('n title words (glove): ', len(title_words))
print('n unknown words (glove): ', len(unknown_words))
print('n known words (glove): ', len(known_words))


n lower words (glove):  738
n title words (glove):  572
n unknown words (glove):  14540
n known words (glove):  84358


In [48]:
set(lower_words)

{'122MM',
 '17Million',
 '20End',
 '20FINAL',
 '20Letter',
 '250Million',
 '25YRS',
 '2Cm',
 '2Cv',
 '3MILLION',
 '3Wd',
 '40Years',
 '7Million',
 'AAAhhh',
 'AAaaa',
 'ACCESIBLE',
 'ACIDIFICATION',
 'ADVOCATED',
 'AFterall',
 'ALASKAS',
 "ALI'I",
 'ALLUDE',
 'ALLof',
 'ALaskans',
 'AMORIS',
 'ANAE',
 'APOLLOS',
 'APPEASER',
 'APr',
 'ARRET',
 'ASo',
 'ATTENDENTS',
 'Accelerants',
 'Actially',
 'Adance',
 'Afirmative',
 'Agirl',
 'AirBnb',
 'AlASKANS',
 'AlCan',
 'Aliado',
 'AllenE',
 'Allyou',
 'AmIright',
 'AmericaNo',
 'Ampla',
 'And2',
 'AntiFA',
 'AntiFa',
 'Antivaxxers',
 'Anyay',
 'ArmyMan',
 'AtheO',
 'Atrack',
 'Atually',
 'AustOn',
 'Ayaa',
 'BAKKA',
 'BANISHMENT',
 'BArry',
 'BEEJEEZUS',
 'BILLON',
 'BIden',
 'BLASE',
 'BLINDINGLY',
 'BOff',
 'BRAINIER',
 'BRIBING',
 'BRv',
 'BUTHe',
 'BaHaHaHa',
 'Backslapping',
 'BanKRupt',
 'Barbarianism',
 'BeIn',
 'Befuddles',
 'Billlion',
 'BizJournal',
 'Blablablah',
 'Blahhhhh',
 'BoRs',
 'BradFord',
 'Buccaneering',
 'Buggah',
 'Bul