### Goals

- Reproducible 
- Quantifiable

### Summary

1. #### Don't use standard preprocessing steps like stemming or stopword removal 
The reason is simple: You loose valuable information, and thereby low performance in identifying signal
2. #### percentage coverage of vocabulary and overall coverage percentage
3. #### And the text cleaning is embedding specific
I just want to mention that the cleaning should be specific for each embedding. For instance, punctuation is present in the Glove embedding, so I believe it should not be removed.
4.   #### Glove  Vs  fasttext  <br>
5. How are numbers treated in embeddings?  89.999 to ##.### and 29.4 to ##.#
6. How is heart symbol treated ? 

### Motivation

1. Setting up the preprocessing is eda 

If a word vector for a token (see remark below for what I mean with token) is available strongly depends on the preprocessing used by the people who trained the embeddings. Unfortunatly most are quite intransparent about this point. (e.g. did they use lower casing, removing contractions, replacement of words, etc. So you need to research their github repositories and/or read the related papers. Google pretrained word vectors replace numbers with "##" or the guys training glove twitter embeddings did `text = re.sub("<3", '<HEART>', text)` 
That all leads to the second conclusion:


Similary King- Man + Woman  = King/Queen ? 

2. Each pretrained embedding needs its own preprocessing

If people used different preprocessing for training their embeddings you would also need to do the same, 

Especially point to can be quite challenging, if you want to concatenate embeddings as in this kernel. Imagine Embedding A preprocesses `"don't"` to a single token`["dont"]` and Embedding B to two tokens`["do","n't"]`. You are basically not able to do both. So you need to find a compromise.

#### Get your vocabulary as close to the embeddings as possible
I will focus in this notebook, how to achieve that. For an example I take the GoogleNews pretrained embeddings, there is no deeper reason for this choice.
<br><br>

The 3 main contributions of this notebook are the following:

- loading embedding from pickles 
- aimed preprocessing for GloVe and fasttext vectors (the main content of this notebook)
- fixing some unknown words

In [1]:
# Put these at the top of every notebook, to get automatic reloading and inline plotting
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import pickle
import numpy as np
from pprint import pprint
import pandas as pd
import os
import time
import logging
import operator
import string
import gc
import random
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import text, sequence
from nltk.tokenize.treebank import TreebankWordTokenizer

Using TensorFlow backend.


In [2]:
os.chdir('/home/swaroop/Downloads/jigsaw-unintended-bias-in-toxicity-classification/')

In [5]:
random.seed(123)
os.environ['PYTHONHASHSEED'] = str(123)
np.random.seed(123)

Use pkl files if possible

In [6]:
CRAWL_EMBEDDING_PATH = 'crawl-300d-2M.pkl'
GLOVE_EMBEDDING_PATH = 'glove.840B.300d.pkl'

we have to adjust the load_embeddings function, to handle the pickled dict.

In [7]:

def get_coefs(word, *arr):
    return word, np.asarray(arr, dtype='float32')


def load_embeddings(path):
    with open(path,'rb') as f:
        emb_arr = pickle.load(f)
    return emb_arr

def build_matrix(word_index, path):
    embedding_index = load_embeddings(path)
    embedding_matrix = np.zeros((len(word_index) + 1, 300))
    unknown_words = []
    
    for word, i in word_index.items():
        try:
            embedding_matrix[i] = embedding_index[word]
        except KeyError:
            unknown_words.append(word)
    return embedding_matrix, unknown_words



In [1]:
def bad_preprocess(data):
    '''
    Most common pre-processing used
    '''
    punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~`" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'
    def clean_special_chars(text, punct):
        for p in punct:
            text = text.replace(p, ' ')
        return text

    data = data.astype(str).apply(lambda x: clean_special_chars(x, punct))
    return data

In principle this functions just deletes some special characters. <br>
What is additionally inefficient is that later the keras tokenizer with its default parameters is used which has its own with the above function redundant behavior.

In [9]:
train = pd.read_csv('test.csv')

train.columns

# test = pd.read_csv('test.csv')

Index(['id', 'comment_text'], dtype='object')

## Preprocessing

Two important functions which we use throughout this notebook

In [10]:
def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in vocab:
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

def build_vocab(sentences, verbose =  True):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

###### Pkl files or Text files for embedding ? 

In [11]:
def load_embeddings_test(path):
    with open(path) as f:
        return dict(get_coefs(*line.strip().split(' ')) for line in f)

tic = time.time()
embedding_index = load_embeddings_test('/home/swaroop/Downloads/glove.6B/glove.6B.300d.txt')
print(f'loaded {len(embedding_index)} word vectors in {time.time()-tic}s')

del embedding_index
gc.collect()

loaded 400000 word vectors in 23.357550859451294s


11

In [12]:
tic = time.time()
glove_embeddings = load_embeddings('glove.840B.300d.pkl')
print(f'loaded {len(glove_embeddings)} word vectors in {time.time()-tic}s')

loaded 2196008 word vectors in 6.887946128845215s


###### Loading pkl files is twenty times faster

In [13]:
vocab = build_vocab(list(train['comment_text'].apply(lambda x:x.split())))

In [14]:
crawl_embeddings = load_embeddings('crawl-300d-2M.pkl')
oov = check_coverage(vocab,crawl_embeddings)

Found embeddings for 36.63% of vocab
Found embeddings for  91.44% of all text


In [15]:
oov[:10]

[("Trump's", 1220),
 ("aren't", 1168),
 ("Don't", 1080),
 ("wouldn't", 1067),
 ('Yes,', 997),
 ("wasn't", 931),
 ("Let's", 763),
 ("You're", 752),
 ('So,', 709),
 ("He's", 672)]

In [16]:
oov = check_coverage(vocab,glove_embeddings)

Found embeddings for 32.62% of vocab
Found embeddings for  89.70% of all text


In [17]:
oov[:10]

[("isn't", 2228),
 ("That's", 1974),
 ("won't", 1678),
 ("he's", 1319),
 ("Trump's", 1220),
 ("aren't", 1168),
 ("wouldn't", 1067),
 ('Yes,', 997),
 ("they're", 966),
 ("wasn't", 931)]

In [18]:
print (len(glove_embeddings))
print (len(crawl_embeddings))

2196008
2000000


Seems like `'` and other punctuation directly on or in a word is an issue. <br>
We could simply delete punctuation to fix that words, but there are better methods.<br>


In [19]:
latin_similar = "’'‘ÆÐƎƏƐƔĲŊŒẞÞǷȜæðǝəɛɣĳŋœĸſßþƿȝĄƁÇĐƊĘĦĮƘŁØƠŞȘŢȚŦŲƯY̨Ƴąɓçđɗęħįƙłøơşșţțŧųưy̨ƴÁÀÂÄǍĂĀÃÅǺĄÆǼǢƁĆĊĈČÇĎḌĐƊÐÉÈĖÊËĚĔĒĘẸƎƏƐĠĜǦĞĢƔáàâäǎăāãåǻąæǽǣɓćċĉčçďḍđɗðéèėêëěĕēęẹǝəɛġĝǧğģɣĤḤĦIÍÌİÎÏǏĬĪĨĮỊĲĴĶƘĹĻŁĽĿʼNŃN̈ŇÑŅŊÓÒÔÖǑŎŌÕŐỌØǾƠŒĥḥħıíìiîïǐĭīĩįịĳĵķƙĸĺļłľŀŉńn̈ňñņŋóòôöǒŏōõőọøǿơœŔŘŖŚŜŠŞȘṢẞŤŢṬŦÞÚÙÛÜǓŬŪŨŰŮŲỤƯẂẀŴẄǷÝỲŶŸȲỸƳŹŻŽẒŕřŗſśŝšşșṣßťţṭŧþúùûüǔŭūũűůųụưẃẁŵẅƿýỳŷÿȳỹƴźżžẓ"
white_list = string.ascii_letters + string.digits + latin_similar + ' '
white_list = string.ascii_letters + string.digits  + ' '
white_list += "'"

In [20]:
glove_chars = ''.join([c for c in glove_embeddings if len(c) == 1])
glove_symbols = ''.join([c for c in glove_chars if not c in white_list])
glove_symbols

',.":)(-!?|;$&/[]>%=#*+\\•~@£·_{}©^®`<→°€™›♥←×§″′Â█½à…“★”–●â►−¢²¬░¡¶↑±¿▾═¦║―¥▓—‹─▒：¼⊕▼▪†■’▀¨▄♫☆é¯♦¤▲è¸¾Ã⋅‘∞∙）↓、│（»，♪╩╚³・╦╣╔╗▬❤ïØ¹≤‡√◄━⇒▶º≥╝♡◊。✈≡☺✔↵≈ã✓Ð♣☎℃◦ø└‟Å～！○◆№♠▌✿▸⁄□É❖í✦．÷｜À┃å／￥╠↩✭▐☼µ☻┐Ó├ü«á∼┌℉☮฿≦♬✧〉－⌂✖･◕※‖◀‰\x97↺æ∆Ñœ┘┬╬،⌘š⊂Îª＞〈⎙Å？☠⇐▫∗∈≠♀ñƒ♔˚ç℗┗＊┼❀äı＆∩♂‿∑‣➜┛⇓☯⊖☀┳；∇⇑✰◇♯☞´ə↔┏｡ß◘∂Û✌♭ó┣┴┓✨ÖÄˈ˜❥┫Ü℠✒ž［∫\x93≧］\x94∀♛\x96∨◎ˑö↻⅓Æ⇩＜≫✩ˆ✪È♕Ù؟₤☛Ç╮␊＋┈ɡ％╋▽⇨┻þ⊗Á￡।▂✯▇＿➤ô₂✞＝▷△Þ◙î▅✝ﾟÏ∧␉☭ð┊╯☾➔ê∴\x92▃↳＾׳ú➢╭➡＠⊙ì☢˝Ô⅛∏ā„①๑∥❝Š☐▆Ÿûý╱⋙๏☁⇔▔\x91②➚◡Ê╰Ì٠ë♢Ý˙۞✘✮☑⋆ℓⓘ❒☣✉ē⌊➠∣❑⅔◢Òòⓒ\x80〒Í∕▮⦿✫✚⋯♩☂ˌ❞‗č܂☜ī‾✜╲ù∘⟩ō＼⟨·⅜✗Ă♚∅Ëⓔ◣͡‛❦⑨③◠✄❄１∃␣≪｢≅◯☽２İ∎｣⁰❧̅ÿǡⒶ↘⚓▣˘∪⇢Ú✍ɛ⊥＃⅝⎯õ↠۩☰Õ◥⊆✽ﬁ⚡↪ở❁☹ł◼☃◤❏Žⓢ⊱α➝̣✡∠｀▴┤Ȃ∝♏ⓐ✎;３④␤＇❣⅞✂✤ⓞ☪✴⌒˛♒＄ɪ✶▻Ⓔ◌◈۲Ʈ❚ʿ❂￦◉╜̃ťν✱╖❉₃ⓡℝ٤↗❶ʡ۰ˇⓣ♻➽۶₁ʃ׀✲Đʤ✬☉▉≒☥⌐♨✕ⓝ⊰❘＂⇧̵➪４▁βđ۱▏⊃ⓛ‚♰́✏⏑Œ̶٩Ⓢー⩾日￠❍≃⋰♋ɿ､̂ǿ❋✳ⓤ╤▕⌣✸℮⁺▨⑤╨Ⓥ♈❃☝Ā５✻⊇≻♘♞◂７✟Łū⌠✠☚✥ŋ❊ƂⒸŮ⌈❅Ⓡ♧Ⓞɑλ۵▭❱Ⓣ∟☕♺∵⍝ⓑɔ✵ŕ✣ℤ年ℕ٭♆Ⓘⅆ∶⚜◞்✹Ǥȡ➥ᴥ↕ɂ̳∷✋į➧∋̿ͧʘ┅⥤⬆ǀμ₄⋱ʔ☄↖⋮۔♌Ⓛ╕♓ـ⁴❯♍▋ă✺⭐６✾♊➣▿Ⓑ♉Ａ⏠◾▹⑥⩽в↦ż╥⍵⌋։➨и∮⇥ⓗⒹ⁻ʊć⎝⌥⌉◔◑ǂ✼♎ℂ♐╪ɨ⊚☒⇤θВⓜ⎠Ｏ◐ǰ⚠╞ﬂş◗⎕ⓨ☟Ｉⓟ♟❈↬ⓓŞ◻♮❙а♤∉؛⁂例ČⓃ־♑╫╓╳⬅☔πɒɹ߂Ō☸ɐʻ┄╧ʌ׃８ʒ⎢ġ❆⋄⚫ħ̏☏➞͂␙Ⓤ◟Ƥąʕ̊Ȥ⚐✙は↙̾ωΔ℘ﾞ✷⑦φ⍺❌⊢▵✅ｗ９ⓖ☨▰ʹŢ╡Ⓜő☤∽╘Ű˹↨ȿ♙⬇♱ś⌡Ω⠀╛❕┉Ⓟ̀Ǩ♖ⓚ┆⑧⎜Śǹ◜⚾⤴✇╟⎛☩➲➟ⓥⒽŘ⏝Ŀ◃０₀╢月↯✆ĶĢ˃⍴Ĥ❇ũ⚽╒Ｃɻɤ̸ʼ♜☓Ｔ➳⇄γ☬⚑✐⁵δȭ⌃◅▢ｓȸ❐ě

Here we see all symbols that we have an embedding vector for. Intrestingly its not only a vast amount of punctuation but also emojis and other symbols. <br><br>

Especially when doing sentiment analysis emojis and other symbols carrying sentiments should not be deleted! <br><br>

What we can delete are symbols we have no embeddings for

In [21]:
jigsaw_chars = build_vocab(list(train["comment_text"]))
jigsaw_symbols = ''.join([c for c in jigsaw_chars if not c in white_list])
jigsaw_symbols

'.,?\n!&-":()%/>…;ï`“#”[]<+$’*—🐴=ĺ_جماعةلفقرءé‘–ìˈʊɒ☐🐢~😂{}@😉\xad☑₂\u200b•😐😊―◄►ē^«à»ᴛᴀʀᴡᴏᴋɪɴɢғᴍʜᴇᴊʙᴜᴅʏᴄʟᴘᴠō☠️\xa0\tᴵ\\😃😦😇😪πολυμαθής😀☮\u200a☒§😳™𝐧𝐚𝐭𝐢𝐨ñ🙏🙂¢𝒑𝒓𝒆𝒔𝒊𝒅𝒏𝒕𝒂𝒖𝒈𝒉🙄😁😑💩💨çêá😥‑ˌɔːʒɑə☄\u2009●°🤔😩😟😭👿č£ü·ā\x7fíóäÉ☺🎉º\u202a\u202c´😰€\u200e☹✔🤡⩛⏖☁℅½😎èÁ👍🇺🇸¼▰τῦ᾽ἔδεισνἡβκὶὺἐῖάηῶρέωξγόὲΜἈίὰὑῆζῷῳἰχύφἀώὸὴὡἄὀὐᾶὅἑἅὖʃʌ🎭．😈ʻ✌🇾😔𝒎\u2028ú◞◝|😜➤©》《😆☀😘͞ūâΔ𝙥𝙤𝙖𝙯𝙞𝙩𝙀𝙒𝙬😢œ\u2004▀▔😮Ā😝👏𝒃𝒄😬👌ɡ👎😄💔\x85\x10𝐡𝐩𝐬𝐰𝐲𝐮𝐛𝐞𝐜𝐦𝐯𝐳𝐔𝐱𝟔𝟓𝐅𝐓𝐃𝐘šć😛💜²ᵗʰ❤≠𝒐𝒗𝒙𝑰𝒇𝒚𝒌𝒘𝒍𝑴😏💝\ufeffö\u200fдерьмо🕍𝑹😣↩\x08¡Двяйнп😍卐卍🕊⅔🐵🏻ò小土豆አዎа\u2005⒈⒉⒊⒋⒌⒍🌈▱⅓ğ℠💥⚾⚽☻Ô∙�ë⁎⁍̴̛ᴗḵ̱🙈🌎🙌🏽🌞🍀😅🚂♥īã♡♫♪ἕ😧🙁Öî🏼😱🤚💰😡\x81😲🎾🤣ὠℐ\uf070\uf071\uf03d\uf031\uf02f\uf032\uf028\uf02d\uf061\uf029\uf020¬∆РсиАлкчтбышужгô\uf0b7😒ŋфз💤😤💀═★☆🤷\u200d♂🔥༼つ◕༽𝙨𝙢𝙙𝙚𝙧𝙣𝙜𝙛𝙡𝙪𝙘🙃👀▆→○\r😞øÙ🌙𝒀💫'

Basically we can delete all symbols we have no embeddings for:

In [22]:
symbols_to_delete = ''.join([c for c in jigsaw_symbols if not c in glove_symbols])
symbols_to_delete

'\n🐴جماعةلفقرء🐢😂😉\xad\u200b😐😊ᴛᴀʀᴡᴏᴋɴɢғᴍʜᴇᴊʙᴜᴅʏᴄʟᴘᴠ️\xa0\tᴵ😃😦😇😪ουής😀\u200a😳𝐧𝐚𝐭𝐢𝐨🙏🙂𝒑𝒓𝒆𝒔𝒊𝒅𝒏𝒕𝒂𝒖𝒈𝒉🙄😁😑💩💨😥‑\u2009🤔😩😟😭👿\x7f🎉\u202a\u202c😰\u200e🤡⩛⏖😎👍🇺🇸ῦ᾽ἔιἡκὶὺἐῖάῶέξόὲΜἈίὰὑῆζῷῳἰχύἀώὸὴὡἄὀὐᾶὅἑἅὖ🎭😈🇾😔𝒎\u2028◝😜》《😆😘͞𝙥𝙤𝙖𝙯𝙞𝙩𝙀𝙒𝙬😢\u2004😮😝👏𝒃𝒄😬👌👎😄💔\x85\x10𝐡𝐩𝐬𝐰𝐲𝐮𝐛𝐞𝐜𝐦𝐯𝐳𝐔𝐱𝟔𝟓𝐅𝐓𝐃𝐘😛💜ᵗʰ𝒐𝒗𝒙𝑰𝒇𝒚𝒌𝒘𝒍𝑴😏💝\ufeff\u200fдерьмо🕍𝑹😣\x08Дяйнп😍卐卍🕊🐵🏻小土豆አዎ\u2005⒈⒉⒊⒋⒌⒍🌈▱💥�⁎⁍̛ᴗḵ🙈🌎🙌🏽🌞🍀😅🚂ἕ😧🙁🏼😱🤚💰😡\x81😲🎾🤣ὠℐ\uf070\uf071\uf03d\uf031\uf02f\uf032\uf028\uf02d\uf061\uf029\uf020РАлкчтбышужг\uf0b7😒фз💤😤💀🤷\u200d🔥༼つ༽𝙨𝙢𝙙𝙚𝙧𝙣𝙜𝙛𝙡𝙪𝙘🙃👀\r😞🌙𝒀💫'

The symbols we want to keep we need to isolate from our words. So lets setup a list of those to isolate.

In [23]:
symbols_to_isolate = ''.join([c for c in jigsaw_symbols if c in glove_symbols])
symbols_to_isolate

'.,?!&-":()%/>…;ï`“#”[]<+$’*—=ĺ_é‘–ìˈʊɒ☐~{}@☑₂•―◄►ē^«à»ɪō☠\\πλμαθ☮☒§™ñ¢çêáˌɔːʒɑə☄●°č£ü·āíóäÉ☺º´€☹✔☁℅½èÁ¼▰τδεσνβηρωγφʃʌ．ʻ✌ú◞|➤©☀ūâΔœ▀▔Āɡšć²❤≠ö↩¡в⅔òа⅓ğ℠⚾⚽☻Ô∙ë̴̱♥īã♡♫♪Öî¬∆сиôŋ═★☆♂◕▆→○øÙ'

Next comes the next trick. Instead of using an inefficient loop of `replace` we use `translate`. 

In [24]:
isolate_dict = {ord(c):f' {c} ' for c in symbols_to_isolate}
remove_dict = {ord(c):f'' for c in symbols_to_delete}


def handle_punctuation(x):
    x = x.translate(remove_dict)
    x = x.translate(isolate_dict)
    return x

In [25]:
handle_punctuation('You look like a horse?  🐴' )

'You look like a horse ?   '

So lets apply that function to our text and reasses the coverage

In [26]:
train['comment_text'] = train['comment_text'].apply(lambda x:handle_punctuation(x))

In [27]:
vocab = build_vocab(list(train['comment_text'].apply(lambda x:x.split())))
oov = check_coverage(vocab,glove_embeddings)
oov[:10]

Found embeddings for 76.75% of vocab
Found embeddings for  98.75% of all text


[("isn't", 2329),
 ("That's", 1997),
 ("won't", 1763),
 ("he's", 1359),
 ("Trump's", 1243),
 ("aren't", 1223),
 ("wouldn't", 1085),
 ("they're", 989),
 ("wasn't", 963),
 ("there's", 836)]

In [28]:
tokenizer = TreebankWordTokenizer()

In [29]:
def handle_contractions(x):
    x = tokenizer.tokenize(x)
    x = ' '.join(x)
    return x

In [30]:
train['comment_text'] = train['comment_text'].apply(lambda x:handle_contractions(x))

In [31]:
vocab = build_vocab(list(train['comment_text'].apply(lambda x:x.split())),verbose=False)
oov = check_coverage(vocab,glove_embeddings)
oov[:10]

Found embeddings for 81.80% of vocab
Found embeddings for  99.57% of all text


[('Brexit', 136),
 ('tRump', 134),
 ("gov't", 97),
 ('theglobeandmail', 69),
 ("'the", 64),
 ('theguardian', 59),
 ('deplorables', 56),
 ('Drumpf', 52),
 ("'The", 42),
 ("'bout", 40)]

Now the oov words look "normal", apart from those still carrying the `'` token in the beginning of the word. Will need to fix those "per hand"

In [32]:
def fix_quote(x):
    x = [x_[1:] if x_.startswith("'") else x_ for x_ in x]
    x = ' '.join(x)
    return x

In [33]:
train['comment_text'] = train['comment_text'].apply(lambda x:fix_quote(x.split()))

In [34]:
train['comment_text'].head()

0    Jeff Sessions is another one of Trump s Orwell...
1    I actually inspected the infrastructure on Gra...
2    No it wo n't . That s just wishful thinking on...
3    Instead of wringing our hands and nibbling the...
4    how many of you commenters have garbage piled ...
Name: comment_text, dtype: object

In [35]:
vocab = build_vocab(list(train['comment_text'].apply(lambda x:x.split())),verbose=False)
oov = check_coverage(vocab,glove_embeddings)
oov[:50]

Found embeddings for 84.13% of vocab
Found embeddings for  99.66% of all text


[('Brexit', 139),
 ('tRump', 134),
 ("gov't", 97),
 ('theglobeandmail', 69),
 ('theguardian', 59),
 ('deplorables', 58),
 ('Drumpf', 52),
 ("Gov't", 39),
 ('Trumpcare', 38),
 ('garycrum', 37),
 ('Klastri', 29),
 ('SB91', 27),
 ('Trumpism', 24),
 ("y'all", 23),
 ('Saullie', 21),
 ('financialpost', 21),
 ('HIHS', 20),
 ('bigly', 20),
 ('907AK', 19),
 ('klastri', 19),
 ('Trumpian', 19),
 ('Hoopili', 19),
 ('Trumpsters', 18),
 ('Auwe', 18),
 ('l2g', 18),
 ('civilbeat', 17),
 ('shibai', 17),
 ('Trudeaus', 17),
 ('SHOPO', 16),
 ('Donkel', 16),
 ('cashapp24', 16),
 ('Deplorables', 15),
 ('Koncerned', 15),
 ('Meggsy', 15),
 ('BCLibs', 14),
 ('Zupta', 14),
 ('Anbang', 14),
 ('Layla4', 14),
 ('gofundme', 14),
 ('Crapwell', 13),
 ('wiliki', 13),
 ('xbt', 13),
 ('TFWs', 13),
 ('vancouversun', 13),
 ('tRUMP', 13),
 ('SJWs', 12),
 ('TrumpCare', 12),
 ('FakeNews', 12),
 ('MAGAphants', 12),
 ('MUPTE', 12)]

Looks good. Now we can implement the preprocessing functions and train a model. See 

In [36]:
print (glove_embeddings['Schwab'])
print(glove_embeddings['rmd'])
print(glove_embeddings['vanguard'])
print(glove_embeddings['blackrock'])
print(glove_embeddings['ameritrade'])
print(glove_embeddings['SCHW'])

[-0.26362    0.17013   -0.2353     0.23      -0.17085    0.31102
  0.45467    0.24224    0.099638   1.0841     0.3914    -0.58338
  0.23517    0.19334   -0.12729   -0.25476   -0.54572   -0.32191
  0.18551    0.53016    0.010809   0.64877    0.38942    0.027237
 -0.21348    0.44491   -0.3409     0.1347    -0.25356   -0.093586
  0.60238   -0.48313    0.55752    0.20925   -0.091584  -0.1464
 -0.22533   -0.42055   -0.098816  -0.42965   -0.53209   -0.24078
  0.2644     0.15721    0.16086    0.28977   -0.57275   -0.23805
  0.12122   -0.53071   -0.25444   -0.31427   -0.24907    0.28778
  0.62412   -0.15069   -0.66065   -0.37106    0.22876   -0.43941
  0.40445   -0.12752    0.49618    0.13293    0.5845     0.22621
  0.37511    0.090205   0.06014    0.070341   0.53625   -0.077997
  0.41538    0.41096   -0.79239    0.61845    0.43017    0.14255
 -0.11131   -0.14772   -0.009152  -0.12959    0.13898    0.44132
 -0.42023    0.1564    -0.021194  -0.084832  -0.21505   -0.0074104
 -0.020073   0.030071

In [37]:
print(crawl_embeddings['Schwab'])
print(crawl_embeddings['rmd'])
print(crawl_embeddings['vanguard'])
print(crawl_embeddings['blackrock'])
print(crawl_embeddings['ameritrade'])
print(crawl_embeddings['SCHW'])

[ 1.799e-01 -6.090e-02  3.940e-02  1.563e-01 -1.152e-01 -7.056e-01
  3.599e-01 -3.455e-01 -5.410e-01  3.297e-01 -2.468e-01  3.851e-01
 -4.890e-02  3.203e-01 -9.200e-03  4.183e-01  1.799e-01 -1.656e-01
 -1.567e-01  1.897e-01 -1.128e-01  2.519e-01 -7.587e-01  2.604e-01
  1.763e-01  3.257e-01  8.870e-02 -3.518e-01 -6.119e-01  1.431e-01
  4.926e-01  1.298e-01 -2.824e-01  2.730e-01  3.544e-01  1.392e-01
 -3.062e-01  5.110e-02 -1.973e-01 -2.703e-01  9.570e-02 -1.982e-01
 -4.788e-01  6.519e-01  2.258e-01 -4.140e-01 -2.274e-01  3.050e-02
  3.159e-01 -3.560e-01  1.755e-01 -2.430e-01  6.916e-01  4.952e-01
 -2.976e-01  4.380e-01 -2.934e-01  1.485e-01 -2.430e-02 -2.160e-02
 -4.477e-01 -1.428e-01  2.560e-01  8.540e-02  4.970e-02  2.503e-01
  3.531e-01  1.051e-01  8.440e-02  1.166e-01 -3.892e-01  7.640e-02
  1.361e-01 -4.056e-01 -4.945e-01 -1.586e-01 -4.023e-01  2.919e-01
 -3.220e-01 -2.810e-02  4.450e-01 -9.910e-02 -1.145e-01 -2.313e-01
 -4.943e-01 -5.789e-01 -3.248e-01  2.097e-01  2.743e-01 -2.104

 try a "lower/upper case version of a" word if an embedding is not found, which sometimes gives us an embedding

In [2]:
print(crawl_embeddings['SchwaB'])

NameError: name 'crawl_embeddings' is not defined

In [39]:
# print(glove_embeddings['SchwaB'])

In [40]:
def build_matrix(word_index, embedding_index):
    unknown_words = []
    lower_words = []
    title_words = []
    known_words = []
    
    
    for word, i in word_index.items():
        if i <= max_features:
            try:
                val = embedding_index[word]
                known_words.append(word)
            except KeyError:
                try:
                    val = embedding_index[word.lower()]
                    lower_words.append(word)
                except KeyError:
                    try:
                        val = embedding_index[word.title()]
                        title_words.append(word)
                    except KeyError:
                        unknown_words.append(word)
    return lower_words, title_words, unknown_words, known_words

In [41]:
max_features = 400000
tokenizer = text.Tokenizer(num_words = max_features, filters='',lower=False)

In [42]:

tokenizer.fit_on_texts(list(train['comment_text']))

lower_words, title_words, unknown_words, known_words = build_matrix(tokenizer.word_index, crawl_embeddings)
print('n lower words (crawl): ', len(lower_words))
print('n title words (crawl): ', len(title_words))
print('n unknown words (crawl): ', len(unknown_words))
print('n known words (crawl): ', len(known_words))



n lower words (crawl):  843
n title words (crawl):  667
n unknown words (crawl):  14318
n known words (crawl):  84438


In [43]:

lower_words, title_words, unknown_words, known_words = build_matrix(tokenizer.word_index, glove_embeddings)
print('n lower words (glove): ', len(lower_words))
print('n title words (glove): ', len(title_words))
print('n unknown words (glove): ', len(unknown_words))
print('n known words (glove): ', len(known_words))


n lower words (glove):  738
n title words (glove):  572
n unknown words (glove):  14598
n known words (glove):  84358


In [44]:
symbols_to_isolate = '.,?!-;*"…:—()%#$&_/@＼・ω+=”“[]^–>\\°<~•≠™ˈʊɒ∞§{}·τα❤☺ɡ|¢→̶`❥━┣┫┗Ｏ►★©―ɪ✔®\x96\x92●£♥➤´¹☕≈÷♡◐║▬′ɔː€۩۞†μ✒➥═☆ˌ◄½ʻπδηλσερνʃ✬ＳＵＰＥＲＩＴ☻±♍µº¾✓◾؟．⬅℅»Вав❣⋅¿¬♫ＣＭβ█▓▒░⇒⭐›¡₂₃❧▰▔◞▀▂▃▄▅▆▇↙γ̄″☹➡«φ⅓„✋：¥̲̅́∙‛◇✏▷❓❗¶˚˙）сиʿ✨。ɑ\x80◕！％¯−ﬂﬁ₁²ʌ¼⁴⁄₄⌠♭✘╪▶☭✭♪☔☠♂☃☎✈✌✰❆☙○‣⚓年∎ℒ▪▙☏⅛ｃａｓǀ℮¸ｗ‚∼‖ℳ❄←☼⋆ʒ⊂、⅔¨͡๏⚾⚽Φ×θ￦？（℃⏩☮⚠月✊❌⭕▸■⇌☐☑⚡☄ǫ╭∩╮，例＞ʕɐ̣Δ₀✞┈╱╲▏▕┃╰▊▋╯┳┊≥☒↑☝ɹ✅☛♩☞ＡＪＢ◔◡↓♀⬆̱ℏ\x91⠀ˤ╚↺⇤∏✾◦♬³の｜／∵∴√Ω¤☜▲↳▫‿⬇✧ｏｖｍ－２０８＇‰≤∕ˆ⚜☁'
symbols_to_delete = '\n🍕\r🐵😑\xa0\ue014\t\uf818\uf04a\xad😢🐶️\uf0e0😜😎👊\u200b\u200e😁عدويهصقأناخلىبمغر😍💖💵Е👎😀😂\u202a\u202c🔥😄🏻💥ᴍʏʀᴇɴᴅᴏᴀᴋʜᴜʟᴛᴄᴘʙғᴊᴡɢ😋👏שלוםבי😱‼\x81エンジ故障\u2009🚌ᴵ͞🌟😊😳😧🙀😐😕\u200f👍😮😃😘אעכח💩💯⛽🚄🏼ஜ😖ᴠ🚲‐😟😈💪🙏🎯🌹😇💔😡\x7f👌ἐὶήιὲκἀίῃἴξ🙄Ｈ😠\ufeff\u2028😉😤⛺🙂\u3000تحكسة👮💙فزط😏🍾🎉😞\u2008🏾😅😭👻😥😔😓🏽🎆🍻🍽🎶🌺🤔😪\x08‑🐰🐇🐱🙆😨🙃💕𝘊𝘦𝘳𝘢𝘵𝘰𝘤𝘺𝘴𝘪𝘧𝘮𝘣💗💚地獄谷улкнПоАН🐾🐕😆ה🔗🚽歌舞伎🙈😴🏿🤗🇺🇸мυтѕ⤵🏆🎃😩\u200a🌠🐟💫💰💎эпрд\x95🖐🙅⛲🍰🤐👆🙌\u2002💛🙁👀🙊🙉\u2004ˢᵒʳʸᴼᴷᴺʷᵗʰᵉᵘ\x13🚬🤓\ue602😵άοόςέὸתמדףנרךצט😒͝🆕👅👥👄🔄🔤👉👤👶👲🔛🎓\uf0b7\uf04c\x9f\x10成都😣⏺😌🤑🌏😯ех😲Ἰᾶὁ💞🚓🔔📚🏀👐\u202d💤🍇\ue613小土豆🏡❔⁉\u202f👠》कर्मा🇹🇼🌸蔡英文🌞🎲レクサス😛外国人关系Сб💋💀🎄💜🤢َِьыгя不是\x9c\x9d🗑\u2005💃📣👿༼つ༽😰ḷЗз▱ц￼🤣卖温哥华议会下降你失去所有的钱加拿大坏税骗子🐝ツ🎅\x85🍺آإشء🎵🌎͟ἔ油别克🤡🤥😬🤧й\u2003🚀🤴ʲшчИОРФДЯМюж😝🖑ὐύύ特殊作戦群щ💨圆明园קℐ🏈😺🌍⏏ệ🍔🐮🍁🍆🍑🌮🌯🤦\u200d𝓒𝓲𝓿𝓵안영하세요ЖљКћ🍀😫🤤ῦ我出生在了可以说普通话汉语好极🎼🕺🍸🥂🗽🎇🎊🆘🤠👩🖒🚪天一家⚲\u2006⚭⚆⬭⬯⏖新✀╌🇫🇷🇩🇪🇮🇬🇧😷🇨🇦ХШ🌐\x1f杀鸡给猴看ʁ𝗪𝗵𝗲𝗻𝘆𝗼𝘂𝗿𝗮𝗹𝗶𝘇𝗯𝘁𝗰𝘀𝘅𝗽𝘄𝗱📺ϖ\u2000үսᴦᎥһͺ\u2007հ\u2001ɩｙｅ൦ｌƽｈ𝐓𝐡𝐞𝐫𝐮𝐝𝐚𝐃𝐜𝐩𝐭𝐢𝐨𝐧Ƅᴨןᑯ໐ΤᏧ௦Іᴑ܁𝐬𝐰𝐲𝐛𝐦𝐯𝐑𝐙𝐣𝐇𝐂𝐘𝟎ԜТᗞ౦〔Ꭻ𝐳𝐔𝐱𝟔𝟓𝐅🐋ﬃ💘💓ё𝘥𝘯𝘶💐🌋🌄🌅𝙬𝙖𝙨𝙤𝙣𝙡𝙮𝙘𝙠𝙚𝙙𝙜𝙧𝙥𝙩𝙪𝙗𝙞𝙝𝙛👺🐷ℋ𝐀𝐥𝐪🚶𝙢Ἱ🤘ͦ💸ج패티Ｗ𝙇ᵻ👂👃ɜ🎫\uf0a7БУі🚢🚂ગુજરાતીῆ🏃𝓬𝓻𝓴𝓮𝓽𝓼☘﴾̯﴿₽\ue807𝑻𝒆𝒍𝒕𝒉𝒓𝒖𝒂𝒏𝒅𝒔𝒎𝒗𝒊👽😙\u200cЛ‒🎾👹⎌🏒⛸公寓养宠物吗🏄🐀🚑🤷操美𝒑𝒚𝒐𝑴🤙🐒欢迎来到阿拉斯ספ𝙫🐈𝒌𝙊𝙭𝙆𝙋𝙍𝘼𝙅ﷻ🦄巨收赢得白鬼愤怒要买额ẽ🚗🐳𝟏𝐟𝟖𝟑𝟕𝒄𝟗𝐠𝙄𝙃👇锟斤拷𝗢𝟳𝟱𝟬⦁マルハニチロ株式社⛷한국어ㄸㅓ니͜ʖ𝘿𝙔₵𝒩ℯ𝒾𝓁𝒶𝓉𝓇𝓊𝓃𝓈𝓅ℴ𝒻𝒽𝓀𝓌𝒸𝓎𝙏ζ𝙟𝘃𝗺𝟮𝟭𝟯𝟲👋🦊多伦🐽🎻🎹⛓🏹🍷🦆为和中友谊祝贺与其想象对法如直接问用自己猜本传教士没积唯认识基督徒曾经让相信耶稣复活死怪他但当们聊些政治题时候战胜因圣把全堂结婚孩恐惧且栗谓这样还♾🎸🤕🤒⛑🎁批判检讨🏝🦁🙋😶쥐스탱트뤼도석유가격인상이경제황을렵게만들지않록잘관리해야합다캐나에서대마초와화약금의품런성분갈때는반드시허된사용🔫👁凸ὰ💲🗯𝙈Ἄ𝒇𝒈𝒘𝒃𝑬𝑶𝕾𝖙𝖗𝖆𝖎𝖌𝖍𝖕𝖊𝖔𝖑𝖉𝖓𝖐𝖜𝖞𝖚𝖇𝕿𝖘𝖄𝖛𝖒𝖋𝖂𝕴𝖟𝖈𝕸👑🚿💡知彼百\uf005𝙀𝒛𝑲𝑳𝑾𝒋𝟒😦𝙒𝘾𝘽🏐𝘩𝘨ὼṑ𝑱𝑹𝑫𝑵𝑪🇰🇵👾ᓇᒧᔭᐃᐧᐦᑳᐨᓃᓂᑲᐸᑭᑎᓀᐣ🐄🎈🔨🐎🤞🐸💟🎰🌝🛳点击查版🍭𝑥𝑦𝑧ＮＧ👣\uf020っ🏉ф💭🎥Ξ🐴👨🤳🦍\x0b🍩𝑯𝒒😗𝟐🏂👳🍗🕉🐲چی𝑮𝗕𝗴🍒ꜥⲣⲏ🐑⏰鉄リ事件ї💊「」\uf203\uf09a\uf222\ue608\uf202\uf099\uf469\ue607\uf410\ue600燻製シ虚偽屁理屈Г𝑩𝑰𝒀𝑺🌤𝗳𝗜𝗙𝗦𝗧🍊ὺἈἡχῖΛ⤏🇳𝒙ψՁմեռայինրւդձ冬至ὀ𝒁🔹🤚🍎𝑷🐂💅𝘬𝘱𝘸𝘷𝘐𝘭𝘓𝘖𝘹𝘲𝘫کΒώ💢ΜΟΝΑΕ🇱♲𝝈↴💒⊘Ȼ🚴🖕🖤🥘📍👈➕🚫🎨🌑🐻𝐎𝐍𝐊𝑭🤖🎎😼🕷ｇｒｎｔｉｄｕｆｂｋ𝟰🇴🇭🇻🇲𝗞𝗭𝗘𝗤👼📉🍟🍦🌈🔭《🐊🐍\uf10aლڡ🐦\U0001f92f\U0001f92a🐡💳ἱ🙇𝗸𝗟𝗠𝗷🥜さようなら🔼'


from nltk.tokenize.treebank import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()


isolate_dict = {ord(c):f' {c} ' for c in symbols_to_isolate}
remove_dict = {ord(c):f'' for c in symbols_to_delete}


def handle_punctuation(x):
    x = x.translate(remove_dict)
    x = x.translate(isolate_dict)
    return x

def handle_contractions(x):
    x = tokenizer.tokenize(x)
    return x

def fix_quote(x):
    x = [x_[1:] if x_.startswith("'") else x_ for x_ in x]
    x = ' '.join(x)
    return x

def preprocess(x):
    x = handle_punctuation(x)
    x = handle_contractions(x)
    x = fix_quote(x)
    return x

train['comment_text'] = train['comment_text'].apply(lambda x:preprocess(x))


In [45]:
tokenizer = text.Tokenizer(num_words = max_features, filters='',lower=False)
tokenizer.fit_on_texts(list(train['comment_text']))


In [46]:

tokenizer.fit_on_texts(list(train['comment_text']))

lower_words, title_words, unknown_words, known_words = build_matrix(tokenizer.word_index, crawl_embeddings)
print('n lower words (crawl): ', len(lower_words))
print('n title words (crawl): ', len(title_words))
print('n unknown words (crawl): ', len(unknown_words))
print('n known words (crawl): ', len(known_words))



n lower words (crawl):  842
n title words (crawl):  667
n unknown words (crawl):  14287
n known words (crawl):  84412


In [47]:

lower_words, title_words, unknown_words, known_words = build_matrix(tokenizer.word_index, glove_embeddings)
print('n lower words (glove): ', len(lower_words))
print('n title words (glove): ', len(title_words))
print('n unknown words (glove): ', len(unknown_words))
print('n known words (glove): ', len(known_words))


n lower words (glove):  738
n title words (glove):  572
n unknown words (glove):  14540
n known words (glove):  84358


In [48]:
set(lower_words)

{'122MM',
 '17Million',
 '20End',
 '20FINAL',
 '20Letter',
 '250Million',
 '25YRS',
 '2Cm',
 '2Cv',
 '3MILLION',
 '3Wd',
 '40Years',
 '7Million',
 'AAAhhh',
 'AAaaa',
 'ACCESIBLE',
 'ACIDIFICATION',
 'ADVOCATED',
 'AFterall',
 'ALASKAS',
 "ALI'I",
 'ALLUDE',
 'ALLof',
 'ALaskans',
 'AMORIS',
 'ANAE',
 'APOLLOS',
 'APPEASER',
 'APr',
 'ARRET',
 'ASo',
 'ATTENDENTS',
 'Accelerants',
 'Actially',
 'Adance',
 'Afirmative',
 'Agirl',
 'AirBnb',
 'AlASKANS',
 'AlCan',
 'Aliado',
 'AllenE',
 'Allyou',
 'AmIright',
 'AmericaNo',
 'Ampla',
 'And2',
 'AntiFA',
 'AntiFa',
 'Antivaxxers',
 'Anyay',
 'ArmyMan',
 'AtheO',
 'Atrack',
 'Atually',
 'AustOn',
 'Ayaa',
 'BAKKA',
 'BANISHMENT',
 'BArry',
 'BEEJEEZUS',
 'BILLON',
 'BIden',
 'BLASE',
 'BLINDINGLY',
 'BOff',
 'BRAINIER',
 'BRIBING',
 'BRv',
 'BUTHe',
 'BaHaHaHa',
 'Backslapping',
 'BanKRupt',
 'Barbarianism',
 'BeIn',
 'Befuddles',
 'Billlion',
 'BizJournal',
 'Blablablah',
 'Blahhhhh',
 'BoRs',
 'BradFord',
 'Buccaneering',
 'Buggah',
 'Bul