In this kernel I want to illustrate how I do come up with meaningful preprocessing when building deep learning NLP models. 

I start with two golden rules:

1.  **Don't use standard preprocessing steps like stemming or stopword removal when you have pre-trained embeddings** 

Some of you might used standard preprocessing steps when doing word count based feature extraction (e.g. TFIDF) such as removing stopwords, stemming etc. 
The reason is simple: You loose valuable information, which would help your NN to figure things out.  

2. **Get your vocabulary as close to the embeddings as possible**

I will focus in this notebook, how to achieve that. For an example I take the GoogleNews pretrained embeddings, there is no deeper reason for this choice.

We start with a neat little trick that enables us to see a progressbar when applying functions to a pandas Dataframe

In [1]:
import json
from tqdm import tqdm
tqdm.pandas()


Lets load our data

In [2]:
import json
with open('inverted.json') as f:
    for line in f:
        pass
invert = json.loads(line)

I will use the following function to track our training vocabulary, which goes through all our text and counts the occurance of the contained words. 

In [3]:
vocab = {}
with open('./analysis/vocab') as f:
    for line in f:
        line = line.split()
        vocab[line[0]] = int(line[1])

So lets populate the vocabulary and display the first 5 elements and their count. Note that now we can use progess_apply to see progress bar

Next we import the embeddings we want to use in our model later. For illustration I use GoogleNews here.

In [4]:
from gensim.models import KeyedVectors

news_path = './embedding/GoogleNews-vectors-negative300.bin'
embeddings_index = KeyedVectors.load_word2vec_format(news_path, binary=True)



Next I define a function that checks the intersection between our vocabulary and the embeddings. It will output a list of out of vocabulary (oov) words that we can use to improve our preprocessing

In [5]:
import operator 

def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

In [6]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 1048921/1048921 [00:01<00:00, 603618.53it/s]


Found embeddings for 11.24% of vocab
Found embeddings for  70.69% of all text


Ouch only 24% of our vocabulary will have embeddings, making 21% of our data more or less useless. So lets have a look and start improving. For this we can easily have a look at the top oov words.

In [7]:
oov[:20]

[('.', 28644184),
 (',', 25669672),
 ('to', 11765243),
 ('a', 10440410),
 ('and', 10298461),
 ('of', 9688997),
 ('-', 6673699),
 ('”', 3175358),
 (':', 2608659),
 ('—', 1516752),
 (')', 1400229),
 ('(', 1373208),
 ("'", 1357140),
 ('"', 880445),
 ('?', 807212),
 ('000', 618443),
 ('it’s', 516104),
 (';', 477945),
 ('10', 369289),
 ('“i', 325595)]

On first place there is "to". Why? Simply because "to" was removed when the GoogleNews Embeddings were trained. We will fix this later, for now we take care about the splitting of punctuation as this also seems to be a Problem. But what do we do with the punctuation then - Do we want to delete or consider as a token? I would say: It depends. If the token has an embedding, keep it, if it doesn't we don't need it anymore. So lets check:

In [8]:
'?' in embeddings_index

False

In [9]:
'&' in embeddings_index

True

Interesting. While "&" is in the Google News Embeddings, "?" is not. So we basically define a function that splits off "&" and removes other punctuation.

In [10]:
vocab_temp = vocab.copy()
invert_temp = invert.copy()

In [11]:
def clean_text(vocab,inverted):
    kkey = list(vocab.keys())
    qq = []
    for key in kkey:
        x = str(key)
        for punct in "/-'":
            x = x.replace(punct, '')
        for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’‘':
            x = x.replace(punct, '')
        
        if(x!=key):
            if(len(x)>0):
                if(x not in vocab and x not in inverted):
                    inverted[x] = {'docs':{}}
                    vocab[x] = 0
                elif(x not in vocab):
                    vocab[x] = 0
                    print('--',x)
                elif(x not in inverted):
                    inverted[x] = {'docs':{}}
                    print('==',x)
                    
                vocab[x] += vocab[key]
                qq.append((key,x))
                for w,n in inverted[key]['docs'].items():
                    try:
                        inverted[x]['docs'][w] += n
                    except:
                        inverted[x]['docs'][w] = n
            del vocab[key]
            del inverted[key]
    return qq

In [12]:
change = clean_text(vocab,invert)

In [13]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 843238/843238 [00:01<00:00, 617928.06it/s]


Found embeddings for 14.16% of vocab
Found embeddings for  84.93% of all text


Nice! We were able to increase our embeddings ratio from 24% to 57% by just handling punctiation. Ok lets check on thos oov words.

In [14]:
oov[:10]

[('to', 11788929),
 ('a', 10516905),
 ('and', 10341307),
 ('of', 9694380),
 ('—', 1517033),
 ('000', 618627),
 ('10', 370043),
 ('30', 308695),
 ('11', 232210),
 ('12', 190189)]

Hmm seems like numbers also are a problem. Lets check the top 10 embeddings to get a clue.

In [15]:
for i in range(10):
    print(embeddings_index.index2entity[i])

</s>
in
for
that
is
on
##
The
with
said


hmm why is "##" in there? Simply because as a reprocessing all numbers bigger tha 9 have been replaced by hashs. I.e. 15 becomes ## while 123 becomes ### or 15.80€ becomes ##.##€. So lets mimic this preprocessing step to further improve our embeddings coverage

In [18]:
import re

def clean_numbers(vocab,inverted):
    kkey = list(vocab.keys())
    for key in kkey:
        
        x = str(key)
        x = re.sub('[0-9]{5,}', '#####', x)
        x = re.sub('[0-9]{4}', '####', x)
        x = re.sub('[0-9]{3}', '###', x)
        x = re.sub('[0-9]{2}', '##', x)
        
        if(x!=key):
            if(len(x)>0):
                if(x not in vocab or x not in inverted):
                    inverted[x] = {'docs':{}}
                    vocab[x] = 0
                vocab[x] += vocab[key]
                print(key,x)
                for w,n in inverted[key]['docs'].items():
                    try:
                        inverted[x]['docs'][w] += n
                    except:
                        inverted[x]['docs'][w] = n
            del vocab[key]
            del inverted[key]

In [20]:
clean_numbers(vocab,invert)

000 ###
10 ##
30 ##
11 ##
12 ##
20 ##
15 ##
703 ###
900 ###
25 ##
13 ##
18 ##
2012 ####
14 ##
50 ##
16 ##
100 ###
2014 ####
17 ##
500 ###
2016 ####
24 ##
2013 ####
40 ##
21 ##
19 ##
2015 ####
22 ##
28 ##
2011 ####
23 ##
27 ##
202 ###
26 ##
2010 ####
301 ###
29 ##
200 ###
45 ##
2008 ####
35 ##
31 ##
60 ##
2009 ####
300 ###
400 ###
90 ##
800 ###
55 ##
32 ##
2007 ####
70 ##
34 ##
33 ##
80 ##
36 ##
600 ###
990 ###
48 ##
38 ##
540 ###
2005 ####
37 ##
2000 ####
42 ##
2006 ####
75 ##
39 ##
700 ###
41 ##
44 ##
43 ##
47 ##
2004 ####
49 ##
52 ##
46 ##
51 ##
250 ###
95 ##
2001 ####
2017 ####
53 ##
54 ##
65 ##
56 ##
99 ##
150 ###
58 ##
59 ##
57 ##
2003 ####
2002 ####
14th ##th
1990s ####s
240 ###
571 ###
1980s ####s
999 ###
410 ###
66 ##
1000 ####
1999 ####
85 ##
1970s ####s
00 ##
64 ##
05 ##
62 ##
1996 ####
20th ##th
1998 ####
2018 ####
10th ##th
63 ##
15th ##th
102 ###
72 ##
68 ##
1960s ####s
19th ##th
350 ###
1200 ####
1100 ####
1992 ####
21st ##st
777 ###
1300 ####
61 ##
120 ###
67 ##
11th ##t

840 ###
811 ###
13065 #####
564 ###
1701 ####
5301 ####
75th ##th
588 ###
12100 #####
817 ###
462 ###
878 ###
1935 ####
421 ###
14800 #####
451 ###
622 ###
1916 ####
42nd ##nd
787 ###
4304 ####
819 ###
9601 ####
9535 ####
543 ###
487 ###
859 ###
353 ###
8601 ####
10600 #####
1934 ####
7759 ####
1928 ####
1913 ####
8857 ####
446 ###
346 ###
448 ###
51st ##st
1632 ####
1925 ####
1923 ####
70th ##th
6372 ####
1921 ####
11900 #####
hart90 hart##
523 ###
577 ###
1015 ####
13800 #####
383 ###
10px ##px
382 ###
532 ###
12900 #####
60th ##th
884 ###
538 ###
1927 ####
542 ###
704 ###
366 ###
803 ###
14300 #####
1630 ####
8477 ####
291 ###
4070 ####
791 ###
714 ###
1922 ####
920 ###
837 ###
631 ###
623 ###
869 ###
552 ###
607 ###
11800 #####
706 ###
12500 #####
11000 #####
39th ##th
10700 #####
1780 ####
477 ###
868 ###
48th ##th
38th ##th
536 ###
44th ##th
10877 #####
452 ###
36441 #####
879 ###
10800 #####
41st ##st
2026 ####
381 ###
1609 ####
716 ###
919 ###
797 ###
386 ###
612 ###
898 ###
15

1505 ####
6630 ####
6972 ####
873 ###
20151 #####
4808 ####
1602 ####
0001 ####
3550 ####
005 ###
2926 ####
1206 ####
846 ###
20715 #####
1618 ####
2425 ####
10401 #####
22311 #####
76th ##th
4941 ####
1411 ####
1750 ####
6405 ####
1216 ####
7975 ####
2270 ####
4x400 4x###
43008 #####
4343 ####
1217 ####
1103 ####
1214 ####
973 ###
2903 ####
22303 #####
9701 ####
1313 ####
13053 #####
20017 #####
1014 ####
9518 ####
21029 #####
3637 ####
852 ###
8881 ####
86th ##th
20194 #####
2567 ####
9444 ####
14515 #####
4025 ####
22556 #####
1859 ####
19100 #####
9410 ####
4356 ####
5890 ####
2796 ####
0394 ####
6612 ####
1324 ####
83rd ##rd
3999 ####
3174 ####
1620 ####
1860s ####s
1848 ####
12001 #####
4125 ####
7671 ####
1304 ####
20877 #####
1714 ####
6019 ####
1857 ####
2480 ####
3713 ####
20705 #####
89th ##th
1408 ####
2999 ####
20015 #####
22202 #####
45745 #####
6491 ####
6252 ####
2650 ####
22150 #####
030 ###
8599 ####
015 ###
8186 ####
1621 ####
992 ###
996 ###
4729 ####
22026 #####
56

5018 ####
7208 ####
7802 ####
6003 ####
3407 ####
9573 ####
45131 #####
20842 #####
0113 ####
5723 ####
1665 ####
1341 ####
8151 ####
7020 ####
8820 ####
8407 ####
6420 ####
9144 ####
5408 ####
6260 ####
9713 ####
5011 ####
1260 ####
4252 ####
3708 ####
7110 ####
3806 ####
4202 ####
5013 ####
20736 #####
4377 ####
22125 #####
4129 ####
3425 ####
7917 ####
24950 #####
3214 ####
5363 ####
8906 ####
3213 ####
2582 ####
3717 ####
2810 ####
2119 ####
3204 ####
4102 ####
0984 ####
7206 ####
14604 #####
3062 ####
3504 ####
2038 ####
1027 ####
12333 #####
5610 ####
027 ###
4903 ####
4203 ####
1774 ####
3883 ####
3934 ####
9302 ####
076 ###
2616 ####
19355 #####
4115 ####
104th ###th
2516 ####
1326 ####
1736 ####
9811 ####
2832 ####
2135 ####
5113 ####
1753 ####
30pm ##pm
6503 ####
8324 ####
3814 ####
8110 ####
6303 ####
6432 ####
8097 ####
9774 ####
7814 ####
20700 #####
3625 ####
9338 ####
4950 ####
1060 ####
1500s ####s
2806 ####
4220 ####
6703 ####
9705 ####
8203 ####
0750 ####
6513 ####
83

6278 ####
13405 #####
9623 ####
13401 #####
11720 #####
2656 ####
11512 #####
11914 #####
10867 #####
1253 ####
2540 ####
11216 #####
42395 #####
5218 ####
300px ###px
13869 #####
7188 ####
9948 ####
7798 ####
3823 ####
1080p ####p
7725 ####
5222 ####
4031 ####
10821 #####
6840 ####
13710 #####
5028 ####
5413 ####
7346 ####
12602 #####
8322 ####
3596 ####
1588 ####
2730 ####
10228 #####
5939 ####
6725 ####
8114 ####
9002 ####
8392 ####
6721 ####
4187 ####
0777 ####
4330 ####
4929 ####
11601 #####
4617 ####
7878 ####
10802 #####
12507 #####
14704 #####
10704 #####
0109 ####
4815 ####
13110 #####
2128 ####
2527 ####
15108 #####
13609 #####
14511 #####
9233 ####
12510 #####
9817 ####
13305 #####
10906 #####
10306 #####
15251 #####
8224 ####
9113 ####
2186 ####
11304 #####
77px ##px
11606 #####
whole30 whole##
5335 ####
7398 ####
5630 ####
1581 ####
8024 ####
14811 #####
1460 ####
13106 #####
7543 ####
9551 ####
1773 ####
2526 ####
22600 #####
8014 ####
5521 ####
3346 ####
7408 ####
12214 

1173 ####
11561 #####
1394 ####
5735 ####
5067 ####
10224 #####
1596 ####
11515 #####
15422 #####
13768 #####
10523 #####
shines3243 shines####
14406 #####
10968 #####
future45 future##
blizzard2016 blizzard####
19813 #####
16071 #####
24532 #####
4c8f102a 4c8f###a
12e9 ##e9
91a4 ##a4
1ff6190e0808 1ff####e####
3fa4b691 3fa4b###
032b ###b
85dc ##dc
54b26e93f753 ##b##e##f###
79489525 #####
a734 a###
489e ###e
03061be1c2a5 #####be1c2a5
e8a2d651 e8a2d###
7e92ee66725d 7e##ee#####d
4af834ec 4af###ec
8816f7a3efe9 ####f7a3efe9
0a0b9756 0a0b####
44ee427bcaba ##ee###bcaba
4f9300aa 4f####aa
4d52 4d##
71512e6c0a3b #####e6c0a3b
20af ##af
b847 b###
e1a67aaa9718 e1a##aaa####
56caca4b ##caca4b
4d86 4d##
ed5002c53f36 ed####c##f##
8a828066 8a#####
a04d a##d
6c18adc7b6e4 6c##adc7b6e4
55c3 ##c3
b746 b###
6c29ea812667 6c##ea#####
m113s m###s
1258 ####
6722 ####
1294 ####
13011 #####
1164 ####
2530 ####
22s ##s
9189 ####
12616 #####
9421 ####
6832 ####
6348 ####
3083 ####
6340 ####
12525 #####
5132 ####
255

4974 ####
20524 #####
12939 #####
12263 #####
200t ###t
22603 #####
12188 #####
20690 #####
32f ##f
11458 #####
p17 p##
p19 p##
87ers ##ers
1610e ####e
23b ##b
7533 ####
3585 ####
7430 ####
4445 ####
4680 ####
15015 #####
14012 #####
42245 #####
3744 ####
22060 #####
1292 ####
7238 ####
15119 #####
14203 #####
4436 ####
9343 ####
9842 ####
u16 u##
4662 ####
7038 ####
6960 ####
15323 #####
8459 ####
11311 #####
5253 ####
2072 ####
8376 ####
7634 ####
6435 ####
12223 #####
10225 #####
5932 ####
7555 ####
8898 ####
7751 ####
7133 ####
777s ###s
8084 ####
8969 ####
8999 ####
4855 ####
8028 ####
9473 ####
14b ##b
4183 ####
4773 ####
6172 ####
20271 #####
3249 ####
15416 #####
8276 ####
2346 ####
6938 ####
7240 ####
5361 ####
6637 ####
4865 ####
5252 ####
3294 ####
7327 ####
7889 ####
8635 ####
4064 ####
143rd ###rd
14405 #####
19370 #####
42476 #####
4288 ####
9454 ####
20006 #####
5927 ####
2084 ####
5763 ####
7322 ####
5678 ####
23700 #####
1495 ####
7345 ####
7294 ####
1059 ####
18850 ##

2780 ####
7642 ####
5972 ####
10046 #####
7079 ####
8334 ####
13754 #####
5273 ####
16862 #####
3965 ####
12730 #####
4696 ####
39464 #####
18306 #####
8953 ####
1089 ####
0104 ####
8137 ####
5285 ####
12436 #####
20305 #####
127th ###th
13033 #####
2276 ####
44400 #####
18941 #####
12248 #####
15324 #####
12942 #####
0157 ####
15905 #####
43750 #####
43780 #####
25k ##k
7670 ####
5460 ####
9339 ####
8559 ####
11433 #####
11817 #####
13956 #####
12334 #####
13430 #####
4177 ####
7262 ####
14428 #####
8587 ####
8927 ####
14246 #####
70mm ##mm
6958 ####
7780 ####
3362 ####
11111 #####
5193 ####
12555 #####
13829 #####
13563 #####
9462 ####
6379 ####
128th ###th
18001 #####
15021 #####
9824 ####
9276 ####
21041 #####
16011 #####
7955 ####
5654 ####
9955 ####
5364 ####
6259 ####
15016 #####
22938 #####
18506 #####
13737 #####
4762 ####
13419 #####
14115 #####
14929 #####
7947 ####
15454 #####
8237 ####
13140 #####
11556 #####
7737 ####
0005 ####
3163 ####
16820 #####
8139 ####
3739 ####
53

11873 #####
11279 #####
news24 news##
14942 #####
155px ###px
20693 #####
s90 s##
7797 ####
20658 #####
632857 #####
19242 #####
w20 w##
41792 #####
22982 #####
13049 #####
24406 #####
31″ ##″
28″ ##″
32″ ##″
19″ ##″
24″ ##″
256a ###a
40515 #####
39207 #####
35095 #####
1610d ####d
1690s ####s
5886 ####
15222 #####
10142 #####
11579 #####
15761 #####
‘30s ‘##s
20809 #####
231st ###st
fox31 fox##
9582 ####
0656 ####
0533 ####
14850 #####
15218 #####
10744 #####
12169 #####
9145 ####
3382 ####
7166 ####
8362 ####
20684 #####
11692 #####
3495 ####
15921 #####
13372 #####
10641 #####
16620 #####
5290 ####
0007 ####
20890 #####
12937 #####
12241 #####
16619 #####
4346 ####
0355 ####
0246 ####
41900 #####
21313 #####
13324 #####
5766 ####
h20 h##
8946 ####
2079 ####
20514 #####
20317 #####
6082 ####
4682 ####
43617 #####
7356 ####
8179 ####
18929 #####
150s ###s
6490 ####
24petwatch ##petwatch
6267 ####
15077 #####
7468 ####
3876 ####
6765 ####
13648 #####
4363 ####
11053 #####
21136 #####
4

20334 #####
22038 #####
2971 ####
13654 #####
10072 #####
13922 #####
34l ##l
10034 #####
22740 #####
13538 #####
42818 #####
43567 #####
11552 #####
21851 #####
11297 #####
6586 ####
funker530 funker###
‘40 ‘##
21204 #####
4463 ####
11597 #####
14614 #####
13923 #####
15533 #####
16824 #####
1393 ####
14637 #####
22750 #####
12259 #####
4364 ####
15804 #####
43190 #####
6987 ####
12832 #####
281st ###st
7466 ####
20403 #####
43982 #####
21905 #####
4285 ####
42980 #####
11790 #####
14162 #####
9742 ####
10829 #####
18108 #####
20418 #####
5686 ####
7357 ####
21689 #####
46375 #####
11294 #####
42782 #####
10957 #####
18424 #####
10e ##e
13459 #####
9686 ####
10268 #####
14260 #####
23106 #####
169th ###th
18632 #####
13957 #####
5655 ####
4298 ####
14837 #####
10195 #####
21157 #####
21775 #####
379th ###th
20454 #####
13289 #####
15664 #####
8243 ####
19204 #####
5759 ####
10551 #####
9783 ####
256th ###th
43976 #####
13661 #####
5063 ####
7288 ####
19116 #####
26020 #####
11573 ####

0770 ####
9091 ####
217th ###th
0082 ####
14866 #####
100ths ###ths
19520 #####
13469 #####
17505 #####
11564 #####
12065 #####
8998 ####
9938 ####
14448 #####
16515 #####
2866 ####
2289 ####
9379 ####
19803 #####
206n ###n
14985 #####
11076 #####
3172 ####
6279 ####
0713 ####
235th ###th
ub40 ub##
8595 ####
0245 ####
12051 #####
6999 ####
8843 ####
18414 #####
15821 #####
21398 #####
0081 ####
9267 ####
9443 ####
42853 #####
19044 #####
43534 #####
7274 ####
22542 #####
3488 ####
15832 #####
22664 #####
5777 ####
4467 ####
44016 #####
21954 #####
25606 #####
25118 #####
23414 #####
0939 ####
20045 #####
14680 #####
12470 #####
13555 #####
42743 #####
11144 #####
23244 #####
6973 ####
16312 #####
10580 #####
14563 #####
14034 #####
16853 #####
13187 #####
9377 ####
8976 ####
11156 #####
19008 #####
62262 #####
17810 #####
43484 #####
43192 #####
26031 #####
14676 #####
12735 #####
7277 ####
ride2012 ride####
6399 ####
42450 #####
22648 #####
4196 ####
6492 ####
7097 ####
15536 #####
89

12187 #####
500x ###x
11674 #####
18122 #####
40110 #####
q92udac q##udac
17424 #####
10565 #####
­28 ­##
18209 #####
20824 #####
43075 #####
13554 #####
15443 #####
fairfax2015 fairfax####
15744 #####
19378 #####
16483 #####
13059 #####
603b ###b
17912 #####
22847 #####
25172 #####
12661 #####
14578 #####
£25 £##
29675 #####
20847 #####
21750 #####
14884 #####
39635 #####
22840 #####
14457 #####
16073 #####
13095 #####
m240 m###
22760 #####
26614 #####
10763 #####
405w ###w
10096 #####
10085 #####
15623 #####
304w ###w
21779 #####
22009 #####
21439 #####
22104 #####
d404 d###
47421 #####
23112 #####
17502 #####
11284 #####
11096 #####
41869 #####
12653 #####
12592 #####
13782 #####
23038 #####
11696 #####
23418 #####
59f ##f
622946 #####
243rd ###rd
42263 #####
0435 ####
46445 #####
13659 #####
10692 #####
1808n ####n
14554 #####
12883 #####
342nd ###nd
17199 #####
43984 #####
17517 #####
22634 #####
41745 #####
25605 #####
attacktheglass2016 attacktheglass####
11377 #####
15d ##d
202

16606 #####
16825 #####
9983 ####
14258 #####
19211 #####
0659 ####
0846 ####
39600 #####
42072 #####
25259 #####
47564 #####
21134 #####
22749 #####
19299 #####
10878 #####
0271 ####
268th ###th
13s ##s
5287 ####
15287 #####
16292 #####
46a ##a
622952 #####
14778 #####
13489 #####
19223 #####
15264 #####
0379 ####
15242 #####
14692 #####
16114 #####
15535 #####
10925 #####
12577 #####
2688 ####
sunday20924 sunday#####
38800 #####
43108 #####
18725 #####
42758 #####
13679 #####
17161 #####
12571 #####
12095 #####
130js ###js
9298 ####
20638 #####
22419 #####
42421 #####
0133 ####
jfv123 jfv###
0556 ####
25095 #####
19936 #####
42819 #####
11977 #####
6763 ####
9376 ####
506th ###th
14399 #####
0980 ####
20921 #####
17075 #####
43106 #####
0032 ####
11665 #####
0132 ####
x86 x##
11395 #####
11227 #####
0239 ####
436th ###th
204a ###a
45090 #####
17446 #####
9690 ####
12374 #####
16433 #####
12097 #####
gl450 gl###
gallery555dc gallery###dc
10467 #####
a380s a###s
19357 #####
jdixon580 j

27400 #####
44338 #####
21423 #####
49b ##b
14795 #####
14286 #####
12768 #####
28m ##m
23154 #####
17589 #####
13599 #####
15747 #####
20153 #####
44031 #####
m9994 m####
farmstead88 farmstead##
msatter195 msatter###
m20139 m#####
m12666 m#####
patriciacorona800 patriciacorona###
m22711 m#####
13288 #####
15188 #####
5687 ####
springforalexandria2012 springforalexandria####
21557 #####
43227 #####
44024 #####
17363 #####
42817 #####
14080 #####
15840 #####
15349 #####
12266 #####
39308 #####
24598 #####
20386 #####
19849 #####
43182 #####
43374 #####
45833 #####
11953 #####
17567 #####
12693 #####
0648 ####
0052 ####
550th ###th
0139 ####
0572 ####
0116 ####
47776 #####
20823 #####
22006 #####
22754 #####
41397 #####
43221 #####
46984 #####
6698 ####
0975 ####
41025 #####
42987 #####
3996 ####
44296 #####
43812 #####
20642 #####
43231 #####
18239 #####
21159 #####
denise76marie denise##marie
wwitk2010 wwitk####
300er ###er
marylandcampaign150 marylandcampaign###
than100 than###
23230 

91f ##f
78f ##f
56f ##f
434b ###b
44010 #####
m825 m###
90m ##m
merlin999 merlin###
wwdc2013 wwdc####
343rd ###rd
lon…39 lon…##
96° ##°
angie12106 angie#####
24312 #####
86cc ##cc
19977 #####
22324 #####
camayoub21 camayoub##
cjanes15 cjanes##
79f ##f
40°c ##°c
68°f ##°f
e104 e###
7021n ####n
conor64 conor##
19244 #####
917w ###w
765m ###m
811w ###w
zeke27 zeke##
74f ##f
27pm ##pm
65n ##n
4508a ####a
08″ ##″
93″ ##″
28r ##r
t27 t##
b507 b###
58″ ##″
09″ ##″
bbface212 bbface###
532yoga ###yoga
45deg ##deg
2011‐2012 ####‐####
2011‐12 ####‐##
299a ###a
18249 #####
66″ ##″
6883 ####
23963 #####
k014 k###
cashman7323 cashman####
999999 #####
32k ##k
curzon417 curzon###
was37 was##
2013highlight ####highlight
46″ ##″
2013… ####…
fox59 fox##
d14 d##
23616 #####
1502e ####e
07″ ##″
2015–2024 ####–####
20thcentury ##thcentury
21527 #####
40n ##n
12–17 ##–##
2000e ####e
1302s ####s
p18 p##
348th ###th
65″ ##″
26592 #####
cbs19 cbs##
23m ##m
7873931 #####
78″ ##″
439a ###a
r01 r##
23259 #####
958

23053 #####
16244 #####
vaendorsements2013 vaendorsements####
23185 #####
05314829 #####
900px ###px
24805 #####
43791 #####
●200 ●###
310th ###th
800razors ###razors
0722 ####
17588 #####
23675 #####
0672 ####
4447 ####
23925 #####
47855 #####
26214 #####
12694 #####
s509 s###
18627 #####
20527 #####
22210 #####
12648 #####
46550 #####
44152 #####
21724 #####
25852 #####
353a ###a
4274 ####
b410 b###
606w ###w
66800 #####
tastings11 tastings##
54f ##f
85em ##em
21k ##k
685i ###i
w1070 w####
179d ###d
15527 #####
15596 #####
19406 #####
13927 #####
43626 #####
42670 #####
42515 #####
42938 #####
43211 #####
42902 #####
15775 #####
45278 #####
42795 #####
10583 #####
15465 #####
12568 #####
43703 #####
arjunsethi81 arjunsethi##
espn27 espn##
356th ###th
528i ###i
42038 #####
20336 #####
44029 #####
21574 #####
20029 #####
21587 #####
21476 #####
19149 #####
39838 #####
25437 #####
43541 #####
704s ###s
22934 #####
22816 #####
18521 #####
13739 #####
15160 #####
17530 #####
16550 #####
0

423rd ###rd
46943 #####
2506b ####b
91de0a4f ##de0a4f
b948 b###
a9acdc52195e a9acdc#####e
41539 #####
u47 u##
100d ###d
£700 £###
ftse100 ftse###
395867 #####
40446 #####
302nd ###nd
helpdesk0019 helpdesk####
16… ##…
r136 r###
16659 #####
47639 #####
fox43 fox##
45866 #####
45884 #####
610n ###n
525th ###th
sky360 sky###
hb516 hb###
s1013 s####
358a ###a
3078south ####south
40178 #####
28s ##s
11121a #####a
940s ###s
4f7f96ca8a7e 4f7f##ca8a7e
u45ffr3l1 u##ffr3l1
c104 c###
sh406 sh###
f19 f##
natediaz209 natediaz###
37144 #####
113a ###a
412th ###th
5801f ####f
730057722993971200 #####
729636634271076352 #####
25704 #####
5476b3 ####b3
44536 #####
19473 #####
cyp2j19 cyp2j##
sb1552 sb####
drext727 drext###
b19c b##c
4e75 4e##
81d5 ##d5
084d4d7f48df ###d4d7f##df
1125s ####s
101′ ###′
s203 s###
—2016 —####
v830 v###
2608s ####s
26304 #####
§207 §###
131399a #####a
40711 #####
22476 #####
0855 ####
djh0710 djh####
22142 #####
753748958770499584 #####
6031c ####c
40128 #####
40427 #####
180

22774 #####
13287 #####
721b ###b
18172 #####
731a ###a
41769 #####
24180 #####
46387 #####
47338 #####
43995 #####
21385 #####
20072 #####
23463 #####
38255 #####
18523 #####
42201 #####
23595 #####
36196 #####
42880 #####
11182 #####
1561a ####a
13692 #####
16438 #####
invite1047 invite####
190px ###px
x1yb76ir0oqxzii x1yb##ir0oqxzii
pn51f4500 pn##f####
x35 x##
15164 #####
armour39 armour##
15183 #####
23725 #####
22224 #####
28308 #####
18527 #####
11493 #####
central1306 central####
41803 #####
46160 #####
46784 #####
46554 #####
43866 #####
44480 #####
20087 #####
21539 #####
22817 #####
16683 #####
25167 #####
45842 #####
invite1048 invite####
25550 #####
485374 #####
6911g ####g
5902f ####f
donnajean5536 donnajean####
54cd ##cd
rarciaga20 rarciaga##
1071229075 #####
12387 #####
46903 #####
20795 #####
44295 #####
42523 #####
22267 #####
23084 #####
43257 #####
40867 #####
36070 #####
dixieduncan94 dixieduncan##
17206 #####
813b ###b
18886 #####
12567 #####
12299 #####
15677 ####

47045 #####
23357 #####
17727 #####
18732 #####
45255 #####
705e ###e
hb493 hb###
k24 k##
20446 #####
newmanbethesda2016 newmanbethesda####
17088 #####
­2­02 ­2­##
20790 #####
44240 #####
43080 #####
17071 #####
37070 #####
36906 #####
10586 #####
16150 #####
41879 #####
19583 #####
35g ##g
200mg ###mg
14311b #####b
factory20 factory##
0156 ####
17046 #####
632826 #####
46420 #####
45549 #####
20485 #####
22717 #####
37777 #####
17165 #####
b506 b###
tbrown9601 tbrown####
998th ###th
42230 #####
g102 g###
15870 #####
e275 e###
wwilsonclassof76 wwilsonclassof##
21322 #####
21896 #####
gfisher143 gfisher###
17150 #####
194000 #####
a559 a###
n111 n###
s502 s###
17847 #####
21099 #####
15469 #####
382nd ###nd
­16th ­##th
16047 #####
38668 #####
50hr ##hr
palette22 palette##
04m ##m
403b ###b
★2010 ★####
23ecaps ##ecaps
43512 #####
20927 #####
25760 #####
25694 #####
0831 ####
16684 #####
22470 #####
39525 #####
2016summercamps ####summercamps
waltere8805 waltere####
wlhs1966 wlhs####
wlcl

24617 #####
1435chapin ####chapin
42418 #####
43709 #####
10pim ##pim
15em ##em
art21 art##
1960s… ####s…
650kg ###kg
10cm ##cm
14211a #####a
b416 b###
39821 #####
951e ###e
8e15090d64ae 8e#####d##ae
a2631c7e a####c7e
242m ###m
785th ###th
665th ###th
714th ###th
903rd ###rd
19665 #####
11093 #####
17â ##â
50â ##â
jebwbush2016 jebwbush####
lanson15 lanson##
00pmet ##pmet
stopyulin2015 stopyulin####
5580047c #####c
2000â ####â
613225835365949440 #####
21298 #####
80123 #####
chrome10 chrome##
75e4143c ##e####c
f42f f##f
103°f ###°f
€65 €##
‘02 ‘##
biff2015 biff####
biff1985 biff####
bill2015 bill####
20150708 #####
559d683de4b0b6ded1624ab4 ###d###de4b0b6ded####ab4
559d6843e4b0b134cab4dbd3 ###d####e4b0b###cab4dbd3
300040 #####
1436379225069 #####
85p ##p
44592 #####
4612d ####d
2929f ####f
1240k ####k
€50bn €##bn
92l ##l
20150713 #####
gryabt32yj gryabt##yj
€67 €##
412′ ###′
368′ ###′
396′ ###′
90kg ##kg
2012—but ####—but
35vw ##vw
33177 #####
gaia14aae gaia##aae
2931a ####a
14215a #

2012sips ####sips
2012suppers ####suppers
county12 county##
42186 #####
43952 #####
19757 #####
43899 #####
45508 #####
43269 #####
40268 #####
18797 #####
40970 #####
36430 #####
35113 #####
25726 #####
25269 #####
42532 #####
45850 #####
39909 #####
children12 children##
4ae7450 4ae####
3625301 #####
3169410 #####
3124410 #####
3134410 #####
3137410 #####
3151301 #####
3634410 #####
3686410 #####
3036410 #####
3110301 #####
3380301 #####
3052410 #####
3186410 #####
3090301 #####
3454410 #####
3642301 #####
3649410 #####
3464410 #####
3744410 #####
3037410 #####
3638301 #####
3181410 #####
3130410 #####
3528301 #####
3883301 #####
3485410 #####
3469410 #####
3474410 #####
3493410 #####
3001410 #####
3039410 #####
3021301 #####
putin2012 putin####
68million ##million
608084 #####
53905 #####
011512heart300 #####heart###
83th ##th
u238 u###
0282 ####
dickens2012 dickens####
17other ##other
d7000 d####
pne8000 pne####
2000estimate ####estimate
­­15 ­­##
a79 a##
db15 db##
500—and ###—and


today—15 today—##
arlinks10 arlinks##
1641g ####g
1422w ####w
919a ###a
6391b ####b
17869 #####
inv996 inv###
0947 ####
22—than ##—than
bl12fr50 bl##fr##
bl12fr100 bl##fr###
l12fr250 l##fr###
bl12fr500 bl##fr###
bl12fr1000 bl##fr####
67now ##now
dc44 dc##
10fps ##fps
147million ###million
27for ##for
ar2009121702062 ar#####
prus23789012 prus#####
auxliary206 auxliary###
40447 #####
40375 #####
40195 #####
26907 #####
28144 #####
i14z i##z
zs20 zs##
250gb ###gb
208w ###w
24473 #####
24276 #####
47534 #####
43928 #####
19784 #####
45652 #####
40293 #####
16592 #####
18846 #####
38688 #####
18297 #####
26161 #####
45352 #####
1025b ####b
38—was ##—was
inv997 inv###
14024b #####b
6917j ####j
5906g ####g
fri23 fri##
644b ###b
17986 #####
1868b ####b
5650c ####c
6151b ####b
6831b ####b
887b ###b
2721b ####b
4643c ####c
a4000 a####
247pp ###pp
at17 at##
average10 average##
socialmediaoutlook2013 socialmediaoutlook####
4238367070 #####
franchisefairwinter2012 franchisefairwinter####
poweroftwe

47scott ##scott
z51 z##
tb516 tb###
5r25 5r##
metromarchguide2013 metromarchguide####
in18 in##
14091 #####
884px ###px
792px ###px
125px ###px
■17th ■##th
■23rd ■##rd
763f ###f
bcc73 bcc##
oaktonhighschoolclassof1988 oaktonhighschoolclassof####
4532c ####c
45755 #####
23936 #####
35438 #####
26765 #####
29912 #####
245960 #####
1925n ####n
903j ###j
24344 #####
18906i #####i
10208302 #####
sa102 sa###
20130140 #####
42219 #####
42083 #####
42055 #####
40389 #####
26059 #####
25139 #####
46710 #####
226s ###s
gs500 gs###
141800 #####
82b ##b
2504b ####b
invite150 invite###
0241 ####
592650 #####
42150 #####
592310 #####
0531 ####
0091 ####
oasis500s oasis###s
s321 s###
mach37™ mach##™
about10 about##
78½ ##½
frannetva0913 frannetva####
chaosuncomplicatedfb101 chaosuncomplicatedfb###
a1023 a####
24578 #####
24475 #####
34977 #####
24323 #####
24374 #####
player—109th player—###th
the10 the##
age16 age##
0569 ####
16694 #####
47387 #####
19765 #####
44917 #####
43207 #####
15993 #####
18

00because ##because
spring14 spring##
r437 r###
437th ###th
an1889 an####
1807u ####u
spaceylacey0420 spaceylacey####
boards—49 boards—##
‘411 ‘###
194½ ###½
44282 #####
20989 #####
40837 #####
16061 #####
310h ###h
35803 #####
21803 #####
rai113 rai###
383583 #####
270655 #####
area10 area##
821n ###n
270656 #####
80406 #####
wpkec150 wpkec###
nasmay1001 nasmay####
the4x400 the4x###
than15 than##
ggarafuliclily2014 ggarafuliclily####
100lilygarafulic ###lilygarafulic
hb1168 hb####
xm25 xm##
stanfordmedicine25 stanfordmedicine##
the25 the##
cap1400 cap####
●58 ●##
bcc64 bcc##
bowie1979 bowie####
bowiebulldogs1979 bowiebulldogs####
waltwhitman1974 waltwhitman####
16998 #####
40390 #####
24990 #####
21241 #####
36004 #####
46656 #####
30530 #####
22534 #####
21622 #####
47158 #####
29827 #####
dh201 dh###
about30 about##
actor98 actor##
15351 #####
14011a #####a
conv1047 conv####
permutations1072 permutations####
a19s a##s
a21s a##s
23549 #####
0499 ####
b3254 b####
839228 #####
832346 #

kevincd22 kevincd##
has13 has##
0381 ####
in420 in###
tip187 tip###
96096 #####
buried711 buried###
included50 included##
drove10 drove##
4136359 #####
7166791 #####
getmy2016catalog getmy####catalog
2016digitalcatalog ####digitalcatalog
0952 ####
100milesuppers ###milesuppers
46383 #####
19626 #####
a428021 a#####
15four ##four
1089712 #####
s618 s###
4232a ####a
23596 #####
29050 #####
22080 #####
44702 #####
22421 #####
46469 #####
48340 #####
19691 #####
27040 #####
28425 #####
27463 #####
50185 #####
024th ###th
conv1134 conv####
invite1133 invite####
496th ###th
gridlock0727 gridlock####
6805c ####c
ph19 ph##
3051a ####a
324752 #####
324700 #####
316453 #####
324760 #####
325152 #####
324657 #####
324852 #####
324769 #####
26821 #####
659b ###b
12h ##h
1264a ####a
16035 #####
classof2011 classof####
00373 #####
500obo ###obo
a37 a##
1015s ####s
n117 n###
clk500 clk###
qx70 qx##
80111 #####
10831c #####c
312e ###e
201c ###c
41757 #####
24714 #####
42606 #####
27048 #####
325099 ##

41083 #####
35356 #####
46704 #####
45298 #####
0712show16 ####show##
n15 n##
26713 #####
0676 ####
sw19 sw##
invite1182 invite####
jazzfestival2016 jazzfestival####
525m ###m
wp16 wp##
700ish ###ish
238⅔ ###⅔
94253 #####
10thannual ##thannual
12percent ##percent
cl20117fleamarket cl#####fleamarket
a436644 a#####
65g ##g
18810 #####
m13 m##
19445 #####
36952 #####
36107 #####
27148 #####
45580 #####
0715show16 ####show##
e1905 e####
gbooth123 gbooth###
e1911 e####
0719show16 ####show##
­300 ­###
save70 save##
29547 #####
92q ##q
21⅔ ##⅔
t26 t##
10⅓ ##⅓
36413 #####
11winchester ##winchester
44574 #####
11— ##—
29gze9t ##gze9t
29j1lfp ##j1lfp
29ki9o1 ##ki9o1
419n ###n
333a ###a
18855 #####
47696 #####
21573 #####
43212 #####
s821 s###
48500 #####
0722show16 ####show##
0726show16 ####show##
23570 #####
18268 #####
b402 b###
3d119073872 3d#####
240mg ###mg
invite1184 invite####
26280 #####
church3755 church####
mundial1994 mundial####
80h ##h
gridlock0725 gridlock####
00356 #####
32146530 

101212 #####
192832 #####
feetzmorgan202 feetzmorgan###
bag35 bag##
24hourfitness ##hourfitness
sb1070—which sb####—which
tgibson810 tgibson###
0986 ####
m23congordc m##congordc
janvesely24 janvesely##
krisjenkins77 krisjenkins##
published2011 published####
2010torn ####torn
2008–except ####–except
197910 #####
2014–research ####–research
san11 san##
h4446 h####
hb4799 hb####
596—roughly ###—roughly
53526 #####
and11 and##
12thstreet ##thstreet
chief123 chief###
fy2002 fy####
2020—it ####—it
32rd ##rd
epa03455042 epa#####
cast2004 cast####
0002008 #####
5002012 #####
€290 €###
planes—15 planes—##
va08 va##
epa03459831 epa#####
a389 a###
a389recordings a###recordings
1980—more ####—more
2011ranked ####ranked
2011corruption ####corruption
romneyryan2012 romneyryan####
g28 g##
g30 g##
b25 b##
newsradio610 newsradio###
2012unfiltered ####unfiltered
200i ###i
754precinct ###precinct
300precinct ###precinct
205precinct ###precinct
159precinct ###precinct
642precinct ###precinct
638precinct #

58m ##m
370s ###s
dce7620d dce####d
413c ###c
b167 b###
b606a9a86ece b###a9a##ece
237b ###b
747sr ###sr
wp10269 wp#####
peepsshow2014 peepsshow####
pulltabproductions11 pulltabproductions##
952—including ###—including
bush41 bush##
another12 another##
7870002 #####
23b6e741 ##b6e###
36f1 ##f1
a99e a##e
118196fc115a #####fc###a
00on ##on
1p991 1p###
ppb19… ppb##…
4680bfcc ####bfcc
a58b a##b
020560 #####
32m ##m
256k ###k
512k ###k
ffebf09e ffebf##e
248e ###e
45ee ##ee
2bef8420ce5d 2bef####ce5d
fl13 fl##
000–4 ###–4
anz531 anz###
anz532 anz###
anz533 anz###
anz537 anz###
anz538 anz###
655237 #####
b254 b###
91d15cb6 ##d##cb6
197ef3568958 ###ef#####
20130drive #####drive
7507m ####m
1601b ####b
n571 n###
p628 p###
1415926535897932384626433 #####
85987448205 #####
14159265359 #####
14159265358979323846264338327950 #####
288419716939937510 #####
7879763 #####
ricardorosa1973 ricardorosa####
170325z #####z
01009kt #####kt
3500vp6000ft ####vp####ft
vv004 vv###
m02 m##
a3011 a####
p0006 p####


944–59 ###–##
218–25 ###–##
710–19 ###–##
42–50 ##–##
739–46 ###–##
358–61 ###–##
576–600 ###–###
7024c ####c
174m ###m
8448736 #####
8448746 #####
8448755 #####
8448769 #####
100dayloans ###dayloans
uno2013 uno####
390bf1dd ###bf1dd
1fcbecebdd74 1fcbecebdd##
kids—21 kids—##
36pm ##pm
h33t h##t
3427148 #####
25–year ##–year
358c758b ###c###b
4c37 4c##
f4c7132e15ee f4c####e##ee
16—came ##—came
2w46 2w##
068fb7b4a845 ###fb7b4a###
8d3fe977 8d3fe###
761e ###e
16ba1eef2390 ##ba1eef####
8458693 #####
v22 v##
p248 p###
341–42 ###–##
ab2501 ab####
be24 be##
162b4c8a ###b4c8a
1968– ####–
8463837 #####
8463923 #####
8463858 #####
8463868 #####
8463882 #####
8463892 #####
8463907 #####
740th ###th
949–50 ###–##
827th ###th
acres–100 acres–###
4336055 #####
1420s ####s
11–which ##–which
13–but ##–but
50w ##w
83b ##b
1200nw ####nw
13357d #####d
12430b #####b
42316 #####
8466171 #####
8466172 #####
8466173 #####
8466175 #####
£292 £###
x900b x###b
hw40es hw##es
f5d955cc f5d###cc
905b ###b
e6ea6a60f6

ebj2012 ebj####
619174739781775361 #####
619169871260164096 #####
619166443066646528 #####
618472716375564288 #####
500° ###°
1000°f ####°f
16l ##l
17h ##h
17l ##l
18j ##j
28g ##g
6067a ####a
314b ###b
42193 #####
41854 #####
40888 #####
44852 #####
20229 #####
111noonherst ###noonherst
501milewalk ###milewalk
2012–2013 ####–####
my2131 my####
268e ###e
b621 b###
b55e495e9b78 b##e###e9b##
619718532268445696 #####
619879799536918528 #####
619880734933684224 #####
21grandslams ##grandslams
wimbledon2015 wimbledon####
2014a ####a
tunica06dv1 tunica##dv1
tunica06mv1 tunica##mv1
267000 #####
2008…sba ####…sba
thisis50 thisis##
1989tourwashingtondc ####tourwashingtondc
15an ##an
£48 £##
90002188 #####
405′ ###′
366′ ###′
452′ ###′
404′ ###′
431′ ###′
441′ ###′
420′ ###′
371′ ###′
372′ ###′
400′ ###′
351′ ###′
376′ ###′
385′ ###′
421′ ###′
399′ ###′
445′ ###′
433′ ###′
390′ ###′
377′ ###′
411′ ###′
402′ ###′
563792055651270656 #####
999999999999 #####
500746781622562816 #####
aylorswift13 ayl

707620187734982656 #####
708081559862120452 #####
60s—perhaps ##s—perhaps
3407a ####a
46755 #####
724s ###s
515m ###m
97m ##m
475m ###m
075m ###m
580m ###m
315m ###m
320m ###m
050m ###m
380m ###m
031m ###m
603mm ###mm
708190667281793024 #####
80d ##d
ku2016champs ku####champs
708334815372828672 #####
674021406426980352 #####
1990s…those ####s…those
4a6332ce 4a####ce
e443 e###
47cb ##cb
b0ca00411fe1 b0ca#####fe1
hb1524 hb####
sb1150 sb####
709564825500848128 #####
709598270402596864 #####
1841—the ####—the
r100 r###
jimward21 jimward##
709918638816550912 #####
622242105033764864 #####
958848 #####
267th ###th
185cd514 ###cd###
d71f d##f
a65b a##b
587e721fb231 ###e###fb###
1930s—i ####s—i
21—examining ##—examining
fb6547 fb####
710453854014279680 #####
r136a1 r###a1
710660117138640896 #####
112–58 ###–##
404–238 ###–###
39921 #####
s712 s###
1801d ####d
14016c #####c
12763 #####
7592a ####a
8240b ####b
100096 #####
955th ###th
a10wbb a##wbb
a10champs a##champs
736–37 ###–##
889–91 ###–##

1830b ####b
rayophotography13 rayophotography##
823221915171061760 #####
obama17 obama##
242196 #####
259636 #####
834017335669358593 #####
835194156175798277 #####
1999–to ####–to
835027942262484992 #####
835048458645970946 #####
835558339002134528 #####
835343752386514944 #####
jaredjeffrey614 jaredjeffrey###
pmullins14 pmullins##
300126215931 #####
836100391314075649 #####
664453 #####
836124739735326720 #####
kingj9943 kingj####
836078946039042049 #####
836357784031510529 #####
836402392035901440 #####
836414672731525120 #####
836382208466235393 #####
sb55 sb##
newspaper15 newspaper##
835845404734414848 #####
61016 #####
836645013685592065 #####
iriseup99 iriseup##
835004560947150848 #####
834964849264205825 #####
836741000022347776 #####
837099085660360704 #####
1925–30 ####–##
shatty22 shatty##
837298505672687616 #####
837323083589566467 #####
fatmeh32 fatmeh##
andrewsholt10 andrewsholt##
g304 g###
206w ###w
41429 #####
12434b #####b
42085lot #####lot
34877 #####
46887 #####
4599

72oz ##oz
1503s ####s
2009…its ####…its
1839s ####s
‘4000 ‘####
w20167s w#####s
officialsf49ers officialsf##ers
f001 f###
ar2339s ar####s
20ss ##ss
295s ###s
snl40 snl##
90125 #####
69′ ##′
18×24 ##×##
50‑50′ ##‑##′
kfp48 kfp##
wizards2010 wizards####
26y ##y
47daysofhitman ##daysofhitman
3191s ####s
i360s i###s
fifa15 fifa##
7042s ####s
16…he ##…he
2001—about ####—about
1651s ####s
300gb ###gb
15… ##…
‘1031 ‘####
early90s early##s
fn2187 fn####
1240s ####s
faith47s faith##s
je11s je##s
‎24hours‬ ‎##hours‬
192s ###s
v90 v##
1973–certainly ####–certainly
211s ###s
i09s i##s
kukiz15 kukiz##
alevine843 alevine###
‘666 ‘###
‘meet24 ‘meet##
15–to–watch ##–to–watch
farooksyed49 farooksyed##
iluvusa1132 iluvusa####
101sts ###sts
99¢ ##¢
c86 c##
1917–1963 ####–####
87… ##…
article19s article##s
c57 c##
2014…andersons ####…andersons
87s ##s
360is ###is
72… ##…
c84 c##
c77 c##
superfly1069 superfly####
£un35 £un##
80s—well ##s—well
‘bh34 ‘bh##
weather5280s weather####s
b330s b###s
96ers ##ers
gl

In [21]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 812155/812155 [00:01<00:00, 620500.06it/s]

Found embeddings for 14.76% of vocab
Found embeddings for  87.25% of all text





Nice! Another 3% increase. Now as much as with handling the puntuation, but every bit helps. Lets check the oov words

Ok now we  take care of common misspellings when using american/ british vocab and replacing a few "modern" words with "social media" for this task I use a multi regex script I found some time ago on stack overflow. Additionally we will simply remove the words "a","to","and" and "of" since those have obviously been downsampled when training the GoogleNews Embeddings. 


In [16]:
def replace_typical_misspell(vocab,inverted,key,re_word):
    if(key in vocab or key in inverted):
        print('o',key,re_word)
        for x in re_word:
            if(x not in vocab and x not in inverted):
                inverted[x] = {'docs':{}}
                vocab[x] = 0
            elif(x not in vocab):
                vocab[x] = 0
                print('--',x)
            elif(x not in inverted):
                inverted[x] = {'docs':{}}
                print('==',x)
            vocab[x] += vocab[key]
            for w,n in inverted[key]['docs'].items():
                try:
                    inverted[x]['docs'][w] += n
                except:
                    inverted[x]['docs'][w] = n

        del vocab[key]
        del inverted[key]

In [17]:
mispell_dict = {'didnt':['did','not'],
                'doesnt':['does','not'],
                'isnt':['is','not'],
                'shouldnt':['should','not'],
                'wasnt':['was','not'],
                'hasnt':['has','not'],
                '‘i':['i'],
                'theatre':['theater'],
                'cancelled':['canceled'],
                'organisation':['organization'],
                'labour':['labor'],
                'favourite':['favorite'],
                'travelling':['traveling'],
                'washingtons':['washington'],
                'marylands':['maryland'],
                'chinas':['china'],
                'russias':['russia'],
                '‘the':['the'],
                'irans':['iran'],
                'dulles':['dulle']
                }


In [18]:
for key,re_word in mispell_dict.items():
    replace_typical_misspell(vocab,invert,key,re_word)

o didnt ['did', 'not']
o doesnt ['does', 'not']
o isnt ['is', 'not']
o shouldnt ['should', 'not']
o wasnt ['was', 'not']
o hasnt ['has', 'not']
o theatre ['theater']
o cancelled ['canceled']
o organisation ['organization']
o labour ['labor']
o favourite ['favorite']
o travelling ['traveling']
o washingtons ['washington']
o marylands ['maryland']
o chinas ['china']
o russias ['russia']
o irans ['iran']
o dulles ['dulle']


In [19]:
oov = check_coverage(vocab,embeddings_index)

100%|██████████| 843220/843220 [00:01<00:00, 610473.53it/s]

Found embeddings for 14.16% of vocab
Found embeddings for  85.10% of all text





In [20]:
oov[:40]

[('to', 11788929),
 ('a', 10516905),
 ('and', 10341307),
 ('of', 9694380),
 ('—', 1517033),
 ('000', 618627),
 ('10', 370043),
 ('30', 308695),
 ('11', 232210),
 ('12', 190189),
 ('20', 172134),
 ('15', 165409),
 ('–', 130187),
 ('703', 126301),
 ('900', 113356),
 ('25', 110584),
 ('13', 107690),
 ('18', 106101),
 ('2012', 105997),
 ('14', 104874),
 ('50', 101847),
 ('16', 101128),
 ('100', 96989),
 ('2014', 94357),
 ('17', 91094),
 ('500', 89634),
 ('2016', 82488),
 ('24', 81104),
 ('2013', 79778),
 ('…', 79766),
 ('40', 76906),
 ('21', 75482),
 ('19', 74826),
 ('2015', 72897),
 ('22', 71464),
 ('28', 65849),
 ('2011', 65395),
 ('23', 64728),
 ('27', 63606),
 ('26', 63521)]

We see that although we improved on the amount of embeddings found for all our text from 89% to 99%. Lets check the oov words again 

Looks good. No obvious oov words there we could quickly fix.
Thank you for reading and happy kaggling

# write out

In [25]:
with open('./clean_data/inverted_file.json','w') as f:
    f.write(json.dumps(invert))
with open('./clean_data/vocab','w') as f:
    for key in vocab:
        f.write('{0}\t{1}\n'.format(key,vocab[key]))

In [23]:
def check_coverage_large(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        la = ''.join([word[0].upper(),word[1:]])
        if(word in embeddings_index or la in embeddings_index):
            a[word] = word
            k += vocab[word]
        else:
            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

In [89]:
del invert['']

In [24]:
oov = check_coverage_large(vocab,embeddings_index)

100%|██████████| 843220/843220 [00:01<00:00, 755026.39it/s]

Found embeddings for 38.32% of vocab
Found embeddings for  96.68% of all text





In [95]:
oov[:20]

[('brexit', 84),
 ('deplorables', 77),
 ('sharknado', 45),
 ('f—', 45),
 ('tywanza', 28),
 ('selfie', 27),
 ('rapability', 25),
 ('rapable', 25),
 ('covfefe', 22),
 ('ooooooouuu', 21),
 ('schlonged', 20),
 ('pozdravleniya', 19),
 ('daesh', 18),
 ('ncis', 15),
 ('…', 15),
 ('adorabilis', 14),
 ('splatoon', 13),
 ('n—', 13),
 ('nastaliq', 12),
 ('whatd', 12)]