[Webster's 1913](https://www.websters1913.com/words/) is a [great dictionary](http://jsomers.net/blog/dictionary). Here, I parse [Project Gutenberg's version](http://www.gutenberg.org/cache/epub/29765/pg29765.txt). The goal is to create a daily email list for myself of words I might actually use. 

### Parsing 

In [1]:
import string
import time
import random

In [2]:
with open("websters1913.txt", "r") as f:
    lines = f.read().split("\n")

In [3]:
len(lines)

974266

In [4]:
def is_word(w):
    return w and all(map(lambda c: c in string.ascii_uppercase, w))

In [5]:
words = [line for line in lines if is_word(line)]

In [6]:
len(words) 

103040

In [7]:
len(list(set(words))) # So it contains duplicates...

88629

Hm that's a lot. 

In [8]:
definitions = {}

In [9]:
curr_word = None
start = time.perf_counter()
for i, line in enumerate(lines):
    if i % 50000 == 0:
        print("Line: ", i)
    
    if is_word(line):
        curr_word = line 
        definitions[curr_word] = definitions.get(curr_word, []) # String concatenation makes it O(N**2) which hurts here
    elif curr_word and line: 
        definitions[curr_word].append(line)
end = time.perf_counter()
print("Total time: ", end - start)

Line:  0
Line:  50000
Line:  100000
Line:  150000
Line:  200000
Line:  250000
Line:  300000
Line:  350000
Line:  400000
Line:  450000
Line:  500000
Line:  550000
Line:  600000
Line:  650000
Line:  700000
Line:  750000
Line:  800000
Line:  850000
Line:  900000
Line:  950000
Total time:  0.6811576559999999


In [10]:
for k, v in definitions.items():
    definitions[k] = "\n".join(v)

In [11]:
len(definitions)

88629

### Filtering important words

#### Method 1: Length of definition

In [12]:
# Lots of words are tiny obscure ones, need to choose a good cutoff length
cutoff_definition_length = 200
cutoff_dict = {k: v for k, v in definitions.items() if len(v) > cutoff_definition_length}
len(cutoff_dict)

32017

In [13]:
def ten_random_words(d):
    keys = random.sample(list(d), 10)
    for k in keys:
        print("\033[1m" + k + "\033[0m") # Bold 
        print(d[k])
        print("\n\n\n")

In [14]:
ten_random_words(cutoff_dict)

[1mABLER[0m
A"bler, a.,
Defn: comp. of Able.
 -- A"blest, a.,
Defn: superl. of Able.
ABLET; ABLEN
Ab"let, Ab"len Etym: [F. ablet, ablette, a dim. fr. LL. abula, for
albula, dim. of albus white. Cf. Abele.] (Zoöl.)
Defn: A small fresh-water fish (Leuciscus alburnus); the bleak.




[1mTYPHOEAN[0m
Ty*pho"ë*an, a. Etym: [L. Typhoius, from Typhoeus, Gr.
Defn: Of or pertaining to Typhoeus (ti*fo"us), the fabled giant of
Greek mythology, having a hundred heads; resembling Typhoeus.
Note: Sometimes incorrectly written and pronounced Ty-phoe''an (, or
Ty-phe'' an.




[1mCINCHONIZE[0m
Cin"cho*nize, v. t.
Defn: To produce cinchonism in; to poison with quinine or with
cinchona.
CINCINNATI EPOCH
Cin`cin*na"ti ep"och. (Geol.)
Defn: An epoch at the close of the American lower Silurian system.
The rocks are well developed near Cincinnati, Ohio. The group
includes the Hudson River and Lorraine shales of New york.




[1mVOODOOISM[0m
Voo"doo*ism, n. Etym: [Probably (through Creole French vaudo

:( I've never heard of most of these and don't expect them to be useful. 

#### Method 2: Frequency 

We use the [top 1/3 million most frequent words](https://norvig.com/ngrams/count_1w.txt) as determined by [Google's n-gram](https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html):

> We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

In [15]:
n_common = 10_000 # Keep it to 10_000 most common words

In [16]:
with open("ngram_common.txt", "r") as f:
    lines = f.read().split("\n")[:-1] # Last line is empty line
    # Remove frequency 
    lines = list(map(lambda l: l.split("\t"), lines))
    most_common = set([w for w, f in lines][1000:n_common]) # Constant lookup time

In [17]:
len(most_common) # Excludign 1000 most common words

9000

In [18]:
def is_common(w):
    # most_common is already lowercase
    return w.lower() in most_common

In [19]:
frequent_dict = {k: v for k, v in definitions.items() if is_common(k)}

In [20]:
len(frequent_dict)

4763

In [21]:
ten_random_words(frequent_dict)

[1mCOOLER[0m
Cool"er, n.
Defn: That which cools, or abates heat or excitement.
if acid things were used only as coolers, they would not be so proper
in this case. Arbuthnot.
2. Anything in or by which liquids or other things are cooled, as an
ice chest, a vessel for ice water, etc.
COOL-HEADED
Cool"-head`ed, a.
Defn: Having a temper not easily excited; free from passion.
 -- Cool"-head`ed*ness, n.




[1mSPEECH[0m
Speech, n. Etym: [OE. speche, AS. sp, spr, fr. specan, sprecan, to
speak; akin to D. spraak speech, OHG. sprahha, G. sprache, Sw. spr,
Dan. sprog. See Speak.]
1. The faculty of uttering articulate sounds or words; the faculty of
expressing thoughts by words or articulate sounds; the power of
speaking.
There is none comparable to the variety of instructive expressions by
speech, wherewith man alone is endowed for the communication of his
thoughts. Holder.
2. he act of speaking; that which is spoken; words, as expressing
ideas; language; conversation.
Note: Speech is voice 

In [22]:
len(frequent_dict) // 365

13

That's much better! It will take *13 years* to get through this with one word a day. I've re-written this in `words.py`, which also saves the result to a CSV. 