# Creating a hyperdictionary

I was considering that the basic way to solve the letter prediction problem given no constraints, would be to just have a dictionary of words, and then be able to reference that dictionary. I am attempting to store a dictionary into a hypervector and create a hyperdictionary.

The hypervectors are very similar to hashes, and so each word or subword has no relationship to the hash. So in order to store a dictionary in the hyper vector, you need to store the word and all of the substrings. 

Essentially, I am encoding an algorithm in the hypervector that does a tree search through a dictionary. I want to start typing in letters and then have the hyperdictionary list the possible next letters given the words that are stored. This means I want to store not only everyword, but the entire tree of substrings that make up the word. 


In [14]:

import random_idx
import utils
import pickle

import string
from pylab import *


%matplotlib inline


## Building the hyper dictionary

So, I have gone to the internet and just found a text file that contains a list of common english words. My goal is to put this dictionary into a hyper vector and then see if I can use a standard word-based algorithm to predict the next letter.

In [2]:
fdict = open("2of12id.txt")
word_list = []

In [3]:
for line in fdict:
    words = line.split()
    
    # take out the noun/verb/adjective
    words.pop(1)
    
    for word in words:
        if word.find('{') > 0:
            continue
            
        w = word.strip('()~-|{}!@/')
        
        if len(w) == 0:
            continue
                
        word_list.append(w)

In [4]:
print len(word_list)

100060


So, we have a dictionary of over 100,000 words now. I am going to go through each word, substring by substring, and add each of the substrings to the hypervector. This means that there will be far more than 100k elements that need to be stored in the hypervector, because I am essentially trying to store the entire tree. Since there are so many words, I am going to start using an even larger hyper-vector. There will be issues with how much information we can store in the hypervectors, and there is already some literature on this. 

I really want the hyper vector to just work like a word dictionary. I am only going to add a substring if it is not already present. 

In [5]:
N=1000000
letter_vectors = 2 * (np.random.randn(len(random_idx.alphabet), N) > 0) - 1
print letter_vectors

[[-1  1 -1 ...,  1 -1  1]
 [-1 -1  1 ..., -1 -1  1]
 [-1 -1 -1 ..., -1 -1  1]
 ..., 
 [-1 -1  1 ..., -1  1 -1]
 [ 1 -1  1 ..., -1  1 -1]
 [ 1 -1  1 ...,  1  1  1]]


In [None]:
hyperdictionary = np.zeros(N)
count = 0
vals = []
subwords = []
skip = 20

for word in word_list[0::skip]:
#for word in ['accelerate','aardvark', 'accordion', 'accordionists',  'apple', 'betazoid', 'betakeratine']:
#for word in ['a', 'b', 'c', 'd','e', 'f']:
    #print ""
    print word,
    subword = ''
    subvec = np.ones(N)
    for i,letter in enumerate(word):
        letter_idx = random_idx.alphabet.find(letter)
        subvec = np.roll(subvec, 1) * letter_vectors[letter_idx,:]
        subword += letter
        
        # check to see if the subvec is already present in the hyperdictionary
        val = np.dot(subvec.T, hyperdictionary) / N
        
        # If the substring is not present, then val should be near 0
        if val < 0.4:
            # then add the substring
            hyperdictionary += subvec
            count += 1
            #print subword, 
    
    letter_idx = random_idx.alphabet.find(' ')
    subvec = np.roll(subvec, 1) * letter_vectors[letter_idx,:]
    # check to see if the subvec is already present in the hyperdictionary
    val = np.dot(subvec.T, hyperdictionary) / N
        
    # If the substring is not present, then val should be near 0
    if val < 0.4:
        # then add the subaQstring
        hyperdictionary += subvec
        count += 1
    

In [None]:
print count

In [None]:
random_idx.alphabet

In [None]:
np.savez('data/hyperdictionary_external-s20-d1M-160223.npz', hyperdictionary=hyperdictionary, letter_vectors=letter_vectors)

In [21]:
fdict = open("raw_texts/texts_english/alice_in_wonderland.txt")
text = fdict.read().lower()

punct = string.punctuation + string.digits

for i in punct:
    if i == '-':
        text = text.replace(i, ' ')
    else:
        text = text.replace(i, '')
    
text = text.replace('\n', ' ')
text = text.replace('\r','')
text = text.replace('\t','')

    
word_list = set(text.split()[1:]);
len(word_list)

3060

In [None]:
hyperdictionary = np.zeros(N)
count = 0
vals = []
subwords = []
skip = 20

for word in word_list:
#for word in ['accelerate','aardvark', 'accordion', 'accordionists',  'apple', 'betazoid', 'betakeratine']:
#for word in ['a', 'b', 'c', 'd','e', 'f']:
    #print ""
    print word,
    subword = ''
    subvec = np.ones(N)
    for i,letter in enumerate(word):
        letter_idx = random_idx.alphabet.find(letter)
        subvec = np.roll(subvec, 1) * letter_vectors[letter_idx,:]
        subword += letter
        
        # check to see if the subvec is already present in the hyperdictionary
        val = np.dot(subvec.T, hyperdictionary) / N
        
        # If the substring is not present, then val should be near 0
        if val < 0.4:
            # then add the substring
            hyperdictionary += subvec
            count += 1
            #print subword, 
    
    letter_idx = random_idx.alphabet.find(' ')
    subvec = np.roll(subvec, 1) * letter_vectors[letter_idx,:]
    # check to see if the subvec is already present in the hyperdictionary
    val = np.dot(subvec.T, hyperdictionary) / N
        
    # If the substring is not present, then val should be near 0
    if val < 0.4:
        # then add the substring
        hyperdictionary += subvec
        count += 1

In [None]:
print count

In [None]:
np.savez('data/hyperdictionary_alice-d1M-160223.npz', hyperdictionary=hyperdictionary, letter_vectors=letter_vectors)


## N-gram statistics

Now, going to make a hypervector that keeps stats on the 2-grams of letters in the text (including spaces). 




In [33]:
reload(random_idx)

<module 'random_idx' from 'random_idx.py'>

In [29]:
# generate text vector based on each pair of characters
text_name="preprocessed_texts/AliceInWonderland.txt"

N=10000
letter_vectors = 2 * (np.random.randn(len(random_idx.alphabet), N) > 0) - 1

alice_text_vector2 = random_idx.generate_RI_text_fast(N, letter_vectors, 2, 0, text)

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000
67000
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000
78000
79000
80000
81000
82000
83000
84000
85000
86000
87000
88000
89000
90000
91000
92000
93000
94000
95000
96000
97000
98000
99000
100000
101000
102000
103000
104000
105000
106000
107000
108000
109000
110000
111000
112000
113000
114000
115000
116000
117000
118000
119000
120000
121000
122000
123000
124000
125000
126000
127000
128000
129000
130000
131000
132000
133000
134000
135000
136000
137000
138000
139000
140000
141000
142000
143000
144000
145000
146000
147000
148000
149000
150000
151000
152000
153000
154000
155000


In [30]:
alice_text_vector2.shape

(1, 10000)

In [31]:
np.savez('data/alice-2gram-space-d10K-160223.npz', hyperdictionary=alice_text_vector2, letter_vectors=letter_vectors)

In [34]:

N=50000
letter_vectors = 2 * (np.random.randn(len(random_idx.alphabet), N) > 0) - 1

alice_text_vector3 = random_idx.generate_RI_text_fast(N, letter_vectors, 3, 0, text)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000 21000 22000 23000 24000 25000 26000 27000 28000 29000 30000 31000 32000 33000 34000 35000 36000 37000 38000 39000 40000 41000 42000 43000 44000 45000 46000 47000 48000 49000 50000 51000 52000 53000 54000 55000 56000 57000 58000 59000 60000 61000 62000 63000 64000 65000 66000 67000 68000 69000 70000 71000 72000 73000 74000 75000 76000 77000 78000 79000 80000 81000 82000 83000 84000 85000 86000 87000 88000 89000 90000 91000 92000 93000 94000 95000 96000 97000 98000 99000 100000 101000 102000 103000 104000 105000 106000 107000 108000 109000 110000 111000 112000 113000 114000 115000 116000 117000 118000 119000 120000 121000 122000 123000 124000 125000 126000 127000 128000 129000 130000 131000 132000 133000 134000 135000 136000 137000 138000 139000 140000 141000 142000 143000 144000 145000 146000 147000 148000 149000 150000 151000 152000 153000 154000 155000


In [39]:
np.savez('data/alice-3gram-space-d50K-160223.npz', hyperdictionary=alice_text_vector3, letter_vectors=letter_vectors)

In [38]:
letter_vectors.shape

(27, 50000)