# Predictive text with concatenated word vectors

By [Allison Parrish](http://www.decontextualize.com/)

This notebook demonstrates one way to implement predictive text like [iOS QuickType](https://www.apple.com/sg/ios/whats-new/quicktype/). It works sort of like a [Markov chain text generator](https://github.com/aparrish/rwet/blob/master/ngrams-and-markov-chains.ipynb), but uses nearest-neighbor lookups on a database of concatenated [word vectors](https://github.com/aparrish/rwet/blob/master/understanding-word-vectors.ipynb) instead of n-grams of tokens. You can build this database with any text you want!

To get this code to work, you'll need to [install spaCy](https://spacy.io/usage/#section-quickstart), download a [spaCy model with word vectors](https://spacy.io/usage/models#available) (like `en_core_web_lg`). You'll also need [Simple Neighbors](https://github.com/aparrish/simpleneighbors), a Python library I made for easy nearest neighbor lookups:

In [None]:
!pip install simpleneighbors

## How it works

The goal of a predictive text interface is to look at what the user has typed so far and then suggest the word that is most likely to come next. The system in this notebook does this by looking at each sequence of words of a particular length `n`, and then looking up the word vector in spaCy for each of those words, concatenating them to create one long vector. It then stores that vector along with the word that *follows* the sequence.

To calculate suggestions for a particular text from this database, you can just look at the last `n` words in the text, concatenate the word vectors for that stretch of words, and then find the entries in the database whose vector is nearest. The words stored along with those sequences (i.e., the words that followed the original sequence) are the words the system suggests as most likely to come next.

So let's implement it! First, we'll import the libraries we need:

In [None]:
from simpleneighbors import SimpleNeighbors

In [2]:
import spacy

And then load the spaCy model. (This will take a few seconds.)

In [3]:
nlp = spacy.load('en_core_web_lg')

You'll need to have some text to use to build the database. If you're following along, download a [plain text file from Project Gutenberg](https://www.gutenberg.org/) to the same directory as this notebook and put its filename below.

In [5]:
filename = "1342-0.txt"

When you're parsing a text with spaCy, it can use up a lot of memory and either throw "out of memory" errors or cause your computer to slow down as it swaps memory to disk. To ameliorate this, we're only going to train on the first 500k characters of the text. You can change the number in the cell below if you want even fewer characters (or more).

In [34]:
cutoff = 500000

The code in the cell below parses your text file into sentences (this might take a few seconds):

In [4]:
doc = nlp(open(filename).read()[:cutoff], 
         disable=['tagger'])

The `concatenate_vectors` function below takes a sequence of spaCy tokens (like those that you get when you parse a text) and returns the *concatenated* word vectors of those tokens. "Concatenating" vectors means to make one big vector from several smaller vectors simply by lining them all up. For example, if you had three 2D vectors `a`, `b`, and `c`:

    a = (1, 2)
    b = (5, 6)
    c = (11, 12)
    
The concatenation of these vectors would be this six-dimensional vector:

    (1, 2, 5, 6, 11, 12)

In [35]:
import numpy as np
def concatenate_vectors(seq):
    return np.concatenate(np.array([w.vector for w in seq]), axis=0)
concatenate_vectors(nlp("hello there")).shape

(600,)

Using vectors instead of tokens is a simple way of coping with predicting the next word even for sequences that aren't found in the source text. Using concatenated vectors facilitates finding entries that have both similar meanings and similar word orders (which is important when predicting the next word in a text).

The code in the cell below builds the nearest neighbor index that maps the concatenated vectors for each sequence of words in the source text to the word that follows. You can adjust `n` to change the length of the sequence considered. (In my experiments, values from 2–4 usually work best.)

In [42]:
n = 3
nns = SimpleNeighbors(n*300)
for seq in doc.sents:
    seq = [item for item in seq if item.is_alpha]
    for i in range(len(seq)-n):
        mean = concatenate_vectors(seq[i:i+n])
        next_item = seq[i+n].text
        nns.add_one(next_item, mean)
nns.build()

Once the index is built, you can test it out! Plug in a phrase with three words into the `start` variable below and run the cell. You'll see the top-ten most likely words to come next, as suggested by the nearest neighbor lookup.

In [44]:
start = "I have never"
nns.nearest(concatenate_vectors(nlp(start)))

['acknowledged',
 'liked',
 'been',
 'seen',
 'heard',
 'desired',
 'known',
 'read',
 'met',
 'seen',
 'supposed',
 'observed']

## Interactive web version

The code below starts a [Flask](http://flask.pocoo.org/) web server on your computer to serve up an interactive version of the suggestion code. Run the cell and click on the link that appears below. If you make changes, make sure to interrupt the kernel before re-running the cell. You can interrupt the kernel either via the menu bar (`Kernel > Interrupt`) or by hitting Escape and typing `i` twice.

In [1]:
import poetryutils2

ModuleNotFoundError: No module named 'anydbm'