<a href="https://colab.research.google.com/github/thedatadj/natural-language-processing/blob/main/autocomplete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Autocomplete model that given the initial words of a sentence it suggest the rest of the sentence.

# Load data
The data consist of a set of tweet.

In [1]:
# Download file
!gdown 13Qzds0_08hG0JNppbro1PElQzmi0hJ96

Downloading...
From: https://drive.google.com/uc?id=13Qzds0_08hG0JNppbro1PElQzmi0hJ96
To: /content/en_US.twitter.txt
  0% 0.00/3.39M [00:00<?, ?B/s]100% 3.39M/3.39M [00:00<00:00, 45.2MB/s]


In [4]:
# Read data file
with open("/content/en_US.twitter.txt", "r") as file:
    data = file.read()
data[:200]

"How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.\nWhen you meet someone special... you'll know. Your heart will beat more rapidly and you'll"

Each tweet is separated by the special character `\n`.

# Preprocess data

`data` is a continuous string, I split this string into individual sentences/tweets.

In [6]:
data0 = [tweet for tweet in data.split("\n")]
data0[:2]

['How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.',
 "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."]

There could be sentences/tweets that start or end with a space. I remove these spaces.

In [13]:
data1 = [tweet.strip() for tweet in data0]

There could be tweets that are empty string, these don't add information to the model so I eliminate them.

In [14]:
data2 = [tweet for tweet in data1 if len(tweet) > 0]

Normalize each sentence.

In [29]:
data3 = [tweet.lower() for tweet in data2]
data3[0]

'how are you? btw thanks for the rt. you gonna be in dc anytime soon? love to see you. been way, way too long.'

Now, I tokenize each sentence in the dataset.

In [30]:
# Tool for tokenization
import re

In [32]:
data4 = []
for tweet in data3:
    tweet_tk = re.findall(r"\w+|\.|\,", tweet)
    data4.append(tweet_tk)
data4[0][:10]

['how', 'are', 'you', 'btw', 'thanks', 'for', 'the', 'rt', '.', 'you']

Split the data into training and testing sets.

In [33]:
# For randomized assigning
import random

In [34]:
random.seed(87)

# Shuffle the data
random.shuffle(data4)

Assign 80% of the data for training.

In [35]:
split_size = int(len(data4) * 0.8)
train_data = data4[:split_size]

Assign the rest 20% for testing.

In [36]:
test_data = data4[split_size:]

I create a dictionary of frequencies of words from the training data.

In [37]:
freq_train = {}
for sentence in train_data:
    for token in sentence:
        if token not in freq_train:
            freq_train[token] = 0
        freq_train[token] += 1

In [143]:
len(freq_train)

32537

In [38]:
# Frequency of "the" in training set
freq_train['the']

15314

There might be words in the testing set that are not in the training set. I consider low frequency words in the training set as out of vocabulary words.
* Create a vocabulary of frequent words.
* Considere all other words as unknown (oov)

In [156]:
# Minimum frequency
minfreq = 2

In [157]:
vocab = []
for word, freq in freq_train.items():
    if freq >= minfreq:
        vocab.append(word)

In [158]:
len(vocab)

14585

In [44]:
vocab[:5]

['i', 'personally', 'would', 'like', 'as']

Replace low frequency words in `train_data` for the special token
* `<ukn>`

In [45]:
train_data_temp = []
for sentence in train_data:
    sentence_range = []
    for token in sentence:
        if token not in vocab:
            token = "<unk>"
        sentence_range.append(token)
    train_data_temp.append(sentence_range)
train_data1 = train_data_temp

In [50]:
train_data1[5][:10]

['baby', 'j', 'at', 'jordan', 'ford', '<unk>', '<unk>', '35', 'n', 'w']

Perform the previous operation on the test dataset.

In [51]:
test_data_temp = []
for sentence in test_data:
    sentence_range = []
    for token in sentence:
        if token not in vocab:
            token = "<unk>"
        sentence_range.append(token)
    test_data_temp.append(sentence_range)
test_data1 = test_data_temp

In [52]:
test_data1[5][:10]

['full', 'house', 'for', '<unk>', 'of', '<unk>']

# N-gram language model

Create a function that computes the counts of n-grams for an arbitraty number n.
* Output: dictionary.

In [58]:
sentence = ['I', "like", "rice"]
sentence = tuple(sentence)
for i in range(len(sentence)):
    ngram = sentence[i:i+2]
    print(ngram)

('I', 'like')
('like', 'rice')
('rice',)


In [64]:
def count_ngrams(data, n, start_token="<s>", end_token="<e>"):
    ngrams = {}
    for sentence in data:
        # Add special tokens
        sentence = [start_token] * n + sentence + [end_token]
        sentence = tuple(sentence)
        for i in range(len(sentence)-1):
            ngram = sentence[i:i+n]
            if ngram in ngrams:
                ngrams[ngram] += 1
            else:
                ngrams[ngram] = 1
    return ngrams

Function that estimates the probability of a word given a sequence sequence of n words.

In [116]:
def estimate_probability(word, previous_n_gram, n_gram_counts,
                         n_plus1_gram_counts, vocabulary_size, k=1.0):
    previous_n_gram = tuple(previous_n_gram)
    previous_n_gram_count = n_gram_counts.get(previous_n_gram, 0)
    denominator = previous_n_gram_count + k * vocabulary_size
    n_plus1_gram = previous_n_gram + (word,)
    n_plus1_gram_count = n_plus1_gram_counts.get(n_plus1_gram, 0)
    numerator = n_plus1_gram_count + k
    probability = numerator / denominator
    return probability

Estimate the probability for all words in training set.

In [67]:
def estimate_probabilities(previous_n_gram, n_gram_counts,
                           n_plus1_gram_counts, vocabulary,
                           end_token="<e>", unknown_token="<unk>",
                           k=1.0):
    previous_n_gram = tuple(previous_n_gram)
    # Add special characters to the vocabulary
    vocabulary = vocabulary + [end_token, unknown_token]
    vocabulary_size = len(vocabulary)

    probabilities = {}
    for word in vocabulary:
        probability = estimate_probability(word, previous_n_gram,
                                           n_gram_counts, n_plus1_gram_counts,
                                           vocabulary_size, k=k)
        probabilities[word] = probability
    return probabilities

Create a probability matrix

In [68]:
# Tools for creating the matrix
import numpy as np
import pandas as pd

In [70]:
def make_cmatrix(n_plus1_gram_counts, vocabulary):
    vocabulary = vocabulary + ["<e>", "<unk>"]
    ngrams = []
    for n_plus1_gram in n_plus1_gram_counts:
        ngram = n_plus1_gram[0:-1]
        ngrams.append(ngram)
    ngrams = list(set(ngrams))
    # N-gram to row
    row_index = {ngram:i for i, ngram in enumerate(ngrams)}
    # N-gram to column
    col_index = {word:j for j, word in enumerate(vocabulary)}

    nrow = len(ngrams)
    ncol = len(vocabulary)
    cmatrix = np.zeros((nrow, ncol))
    for n_plus1_gram, count in n_plus1_gram_counts.items():
        ngram = n_plus1_gram[0: -1]
        word = n_plus1_gram[-1]
        if word not in vocabulary:
            continue
        i = row_index[ngram]
        j = col_index[word]
        cmatrix[i, j] = count

    cmatrix = pd.DataFrame(cmatrix, index=ngrams, columns=vocabulary)
    return cmatrix

In [71]:
def make_pmatrix(n_plus1_gram_counts, vocabulary, k):
    cmatrix = make_cmatrix(n_plus1_gram_counts, vocabulary)
    cmatrix = cmatrix + k
    pmatrix = cmatrix.div(cmatrix.sum(axis=1), axis=0)
    return pmatrix

Given a sequence of words, suggest the most likely next word.

In [99]:
def suggest_a_word(previous_tokens, n_gram_counts, n_plus1_gram_counts,
                   vocabulary, end_token="<e>", unknown_token="<s>",
                   k=1.0, start_with=None):
    n = len(list(n_gram_counts.keys())[0])
    previous_tokens = ['<s>'] * n + previous_tokens
    previous_n_gram = previous_tokens[-n]
    probabilities = estimate_probabilities(previous_n_gram, n_gram_counts,
                                           n_plus1_gram_counts,
                                           vocabulary, k=k)
    suggestion = None
    max_prob = 0
    for word, prob in probabilities.items():
        if start_with:
            if word[0] != start_with[0]:
                continue
        if max_prob < prob:
            suggestion = word
            max_prob = prob
    return suggestion, max_prob

Suggest multiple next words.

In [100]:
def get_suggestions(previous_tokens, n_gram_counts_list,
                    vocabulary, k=1.0, start_with=None):
    model_counts = len(n_gram_counts_list)
    suggestions = []
    for i in range(model_counts-1):
        n_gram_counts = n_gram_counts_list[i]
        n_plus1_gram_counts = n_gram_counts_list[i+1]

        suggestion = suggest_a_word(previous_tokens, n_gram_counts,
                                    n_plus1_gram_counts, vocabulary,
                                    k=k, start_with=start_with)
        suggestions.append(suggestion)
    return suggestions

# Evaluation
Evaluate the model using perplexity.

In [97]:
def calculate_perplexity(sentence, n_gram_counts,
                         n_plus1_gram_counts, vocabulary_size,
                         start_token="<s>", end_token="<e>", k=1.0):
    n = len(list(n_gram_counts.keys())[0])
    sentence = [start_token] * n + sentence + [end_token]
    sentence = tuple(sentence)
    N = len(sentence)
    product_pi = 1.0
    for t in range(n, N):
        ngram = sentence[t-n:t]
        word = sentence[t]
        probability = estimate_probability(word, ngram, n_gram_counts,
                                           n_plus1_gram_counts,
                                           vocabulary_size, k)
        product_pi = product_pi * (1/probability)
    perplexity = product_pi**(1/len(sentence))
    return perplexity

# Demostration

In [132]:
n_gram_counts_list = []
for n in range(1, 6):
    n_model_counts = count_ngrams(train_data1, n)
    n_gram_counts_list.append(n_model_counts)

Write a sentence in lower case

In [136]:
len(vocab)

14585

In [137]:
sentence = "i love to watch"

In [138]:
previous_tokens = sentence.split()
tmp_suggest4 = get_suggestions(previous_tokens, n_gram_counts_list, vocab, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest4)

The previous words are ['i', 'love', 'to', 'watch'], the suggestions are:


[('i', 6.855419208884623e-05),
 ('i', 6.855419208884623e-05),
 ('i', 6.855419208884623e-05),
 ('<e>', 0.0010968670734215397)]

<table>
    <tr>
        <td>
            Based on
        </td>
        <td>
            Assignments from the Natural Language Processing Specialization
        </td>
    </tr>
</table>