<a href="https://colab.research.google.com/github/samyxandz/Ml-playgound/blob/main/Auto_Complete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Language Models: Auto-Complete


This Prototype uses N-grams, a simple but powerful method for language modeling.

#### RoadMap

<br>

1.   Load and preprocess data
Load and tokenize data.

*   Load and tokenize data.
*   Split the sentences into train and test sets.
*   Replace words with a low frequency by an unknown marker <unk>.


2.   Develop N-gram based language models


*   Compute the count of n-grams from a given data set.
*   Estimate the conditional probability of a next word with k-smoothing.

3. Evaluate the N-gram models by computing the perplexity score.

4. Use your own model to suggest an upcoming word given your sentence.



In [None]:
import math
import random
import numpy as np
import pandas as pd
import nltk
nltk.data.path.append('.')
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Load the data


In [None]:
with open("en_US.twitter.txt", "r") as f:
    data = f.read()

display(data[0:300])

"How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.\nWhen you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.\nthey've decided its more fun if I don't.\nSo Tired D; Played Lazer Tag & Ran A "

## Pre-process the data

reprocess this data with the following steps:

1. Split data into sentences using "\n" as the delimiter.
2. Split each sentence into tokens.
3. Assign sentences into train or test sets.
4. Find tokens that appear at least N times in the training data.
5. Replace tokens that appear less than N times by <unk>

Split data into sentences

    Split data by linebreak "\n"
    Args:
        data: str
    Returns:
        A list of sentences


In [None]:
def split_to_sentences(data):

    sentences = data.split('\n')

    # - Remove leading and trailing spaces from each sentence
    # - Drop sentences if they are empty strings.
    sentences = [sentence.strip() for sentence in sentences]
    sentences = [s for s in sentences if len(s) > 0]

    return sentences

In [None]:
# test your code
x = """I have a pen.\nI have an apple. \nAh\nApple pen.\n"""
print(x)

split_to_sentences(x)

I have a pen.
I have an apple. 
Ah
Apple pen.



['I have a pen.', 'I have an apple.', 'Ah', 'Apple pen.']

Tokenize sentences (split a sentence into a list of words).

* Convert all tokens into lower case so that words which are capitalized (for example, at the start of a sentence) in the original text are treated the same as the lowercase versions of the words.
* Append each tokenized list of words into a list of tokenized sentences


    Tokenize sentences into tokens (words)
    
    Args:
        sentences: List of strings
    
    Returns:
        List of lists of tokens

In [None]:
def tokenize_sentences(sentences):

    tokenized_sentences = []


    for sentence in sentences:

        # Convert to lowercase letters
        sentence = sentence.lower()

        # Convert into a list of words
        tokenized = nltk.word_tokenize(sentence)

        # append the list of words to the list of lists
        tokenized_sentences.append(tokenized)

    return tokenized_sentences


In [None]:
sentences = ["Sky is blue.", "Leaves are green.", "Roses are red."]
tokenize_sentences(sentences)

[['sky', 'is', 'blue', '.'],
 ['leaves', 'are', 'green', '.'],
 ['roses', 'are', 'red', '.']]

#### Split into train and test sets

In [None]:
tokenized_data = get_tokenized_data(data)
random.seed(87)
random.shuffle(tokenized_data)

train_size = int(len(tokenized_data) * 0.8)
train_data = tokenized_data[0:train_size]
test_data = tokenized_data[train_size:]

### Making the dictionary

* focus on the words that appear at least N times in the data.



In [None]:
def count_words(tokenized_sentences):

    word_counts = {}

    for sentence in tokenized_sentences:

        for token in sentence:
            word_counts[token] = word_counts.get(token, 0) + 1

    return word_counts

In [None]:

tokenized_sentences = [['sky', 'is', 'blue', '.'],
                       ['leaves', 'are', 'green', '.'],
                       ['roses', 'are', 'red', '.']]
count_words(tokenized_sentences)

Handling 'Out of Vocabulary' words

To handle unknown words during prediction, use a special token to represent all unknown words 'unk'.

* Modify the training data so that it has some 'unknown' words to train on.
* Words to convert into "unknown" words are those that do not occur very frequently in the training set.
* Create a list of the most frequent words in the training set, called the closed vocabulary .
* Convert all the other words that are not part of the closed vocabulary to the token 'unk'.

In [None]:
def get_words_with_nplus_frequency(tokenized_sentences, count_threshold):

    closed_vocab = []
    word_counts = count_words(tokenized_sentences)
    # for each word and its count
    for word, count in word_counts.items():

        # check that the word's count
        # is at least as great as the minimum count
        if count >= count_threshold:
            # append the word to the list
            closed_vocab.append(word)


    return closed_vocab

The words that appear count_threshold times or more are in the closed vocabulary.

* All other words are regarded as unknown.
* Replace words not in the closed vocabulary with the token $<unk>$.

In [None]:
def replace_oov_words_by_unk(tokenized_sentences, vocabulary, unknown_token='<unk>'):

    vocabulary = set(vocabulary)

    # Initialize a list that will hold the sentences
    # after less frequent words are replaced by the unknown token
    replaced_tokenized_sentences = []


    for sentence in tokenized_sentences:

        # Initialize the list that will contain
        # a single sentence with "unknown_token" replacements
        replaced_sentence = []
        # for each token in the sentence
        for token in sentence:
            replaced_sentence.append(token if token in vocabulary else unknown_token)

        replaced_tokenized_sentences.append(replaced_sentence)

    return replaced_tokenized_sentences

In [None]:
def preprocess_data(train_data, test_data, count_threshold):


    # Get the closed vocabulary using the train data
    vocabulary = get_words_with_nplus_frequency(train_data, count_threshold)

    train_data_replaced = replace_oov_words_by_unk(train_data, vocabulary)
    test_data_replaced = replace_oov_words_by_unk(test_data, vocabulary)

    return train_data_replaced, test_data_replaced, vocabulary

## Preprocess the train and test data

In [None]:
minimum_freq = 2
train_data_processed, test_data_processed, vocabulary = preprocess_data(train_data, test_data, minimum_freq)

## Developing the  n-gram based language

Asumptions

* Assume the probability of the next word depends only on the previous n-gram.
* The previous n-gram is the series of the previous 'n' words


The conditional probability for the word at position 't' in the sentence, given that the words preceding it are $$w_{t-1}, w_{t-2} \cdots w_{t-n}$$ is:
$P(w_t | w_{t-1}\dots w_{t-n})$
  

- The probability can be estimated as a ratio, where
- The numerator is the number of times word 't' appears after words t-1 through t-n appear in the training data.
- The denominator is the number of times word t-1 through t-n appears in the training data.

    $$ \hat{P}(w_t | w_{t-1}\\dots w_{t-n}) = \\frac{C(w_{t-1}\\dots w_{t-n}, w_n)}{C(w_{t-1}\\dots w_{t-n})}  $$


The function $C(\cdots)$ denotes the number of occurence of the given sequence.
$\hat{P}$ means the estimation of $P$.

  The equation tells us that to estimate probabilities based on n-grams, you need the counts of n-grams (for denominator) and (n+1)-grams (for numerator).

 ### Count The N-Grams

 function that computes the counts of n-grams for an arbitrary number $n$

 When computing the counts for n-grams, prepare the sentence beforehand by prepending $n-1$ starting markers $<s>$ to indicate the beginning of the sentence.

In [2]:
def count_n_grams(data, n, start_token='<s>', end_token='<e>'):

    n_grams = {}

    for sentence in data:
        sentence = [start_token] * n + sentence + [end_token]

        sentence = tuple(sentence)

        for i in range(len(sentence) - n + 1):
            n_gram = sentence[i : i + n]
            n_grams[n_gram] = n_grams.get(n_gram, 0) + 1

    return n_grams

In [3]:

sentences = [['i', 'like', 'a', 'cat'],
             ['this', 'dog', 'is', 'like', 'a', 'cat']]
print("Uni-gram:")
print(count_n_grams(sentences, 1))
print("Bi-gram:")
print(count_n_grams(sentences, 2))

Uni-gram:
{('<s>',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2, ('<e>',): 2, ('this',): 1, ('dog',): 1, ('is',): 1}
Bi-gram:
{('<s>', '<s>'): 2, ('<s>', 'i'): 1, ('i', 'like'): 1, ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', '<e>'): 2, ('<s>', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1, ('is', 'like'): 1}
