# Extract tokens and bigrams from a sentence

### Assumptions and terminology

We will assume that text data is in the form of sentences with no punctuation. If a sentence is in a single line, we will add add a token for start of sentence: `<s>` and end of sentence: `</s>`. For example, if the sentence is "I love language models." it will appear in code as:

`'I love language models'`

The tokens for this sentence are represented as an ordered list of the lower case words plus the start and end sentence tags:

`tokens = ['<s>', 'i', 'love', 'language', 'models', '</s >']`

The bigrams for this sentence are represented as a list of lower case ordered pairs of tokens:

`bigrams = [('<s>', 'i'), ('i', 'love'), ('love', 'language'), ('language', 'models'), ('models', '</s>')]``

### Quiz 1 Instructions

In the quiz below, write a function that returns a list of tokens and a list of bigrams for a given sentence. You will need to first break a sentence into words in a list, then add a `<s>` and `<s/>` token to the start and end of the list to represent the start and end of the sentence.

Your final lists should be in the format shown above and called out in the function doc string.

In [1]:
test_sentences = [
    'the old man spoke to me',
    'me to spoke man old the',
    'old man me old man me',
]

###### 1. First option: Without any library

In [1]:
def sentence_to_bigrams(sentence):
    """
    Add start '<s>' and stop '</s>' tags to the sentence and tokenize it into a list
    of lower-case words (sentence_tokens) and bigrams (sentence_bigrams)
    :param sentence: string
    :return: list, list
        sentence_tokens: ordered list of words found in the sentence
        sentence_bigrams: a list of ordered two-word tuples found in the sentence
    """
    sentence_tokens = ['<s>'] + sentence.lower().split() + ['</s>']
    sentence_bigrams = []
    for i in range(len(sentence_tokens)-1):
        sentence_bigrams.append((sentence_tokens[i], sentence_tokens[i+1]))
    return sentence_tokens, sentence_bigrams

In [3]:
sentence_to_bigrams(test_sentences[0])

(['<s>', 'the', 'old', 'man', 'spoke', 'to', 'me', '</s>'],
 [('<s>', 'the'),
  ('the', 'old'),
  ('old', 'man'),
  ('man', 'spoke'),
  ('spoke', 'to'),
  ('to', 'me'),
  ('me', '</s>')])

###### 2. Second option: With a library

In [5]:
# Importing the libraries
import nltk
from collections import Counter

In [12]:
def sentence_to_bigrams(sentence):
    """
    Add start '<s>' and stop '</s>' tags to the sentence and tokenize it into a list
    of lower-case words (sentence_tokens) and bigrams (sentence_bigrams)
    :param sentence: string
    :return: list, list
        sentence_tokens: ordered list of words found in the sentence
        sentence_bigrams: a list of ordered two-word tuples found in the sentence
    """
    # Tokenizing the text
    sentence_tokens = nltk.tokenize.word_tokenize(sentence, language = "english")

    # Adding <s> at the start 
    sentence_tokens.insert(0, "<s>")
    
    # Adding </s> at the end
    sentence_tokens.insert(len(sentence_tokens), "</s>")
    
    # Applying bigrams
    sentence_bigrams = list(nltk.bigrams(sentence_tokens))

    return sentence_tokens, sentence_bigrams

In [13]:
sentence_tokens, sentence_bigrams = sentence_to_bigrams('the old man spoke to me')

In [14]:
sentence_tokens

['<s>', 'the', 'old', 'man', 'spoke', 'to', 'me', '</s>']

In [15]:
sentence_bigrams

[('<s>', 'the'),
 ('the', 'old'),
 ('old', 'man'),
 ('man', 'spoke'),
 ('spoke', 'to'),
 ('to', 'me'),
 ('me', '</s>')]