# N-gram

Before we move on to the probability stuff, let’s answer this question first. Why is it that we need to learn n-gram and the related probability? Well, in Natural Language Processing, or NLP for short, n-grams are used for a variety of things. Some examples include auto completion of sentences (such as the one we see in Gmail these days), auto spell check (yes, we can do that as well), and to a certain extent, we can check for grammar in a given sentence. We’ll see some examples of this later in the post when we talk about assigning probabilities to n-grams.

N-grams are continuous sequences of words or symbols, or tokens in a document. In technical terms, they can be defined as the neighboring sequences of items in a document. They come into play when we deal with text data in NLP (Natural Language Processing) tasks. They have a wide range of applications, like language models, semantic features, spelling correction, machine translation, text mining, etc.

N-gram is a sequence of the N-words in the modeling of NLP. Consider an example of the statement for modeling. “I love reading history books and watching documentaries”. In one-gram or unigram, there is a one-word sequence. As for the above statement, in one gram it can be “I”, “love”, “history”, “books”, “and”, “watching”, “documentaries”. In two-gram or the bi-gram, there is the two-word sequence i.e. “I love”, “love reading”, or “history books”.

"There was heavy rain" and "There was heavy flood". By using experience, it can be said that the first statement is good. The N-gram language model tells that the "heavy rain" occurs more frequently than the "heavy flood". So, the first statement is more likely to occur and it will be then selected by this model. In the one-gram model, the model usually relies on that which word occurs often without pondering the previous words. In 2-gram, only the previous word is considered for predicting the current word. In 3-gram, two previous words are considered. In the N-gram language model the following probabilities are calculated:

Summing up, ‘n’ is just a variable that can have positive integer values, including 1,2,3, and so on.’n’ basically refers to multiple.

Thinking along the same lines, n-grams are classified into the following types, depending on the value that ‘n’ takes.

>n	Term

>1	Unigram

>2	Bigram

>3	Trigram

>n	n-gram

As clearly depicted in the table above, when n=1, it is said to be a unigram. When n=2, it is said to be a bigram, and so on.

In [1]:
from nltk import ngrams

sentence = 'I reside in Bengaluru.'
n = 1
unigrams = ngrams(sentence.split(), n)

In [9]:
unigrams

<zip at 0x175eed73280>

In [2]:
for grams in unigrams:
    print (grams)

('I',)
('reside',)
('in',)
('Bengaluru.',)


In [11]:
from nltk import ngrams
sentence = 'I reside in Bengaluru.'
n = 2
bigrams = ngrams(sentence.split(), n)
for grams in bigrams:
    print (grams)

('I', 'reside')
('reside', 'in')
('in', 'Bengaluru.')


In [12]:
passage=""" The Marvel Cinematic Universe (MCU) is an American media franchise and shared universe centered on a series of superhero films produced by Marvel Studios. The films are based on characters that appear in American comic books published by Marvel Comics. The franchise also includes television series, short films, digital series, and literature. The shared universe, much like the original Marvel Universe in comic books, was established by crossing over common plot elements, settings, cast, and characters.

Marvel Studios releases its films in groups called "Phases", with the first three phases collectively known as "The Infinity Saga" and the following three phases as "The Multiverse Saga". The first MCU film, Iron Man (2008), began Phase One, which culminated in the 2012 crossover film The Avengers. Phase Two began with Iron Man 3 (2013) and concluded with Ant-Man (2015). Phase Three began with Captain America: Civil War (2016) and concluded with Spider-Man: Far From Home (2019). Phase Four began with Black Widow (2021) and concluded with Black Panther: Wakanda Forever (2022). Ant-Man and the Wasp: Quantumania (2023) began Phase Five, which will end with Thunderbolts (2025), and Phase Six will begin with The Fantastic Four (2025). Phase Six and "The Multiverse Saga" will conclude with Avengers 5 (2026) and Avengers: Secret Wars (2027).

Marvel Television expanded the universe to network television with Agents of S.H.I.E.L.D. on ABC in 2013 before further expanding to streaming television on Netflix and Hulu and to cable television on Freeform. They also produced the digital series Agents of S.H.I.E.L.D.: Slingshot. Marvel Studios began producing their own television series for streaming on Disney+, starting with WandaVision in 2021 as the beginning of Phase Four. They also expanded to television specials in Phase Four, known as Marvel Studios Special Presentations, the first of which was Werewolf by Night (2022). The MCU also includes tie-in comics published by Marvel Comics, a series of direct-to-video short films called Marvel One-Shots, and viral marketing campaigns for the films featuring the faux news programs WHIH Newsfront and The Daily Bugle.

The franchise has been commercially successful, becoming one of the highest-grossing media franchises of all time, and generally received positive reviews. It has inspired other film and television studios to attempt similar shared universes and has also inspired several themed attractions, an art exhibit, television specials, literary material, multiple tie-in video games, and commercials."""

In [13]:
n = 2
bigrams = ngrams(passage.split(), n)
for grams in bigrams:
    print (grams)

('The', 'Marvel')
('Marvel', 'Cinematic')
('Cinematic', 'Universe')
('Universe', '(MCU)')
('(MCU)', 'is')
('is', 'an')
('an', 'American')
('American', 'media')
('media', 'franchise')
('franchise', 'and')
('and', 'shared')
('shared', 'universe')
('universe', 'centered')
('centered', 'on')
('on', 'a')
('a', 'series')
('series', 'of')
('of', 'superhero')
('superhero', 'films')
('films', 'produced')
('produced', 'by')
('by', 'Marvel')
('Marvel', 'Studios.')
('Studios.', 'The')
('The', 'films')
('films', 'are')
('are', 'based')
('based', 'on')
('on', 'characters')
('characters', 'that')
('that', 'appear')
('appear', 'in')
('in', 'American')
('American', 'comic')
('comic', 'books')
('books', 'published')
('published', 'by')
('by', 'Marvel')
('Marvel', 'Comics.')
('Comics.', 'The')
('The', 'franchise')
('franchise', 'also')
('also', 'includes')
('includes', 'television')
('television', 'series,')
('series,', 'short')
('short', 'films,')
('films,', 'digital')
('digital', 'series,')
('series,',

# N-gram Probabilities

Let’s take the example of a sentence completion system. This system suggests words which could be used next in a given sentence. Suppose I give the system the sentence “Thank you so much for your” and expect the system to predict what the next word will be. Now you and me both know that the next word is “help” with a very high probability. But how will the system know that?

One important thing to note here is that, as for any other artificial intelligence or machine learning model, 

### we need to train the model with a huge corpus of data. 

Once we do that, the system, or the NLP model will have a pretty good idea of the “probability” of the occurrence of a word after a certain word. So hoping that we have trained our model with a huge corpus of data, we’ll assume that the model gave us the correct answer.

I spoke about the probability a bit there, but let’s now build on that. When we’re building an NLP model for predicting words in a sentence, *the probability of the occurrence of a word in a sequence of words is what matters.* And how do we measure that? Let’s say we’re working with a bigram model here, and we have the following sentences as the training corpus:

> 1: Thank you so much for your help.

> 2: I really appreciate your help.

> 3: Excuse me, do you know what time it is?

> 4: I’m really sorry for not inviting you.

> 5: I really like your watch.

> 6: I really like the pizza.

> 7: I like roses

In [None]:
count(really like) /count(like)

2 / 3
event : after really we are getting
Sample space: how many times we are getting really

P(Number )

In [None]:
(Tea biscuits)/ Tea

In [None]:
20 times tea

8 times tea with biscuit


30 times biscuit

In [None]:
really ?

Let’s suppose that after training our model with this data, I want to write the sentence “I really like your garden.” Now because this is a bigram model, the model will learn the occurrence of every two words, to determine the probability of a word occurring after a certain word. For example, from the 2nd, 4th, and the 5th sentence in the example above, we know that after the word “really” we can see either the word “appreciate”, “sorry”, or the word “like” occurs. So the model will calculate the probability of each of these sequences.



Suppose we’re calculating the probability of word “w1” occurring after the word “w2,” then the formula for this is as follows:

> count(w2 w1) / count(w2)

which is the number of times the words occurs in the required sequence, divided by the number of the times the word before the expected word occurs in the corpus.

From our example sentences, let’s calculate the probability of the word “like” occurring after the word “really”:

> count(really like) / count(really)
    
    = 1 / 3
    
    = 0.33

> count(really appreciate) / count(really)

    = 1 / 3

    = 0.33

    count(really sorry) / count(really)

    = 1 / 3

    = 0.33

In [3]:
d = {("really", "like"): 0.40, 
    ("really","appreciate"): 0.27, 
     ("really", "sorry"): 0.33
    }

In [4]:
d

{('really', 'like'): 0.33,
 ('really', 'appreciate'): 0.33,
 ('really', 'sorry'): 0.33}

So when I type the phrase “I really,” and expect the model to suggest the next word, it’ll get the right answer only once out of three times, because the probability of the correct answer is only 1/3.

As an another example, if my input sentence to the model is “Thank you for inviting,” and I expect the model to suggest the next word, it’s going to give me the word “you,” because of the example sentence 4. That’s the only example the model knows. As you can imagine, if we give the model a bigger corpus (or a bigger dataset) to train on, the predictions will improve a lot. Similarly, we’re only using a bigram here. We can use a trigram or even a 4-gram to improve the model’s understanding of the probabilities.

### Using these n-grams and the probabilities of the occurrences of certain words in certain sequences could improve the predictions of auto completion systems. Similarly, we can use NLP and n-grams to train voice-based personal assistant bots. 

For example, using a 3-gram or trigram training model, a bot will be able to understand the difference between sentences such as “what’s the temperature?” and “set the temperature.”

> What's the temperature.

> Set the temperature

bigram

[(What's the), (the temperature)]

[(Set the), (the temperature)]