N-grams are contiguous sequences of n-items in a sentence. N can be 1, 2 or any other positive integers, although usually we do not consider very large N because those n-grams rarely appears in many different places.

When performing machine learning tasks related to natural language processing, we usually need to generate n-grams from input sentences. For example, in text classification tasks, in addition to using each individual token found in the corpus, we may want to add bi-grams or tri-grams as features to represent our documents.

In [5]:
# import packages
import re
from nltk.util import ngrams
from gensim.parsing.preprocessing import STOPWORDS, strip_tags, strip_numeric, strip_punctuation, strip_multiple_whitespaces, remove_stopwords, strip_short, stem_text
from textblob import TextBlob

In [6]:
# data
s = "Getting back to work made a difference. After 10 days of nothing but the business of moving and all of its seemingly obligatory messy emotions, it was nice to think of nothing but my patients. I worked Wednesday through Friday, and even with a couple of long days in there, it was a relief to be away from home. It was a relief to be away from unpacking, and contemplating, and deciding. It was a pleasure to think about somebody other than myself for 3 days. I needed that. Those 3 days away, combined with a long run/walk/dip into Lake Superior with Jet yesterday, gave me the energy to unpack nearly my entire basement today. I ve still got a lot to do, but things are starting to take shape. My bedroom is almost completely put together. My bathroom and kitchen are done. I ve still got boxes in the living room, dining room and the other 2 bedrooms, but I m getting there. Tomorrow I m heading south to Mayo Clinic for a ketamine infusion. Im pleased its not an urgent need at this time, just a regular maintenance dose. Returning to work, getting some exercise, and progressing with my unpacking have each helped stabilize my mood. Im  no longer daily wiping tears from my eyes. In fact, I haven t cried for several days. That, in and of itself, is quite a feat! I m taking my time with unpacking. I m doing my best to remain patient. Taking the next right action and maintaining my attitude of gratitude are my focus now. Its still hard, but its not impossible. Settling into my new home, new routine, and new city will take time. I m keeping that fact forefront in my mind. I can do this. But I cant do it all today, nor do I have to. Patiently, Ill get it done."

In [8]:
blob = TextBlob(s)
print(blob.noun_phrases)
s = blob.noun_phrases

['obligatory messy emotions', 'long days', 'long run/walk/dip', 'jet', 'entire basement', 'tomorrow', 'mayo clinic', 'ketamine infusion', 'im', 'urgent need', 'regular maintenance dose', 'im', 'haven t', 'right action', 'settling', 'new home', 'new city', 'fact forefront', 'patiently', 'ill']


In [32]:
s = ['obligatory messy emotions', 'long days', 'long run/walk/dip', 'jet', 'entire basement', 'tomorrow', 'mayo clinic', 'ketamine infusion', 'im', 'urgent need', 'regular maintenance dose', 'im', 'haven t', 'right action', 'settling', 'new home', 'new city', 'fact forefront', 'patiently', 'ill']
s=str(s)

In [33]:
def n_grams(s, n):
    '''returns n_grams
    s = data
    n = number of grams'''

    s = s.lower()
    s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
    tokens = [token for token in s.split(" ") if token != ""]
    output = list(ngrams(tokens, n))[1:10]
    return output;

In [34]:
n_grams(s=s, n=2)

[('messy', 'emotions'),
 ('emotions', 'long'),
 ('long', 'days'),
 ('days', 'long'),
 ('long', 'run'),
 ('run', 'walk'),
 ('walk', 'dip'),
 ('dip', 'jet'),
 ('jet', 'entire')]

In [35]:
n_grams(s=s, n=3)

[('messy', 'emotions', 'long'),
 ('emotions', 'long', 'days'),
 ('long', 'days', 'long'),
 ('days', 'long', 'run'),
 ('long', 'run', 'walk'),
 ('run', 'walk', 'dip'),
 ('walk', 'dip', 'jet'),
 ('dip', 'jet', 'entire'),
 ('jet', 'entire', 'basement')]