N-grams are contiguous sequences of n-items in a sentence. N can be 1, 2 or any other positive integers, although usually we do not consider very large N because those n-grams rarely appears in many different places.

When performing machine learning tasks related to natural language processing, we usually need to generate n-grams from input sentences. For example, in text classification tasks, in addition to using each individual token found in the corpus, we may want to add bi-grams or tri-grams as features to represent our documents.

In [2]:
# import packages
import pandas as pd
import re
import nltk
from nltk.util import ngrams
from gensim.parsing.preprocessing import STOPWORDS, strip_tags, strip_numeric, strip_punctuation, strip_multiple_whitespaces, remove_stopwords, strip_short, stem_text
from textblob import TextBlob
import json
#pip install NRCLex
from nrclex import NRCLex




In [3]:
# data
data = "Getting back to work made a difference. After 10 days of nothing but the business of moving and all of its seemingly obligatory messy emotions, it was nice to think of nothing but my patients. I worked Wednesday through Friday, and even with a couple of long days in there, it was a relief to be away from home. It was a relief to be away from unpacking, and contemplating, and deciding. It was a pleasure to think about somebody other than myself for 3 days. I needed that. Those 3 days away, combined with a long run/walk/dip into Lake Superior with Jet yesterday, gave me the energy to unpack nearly my entire basement today. I ve still got a lot to do, but things are starting to take shape. My bedroom is almost completely put together. My bathroom and kitchen are done. I ve still got boxes in the living room, dining room and the other 2 bedrooms, but I m getting there. Tomorrow I m heading south to Mayo Clinic for a ketamine infusion. Im pleased its not an urgent need at this time, just a regular maintenance dose. Returning to work, getting some exercise, and progressing with my unpacking have each helped stabilize my mood. Im  no longer daily wiping tears from my eyes. In fact, I haven t cried for several days. That, in and of itself, is quite a feat! I m taking my time with unpacking. I m doing my best to remain patient. Taking the next right action and maintaining my attitude of gratitude are my focus now. Its still hard, but its not impossible. Settling into my new home, new routine, and new city will take time. I m keeping that fact forefront in my mind. I can do this. But I cant do it all today, nor do I have to. Patiently, Ill get it done."

In [4]:
# use blob to extract easy nouns
blob = TextBlob(data)
print(blob.noun_phrases)
nouns_of_data = blob.noun_phrases

['obligatory messy emotions', 'long days', 'long run/walk/dip', 'jet', 'entire basement', 'tomorrow', 'mayo clinic', 'ketamine infusion', 'im', 'urgent need', 'regular maintenance dose', 'im', 'haven t', 'right action', 'settling', 'new home', 'new city', 'fact forefront', 'patiently', 'ill']


In [5]:
# use nouns for ngrams
# nouns_of_data = ['obligatory messy emotions', 'long days', 'long run/walk/dip', 'jet', 'entire basement', 'tomorrow', 'mayo clinic', 'ketamine infusion', 'im', 'urgent need', 'regular maintenance dose', 'im', 'haven t', 'right action', 'settling', 'new home', 'new city', 'fact forefront', 'patiently', 'ill']
# nouns_of_data=str(s)

In [6]:
def n_grams(s, n):
    '''returns n_grams
    s = data str
    n = number of grams'''

    s = s.lower()
    s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
    tokens = [token for token in s.split(" ") if token != ""]
    output = list(ngrams(tokens, n))[1:10]
    return output;

In [7]:
#Instantiate text object (for best results, 'text' should be unicode).
text_object = NRCLex(data)

In [8]:
#Return words list.
print(len(text_object.words))
print(text_object.words)

315
['Getting', 'back', 'to', 'work', 'made', 'a', 'difference', 'After', '10', 'days', 'of', 'nothing', 'but', 'the', 'business', 'of', 'moving', 'and', 'all', 'of', 'its', 'seemingly', 'obligatory', 'messy', 'emotions', 'it', 'was', 'nice', 'to', 'think', 'of', 'nothing', 'but', 'my', 'patients', 'I', 'worked', 'Wednesday', 'through', 'Friday', 'and', 'even', 'with', 'a', 'couple', 'of', 'long', 'days', 'in', 'there', 'it', 'was', 'a', 'relief', 'to', 'be', 'away', 'from', 'home', 'It', 'was', 'a', 'relief', 'to', 'be', 'away', 'from', 'unpacking', 'and', 'contemplating', 'and', 'deciding', 'It', 'was', 'a', 'pleasure', 'to', 'think', 'about', 'somebody', 'other', 'than', 'myself', 'for', '3', 'days', 'I', 'needed', 'that', 'Those', '3', 'days', 'away', 'combined', 'with', 'a', 'long', 'run/walk/dip', 'into', 'Lake', 'Superior', 'with', 'Jet', 'yesterday', 'gave', 'me', 'the', 'energy', 'to', 'unpack', 'nearly', 'my', 'entire', 'basement', 'today', 'I', 've', 'still', 'got', 'a', 'lo

In [33]:
#Return sentences list.
# 26 sentences
print(len(text_object.sentences))
text_object.sentences

26


[Sentence("Getting back to work made a difference."),
 Sentence("After 10 days of nothing but the business of moving and all of its seemingly obligatory messy emotions, it was nice to think of nothing but my patients."),
 Sentence("I worked Wednesday through Friday, and even with a couple of long days in there, it was a relief to be away from home."),
 Sentence("It was a relief to be away from unpacking, and contemplating, and deciding."),
 Sentence("It was a pleasure to think about somebody other than myself for 3 days."),
 Sentence("I needed that."),
 Sentence("Those 3 days away, combined with a long run/walk/dip into Lake Superior with Jet yesterday, gave me the energy to unpack nearly my entire basement today."),
 Sentence("I ve still got a lot to do, but things are starting to take shape."),
 Sentence("My bedroom is almost completely put together."),
 Sentence("My bathroom and kitchen are done."),
 Sentence("I ve still got boxes in the living room, dining room and the other 2 bedr

In [10]:
#Return affect list.
# 37 "emotions"
print(len(text_object.affect_list))
text_object.affect_list

37


['disgust',
 'negative',
 'anticipation',
 'positive',
 'positive',
 'anticipation',
 'positive',
 'positive',
 'joy',
 'positive',
 'anticipation',
 'fear',
 'negative',
 'surprise',
 'anticipation',
 'trust',
 'anticipation',
 'trust',
 'positive',
 'trust',
 'anticipation',
 'joy',
 'positive',
 'surprise',
 'anticipation',
 'anticipation',
 'positive',
 'positive',
 'joy',
 'positive',
 'positive',
 'negative',
 'sadness',
 'positive',
 'trust',
 'anticipation',
 'trust']

In [11]:
#Return affect dictionary.
# words that contains emotions and have been found by nrclex
print(len(text_object.affect_dict))
text_object.affect_dict

19


{'messy': ['disgust', 'negative'],
 'long': ['anticipation'],
 'relief': ['positive'],
 'shape': ['positive'],
 'completely': ['positive'],
 'pleased': ['joy', 'positive'],
 'urgent': ['anticipation', 'fear', 'negative', 'surprise'],
 'time': ['anticipation'],
 'maintenance': ['trust'],
 'daily': ['anticipation'],
 'fact': ['trust'],
 'haven': ['positive', 'trust'],
 'feat': ['anticipation', 'joy', 'positive', 'surprise'],
 'patient': ['anticipation', 'positive'],
 'action': ['positive'],
 'gratitude': ['joy', 'positive'],
 'focus': ['positive'],
 'impossible': ['negative', 'sadness'],
 'routine': ['positive', 'trust']}

In [48]:
#Return raw emotional counts.
text_object.raw_emotion_scores

{'disgust': 1,
 'negative': 3,
 'anticipation': 9,
 'positive': 12,
 'joy': 3,
 'fear': 1,
 'surprise': 2,
 'trust': 5,
 'sadness': 1}

In [49]:
#Return highest emotions.
text_object.top_emotions

[('positive', 0.32432432432432434)]

In [52]:
#Return affect frequencies.
text_object.affect_frequencies

{'fear': 0.02702702702702703,
 'anger': 0.0,
 'anticip': 0.0,
 'trust': 0.13513513513513514,
 'surprise': 0.05405405405405406,
 'positive': 0.32432432432432434,
 'negative': 0.08108108108108109,
 'sadness': 0.02702702702702703,
 'disgust': 0.02702702702702703,
 'joy': 0.08108108108108109,
 'anticipation': 0.24324324324324326}

In [53]:
# check data type
type(text_object.affect_dict)

dict

In [58]:
text_object.raw_emotion_scores

{'disgust': 1,
 'negative': 3,
 'anticipation': 9,
 'positive': 12,
 'joy': 3,
 'fear': 1,
 'surprise': 2,
 'trust': 5,
 'sadness': 1}

In [60]:
# get the 3 top emotions
from collections import Counter
c = Counter(text_object.affect_dict)

# return top 3 pairs
most_common = c.most_common(5)
# For getting the keys from most common
my_keys = [key for key, val in most_common]

print(most_common)
print(my_keys)

[('maintenance', ['trust']), ('fact', ['trust']), ('haven', ['positive', 'trust']), ('routine', ['positive', 'trust']), ('relief', ['positive'])]
['maintenance', 'fact', 'haven', 'routine', 'relief']


In [54]:
res = str(text_object.affect_dict)
type(res)

str

In [22]:
# check data type 
type(data)

str

In [29]:
test = str(['messy',
 'long',
 'relief',
 'shape',
 'completely',
 'pleased',
 'urgent',
 'time',
 'maintenance',
 'daily',
 'fact',
 'haven',
 'feat',
 'patient',
 'action',
 'gratitude',
 'focus',
 'impossible',
 'routine'])

In [36]:
messy = str(text_object.sentences[1])

In [38]:
# use nouns for ngrams
# use blob to extract easy nouns
blob = TextBlob(messy)
print(blob.noun_phrases)
nouns_of_data = blob.noun_phrases  

['obligatory messy emotions']


In [40]:
messy

'After 10 days of nothing but the business of moving and all of its seemingly obligatory messy emotions, it was nice to think of nothing but my patients.'

In [47]:
# test bigram
n_grams(s=res, n=2)

[('disgust', 'negative'),
 ('negative', 'long'),
 ('long', 'anticipation'),
 ('anticipation', 'relief'),
 ('relief', 'positive'),
 ('positive', 'shape'),
 ('shape', 'positive'),
 ('positive', 'completely'),
 ('completely', 'positive')]

In [45]:
score_ngram(messy)

NameError: name 'score_ngram' is not defined