# Preprocessing

In this part we will cover the following:
- tokenization: learning how to use the inbuilt tokenizers of NLTK
- stemming: learning to use the inbuilt stemmers
- lemmatization: learning to use the WordnetLemmatizer of NLTK
- stopwords: learning to use the stopwords corpus
- processing two short stories and extracting the common vocabulary between two of them

# Tokenization: learning to use the inbuilt tokenizers of NLTK

Tokenization is the process of breaking a document or a long string into words and punctuation marks.

In [1]:
# import required libraries
import nltk
from nltk.tokenize import LineTokenizer, SpaceTokenizer, TweetTokenizer
from nltk import word_tokenize

In [24]:
# LineTokenizer
lTokenizer = LineTokenizer();
document = """ 
My name is Maximus Decimus Meridius, commander of the Armies of the North, General of the Felix Legions and loyal servant to the true emperor, Marcus Aurelius. 
\nFather to a murdered son, husband to a  murdered wife. \nAnd I will have my vengeance, in this life or the next.
"""
print('Line tokenizer output: ', lTokenizer.tokenize(document))


Line tokenizer output:  ['My name is Maximus Decimus Meridius, commander of the Armies of the North, General of the Felix Legions and loyal servant to the true emperor, Marcus Aurelius. ', 'Father to a murdered son, husband to a  murdered wife. ', 'And I will have my vengeance, in this life or the next.']


As we can see LineTokenizer divided the input document into three lines (sentences) on the basis where the newlines are. LineTokenizer simply divides the given input string into new lines.

In [25]:
# SpaceTokenizer
rawText = "By 11 o'clock on Sunday, the doctor shall open the dispensary."
sTokenizer = SpaceTokenizer()
print("Space Tokenizer output: ", sTokenizer.tokenize(rawText))

Space Tokenizer output:  ['By', '11', "o'clock", 'on', 'Sunday,', 'the', 'doctor', 'shall', 'open', 'the', 'dispensary.']


As we can see we invoke the method tokenize from the class SpaceTokenizer to split our input by space. Our input has been split on the space character "". The next one is the word_tokenize() method which broke every word as token.

In [26]:
# word_tokenize
print("Word Tokenizer output: ", word_tokenize(rawText))

Word Tokenizer output:  ['By', '11', "o'clock", 'on', 'Sunday', ',', 'the', 'doctor', 'shall', 'open', 'the', 'dispensary', '.']


The method word_tokenize() has split our input into chuch of words where '.' has been also taken as a chunk.

In [27]:
# TweetTokenizer
tTokenizer = TweetTokenizer()
print("Tweet Tokenizer output :", tTokenizer.tokenize("This is a cooool #dummysmiley: :-) :-P <3"))

Tweet Tokenizer output : ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3']


As we can see, the tokenizer kept the hastag word intact and didn't break it; the smiley are also kept intact and are not lost. This tokenizer can be used when application demands it.

# Stemming: learning to use the inbuilt stemmers of NLTK

In [28]:
# import required libraries
from nltk import PorterStemmer, LancasterStemmer, word_tokenize

In [85]:
# get raw text
raw = """My name is Maximus Decimus Meridius, commander of the Armies of the North, 
General of the Felix Legions and loyal servant to the true emperor, Marcus Aurelius. 
Father to a murdered son, husband to a murdered wife. 
And I will have my vengeance, in this life or the next."""

In [39]:
# print raw text
raw

'My name is Maximus Decimus Meridius, commander of the Armies of the North, General of the Felix Legions and loyal servant to the true emperor, Marcus Aurelius. Father to a murdered son, husband to a murdered wife. \nAnd I will have my vengeance, in this life or the next.'

In [41]:
# tokenize the text
tokens = word_tokenize(raw)

In [46]:
%pprint

Pretty printing has been turned OFF


In [47]:
tokens

['My', 'name', 'is', 'Maximus', 'Decimus', 'Meridius', ',', 'commander', 'of', 'the', 'Armies', 'of', 'the', 'North', ',', 'General', 'of', 'the', 'Felix', 'Legions', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'Marcus', 'Aurelius', '.', 'Father', 'to', 'a', 'murdered', 'son', ',', 'husband', 'to', 'a', 'murdered', 'wife', '.', 'And', 'I', 'will', 'have', 'my', 'vengeance', ',', 'in', 'this', 'life', 'or', 'the', 'next', '.']

In [43]:
# PorterStemmer
porter = PorterStemmer()
pStems = [porter.stem(t) for t in tokens]

In [48]:
# print the stems
pStems

['My', 'name', 'is', 'maximu', 'decimu', 'meridiu', ',', 'command', 'of', 'the', 'armi', 'of', 'the', 'north', ',', 'gener', 'of', 'the', 'felix', 'legion', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'marcu', 'aureliu', '.', 'father', 'to', 'a', 'murder', 'son', ',', 'husband', 'to', 'a', 'murder', 'wife', '.', 'and', 'I', 'will', 'have', 'my', 'vengeanc', ',', 'in', 'thi', 'life', 'or', 'the', 'next', '.']

Looking at the generated list of stems, we can see that all the words have been rid of the trailing 's', 'es', 'e', 'ed', 'al'

In [49]:
# LancasterStemmer()
lancaster = LancasterStemmer()
lStems = [lancaster.stem(t) for t in tokens]

In [50]:
lStems

['my', 'nam', 'is', 'maxim', 'decim', 'meridi', ',', 'command', 'of', 'the', 'army', 'of', 'the', 'nor', ',', 'gen', 'of', 'the', 'felix', 'leg', 'and', 'loy', 'serv', 'to', 'the', 'tru', 'emp', ',', 'marc', 'aureli', '.', 'fath', 'to', 'a', 'murd', 'son', ',', 'husband', 'to', 'a', 'murd', 'wif', '.', 'and', 'i', 'wil', 'hav', 'my', 'veng', ',', 'in', 'thi', 'lif', 'or', 'the', 'next', '.']

Looking at the generated list of stems when using LancasterStemmer, we can see that the suffixes that are dropped are bigger than Porter e.g 'us', 'e', 'th', 'eral'

The difference is clearly visible where LancasterStemmer is greedier than PorterStemmer when dropping the suffixes. It tries to remove as many characters from the end as possible whereas porter it removes as little as possible.


# Lemmatization: learning to use the WordNetLemmatizer of NLTK

In [51]:
# import required libraries
from nltk import word_tokenize, PorterStemmer, WordNetLemmatizer

In [80]:
# get input text
raw = """My name is Maximus Decimus Meridius, commander of the Armies of the North, 
General of the Felix Legions and loyal servant to the true emperor, Marcus Aurelius. 
Father to a murdered son, husband to a murdered wife. 
And I will have my vengeance, in this life or the next."""

In [81]:
# print raw text
raw

'My name is Maximus Decimus Meridius, commander of the Armies of the North, \nGeneral of the Felix Legions and loyal servant to the true emperor, Marcus Aurelius. \nFather to a murdered son, husband to a murdered wife. \nAnd I will have my vengeance, in this life or the next.'

In [82]:
# tokenize raw text
tokens = word_tokenize(raw)

In [83]:
tokens

['My', 'name', 'is', 'Maximus', 'Decimus', 'Meridius', ',', 'commander', 'of', 'the', 'Armies', 'of', 'the', 'North', ',', 'General', 'of', 'the', 'Felix', 'Legions', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'Marcus', 'Aurelius', '.', 'Father', 'to', 'a', 'murdered', 'son', ',', 'husband', 'to', 'a', 'murdered', 'wife', '.', 'And', 'I', 'will', 'have', 'my', 'vengeance', ',', 'in', 'this', 'life', 'or', 'the', 'next', '.']

In [84]:
# next, apply PorterStemmer
porter = PorterStemmer()
stems = [porter.stem(t) for t in tokens]

In [58]:
# print all stems
stems

['My', 'name', 'is', 'maximu', 'decimu', 'meridiu', ',', 'command', 'of', 'the', 'armi', 'of', 'the', 'north', ',', 'gener', 'of', 'the', 'felix', 'legion', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'marcu', 'aureliu', '.', 'father', 'to', 'a', 'murder', 'son', ',', 'husband', 'to', 'a', 'murder', 'wife', '.', 'and', 'I', 'will', 'have', 'my', 'vengeanc', ',', 'in', 'thi', 'life', 'or', 'the', 'next', '.']

In [59]:
# next, apply the lemmatizer
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t) for t in tokens]

In [60]:
# print all lemmas
lemmas

['My', 'name', 'is', 'Maximus', 'Decimus', 'Meridius', ',', 'commander', 'of', 'the', 'Armies', 'of', 'the', 'North', ',', 'General', 'of', 'the', 'Felix', 'Legions', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'Marcus', 'Aurelius', '.', 'Father', 'to', 'a', 'murdered', 'son', ',', 'husband', 'to', 'a', 'murdered', 'wife', '.', 'And', 'I', 'will', 'have', 'my', 'vengeance', ',', 'in', 'this', 'life', 'or', 'the', 'next', '.']

# Stopwords: learning to use the stopwords corpus

In [61]:
# import required libraries
import nltk
from nltk.corpus import gutenberg
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


In [62]:
# get the list of words from 'bible-kjv.txt'
gb_words = gutenberg.words('bible-kjv.txt')

In [63]:
gb_words

['[', 'The', 'King', 'James', 'Bible', ']', 'The', ...]

In [64]:
len(gb_words)

1010654

In [70]:
# filter out all words than have less than 3 characters
words_filtered = [e for e in gb_words if len(e) >= 3]

In [71]:
words_filtered[:50]

['The', 'King', 'James', 'Bible', 'The', 'Old', 'Testament', 'the', 'King', 'James', 'Bible', 'The', 'First', 'Book', 'Moses', 'Called', 'Genesis', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', 'And', 'the', 'earth', 'was', 'without', 'form', 'and', 'void', 'and', 'darkness', 'was', 'upon', 'the', 'face', 'the', 'deep', 'And', 'the', 'Spirit', 'God', 'moved', 'upon', 'the', 'face']

In [72]:
# create list of stopwords
stopwords = nltk.corpus.stopwords.words('english')
words = [w for w in words_filtered if w.lower() not in stopwords]

In [73]:
# apply frequency distribution to the list of filtered_words without stopwords
fdistPlain = nltk.FreqDist(words)
# apply frequency distribution to the list of words from 'bible-kvj.txt' from gutenberg corpus
fdist = nltk.FreqDist(gb_words)

In [74]:
# check frequency distribution
print('Following are the most common 10 words')
print(fdistPlain.most_common(10))
print('Following are the most common 10 words minus stopwords')
print(fdist.most_common(10))

Following are the most common 10 words
[('shall', 9760), ('unto', 8940), ('LORD', 6651), ('thou', 4890), ('thy', 4450), ('God', 4115), ('said', 3995), ('thee', 3827), ('upon', 2730), ('man', 2721)]
Following are the most common 10 words minus stopwords
[(',', 70509), ('the', 62103), (':', 43766), ('and', 38847), ('of', 34480), ('.', 26160), ('to', 13396), ('And', 12846), ('that', 12576), ('in', 12331)]


# Processing two short stories and extracting the common vocabulary between two of them

In this part we will work on a simple exercise where we will remove all special characters, splitting words, doing case folds, and some list operations

In [76]:
story1 = """In a far away kingdom, there was a river. This river
was home to many golden swans. The swans spent most of their time
on the banks of the river. Every six months, the swans would leave
a golden feather as a fee for using the lake. The soldiers of the
kingdom would collect the feathers and deposit them in the royal
treasury.
One day, a homeless bird saw the river. "The water in this river
seems so cool and soothing. I will make my home here," thought the
bird.
As soon as the bird settled down near the river, the golden swans
noticed her. They came shouting. "This river belongs to us. We pay
a golden feather to the King to use this river. You can not live
here."
"I am homeless, brothers. I too will pay the rent. Please give me
shelter," the bird pleaded. "How will you pay the rent? You do not
have golden feathers," said the swans laughing. They further added,
"Stop dreaming and leave once." The humble bird pleaded many times.
But the arrogant swans drove the bird away.
"I will teach them a lesson!" decided the humiliated bird.
She went to the King and said, "O King! The swans in your river are
impolite and unkind. I begged for shelter but they said that they
had purchased the river with golden feathers."
The King was angry with the arrogant swans for having insulted the
homeless bird. He ordered his soldiers to bring the arrogant swans
to his court. In no time, all the golden swans were brought to the
King’s court.
"Do you think the royal treasury depends upon your golden feathers?
You can not decide who lives by the river. Leave the river at once
or you all will be beheaded!" shouted the King.
The swans shivered with fear on hearing the King. They flew away
never to return. The bird built her home near the river and lived
there happily forever. The bird gave shelter to all other birds in
the river. """

In [77]:
story1

'In a far away kingdom, there was a river. This river\nwas home to many golden swans. The swans spent most of their time\non the banks of the river. Every six months, the swans would leave\na golden feather as a fee for using the lake. The soldiers of the\nkingdom would collect the feathers and deposit them in the royal\ntreasury.\nOne day, a homeless bird saw the river. "The water in this river\nseems so cool and soothing. I will make my home here," thought the\nbird.\nAs soon as the bird settled down near the river, the golden swans\nnoticed her. They came shouting. "This river belongs to us. We pay\na golden feather to the King to use this river. You can not live\nhere."\n"I am homeless, brothers. I too will pay the rent. Please give me\nshelter," the bird pleaded. "How will you pay the rent? You do not\nhave golden feathers," said the swans laughing. They further added,\n"Stop dreaming and leave once." The humble bird pleaded many times.\nBut the arrogant swans drove the bird away.

In [78]:
story2 = """Long time ago, there lived a King. He was lazy and
liked all the comforts of life. He never carried out his duties as
a King. “Our King does not take care of our needs. He also ignores
the affairs of his kingdom." The people complained.
One day, the King went into the forest to hunt. After having
wandered for quite sometime, he became thirsty. To his relief, he
spotted a lake. As he was drinking water, he suddenly saw a golden
swan come out of the lake and perch on a stone. “Oh! A golden swan.
I must capture it," thought the King.
But as soon as he held his bow up, the swan disappeared. And the
King heard a voice, “I am the Golden Swan. If you want to capture
me, you must come to heaven."
Surprised, the King said, “Please show me the way to heaven." “Do
good deeds, serve your people and the messenger from heaven would
come to fetch you to heaven," replied the voice.
The selfish King, eager to capture the Swan, tried doing some good
deeds in his Kingdom. “Now, I suppose a messenger will come to take
me to heaven," he thought. But, no messenger came.
The King then disguised himself and went out into the street. There
he tried helping an old man. But the old man became angry and said,
“You need not try to help. I am in this miserable state because of
out selfish King. He has done nothing for his people."
Suddenly, the King heard the golden swan’s voice, “Do good deeds and
you will come to heaven." It dawned on the King that by doing
selfish acts, he will not go to heaven.
He realized that his people needed him and carrying out his duties
was the only way to heaven. After that day he became a responsible
King.
"""

In [79]:
story2

'Long time ago, there lived a King. He was lazy and\nliked all the comforts of life. He never carried out his duties as\na King. “Our King does not take care of our needs. He also ignores\nthe affairs of his kingdom." The people complained.\nOne day, the King went into the forest to hunt. After having\nwandered for quite sometime, he became thirsty. To his relief, he\nspotted a lake. As he was drinking water, he suddenly saw a golden\nswan come out of the lake and perch on a stone. “Oh! A golden swan.\nI must capture it," thought the King.\nBut as soon as he held his bow up, the swan disappeared. And the\nKing heard a voice, “I am the Golden Swan. If you want to capture\nme, you must come to heaven."\nSurprised, the King said, “Please show me the way to heaven." “Do\ngood deeds, serve your people and the messenger from heaven would\ncome to fetch you to heaven," replied the voice.\nThe selfish King, eager to capture the Swan, tried doing some good\ndeeds in his Kingdom. “Now, I suppose

In [90]:
# (1) remove special characters in the text (e.g newlines, commas, full stops, excalamtions, question marks etc)
story1 = story1.replace(",", "").replace("\n", "").replace('.', '').replace('"', '').replace("!", "").replace("?","").casefold()

In [91]:
story2 = story2.replace(",", "").replace("\n", "").replace('.', '').replace('"', '').replace("!", "").replace("?", "").casefold()

In [93]:
# check for the last change story1
story1

'in a far away kingdom there was a river this riverwas home to many golden swans the swans spent most of their timeon the banks of the river every six months the swans would leavea golden feather as a fee for using the lake the soldiers of thekingdom would collect the feathers and deposit them in the royaltreasuryone day a homeless bird saw the river the water in this riverseems so cool and soothing i will make my home here thought thebirdas soon as the bird settled down near the river the golden swansnoticed her they came shouting this river belongs to us we paya golden feather to the king to use this river you can not liveherei am homeless brothers i too will pay the rent please give meshelter the bird pleaded how will you pay the rent you do nothave golden feathers said the swans laughing they further addedstop dreaming and leave once the humble bird pleaded many timesbut the arrogant swans drove the bird awayi will teach them a lesson decided the humiliated birdshe went to the king

In [94]:
# check for the last change story2
story2

'long time ago there lived a king he was lazy andliked all the comforts of life he never carried out his duties asa king “our king does not take care of our needs he also ignoresthe affairs of his kingdom the people complainedone day the king went into the forest to hunt after havingwandered for quite sometime he became thirsty to his relief hespotted a lake as he was drinking water he suddenly saw a goldenswan come out of the lake and perch on a stone “oh a golden swani must capture it thought the kingbut as soon as he held his bow up the swan disappeared and theking heard a voice “i am the golden swan if you want to captureme you must come to heavensurprised the king said “please show me the way to heaven “dogood deeds serve your people and the messenger from heaven wouldcome to fetch you to heaven replied the voicethe selfish king eager to capture the swan tried doing some gooddeeds in his kingdom “now i suppose a messenger will come to takeme to heaven he thought but no messenger c

In [95]:
# next, we split the texts in words (tokens)
story1_words = story1.split(" ")
print("First story words: ", story1_words)

First story words:  ['in', 'a', 'far', 'away', 'kingdom', 'there', 'was', 'a', 'river', 'this', 'riverwas', 'home', 'to', 'many', 'golden', 'swans', 'the', 'swans', 'spent', 'most', 'of', 'their', 'timeon', 'the', 'banks', 'of', 'the', 'river', 'every', 'six', 'months', 'the', 'swans', 'would', 'leavea', 'golden', 'feather', 'as', 'a', 'fee', 'for', 'using', 'the', 'lake', 'the', 'soldiers', 'of', 'thekingdom', 'would', 'collect', 'the', 'feathers', 'and', 'deposit', 'them', 'in', 'the', 'royaltreasuryone', 'day', 'a', 'homeless', 'bird', 'saw', 'the', 'river', 'the', 'water', 'in', 'this', 'riverseems', 'so', 'cool', 'and', 'soothing', 'i', 'will', 'make', 'my', 'home', 'here', 'thought', 'thebirdas', 'soon', 'as', 'the', 'bird', 'settled', 'down', 'near', 'the', 'river', 'the', 'golden', 'swansnoticed', 'her', 'they', 'came', 'shouting', 'this', 'river', 'belongs', 'to', 'us', 'we', 'paya', 'golden', 'feather', 'to', 'the', 'king', 'to', 'use', 'this', 'river', 'you', 'can', 'not', '

In [96]:
story2_words = story2.split(" ")
print("Second story words: ", story2_words)

Second story words:  ['long', 'time', 'ago', 'there', 'lived', 'a', 'king', 'he', 'was', 'lazy', 'andliked', 'all', 'the', 'comforts', 'of', 'life', 'he', 'never', 'carried', 'out', 'his', 'duties', 'asa', 'king', '“our', 'king', 'does', 'not', 'take', 'care', 'of', 'our', 'needs', 'he', 'also', 'ignoresthe', 'affairs', 'of', 'his', 'kingdom', 'the', 'people', 'complainedone', 'day', 'the', 'king', 'went', 'into', 'the', 'forest', 'to', 'hunt', 'after', 'havingwandered', 'for', 'quite', 'sometime', 'he', 'became', 'thirsty', 'to', 'his', 'relief', 'hespotted', 'a', 'lake', 'as', 'he', 'was', 'drinking', 'water', 'he', 'suddenly', 'saw', 'a', 'goldenswan', 'come', 'out', 'of', 'the', 'lake', 'and', 'perch', 'on', 'a', 'stone', '“oh', 'a', 'golden', 'swani', 'must', 'capture', 'it', 'thought', 'the', 'kingbut', 'as', 'soon', 'as', 'he', 'held', 'his', 'bow', 'up', 'the', 'swan', 'disappeared', 'and', 'theking', 'heard', 'a', 'voice', '“i', 'am', 'the', 'golden', 'swan', 'if', 'you', 'wan

As we can see, all the special characters are gone and a list of words has been created.

In [97]:
# next, create vocabulary out of this list of words without repeats
story1_vocab = set(story1_words)
print("First story vocabulary: ", story1_vocab)

First story vocabulary:  {'', 'by', 'think', 'shelter', 'fee', 'for', 'them', 'were', 'swans', 'went', 'riverwas', 'soothing', 'o', 'will', 'lives', 'feathers', 'beheaded', 'paya', 'areimpolite', 'kingthe', 'upon', 'water', 'arrogant', 'soldiers', 'once', 'i', 'return', 'at', 'their', 'shivered', 'came', 'how', 'angry', 'happily', 'deposit', 'nothave', 'leavea', 'this', 'settled', 'shouted', 'give', 'away', 'every', 'royal', 'use', 'months', 'thought', 'insulted', 'hearing', 'they', 'said', 'purchased', 'theyhad', 'all', 'my', 'of', 'his', 'brought', 'no', 'treasury', 'brothers', 'dreaming', 'here', 'swansnoticed', 'be', 'king', 'timeon', 'swansto', 'kingdom', 'your', 'day', 'court', 'and', 'but', 'lesson', 'soon', 'spent', 'lake', 'thehomeless', 'depends', 'feathersthe', 'awayi', 'homeless', 'as', 'please', 'on', 'feather', 'banks', 'begged', 'saw', 'humiliated', 'cool', 'built', 'would', 'time', 'other', 'gave', 'birdshe', 'pleaded', 'river', 'down', 'onceor', 'shouting', 'am', 'bird

In [98]:
story2_vocab = set(story2_words)
print("Second story vocabulary: ", story2_vocab)

Second story vocabulary:  {'goldenswan', 'life', 'needs', 'by', 'for', 'doingselfish', 'sometime', 'if', 'selfish', 'miserable', 'went', '“dogood', 'man', 'does', 'captureme', 'will', 'perch', 'care', 'go', 'water', 'deeds', 'fetch', '“please', 'take', 'i', 'good', 'need', 'after', 'carrying', 'voice', 'angry', 'realized', 'wouldcome', 'because', 'this', 'serve', 'way', 'come', 'disguised', 'helping', 'needed', 'dawned', 'quite', 'thought', 'forest', 'up', 'try', 'relief', 'said', 'want', 'eager', 'all', 'state', 'of', 'his', 'long', 'no', 'comforts', 'swani', 'people', 'king', 'havingwandered', 'kingdom', 'your', 'day', 'old', '“now', 'asa', 'and', 'but', 'me', 'heavensurprised', 'stone', 'soon', 'lake', 'voicethe', 'heavenhe', '“our', 'peoplesuddenly', 'kingbut', 'heard', 'duties', 'andyou', 'swan’s', 'carried', 'responsibleking', 'held', 'as', 'on', 'messenger', 'affairs', 'thirsty', 'saw', 'street', 'into', 'complainedone', 'acts', 'also', 'an', 'time', 'andliked', 'takeme', 'am', 

Calling Python set() function on our list of words, we are deduplicating the list and convert it into a set.

In [99]:
# next, produce the common vocabulary between the 2 stories
common_vocab = story1_vocab & story2_vocab
print("Common vocabulary: ", common_vocab)

Common vocabulary:  {'a', 'by', 'was', 'thought', 'as', 'for', 'on', 'golden', 'saw', 'not', 'went', 'said', 'will', 'all', 'you', 'of', 'his', 'time', 'water', 'no', 'i', 'king', 'there', 'am', 'kingdom', 'your', 'to', 'he', 'day', 'angry', 'and', 'but', 'this', 'soon', 'lake', 'the', 'that', 'in'}
