## Sentence Tokenizer
- It converts paragraph into sentences

In [1]:
text = "The limits of your language are the limits of your world.\"-Ludwig Wittgenstein  Our English learner's have a strong support system at home that begs for more resources.  Many times our parents are learning to read and speak English along side of their children.  Sometimes this creates barriers for parents to be able to help their child learn phonetics, letter recognition, and other reading skills."

In [3]:
from nltk.tokenize import sent_tokenize

p = sent_tokenize(text);

print(p)

['The limits of your language are the limits of your world.', '"-Ludwig Wittgenstein  Our English learner\'s have a strong support system at home that begs for more resources.', 'Many times our parents are learning to read and speak English along side of their children.', 'Sometimes this creates barriers for parents to be able to help their child learn phonetics, letter recognition, and other reading skills.']


In [4]:
p[1]

'"-Ludwig Wittgenstein  Our English learner\'s have a strong support system at home that begs for more resources.'

In [7]:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

array = tokenizer.tokenize(text)
array

['The limits of your language are the limits of your world.',
 '"-Ludwig Wittgenstein  Our English learner\'s have a strong support system at home that begs for more resources.',
 'Many times our parents are learning to read and speak English along side of their children.',
 'Sometimes this creates barriers for parents to be able to help their child learn phonetics, letter recognition, and other reading skills.']

- Tokenizing using english.pickle, it contains data from various fiction which is used to tokenize english paragraph into sentences

<h2>Word Tokenizer</h2>

- It's used to tokenize sentences into words

In [8]:
from nltk.tokenize import word_tokenize;

words = word_tokenize(array[0]);
words[3]

'your'

In [9]:
#Various ways to perfrom word tokenization
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import WordPunctTokenizer

tokenizer_2 = TreebankWordTokenizer();
tokenizer_3 = WordPunctTokenizer();

sent = "I won't be able to come today"

result1 = word_tokenize(sent);
print(result1);

result2 = tokenizer_2.tokenize(sent);
print(result2);

result3 = tokenizer_3.tokenize(sent);
print(result3);

['I', 'wo', "n't", 'be', 'able', 'to', 'come', 'today']
['I', 'wo', "n't", 'be', 'able', 'to', 'come', 'today']
['I', 'won', "'", 't', 'be', 'able', 'to', 'come', 'today']


- word_tokenize, TreebankWordTokenizer and WordPunctTokenizer works similarly in most of the cases.
- Difference comes when sentences have words like won't, can't.
- As we can see word_tokenize and TreebankWordTokenizer works same that means they are breaking that word.
- But WordPunctTokenizer breaking from ' and including ' also in result

<h2>Regexp Tokenizer</h2>

- Using Regular Expression to perfor word tokenization
- Example in above sentence we see that it was breaking word like won't into two parts, so we can use regular expression to tell tokenizer that don't break such words.

In [11]:
from nltk.tokenize import regexp_tokenize;

sent = "I won't be in class today, because I can't drive";

result = regexp_tokenize(sent, "[\w']+");
print(result)

#comparing it with word_tokenize
result2 = word_tokenize(sent);
print(result2)

['I', "won't", 'be', 'in', 'class', 'today', 'because', 'I', "can't", 'drive']
['I', 'wo', "n't", 'be', 'in', 'class', 'today', ',', 'because', 'I', 'ca', "n't", 'drive']


- here we can see that it's not breaking won't into parts while tokenizing because we are passing regular expression which says preserve words having ' in them

<h2>Stop words</h2>

In [12]:
from nltk.corpus import stopwords

In [15]:
STOP_WORDS = stopwords.words('english');
print(STOP_WORDS)  

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [23]:
clean_text = [word for word in text.split() if word.lower() not in STOP_WORDS];

In [24]:
print('Text before removing stopwords: ');
print(text)

print('-'*50);

print('Text after removing stopwords: ');
print(' '.join(clean_text));

Text before removing stopwords: 
The limits of your language are the limits of your world."-Ludwig Wittgenstein  Our English learner's have a strong support system at home that begs for more resources.  Many times our parents are learning to read and speak English along side of their children.  Sometimes this creates barriers for parents to be able to help their child learn phonetics, letter recognition, and other reading skills.
--------------------------------------------------
Text after removing stopwords: 
limits language limits world."-Ludwig Wittgenstein English learner's strong support system home begs resources. Many times parents learning read speak English along side children. Sometimes creates barriers parents able help child learn phonetics, letter recognition, reading skills.


<h2>Wordnet</h2>
- It's a dictionary of English with some advance features

<h4>synsets</h4>


- It's a synonym set, which is a collection of synonym words
- wordent.synsets(w) return different words in different context of word w

In [38]:
from nltk.corpus import wordnet
word1 = "weapon";

synArray = wordnet.synsets(word1);

print(synArray); #it has words with 2 contexts the visible names are called lemma code name
#the second letter after dot represents part of speech of word

#let's pick words present in firs context
woi = synArray[0];

#lemma_names() gives words related to this context
print("Related words: ", woi.lemma_names())

#it gives definition of the obtained synonym word
print("Definition of weapon: ",woi.definition())

#Obtianing part of speech 
print("Parts of Speech of weapon: ", woi.pos())

#examples() method return sentences containing the word 'weapon'
print(woi.examples())

[Synset('weapon.n.01'), Synset('weapon.n.02')]
Related words:  ['weapon', 'arm', 'weapon_system']
Definition of weapon:  any instrument or instrumentality used in fighting or hunting
Parts of Speech of weapon:  n
['he was licensed to carry a weapon']


<h4>hypernyms and hyponyms</h4>


- hypernyms return more general words for given word: Parent word
- hyponyms return more specific words for given word: Child word

In [41]:
#example for hypernyms

hypernyms = woi.hypernyms();
print(hypernyms)

#obtaining definintion of obtained context;
print(hypernyms[0].definition())

[Synset('instrument.n.01')]
a device that requires skill for proper use


In [48]:
#example for hypnonyms

hyponyms = woi.hyponyms();
print(hyponyms) #it will return all the specific contexts related to weapons

print('-'*50)

#fetch first synset
first_hyponym = hyponyms[0];

#obtianing definition
print("definition: ", first_hyponym.definition())

#obtaining lemma names
print("lemma names: ", first_hyponym.lemma_names())

[Synset('bow.n.04'), Synset('bow_and_arrow.n.01'), Synset('brass_knucks.n.01'), Synset('fire_ship.n.01'), Synset('flamethrower.n.01'), Synset('greek_fire.n.01'), Synset('gun.n.01'), Synset('knife.n.02'), Synset('light_arm.n.01'), Synset('missile.n.01'), Synset('pike.n.04'), Synset('projectile.n.01'), Synset('slasher.n.02'), Synset('sling.n.04'), Synset('spear.n.01'), Synset('stun_gun.n.01'), Synset('sword.n.01'), Synset('tomahawk.n.01'), Synset('weapon_of_mass_destruction.n.01')]
--------------------------------------------------
definition:  a weapon for shooting arrows, composed of a curved piece of resilient wood with a taut cord to propel the arrow
lemma names:  ['bow']


<h4>Lemmas, Synonyms, Antonyms</h4>

In [50]:
word = 'win';

#obtaining synsets
synsets = wordnet.synsets(word)
print(synsets)

[Synset('win.n.01'), Synset('winnings.n.01'), Synset('win.v.01'), Synset('acquire.v.05'), Synset('gain.v.05'), Synset('succeed.v.01')]


In [51]:
#let's pick win as verb present at 3 position
woi = synsets[2];

#obtaining lemmas
lemmas_win = woi.lemmas();
print(lemmas_win)

[Lemma('win.v.01.win')]


In [53]:
#obtaining synonyms
synonymsArray = [];

for syn in synsets: #traversing all synsets of 'win'
    for lem in syn.lemmas(): #traversing all lemmas obtained for each synsets
        synonymsArray.append(lem.name()); #using name() method to obtain name
        
print(synonymsArray)

['win', 'winnings', 'win', 'profits', 'win', 'acquire', 'win', 'gain', 'gain', 'advance', 'win', 'pull_ahead', 'make_headway', 'get_ahead', 'gain_ground', 'succeed', 'win', 'come_through', 'bring_home_the_bacon', 'deliver_the_goods']


In [55]:
#obtaining antonyms
antonymsArray = [];

for syn in synsets:
    for lem in syn.lemmas():
        for antonyms in lem.antonyms():
            antonymsArray.append(antonyms.name());
            
print(antonymsArray)

['losings', 'lose', 'lose', 'fall_back', 'fail']


In [66]:
#description: how antonyms are obtained?
print(synsets)
syn = synsets[2];

print(syn.lemmas());
lemma = syn.lemmas()[0];

print(lemma.antonyms());
antonym = lemma.antonyms()[0].name();

print(antonym)


[Synset('win.n.01'), Synset('winnings.n.01'), Synset('win.v.01'), Synset('acquire.v.05'), Synset('gain.v.05'), Synset('succeed.v.01')]
[Lemma('win.v.01.win')]
[Lemma('lose.v.02.lose')]
lose
win.v.01


<h4>Difference between Lemma and Synset</h4>
<b>Lemma: </b>A word in canonical form, with a single meaning
<br>
<b>Synset: </b> A set of synonyms is a set of words with similar meaning or represent the set of different senses of a particular word

<h3>WUP similarity</h3>

- It finds similarity based on shortest path to similary hypernyms.
- If two words are co-hypernyms then they are much similar then they are children      

In [71]:
w1 = 'cake';
w2 = 'loaf';
w3 = 'bread';

print(wordnet.synsets(w1));
cake = wordnet.synsets(w1)[0];

print(wordnet.synsets(w2));
loafb = wordnet.synsets(w2)[0];
loaf  = wordnet.synsets(w2)[1];

print(wordnet.synsets(w3));
bread = wordnet.synsets(w3)[0];

[Synset('cake.n.01'), Synset('patty.n.01'), Synset('cake.n.03'), Synset('coat.v.03')]
[Synset('loaf_of_bread.n.01'), Synset('loaf.n.02'), Synset('bum.v.02'), Synset('loiter.v.01')]
[Synset('bread.n.01'), Synset('boodle.n.01'), Synset('bread.v.01')]


In [75]:
print('wup_similarities are: ')
print('bread & loaf: ', bread.wup_similarity(loaf))
print('bread & loafb: ', bread.wup_similarity(loafb))
print('loaf & loafb: ', loaf.wup_similarity(loafb))

wup_similarities are: 
bread & loaf:  0.7692307692307693
bread & loafb:  0.9411764705882353
loaf & loafb:  0.7142857142857143


<h4>How wup_similarity is calculated, overview:</h4>

In [77]:
#let's find hypernym of loaf
print(loaf.hypernyms()) #loaf has only one hypernym
ref = loaf.hypernyms()[0]

[Synset('food.n.02')]


In [78]:
#now we'll find shortest path distance of each word with this ref of hypernyms

print(loaf.shortest_path_distance(ref))
print(bread.shortest_path_distance(ref))
print(loafb.shortest_path_distance(ref))
print(cake.shortest_path_distance(ref))

1
2
3
8


we can see that:
- loaf is 1 path away from hypernym
- bread is 2 path away from hypernym
- loafb is 3 path away from hypernym
- cake is 8 path away from hypernym

<h4>Path and LCH Similarities</h4>

<h2>Decontraction</h2>

In [None]:
def class Replacer():
    patterns = [
        ('won\'t', 'will not'), ('\'d', 'would'), ()
    ]
    __init__(self):
        pass;