# Text normalization cont.
Last time we finished with some text normalization activities like stemming and normalization(removing [inflectional](https://en.wikipedia.org/wiki/Inflection) affixes**(ed, ing, ize, s, de)**).

Note that 

- stemming can result in a word not in the dictionary.
- Lemmatization ensures word is in dictionary.
- stemming is a fast process compared to lemmatization.


We can use use both the techniques to further reduction

In [1]:
import nltk
from nltk import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [2]:
ps = PorterStemmer()
word= 'muses'
ps.stem(word)
!pip install nltk
nltk.download('all')

'muse'

In [3]:
wn_lm= WordNetLemmatizer()
lm_word = wn_lm.lemmatize(word)
print(lm_word)
print(ps.stem(lm_word))

mus
mu


# Expanding contractions

This activity invloves replacing contractions with full words like
- can't with cannot.
- Should've with should have
- Weren't were not
- 


# Any suggestion how to do it? What module and function?

In [4]:
import re
contraction_patterns=[(r'can\'t', 'cannot'),
                    (r'haven\'t', 'have not'),
                    (r'(\w+)\'ll', '\g<1> will'),
                    (r'(\w+)\'re', '\g<1> are')]


In [25]:
class contraction_replacer(object):
    def __init__(self, contraction_patterns):        
        # store compiled regex object
        self._contraction_regexes = [(re.compile(p), replaced_text) for p, replaced_text in contraction_patterns]
        
    def __do_contraction_normalization(self, text):
        for contraction_regex, replaced_text in self._contraction_regexes:
            text = contraction_regex.sub(replaced_text, text)
        return text     


**Let's use it**

In [26]:
sample_contraction_replacer = contraction_replacer(contraction_patterns)

In [7]:
sample_contraction_replacer.do_contraction_normalization("We'll do this work")

'We will do this work'

In [8]:
# removing contraction and tokenize
nltk.tokenize.word_tokenize(sample_contraction_replacer.do_contraction_normalization("We'll do this work"))

['We', 'will', 'do', 'this', 'work']

# Removing repeated words

In [15]:
class repeat_replacer(object):
    def __init__(self, repeat_patterns, sub_pattern):       
        
        # store compiled regex object
        self._repeat_regexes = re.compile(repeat_patterns)
        self._sub_pattern = sub_pattern
    def do_repeat_normalization(self, word):
        compressed_word = self._repeat_regexes.sub(self._sub_pattern, word)
        if compressed_word != word:
            compressed_word = self.do_repeat_normalization(compressed_word)
            
        
        return compressed_word

In [16]:
# Notice how backreferences(\1, \2, \3) are used  I looove loove love 
sample_repeat_replacer = repeat_replacer(r'(\w*)(\w)\2(\w*)', r'\1\2\3' )


In [17]:
sample_repeat_replacer.do_repeat_normalization('ooooh'), sample_repeat_replacer.do_repeat_normalization('loooove')

('oh', 'love')

What happens when word has repeating character!!!

In [18]:
sample_repeat_replacer.do_repeat_normalization('sheep')

'shep'

In [19]:
from nltk.corpus import wordnet
class repeat_replacer(object):
    def __init__(self, repeat_patterns, sub_pattern):
        
        
        # store compiled regex object
        self._repeat_regexes = re.compile(repeat_patterns)
        self._sub_pattern = sub_pattern
    def do_repeat_normalization(self, word):
        if wordnet.synsets(word):
            return word
        compressed_word = self._repeat_regexes.sub(self._sub_pattern, word)
        if compressed_word != word:
            #print('iside if')
            compressed_word = self.do_repeat_normalization(compressed_word)
        return compressed_word

In [20]:
repeat_replacer_inst = repeat_replacer(r'(\w*)(\w)\2(\w*)', r'\1\2\3')

In [21]:
repeat_replacer_inst.do_repeat_normalization('sheep')

'sheep'

# Spelling correction with Enchant

Go to 
http://www.abisource.com/projects/enchant/ 
to learn more

In [61]:
!pip install pyenchant

Collecting pyenchant
[?25l  Downloading https://files.pythonhosted.org/packages/9e/54/04d88a59efa33fefb88133ceb638cdf754319030c28aadc5a379d82140ed/pyenchant-2.0.0.tar.gz (64kB)
[K    100% |████████████████████████████████| 71kB 2.4MB/s ta 0:00:011
[?25hInstalling collected packages: pyenchant
  Running setup.py install for pyenchant ... [?25ldone
[?25hSuccessfully installed pyenchant-2.0.0
[33mYou are using pip version 18.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


# Let's build a spell checker class, We need
- a spellchecking library like enchant. We just installed it
- and a dictionary for it to use


# aspell demo at command prompt

# Let' see how enchant works

In [27]:
import enchant
enchant.list_dicts()


[('en_US', <Enchant: Myspell Provider>),
 ('en', <Enchant: Aspell Provider>),
 ('en_CA', <Enchant: Aspell Provider>),
 ('en_GB', <Enchant: Aspell Provider>)]

In [29]:
dict_int = enchant.Dict('en')
dict_int.check('love'), dict_int.check('lov')

(True, False)

In [30]:
dict_int.suggest('scien')

['scion',
 'skien',
 'sci en',
 'sci-en',
 'science',
 'scenic',
 'scene',
 'menisci',
 'Siena',
 'Lucien']

# How edit distance works

*minimum number of character changes to transform one word into another.*

See wiki for details
https://en.wikipedia.org/wiki/Edit_distance

In [31]:
from nltk.metrics import edit_distance
edit_distance('sciena', 'science')

2

# Let's write the class for performing spell correction
- import enchant and initialize a dictionary(will use opensource aspell http://aspell.net/) for it to use
- import edit_distance  from nltk.metrics 

In [32]:
import enchant
from nltk.metrics import edit_distance
import numpy as np

class spell_checker(object):
    def __init__(self, dict_name='en_US', max_edit_dist=3):
        self._dict= enchant.Dict(dict_name)
        self._max_edit_dist = max_edit_dist
    def _word_with_min_dist(self, word, suggestions):
        print(suggestions)
        #min_edit_distance = np.inf
        corrected_word = word
        for sug in [suggestions[0]]:
            distance = edit_distance(word, sug)
            #print(distance)
            if distance < self._max_edit_dist:
                print(distance, sug)
                min_edit_distance = distance
                corrected_word = sug
        return corrected_word        
                
                
        
    def check_spell(self, word):
        if self._dict.check(word):
            return word
        # the the words with minimum distance
        return self._word_with_min_dist(word, self._dict.suggest(word))
            
        
        
        

In [33]:
spell_check_int = spell_checker()

In [34]:
spell_check_int.check_spell('jukeboc')

['jukebox', 'juke', 'cookbook', 'kickback']
1 jukebox


'jukebox'

# Use right dictionary

In [35]:
us_spell_ckeck_inst = spell_checker('en_US')
us_spell_ckeck_inst.check_spell('theater')

'theater'

In [36]:
br_spell_ckeck_inst = spell_checker('en_GB')
br_spell_ckeck_inst.check_spell('theater')

['theatre', 'heater', 'cheater', 'theta', 'that', 'eater', 'hater', 'tater', 'threader', 'beater', 'header', 'neater', 'teeter', 'Theiler']
2 theatre


'theatre'

# Adding custom word list

In [37]:
%%bash
echo -e "deeplearning\nnlp" > my_words.txxxt
cat my_words.txxxt

deeplearning
nlp


In [38]:
d1 = enchant.Dict('en_US')
d1.check('nlp')

False

In [39]:
d2 = enchant.DictWithPWL ('en_US', 'my_words.txxxt')
d2.check('nlp')

True

# synonyms

In [40]:
class synonynm(object):
    def __init__(self, word_map):
        self._map = word_map
    def get_synonym(self, word):
        return self._map.get(word, word)
        

In [42]:
synonynm_inst = synonynm({'bday': 'birthday', 'yolo':'you live only once'})

In [43]:
synonynm_inst.get_synonym('yolo')

'you live only once'

we could have maintained a dictionary but this solution is not a extensible solution. One can maintain synonym dictionary in any format and synonym class can acts a wrapper.

In [50]:
import csv
class csv_based_synonym(synonynm):
    def __init__(self, file_name):
        word_map = {}
        for line in csv.reader(open(file_name)):
            word, syn = line
            word_map[word] = syn
        super(csv_based_synonym ,self).__init__(word_map)    

In [48]:
%%bash
# Let's create a csv file
echo -e 'hpy, happy\nbday, birthday' > syn.csv
cat syn.csv

hpy, happy
bday, birthday


In [51]:
csv_based_synonym_int = csv_based_synonym('syn.csv')
csv_based_synonym_int.get_synonym('hpy')

' happy'

# Replacing negation with antonyms

# review of WordNet(a lexical database for the English language)

nltk provides an interface to WordNet synset(synonymous words that express the same concept.) lookup.

In [52]:
from nltk.corpus import wordnet

In [53]:
wordnet.synsets('hike')

[Synset('hike.n.01'),
 Synset('rise.n.09'),
 Synset('raise.n.01'),
 Synset('hike.v.01'),
 Synset('hike.v.02')]

In [56]:
for syn in wordnet.synsets('hike'):
    print(syn.name())
    print(syn.definition())
    
    

hike.n.01
a long walk usually for exercise or pleasure
rise.n.09
an increase in cost
raise.n.01
the amount a salary is increased
hike.v.01
increase
hike.v.02
walk a long way, as for pleasure or physical exercise


## Looking for lemmas and synonyms

In [58]:
# Let's take first synset for science
syn = wordnet.synsets('science')[0]
syn
wordnet.synsets('science')

[Synset('science.n.01'), Synset('skill.n.02')]

In [42]:
syn.lemmas()

[Lemma('science.n.01.science'), Lemma('science.n.01.scientific_discipline')]

In [59]:
# can treat lemmas as synonyms
[l.name() for l in syn.lemmas()]

['science', 'scientific_discipline']

# Antonyms
lemmas has antonyms

In [60]:
wordnet.synsets('glad')

[Synset('gladiolus.n.01'),
 Synset('glad.a.01'),
 Synset('glad.s.02'),
 Synset('glad.s.03'),
 Synset('beaming.s.01')]

In [61]:
syn = wordnet.synsets('glad')[1]
syn.definition()

'showing or causing joy and pleasure; especially made happy'

In [63]:
glad_antonyms = syn.lemmas()[0].antonyms()
glad_antonyms


[Lemma('sad.a.01.sad')]

In [64]:
glad_antonyms[0].synset().definition()

'experiencing or showing sorrow or unhappiness'

# Back to replacing negation with antonyms

In [70]:
class antonym_replacer(object):
    def _find_antonym(self, word, pos=None):
        antonyms = set()
        
        for syn in wordnet.synsets(word):
            for lem in syn.lemmas():
                for ant in lem.antonyms():
                    antonyms.add(ant.name())
        if len(antonyms) == 1:
            return antonyms.pop()
        else:
            return None
            
    def remove_negation(self, sent):
        s=0
        l=len(sent)
        clean_words= []
        while s < l -1:
            possible_not_word = sent[s]
            word = sent[s +1 ]
            if possible_not_word == 'not':
                ant= self._find_antonym(word)
                print(ant)
                if ant:
                    clean_words.append(ant)
                    s+=2
            else:
                clean_words.append(possible_not_word)
                
                if s==l-2:
                    clean_words.append(word)
                s+=1
            
        return clean_words    
                                     
                    

In [66]:
sentence = "Let's not uglify    this place"
tokens = nltk.word_tokenize(sentence)

tokens

['Let', "'s", 'not', 'uglify', 'this', 'place']

In [68]:
# What if we want Let's together

regex= nltk.RegexpTokenizer(r'\s+' , gaps=True)
regex.tokenize(sentence)

["Let's", 'not', 'uglify', 'this', 'place']

In [72]:
antonym_replacer_inst= antonym_replacer()
antonym_replacer_inst.remove_negation(tokens)

beautify


['Let', "'s", 'beautify', 'this', 'place']

# Side: WordNet Methods you may find useful for your work

Synstes are organised in the form of a tree using **hypernyms** and **hyponyms**

**hypernyms:** abstract terms are known as hypernyms

**hyponyms:** more specific terms as hyponyms

In [73]:
syn = wordnet.synsets('hike')[0]
print(syn.name())
print(syn.definition())
syn.hypernyms()

hike.n.01
a long walk usually for exercise or pleasure


[Synset('walk.n.04')]

In [74]:
syn.hypernyms()[0].hyponyms()

[Synset('amble.n.01'),
 Synset('constitutional.n.01'),
 Synset('foot.n.07'),
 Synset('hike.n.01'),
 Synset('last_mile.n.01'),
 Synset('moonwalk.n.02'),
 Synset('perambulation.n.01'),
 Synset('turn.n.12'),
 Synset('walk-through.n.04'),
 Synset('walkabout.n.03')]

In [75]:
syn.hypernym_paths()

[[Synset('entity.n.01'),
  Synset('abstraction.n.06'),
  Synset('psychological_feature.n.01'),
  Synset('event.n.01'),
  Synset('act.n.02'),
  Synset('action.n.01'),
  Synset('change.n.03'),
  Synset('motion.n.06'),
  Synset('travel.n.01'),
  Synset('walk.n.04'),
  Synset('hike.n.01')]]

# Finding similarity
Using hypernym tree for similarity between the Synsets

## Wu-Palmer Similarity

It is based on how similar the word senses are and realtive position of synsets in hypernym tree

In [76]:
syns = wordnet.synsets('slip')
syns

[Synset('faux_pas.n.01'),
 Synset('slip.n.02'),
 Synset('slip.n.03'),
 Synset('cutting.n.02'),
 Synset('slip.n.05'),
 Synset('mooring.n.01'),
 Synset('slip.n.07'),
 Synset('slickness.n.03'),
 Synset('strip.n.02'),
 Synset('slip.n.10'),
 Synset('chemise.n.01'),
 Synset('case.n.19'),
 Synset('skid.n.03'),
 Synset('slip.n.14'),
 Synset('slip.n.15'),
 Synset('steal.v.02'),
 Synset('slip.v.02'),
 Synset('skid.v.04'),
 Synset('slip.v.04'),
 Synset('slip.v.05'),
 Synset('err.v.01'),
 Synset('slip.v.07'),
 Synset('slip.v.08'),
 Synset('slip.v.09'),
 Synset('slip.v.10'),
 Synset('dislocate.v.01')]

In [77]:
wordnet.wup_similarity(syns[0], syns[1])

0.8235294117647058

In [78]:
wordnet.wup_similarity(syns[0], wordnet.synsets('apple')[0])

0.11764705882352941

Above metric uses shortest path distance between the two Synsets and their common hypernym. 

In [80]:
d1 = syns[0]
d2 = syns[1]
d1,d2

(Synset('faux_pas.n.01'), Synset('slip.n.02'))

In [81]:
d1.shortest_path_distance(d2)

3

In [82]:
common_hypernym_d1 = d1.hypernyms()[0]
common_hypernym_d1

Synset('blunder.n.01')

In [83]:
common_hypernym_d1.shortest_path_distance(d1)

1

In [84]:
common_hypernym_d2 = d2.hypernyms()[0]
common_hypernym_d2

Synset('mistake.n.01')

In [85]:
common_hypernym_d2.shortest_path_distance(d1)

2

Be care comapring verbs, as many verbs don't share common hypernyms. Return value would be None 

# read nltk for word collocation

- http://www.nltk.org/howto/collocations.html
- https://www.nltk.org/

- https://www.nltk.org/book/


# NLP Resources
https://web.stanford.edu/~jurafsky/slp3/

http://web.stanford.edu/class/cs224n/

1 - Ruder website: http://ruder.io/ (all his tutorials are amazing, I suggest you to start from old posts he has on the website)

2 - https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8416973&tag=1 (this covers the most recent advances in DL in NLP)

3 - pretty much everything in this website: https://machinelearningmastery.com/category/natural-language-processing/

4 - This github repo has a lot of good resources: https://github.com/keon/awesome-nlp

5- https://www.youtube.com/watch?v=jfwqRMdTmLo&list=
