<a href="https://colab.research.google.com/github/tomnie96/ISE/blob/master/02_ISE2020_NLP_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**To adapt this notebook to your own needs** and to be able to edit it, please make a copy of your own. This works via "*File*" -> "*Save a copy ..*."


---



Some of the **NLP Techniques** mentioned [in Sect. 2.4 of the ISE 2020 lecture](https://www.slideshare.net/lysander07/02-ise2020-natural-language-processing-1-232058444) are already implemented in the [python NLTK library.](https://www.nltk.org/) Please find some basic NLP examples below.

# Tokenization
Tokenization is the process of separating character sequences into
smaller pieces, called tokens. In this process, certain characters
might be omitted, such as punctuation (dependening on the
tokenizer).

In [0]:
#First we have to import nltk and download a few required packages
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.download('wordnet')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('treebank')

First let's try **Sentence Splitting**:

In [0]:
text="Mary had a little lamb. Her fleece was white as snow."
# We import the two methods required for (1) word-based tokenization, and (2) sentence splitting
from nltk.tokenize import word_tokenize, sent_tokenize
sents=sent_tokenize(text)
print(sents)

Now, let's try **Words**:

In [0]:
words=[word_tokenize(sent) for sent in sents]
print(words)

# Part-of-Speech tagging
Part-of-speech tagging classifies words into their part-of-speech
and labels them according to a specified tagset. Most commonly
the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) is used.

In [0]:
# each word in the text will be assigned a POS tag
nltk.pos_tag(word_tokenize(text))

In [0]:
# in case you don't know the meaning of some of the POS tags
nltk.help. upenn_tagset ('NNP')
nltk.help. upenn_tagset ('VBD')
nltk.help. upenn_tagset ('PRP$')

# Lemmatization
* Lemmatization groups words together that have different inflections so that they can be treated as the same item.
* It reduces a word to its baseform using a online lexicon. 

*For Lemmatization, NLTK provides an interface to the [WordNet](https://wordnet.princeton.edu/) dictionary. WordNet is a large English lexical database. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.*



In [0]:
# we import the WordNet lemmatizer
from nltk.stem import WordNetLemmatizer
# new text example
sentence = "Mary had two little lambs. Her fleeces were white as snow."

lemmatizer = WordNetLemmatizer()
# for each word of the sentence
for token in word_tokenize(sentence):
  print(lemmatizer.lemmatize(token, pos='v'))

# Stemming
* Stemming strips the words of its suffixes and prefixes. For English, the [Porter Stemmer](http://snowball.tartarus.org/algorithms/porter/stemmer.html) is rather popular.

In [0]:
# we import the Porter Stemmer
from nltk.stem import PorterStemmer
sentence = "Mary had two little lambs. Her fleeces were white as snow."

ps = PorterStemmer()
# for each word of the sentence
for token in word_tokenize(sentence):
  print(ps.stem(token))

# Named Entity Recognition (NER)
Locating and classifying atomic elements into predefined categories such as **names, persons, organizations, locations, expressions of time, quantities, monetary values**, etc.

*For casual use, NLTK provides us with a method called `ne_chunk` to perform NER on a given text. In order to use `ne_chunk`, the text needs to first be tokenized into words and then POS tagged. After NER, the tagged words depict their respective entity type*

In [0]:
# For NER, we need tokenization, POS tagging and Named Entity Chunking
from nltk import word_tokenize, pos_tag, ne_chunk
# New text example
sentence = "The Avengers began as a group of extraordinary individuals who were assembled to defeat \
Loki and his chitauri army in New York City. "
print (ne_chunk(pos_tag(word_tokenize(sentence))))

Now let's try an **alternativ NLP Library: [spacy]**(https://spacy.io/).

In [0]:
!python -m spacy download en

In [0]:
import spacy
nlp = spacy.load('en')

We first start with **Named Entity Recognition **

In [0]:
doc = nlp(u'Neil Alden Armstrong was an American astronaut and aeronautical engineer who was the first person to walk on the Moon.')

 
# Named Entity Recognition
for ent in doc.ents:
    print(ent.text, ent.label_)
    
# displaCy
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

# Dependency Parsing
Dependency Parsing is an approximation of semantic relations between arguments. It relies on direct binary grammatical relations among words.


In [0]:
# Dependency Parsing

doc = nlp(u'Neil Alden Armstrong was an American astronaut and aeronautical engineer who was the first person to walk on the Moon.')
 
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

In [0]:
# Visualizing Dependency Parsing

from spacy import displacy
 
doc = nlp(u'Neil Alden Armstrong was an American astronaut and aeronautical engineer who was the first person to walk on the Moon.')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})