    # Topic:
        01 - Theory
        02 - Tokenization Examples
        03 - Stemming
        04 - Lemmatization (it accepts POS-tags as arguments.)
        05 - POS-Tag
        06 - nltk-Synonyms
        07 - N-Grams

#  01-Theory:

Terms:
    1. Corpus, Tokens and N-Grams
    2. Tokenization
    3. Stemming
    4. Lemmatization
    5. Part of Speech Tagging
    6. Dependency Grammer

In [None]:
♦ Corpus: Collection of text documents.
    Corpus > Documents > Paragraphs > Sentences > Tokens
    
♦ Tokens: Smaller units of text(words, phrases or ngrams).

♦ N-grams: Combination of n words/characters together.
    Ex: I love my phone
    Unigrams(n=1): I, love, my, phone
    Bigrams(n=2): I love, love my, my phone
    Trigrams(n=3): I love my, love my phone


In [None]:
♦ Tokenization: process of spliting a text object into smaller units(tokens)
    Ex_1: White Space Tokenizer/Unigram Tokenizer
        Sentence: "I went to New-York to play football"
        Tokens: "I", "went", "to" ,"New-York", "to" ,"play", "football"
    Ex_2: Regular Expression Tokenizer
        Sentence: "Football,Cricket;Baseball Tennis"
        re.split(r'[;,\s]',st)
        Tokens: ['Football', 'Cricket', 'Baseball', 'Tennis']

In [None]:
♦ Normalization: it is the process of converting a token into its base form(morpheme).
    Morpheme: Base form of a word
        Structure of Token: <prefix> <morpheme> <suffix> 
        Ex: Antinationalist = Anti + national + ist
• Two methods of normalization are: stemming and lemmatization.

In [None]:
♦ Stemming: Elementary rule based process of removal of inflectional forms from a token.
    Ex: "langhing", "Laughs", "Laughed" > "Laugh"
    But sometimes this method is not very efficient.
    Ex: "his teams are not winning"  > "hi" "team" "are" "not" "winn"   

In [None]:
♦ Lemmatization: Systematic process for reducing a token to its lemma. It uses grammers and parts of speech.
    Ex: 1. is,am,are  > be
        2. running,runs,ran > run
        3. running(verb) > run
        4. running(noun) > running

In [None]:
♦ Parts of Speech Tagging(POS Tagging): Noun, Verb, Adjective, and adverbs
    

In [None]:
♦ Grammer:
    1. Constituency Grammer: Organize any sentence into constituents using their properties.
        Sentence: <subject> <context> <object>
            Ex: The dogs are barking in the park.
                <subject>: The cat/the dogs/they
                <context>: are barking/are running/are eating
                <object>: in the park/ happily/since the morning.
    2. Dependency Grammer: Words of a sentence depends on which other words(dependecies).
            Modifier: Barking dog.(dog is modified by barking)
            Relation(Governer,Relation,Dependent)
            Ex: <AnalyticsVidhya><is><the largest community of data scientists>

# 02-Tokenization Examples

In [1]:
# Tokenization Example:
import re
st = "Football,Cricket;Baseball Tennis"
re.split(r'[;,\s]',st)

['Football', 'Cricket', 'Baseball', 'Tennis']

In [23]:
# from nltk import sent_tokenize, word_tokenize #if doesn't work use: from nltk.tokenize import....
#from mosestokenizer import sent_tokenize
text = "Hi John! How are you doing? I will be traveling to your city. Lets Catchup."
sent_tokenize(text)
word_tokenize(text)

['Hi',
 'John',
 '!',
 'How',
 'are',
 'you',
 'doing',
 '?',
 'I',
 'will',
 'be',
 'traveling',
 'to',
 'your',
 'city',
 '.',
 'Lets',
 'Catchup',
 '.']

# 03- Stemming

In [24]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [100]:
#Stemming is not very efficient: Sometimes it produces meaningless words:
print(stemmer.stem("Playing"))
print(stemmer.stem("Increases"))
print(stemmer.stem("Raining"))
print(stemmer.stem("Decreases"))

play
increas
rain
decreas


# 04- Lemmatization:

In [59]:
# if doesn't work: First download wordnet: run below line
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()

In [62]:
# this accept 2nd argument as POS tag.
print(lemm.lemmatize("playing"))
print(lemm.lemmatize("playing",pos="v"))
print(lemm.lemmatize("increases"))

playing
play
increase


# 05- POS-Tag

In [67]:
from nltk import pos_tag
text = "Hi John! How are you doing? I will be traveling to your city. Lets Catchup."

In [71]:
# If Error, run below code for downloading: averaged_perceptron_tagger
# nltk.download('averaged_perceptron_tagger')
tokens = word_tokenize(text)
pos_tag(tokens)

[('Hi', 'NNP'),
 ('John', 'NNP'),
 ('!', '.'),
 ('How', 'WRB'),
 ('are', 'VBP'),
 ('you', 'PRP'),
 ('doing', 'VBG'),
 ('?', '.'),
 ('I', 'PRP'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('traveling', 'VBG'),
 ('to', 'TO'),
 ('your', 'PRP$'),
 ('city', 'NN'),
 ('.', '.'),
 ('Lets', 'VBZ'),
 ('Catchup', 'NNP'),
 ('.', '.')]

# 06-nltk - Synonyms:

In [90]:
# from nltk we can get synonyms of words:
from nltk.corpus import wordnet
wordnet.synsets("Computer")

[Synset('computer.n.01'), Synset('calculator.n.01')]

# 07- N-grams:

In [97]:
# From nltk we can use ngrams:
from nltk import ngrams
sentence = "I love to play football"
for gram in ngrams(word_tokenize(sentence),2):
    print(gram)

('I', 'love')
('love', 'to')
('to', 'play')
('play', 'football')
