## Introduction to NLTK package in python 
Contents:
    1. Tokenization 
    2. Ngrams 
    3. Pos_tagging 
    4. TF-IDF (Feature Extraction) 
    5. NER 
    6. Stemming and Lemmatization

In [23]:
import nltk
from nltk import ngrams
from nltk import sent_tokenize
from nltk import word_tokenize
from nltk import pos_tag
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
text = '''
Machine learning is the science of getting computers to act without being explicitly programmed. 
In the past decade, machine learning has given us self-driving cars, 
practical speech recognition, effective web search, 
and a vastly improved understanding of the human genome. 
Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it.
Many researchers also think it is the best way to make progress towards human-level AI. 
In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself.
More importantly, you'll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems.
Finally, you'll learn about some of Silicon Valley's best practices in innovation as it pertains to machine learning and AI.
'''


In [30]:
text = "Data Science is a new field and applied to many different domains."
sent_tokens = nltk.sent_tokenize(text)

In [31]:
token = word_tokenize(text)

In [15]:
sent_tokens[0]

'Data Science is a new field and applied to many different domains.'

In [16]:
bigrams = ngrams(token,2)

In [17]:
for k in bigrams:
    print(k)

('Data', 'Science')
('Science', 'is')
('is', 'a')
('a', 'new')
('new', 'field')
('field', 'and')
('and', 'applied')
('applied', 'to')
('to', 'many')
('many', 'different')
('different', 'domains')
('domains', '.')


In [18]:
trigrams = ngrams(token,3)

In [20]:
for k in trigrams:
    print(k)

('Data', 'Science', 'is')
('Science', 'is', 'a')
('is', 'a', 'new')
('a', 'new', 'field')
('new', 'field', 'and')
('field', 'and', 'applied')
('and', 'applied', 'to')
('applied', 'to', 'many')
('to', 'many', 'different')
('many', 'different', 'domains')
('different', 'domains', '.')


In [22]:
for grams in bigrams:
    print(grams)

In [5]:
for sent in sent_token:
    print sent

This is a machine learning course, we teach data science.
We use R and Python


In [23]:
sentences = nltk.sent_tokenize(text)


In [24]:
sentences

['Data Science is a new field and applied to many different domains.']

In [1]:
corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?',
... ]

In [27]:
# max_df and min_df can take values as proportion or count 
vectorizer = TfidfVectorizer(min_df = 2, max_df=4, ngram_range=(1,2), use_idf=False) 
a = vectorizer.fit_transform(corpus)

In [28]:
a.toarray()

array([[0.33333333, 0.33333333, 0.33333333, 0.33333333, 0.33333333,
        0.33333333, 0.33333333, 0.33333333, 0.33333333],
       [0.40824829, 0.        , 0.        , 0.40824829, 0.40824829,
        0.40824829, 0.        , 0.40824829, 0.40824829],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        1.        , 0.        , 0.        , 0.        ],
       [0.37796447, 0.37796447, 0.37796447, 0.37796447, 0.        ,
        0.37796447, 0.37796447, 0.37796447, 0.        ]])

In [24]:
cntvect = CountVectorizer()
b = cntvect.fit_transform(corpus)
b.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

In [32]:
postags = pos_tag(token)

In [33]:
postags

[('Data', 'NNP'),
 ('Science', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('new', 'JJ'),
 ('field', 'NN'),
 ('and', 'CC'),
 ('applied', 'VBD'),
 ('to', 'TO'),
 ('many', 'JJ'),
 ('different', 'JJ'),
 ('domains', 'NNS'),
 ('.', '.')]

In [36]:
nltk.help.upenn_tagset('VBG')

VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...


## Named Entity Recognition
1. Using NER and NERT tagger

In [34]:
import ner

In [39]:
import nltk 
with open('sample.txt', 'r') as f:
    sample = f.read() # f.readlines() -- to read each line as a separate row 


sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

entity_names = []
for tree in chunked_sentences:
    # Print results per sentence
    # print extract_entity_names(tree)

    entity_names.extend(extract_entity_names(tree))

# Print all entity names
#print entity_names

# Print unique entity names
print(set(entity_names))

{'Apple India Pvt Ltd', 'NER', 'Hyderabad', 'India', 'Microsoft Corporation', 'CEO Satya Nadella'}


In [45]:
## Stemming and lemmatization 
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.corpus import stopwords

In [46]:
Text = " This is a test of how Stemmer stems or lemmatizes. Programmer builds programs. Programming is a skill. Execution of different programs are tasks"

In [47]:
words = word_tokenize(Text)

In [50]:
words[7:20]

['stems',
 'or',
 'lemmatizes',
 '.',
 'Programmer',
 'builds',
 'programs',
 '.',
 'Programming',
 'is',
 'a',
 'skill',
 '.']

In [48]:
## Lemmatization 
lemm = WordNetLemmatizer()
lem_words = [ lemm.lemmatize(w) for w in words] ## List comprehension
print(lem_words)

['This', 'is', 'a', 'test', 'of', 'how', 'Stemmer', 'stem', 'or', 'lemmatizes', '.', 'Programmer', 'build', 'program', '.', 'Programming', 'is', 'a', 'skill', '.', 'Execution', 'of', 'different', 'program', 'are', 'task']


In [51]:
## Using Porter stemmer 
ps = PorterStemmer()
ps_words = [ps.stem(w) for w in words]
print(ps_words)

['thi', 'is', 'a', 'test', 'of', 'how', 'stemmer', 'stem', 'or', 'lemmat', '.', 'programm', 'build', 'program', '.', 'program', 'is', 'a', 'skill', '.', 'execut', 'of', 'differ', 'program', 'are', 'task']


In [52]:
# Using Lancaster Stemmer 
ls = LancasterStemmer()
ls_words = [ls.stem(w) for w in words]
print(ls_words)

['thi', 'is', 'a', 'test', 'of', 'how', 'stem', 'stem', 'or', 'lem', '.', 'program', 'build', 'program', '.', 'program', 'is', 'a', 'skil', '.', 'execut', 'of', 'diff', 'program', 'ar', 'task']


In [62]:
## Remove stop words 
stop_words = stopwords.words('english')
mywords = ['test','this']
stop_words = stop_words+mywords
clean_words = [w for w in words if not w in stop_words]
clean_words = [ w.lower() for w in clean_words]
print(clean_words)

['this', 'stemmer', 'stems', 'lemmatizes', '.', 'programmer', 'builds', 'programs', '.', 'programming', 'skill', '.', 'execution', 'different', 'programs', 'tasks']


In [55]:
print(stop_words[0:10])
mywords = ["test","this"]
stop_words.extend(mywords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
