### Topic for today's discussion
* Understanding NLTK
* Tokenizing
* Stop-words
* Stemming 
* Lemmetizing

<hr>

* Text cannot be processed by ML algos
* They needs to be pre-processed
* They needs to be feature reduction
* NLTK is a very foundation which provides all these things

In [6]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [1]:
import nltk

In [5]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [2]:
from nltk.tokenize import sent_tokenize,word_tokenize

In [3]:
my_txt = 'Hello Mr. Learners, how is learning going on? Hope things are fine. Hope the lockdown solves all the issues.'

In [4]:
sent_tokenize(my_txt)

['Hello Mr. Learners, how is learning going on?',
 'Hope things are fine.',
 'Hope the lockdown solves all the issues.']

In [5]:
word_tokenize(my_txt)

['Hello',
 'Mr.',
 'Learners',
 ',',
 'how',
 'is',
 'learning',
 'going',
 'on',
 '?',
 'Hope',
 'things',
 'are',
 'fine',
 '.',
 'Hope',
 'the',
 'lockdown',
 'solves',
 'all',
 'the',
 'issues',
 '.']

### Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing. 

### Stemming
* Many variations of words carry the same meaning, other than when tense is involved.
* Objective is reduce the dimension of data
* Curse of dimension - lot of algorithms don't work that well if the dimensions is too many

In [6]:
from nltk.stem import PorterStemmer

In [7]:
ps = PorterStemmer()

In [8]:
words = ['go','went','going','gone']#['runs','runner','running','run']

In [9]:
for word in words:
    print(ps.stem(word))

go
went
go
gone


In [10]:
text_data = ['I runs verying is fast','I was very running fast veries veried']

In [11]:
import pandas as pd

In [12]:
df = pd.DataFrame({'Text':text_data})

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
cv = CountVectorizer()

In [15]:
def f(r):
    words = word_tokenize(r)
    res=[]
    for word in words:
        res.append(ps.stem(word))
    return (' '.join(res))
df.Text = df.Text.map(f)

In [16]:
df.Text

0              I run veri is fast
1    I wa veri run fast veri veri
Name: Text, dtype: object

In [17]:
cv.fit_transform(df.Text)

<2x5 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [18]:
cv.vocabulary_

{'run': 2, 'veri': 3, 'is': 1, 'fast': 0, 'wa': 4}

In [19]:
cv.fit_transform(df.Text).toarray()

array([[1, 1, 1, 1, 0],
       [1, 0, 1, 3, 1]], dtype=int64)

### Lemmatizing
* Similar to Stemming 
* Stemming can work  for incorrect words
* Lemmatizing works on the actual words

In [20]:
from nltk.stem import WordNetLemmatizer

In [21]:
wl = WordNetLemmatizer()

In [22]:
wl.lemmatize('cats')

'cat'

In [23]:
wl.lemmatize('runs')

'run'

In [24]:
wl.lemmatize('goose')

'goose'

In [25]:
wl.lemmatize('geese')

'goose'

In [26]:
wl.lemmatize('better',pos="a")

'good'

In [27]:
wl.lemmatize('good',pos="a")

'good'

In [28]:
ps.stem('paying')

'pay'

In [29]:
ps.stem('pays')

'pay'

In [30]:
ps.stem('payed')

'pay'

### LancasterStemmer

In [31]:
from nltk.stem import LancasterStemmer

In [32]:
ls = LancasterStemmer()

In [33]:
ls.stem('trouble')

'troubl'

In [34]:
ls.stem('troubling')

'troubl'

In [35]:
text = 'He was running and eating at the, same time. He also has a very bad habbit of playing in the Sun after having food'

In [36]:
punctuations = ',.?'

In [37]:
text = text.replace(',','').replace('?','').replace('.','')

In [38]:
words = word_tokenize(text)

In [39]:
words

['He',
 'was',
 'running',
 'and',
 'eating',
 'at',
 'the',
 'same',
 'time',
 'He',
 'also',
 'has',
 'a',
 'very',
 'bad',
 'habbit',
 'of',
 'playing',
 'in',
 'the',
 'Sun',
 'after',
 'having',
 'food']

In [40]:
for word in words:
    print(wl.lemmatize(word,pos='v'))

He
be
run
and
eat
at
the
same
time
He
also
have
a
very
bad
habbit
of
play
in
the
Sun
after
have
food


In [41]:
horror_data = pd.read_csv('https://raw.githubusercontent.com/edyoda/data-science-complete-tutorial/master/Data/horror-train.csv')

In [42]:
horror_data.columns

Index(['id', 'text', 'author'], dtype='object')

In [43]:
horror_data = horror_data[['text']]

In [44]:
horror_data[:5]

Unnamed: 0,text
0,"This process, however, afforded me no means of..."
1,It never once occurred to me that the fumbling...
2,"In his left hand was a gold snuff box, from wh..."
3,How lovely is spring As we looked from Windsor...
4,"Finding nothing else, not even gold, the Super..."


* Using NearestNeighbours with metrices as cosine distance,we will find similar texts
* We can use regex to remove punctuations

In [45]:
def f(t):
    return t.replace(',','').replace('?','').replace('.','')
horror_data['new_text'] = horror_data.text.map(f)

In [46]:
def stem_func(r):
    words = word_tokenize(r)
    sent = []
    for word in words:
        sent.append(ps.stem(word))
    return ' '.join(sent)

horror_data['stem_words'] = horror_data.new_text.map(stem_func)

In [47]:
cv = CountVectorizer(stop_words='english')

In [48]:
cv.fit(horror_data.text)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [49]:
cv.vocabulary_

{'process': 16946,
 'afforded': 452,
 'means': 13553,
 'ascertaining': 1269,
 'dimensions': 6069,
 'dungeon': 6830,
 'make': 13253,
 'circuit': 3675,
 'return': 18392,
 'point': 16441,
 'set': 19497,
 'aware': 1620,
 'fact': 8144,
 'perfectly': 15882,
 'uniform': 23052,
 'wall': 23988,
 'occurred': 14936,
 'fumbling': 9165,
 'mere': 13692,
 'mistake': 13929,
 'left': 12627,
 'hand': 10039,
 'gold': 9634,
 'snuff': 20156,
 'box': 2543,
 'capered': 3070,
 'hill': 10435,
 'cutting': 5197,
 'manner': 13331,
 'fantastic': 8248,
 'steps': 20746,
 'took': 22193,
 'incessantly': 11152,
 'air': 551,
 'greatest': 9778,
 'possible': 16596,
 'self': 19381,
 'satisfaction': 19018,
 'lovely': 13045,
 'spring': 20531,
 'looked': 12987,
 'windsor': 24388,
 'terrace': 21794,
 'sixteen': 19928,
 'fertile': 8438,
 'counties': 4830,
 'spread': 20524,
 'beneath': 2066,
 'speckled': 20405,
 'happy': 10085,
 'cottages': 4797,
 'wealthier': 24123,
 'towns': 22272,
 'years': 24662,
 'heart': 10243,
 'cheering'

In [50]:
len(cv.vocabulary_)

24764

In [51]:
cv.fit(horror_data.stem_words)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [52]:
out = cv.transform(horror_data.stem_words)

In [53]:
len(cv.vocabulary_)

15355

In [54]:
from sklearn.neighbors import NearestNeighbors

In [55]:
nn = NearestNeighbors(metric='cosine')

In [56]:
nn.fit(out)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [57]:
nn.kneighbors(out[4:5])

(array([[1.11022302e-16, 4.57917835e-01, 4.79516561e-01, 4.96637990e-01,
         4.98449609e-01]]),
 array([[    4, 15457,  7409, 18122, 13440]], dtype=int64))

In [58]:
horror_data[:1].text[0]

'This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.'

In [59]:
horror_data.loc[4].text

'Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.'

In [60]:
horror_data.loc[15457].text

'His countenance was rough but intelligent his ample brow and quick grey eyes seemed to look out, over his own plans, and the opposition of his enemies.'

In [61]:
horror_data.loc[18122].text

'The smile of triumph shone on his countenance; determined to pursue his object to the uttermost, his manner and expression seem ominous of the accomplishment of his wishes.'

In [None]:
#convert list into dict
lst=[1,'a',2,'b',3,'c']
res_dct = {lst[i]: lst[i + 1] for i in range(0, len(lst), 2)} 
print(res_dct)