### Install NLTK

In [1]:
# pip install

!pip install nltk



You are using pip version 9.0.1, however version 9.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [2]:
# import and download

import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Tokenization
Chopping off a given document into small pieces is known as tokenization.

These small pieces of text are called as tokens. 

Sentence tokenization chops a document or article into sentences.
word tokenization chops a document or article down to words.

In [3]:
# importing tokenizers

from nltk import sent_tokenize, word_tokenize

In [4]:
# using sentence tokenizer

example_text = "This is python class. I like python.  I am a student. This is last line. "
sent = sent_tokenize(example_text)
print(sent)

['This is python class.', 'I like python.', 'I am a student.', 'This is last line.']


In [5]:
# word tokenizer

words = word_tokenize(example_text)
print(words)

['This', 'is', 'python', 'class', '.', 'I', 'like', 'python', '.', 'I', 'am', 'a', 'student', '.', 'This', 'is', 'last', 'line', '.']


In [6]:
#using .split() function

example_words = example_text.split() 
print (example_words)

['This', 'is', 'python', 'class.', 'I', 'like', 'python.', 'I', 'am', 'a', 'student.', 'This', 'is', 'last', 'line.']


## Text Preprocessing

Text is the most unstructured form of all the available data, various types of noise are present in it and the data is not readily analyzable without any pre-processing.

There are three major ways of text preprocessing. 
1. Noise reduction
2. Lexicon normalization
3. Object standarization 

#### Noise reduction:

Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise.

Ex: 'is' , 'or' , 'and' , 'the'. 

These words can also be called as stopwords

In [7]:
#making a list of meaningless words

noise_text = ["is","a","for", "that", "this" , "it" , "of" , "to"]

# creating a function to remove noise

def remove_noise(input_text):
    
    words = input_text.split()                                                #splitting the sentence into words
    noise_free_words = [w for w in words if w not in noise_text]              # FOR loop to remove words in above list
    noise_free_text = " ".join(noise_free_words)                              # joining those words
    return noise_free_text


In [8]:
remove_noise(" this is a cricket bat. give this bat to Virat Kohli")

'cricket bat. give bat Virat Kohli'

##### Method 2 :  stopwords removal

In [9]:
#importing stopwords from nltk.corpus

from nltk.corpus import stopwords

In [10]:
#making english stopwords into a list

stop_words = set(stopwords.words("English"))

print(stop_words)

{'t', 'their', 'to', 'should', 'did', 'after', "mustn't", 'for', "don't", "couldn't", 'once', 'our', 'no', 'doesn', "it's", 'himself', 'how', 'such', 'i', 'were', 'not', 'again', 'd', "mightn't", 'she', 's', 'hasn', "that'll", 'against', 'more', "should've", 'aren', 'below', 'into', 'further', 'theirs', "wouldn't", 'it', 'but', 'myself', 'that', 'and', 'own', 'just', 'ain', 'has', 'being', "didn't", 'between', 'wouldn', 'my', 'which', 'will', "aren't", 'above', 'mustn', 'are', 'whom', 'now', 'do', 'a', 'couldn', "shouldn't", "haven't", 'off', 'these', 'with', 'shouldn', 'few', 'this', 'than', 'who', 'when', 'same', 'too', "wasn't", 'because', 're', "doesn't", 'other', 'have', 'ourselves', 'there', 'herself', 'themselves', 'at', 'about', 'we', 'out', 'under', 'll', 'nor', 'the', "hasn't", 'won', 'having', "you'd", 'only', 'hadn', 'its', 'yourselves', 'can', 'm', 'from', 'on', 'mightn', 'until', "you've", 'ma', 'each', 'me', 'what', 'those', 'while', 'then', 'haven', 'why', 'didn', "were

In [11]:
# defining a function to remove stopwords

def noiseless_text(input_text):
    
    words = word_tokenize(input_text)                     #splitting sentence into words
    noiseless_text =[]                                    # making a empty list
    
    for w in words:                                       # FOR loop to remove words from above list
         if w not in stop_words:
                noiseless_text.append(w)                  # appending remaining words into list
    return print(noiseless_text)

In [12]:
noiseless_text("India fought during second world war. India sent soliders and supplies into war")

['India', 'fought', 'second', 'world', 'war', '.', 'India', 'sent', 'soliders', 'supplies', 'war']


## Lexicon normalization

Another type of textual noise is about the multiple representations exhibited by single word.

Example :  write, wrote, writing, writer, written 

The most common lexicon normalization practices are :

#### Stemming:  

Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.


In [13]:
# lower casing the words

example = "Automation automatic automated automotive"

example_lower = example.lower().split()
print(example_lower)



['automation', 'automatic', 'automated', 'automotive']


In [14]:
#stemming
#import stemmer

from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

In [15]:
#FOR loop for using Porter Stemmer

for word in example_lower:
    stemmed_word = ps.stem(word)
    print(stemmed_word)

autom
automat
autom
automot



#### Lemmatization:

Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words)

In [16]:
#import lemmatizer 

from nltk.stem.wordnet import WordNetLemmatizer

lem = WordNetLemmatizer()

In [17]:
example2 = ["maker", "called" , "ears", "loving"]

In [19]:
#FOR loop for using lemmatizer

for word in example2:
    lemmatized_word =  lem.lemmatize(word)
    print(lemmatized_word)

make
called
ear
loving


In [22]:
#lemmatizer with POS as 'adjective' example

ex_3 =  ["fast", "faster","fastest"]
for word in ex_3:
    lemmatized_word =  lem.lemmatize(word, pos = 'a')
    print(lemmatized_word)

fast
fast
fast
