## Basic NLP Pipeline 
This pipeline is used to create features for our textual data
1. Data Collection
2. Tokenization, stopword removal, stemming
3. Building a common vocabulary
4. Vectorizing the documents
5. Performing classification/clustering

### 1. Data Collection

In [1]:
from nltk.corpus import brown
# https://www.nltk.org/book/ch02.html

In [2]:
print(type(brown))

<class 'nltk.corpus.util.LazyCorpusLoader'>


In [3]:
brown

<CategorizedTaggedCorpusReader in '/Users/riagupta/nltk_data/corpora/brown'>

In [4]:
# brown?

# A reader for part-of-speech tagged corpora whose documents are divided into categories 
# based on their file identifiers.

In [5]:
print(brown.categories())

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


In [6]:
data = brown.sents(categories='fiction')
# Returns a list of sentences, each encoded as a list of word strings. We can specify category (or leave blank for all)

In [7]:
len(data)

# ie, 4249 sentences

4249

In [8]:
print(data)

[['Thirty-three'], ['Scotty', 'did', 'not', 'go', 'back', 'to', 'school', '.'], ...]


In [9]:
print(data[4])

['Scotty', 'accepted', 'the', 'decision', 'with', 'indifference', 'and', 'did', 'not', 'enter', 'the', 'arguments', '.']


### 2. Tokenization, Stopword Removal and Stemming

### (a) Tokenization
Converting each sentence into a list of words

In [10]:
text = "It was a very pleasant day, the weather was cool and there were light showers. I went to the market to buy some fruits."
print(text)

It was a very pleasant day, the weather was cool and there were light showers. I went to the market to buy some fruits.


In [11]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [12]:
# To split our text into sentences
sentences = sent_tokenize(text)
print(type(sentences))
print(sentences)

<class 'list'>
['It was a very pleasant day, the weather was cool and there were light showers.', 'I went to the market to buy some fruits.']


In [13]:
# To split our text into words
words = word_tokenize(sentences[0].lower())
print(type(words))
print(words)
# Convert to lower case to perform proper stopword removal later

<class 'list'>
['it', 'was', 'a', 'very', 'pleasant', 'day', ',', 'the', 'weather', 'was', 'cool', 'and', 'there', 'were', 'light', 'showers', '.']


#### Tokenizing using Regular Expressions
Word tokenizer cant handle complex tokenizations. So we use RegExp Tokenizer class in NLTK

In [14]:
text2 = "Send all the 50 documents related to clauses 1,2,3 at abc@gmail.com"

sents = sent_tokenize(text2)
print(sents)

word_list = word_tokenize(sents[0].lower())
print(word_list)

['Send all the 50 documents related to clauses 1,2,3 at abc@gmail.com']
['send', 'all', 'the', '50', 'documents', 'related', 'to', 'clauses', '1,2,3', 'at', 'abc', '@', 'gmail.com']


In [15]:
from nltk.tokenize import RegexpTokenizer

In [16]:
tokenizer = RegexpTokenizer("[a-zA-Z@]+")
# Pass a regular expression pattern to remove whatever you want to remove
# This regular expression allows chars a-z, A-Z, and spcl char @ => So, it only removes numbers
# Create a regular expression using cheat sheet on regexpal.com

In [17]:
print(text2)
print(tokenizer.tokenize(text2))     # With numbers removed

Send all the 50 documents related to clauses 1,2,3 at abc@gmail.com
['Send', 'all', 'the', 'documents', 'related', 'to', 'clauses', 'at', 'abc@gmail', 'com']


### (b) Stopword Removal
Removing unimportant words like is, an, the, it, there, and, has, etc.

In [18]:
from nltk.corpus import stopwords

In [19]:
sw = stopwords.words('english')

In [20]:
print(len(sw))
print(type(sw))

179
<class 'list'>


In [21]:
print(sw)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [22]:
def filter_words(text):
    useful_words = [w for w in text if w not in sw]
    return useful_words

In [23]:
# Filtering words from our sentence using list comprehension

useful_words = [w for w in words if w not in sw]
print(useful_words)

['pleasant', 'day', ',', 'weather', 'cool', 'light', 'showers', '.']


### (c) Stemming
- A process that transforms words to (verbs, plurals) to their radical form
- Reducing words to their root words.
- Purpose: Preserve the semantics of the sentence without increasing the number of unique tokens

eg. 
- Cat jumps
- Cat jumped
- Cat is jumping 

All can be reduced to jump <br>
This can also be done through lemmatization

In [24]:
text = "Foxes love to make jumps. The quick brown fox was seen jumping over the lovely dog from a 6 feet high wall"

words_list = tokenizer.tokenize(text.lower())
print(words_list)

['foxes', 'love', 'to', 'make', 'jumps', 'the', 'quick', 'brown', 'fox', 'was', 'seen', 'jumping', 'over', 'the', 'lovely', 'dog', 'from', 'a', 'feet', 'high', 'wall']


In [25]:
words_list = filter_words(words_list)
print(words_list)

['foxes', 'love', 'make', 'jumps', 'quick', 'brown', 'fox', 'seen', 'jumping', 'lovely', 'dog', 'feet', 'high', 'wall']


#### Types of Stemmers:
1. Snowball Stemmer (Multilingual => supports French, German, Russian, etc)
2. Porter Stemmer
3. Lancaster Stemmer

In [26]:
from nltk.stem.snowball import PorterStemmer, SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
ps = PorterStemmer()

In [27]:
print(ps.stem('jumped'))
print(ps.stem('jumps'))
print(ps.stem('jumping'))
print(ps.stem('jumper'))

jump
jump
jump
jumper


In [28]:
print(ps.stem('lovely'))
print(ps.stem('awesome'))
print(ps.stem('teeth'))
print(ps.stem('teenager'))

love
awesom
teeth
teenag


In [29]:
ls = LancasterStemmer()

print(ls.stem('jumped'))
print(ls.stem('jumps'))
print(ls.stem('jumping'))
print(ls.stem('jumper'))
print(ls.stem('lovely'))
print(ls.stem('awesome'))
print(ls.stem('teeth'))
print(ls.stem('teenager'))

jump
jump
jump
jump
lov
awesom
tee
teen


In [30]:
ss = SnowballStemmer('english')
print(ss.stem('jumped'))
print(ss.stem('jumps'))
print(ss.stem('jumping'))
print(ss.stem('jumper'))
print(ss.stem('lovely'))
print(ss.stem('awesome'))
print(ss.stem('teeth'))
print(ss.stem('teenager'))

jump
jump
jump
jumper
love
awesom
teeth
teenag


In [31]:
ss2 = SnowballStemmer('french')

print(ss2.stem("parlais"))
print(ss2.stem("parles"))
print(ss2.stem("parler"))
print(ss2.stem("parlerai"))
print(ss2.stem("parlons"))
print(ss2.stem("finir"))
print(ss2.stem("finit"))

parl
parl
parl
parl
parlon
fin
fin


In [32]:
# Write a function to perform all 3 steps: Tokenization, stopword removal, and stemming, and also removes any leading
# or trailing white spaces

In [33]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer, PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.corpus import brown

In [34]:
print(brown.categories())

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


In [35]:
text = brown.sents(categories='romance')
sent = ' '.join(text[3743])
print(sent)

He had nothing much to say to her but that he said anything seemed to please her and he accompanied her on some of her unusually searching tours of Tokyo .


In [40]:
def cleanup(sent):
    print(sent+'\n')
    tokenizer = RegexpTokenizer("[a-zA-Z]+")
    sent = tokenizer.tokenize(sent.lower())
    print(sent)
    
    sw = stopwords.words('english')
    sent = [w for w in sent if w not in sw]
    print(sent)
    
    ss = SnowballStemmer('english')
    ps = PorterStemmer()
    ls = LancasterStemmer()
    stemmed_words = []
    stemmed_words2 = []
    stemmed_words3 = []

    for w in sent:
        stemmed_words.append(ss.stem(w))
        stemmed_words2.append(ps.stem(w))
        stemmed_words3.append(ls.stem(w))

    print()
    print(sent)
    print()
    print(stemmed_words)
    print()
    print(stemmed_words2)
    print()
    print(stemmed_words3)
    print()
    
    
text = "           He had nothing much to say to her but that          he said anything seemed to please her and he accompanied her on some of her unusually searching tours of Tokyo.             "
   
cleanup(text)

           He had nothing much to say to her but that          he said anything seemed to please her and he accompanied her on some of her unusually searching tours of Tokyo.             

['he', 'had', 'nothing', 'much', 'to', 'say', 'to', 'her', 'but', 'that', 'he', 'said', 'anything', 'seemed', 'to', 'please', 'her', 'and', 'he', 'accompanied', 'her', 'on', 'some', 'of', 'her', 'unusually', 'searching', 'tours', 'of', 'tokyo']
['nothing', 'much', 'say', 'said', 'anything', 'seemed', 'please', 'accompanied', 'unusually', 'searching', 'tours', 'tokyo']

['nothing', 'much', 'say', 'said', 'anything', 'seemed', 'please', 'accompanied', 'unusually', 'searching', 'tours', 'tokyo']

['noth', 'much', 'say', 'said', 'anyth', 'seem', 'pleas', 'accompani', 'unusu', 'search', 'tour', 'tokyo']

['noth', 'much', 'say', 'said', 'anyth', 'seem', 'pleas', 'accompani', 'unusu', 'search', 'tour', 'tokyo']

['noth', 'much', 'say', 'said', 'anyth', 'seem', 'pleas', 'accompany', 'unus', 'search', 'tour',

In [41]:
### Lemmatization (Try it yourself)
from nltk.stem import WordNetLemmatizer

l = WordNetLemmatizer()
l.lemmatize("crying")

'cry'

### 3. Building Common Vocabulary
After extracting all the important words

In [43]:
from sklearn.feature_extraction.text import CountVectorizer

In [42]:
corpus = [
        'Indian cricket team will wins World Cup, says Capt. Virat Kohli. World cup will be held at Sri Lanka.',
        'We will win next Lok Sabha Elections, says confident Indian PM',
        'The nobel laurate won the hearts of the people',
        'The movie Raazi is an exciting Indian Spy thriller based upon a real story'
]

In [47]:
cv = CountVectorizer()
vectorized_corpus = cv.fit_transform(corpus).toarray()

In [48]:
type(vectorized_corpus)

numpy.ndarray

In [50]:
print(vectorized_corpus.shape)
print(vectorized_corpus)

# Each vector entry denotes the frequency of the word at that index in the dictionary. Indices are described in random
# order and can be checked in next cell using get_feature_names() function

# Eg. word "world" is at last index in dictionary and appears 2x in sentence 1 => array has value 2 at first row 
# last index
# Eg. word "the" is at 10th last index in dictionary and appears 3x in sentence 3 => array has value 3 at third row 
# 10th last index

(4, 42)
[[0 1 0 1 1 0 1 2 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1
  0 2 0 1 0 2]
 [0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0
  1 1 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 3 0 0 0
  0 0 0 0 1 0]
 [1 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 0
  0 0 0 0 0 0]]


In [52]:
print(cv.get_feature_names())

['an', 'at', 'based', 'be', 'capt', 'confident', 'cricket', 'cup', 'elections', 'exciting', 'hearts', 'held', 'indian', 'is', 'kohli', 'lanka', 'laurate', 'lok', 'movie', 'next', 'nobel', 'of', 'people', 'pm', 'raazi', 'real', 'sabha', 'says', 'spy', 'sri', 'story', 'team', 'the', 'thriller', 'upon', 'virat', 'we', 'will', 'win', 'wins', 'won', 'world']


In [55]:
# Dictionary which maps word to index
print(cv.vocabulary_)

{'indian': 12, 'cricket': 6, 'team': 31, 'will': 37, 'wins': 39, 'world': 41, 'cup': 7, 'says': 27, 'capt': 4, 'virat': 35, 'kohli': 14, 'be': 3, 'held': 11, 'at': 1, 'sri': 29, 'lanka': 15, 'we': 36, 'win': 38, 'next': 19, 'lok': 17, 'sabha': 26, 'elections': 8, 'confident': 5, 'pm': 23, 'the': 32, 'nobel': 20, 'laurate': 16, 'won': 40, 'hearts': 10, 'of': 21, 'people': 22, 'movie': 18, 'raazi': 24, 'is': 13, 'an': 0, 'exciting': 9, 'spy': 28, 'thriller': 33, 'based': 2, 'upon': 34, 'real': 25, 'story': 30}


In [56]:
print(len(vectorized_corpus[0]))

42


In [64]:
# Reverse mapping: Given any self-created vector, we want to see what sentence is created
import numpy as np

vec = np.ones((42,))
vec[3:7] = 0
print(vec)
print()

print(cv.inverse_transform(vec))
vec[3:7] = 1

print(cv.inverse_transform(vec))

[1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

[array(['an', 'at', 'based', 'cup', 'elections', 'exciting', 'hearts',
       'held', 'indian', 'is', 'kohli', 'lanka', 'laurate', 'lok',
       'movie', 'next', 'nobel', 'of', 'people', 'pm', 'raazi', 'real',
       'sabha', 'says', 'spy', 'sri', 'story', 'team', 'the', 'thriller',
       'upon', 'virat', 'we', 'will', 'win', 'wins', 'won', 'world'],
      dtype='<U9')]
[array(['an', 'at', 'based', 'be', 'capt', 'confident', 'cricket', 'cup',
       'elections', 'exciting', 'hearts', 'held', 'indian', 'is', 'kohli',
       'lanka', 'laurate', 'lok', 'movie', 'next', 'nobel', 'of',
       'people', 'pm', 'raazi', 'real', 'sabha', 'says', 'spy', 'sri',
       'story', 'team', 'the', 'thriller', 'upon', 'virat', 'we', 'will',
       'win', 'wins', 'won', 'world'], dtype='<U9')]


In [70]:
# Check index of any word
print(cv.vocabulary_['virat'])
print(cv.vocabulary_['kohli'])

35
14


### 4. Vectorize the Document
Use the bag-of-words model to convert this vocabulary to vectors. 

eg. It was raining and the cat jumped <br>
Vocab => rain, cat, jump

For sentence: The cat was jumping, vector is [0,1,1]

In [None]:
# For a corpus with one million words, each sentence vector is represented as a vector of size 1 million, so,
# Effectively reduce the size of the vector

def myTokenizer(sentence):
    words = tokenizer.tokenize(sentence)
    return filter_words(words)

### 5. Classification/Clustering
Feed to classification/clustering algorithm to make some predictions, like what kind of data this is, or which category this news article belongs to