# **NLP Basics**

In [3]:
import nltk
nltk.download('punkt')
sentence = "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do."

sentence_tokenized = nltk.word_tokenize(sentence)

print (sentence_tokenized)

['Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', '.']


[nltk_data] Downloading package punkt to C:\Users\Sumit
[nltk_data]     Dembla\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Tokens can be received from a paragraph as well. Sometimes there is also a need to break down a paragraph into sentences. That can be done with `sent_tokenize()`.

In [4]:
paragraph = """
            Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, 
            but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?”
"""

para_tokenized = nltk.word_tokenize(paragraph)

print (para_tokenized)

['Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', '“', 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book', ',', '”', 'thought', 'Alice', '“', 'without', 'pictures', 'or', 'conversations', '?', '”']


In [5]:
paragraph = """
            As she said this she looked down at her hands, and was surprised to 
            see that she had put on one of the Rabbit’s little white kid gloves while she was talking. “How can I have done that?” she thought. “I must be growing small again.”
"""

para_tokenized = nltk.sent_tokenize(paragraph)

print (para_tokenized)

['\n            As she said this she looked down at her hands, and was surprised to \n            see that she had put on one of the Rabbit’s little white kid gloves while she was talking.', '“How can I have done that?” she thought.', '“I must be growing small again.”']


# **Normalization**

In this section we will learn techniques to get rid of unnecessary words from a corpus. There are different words such as "was" or "can" that don't hold much value when we don't care much about sentence structure. There is a list of common words such as "and", "is", "or", "and", "a", "an", "the", etc., which don't provide much information in a sentence. Usually for text processing we get rid of the words in this list. Here the `nltk` library comes in handy to do such cleaning of text. There are 21 lists of such common words supported in the `nltk` library. Since our paragraph is in English, we will use the respective list of stop words from `nltk`. Also "As" and "as" are usually considered the same word while counting frequency of words. It is good practice to either make it capital or small letters so that we can get rid of more common words that are just separated by case. Text normalization is an umbrella term for considering the following operations:

- Removal of stop words
- Stemming - abruptly removing the suffix from words
- Converting text to upper or lower case
- Lemmatization - replacing a token with its root word


In [5]:
from nltk.corpus import stopwords
#Download stop words from nltk library
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

tokens = nltk.word_tokenize(paragraph)

clean_paragraph = [word for word in tokens if word not in stop_words]

try:
  print ("print tokens ")
  print (clean_paragraph)
except:
  print ("expected variable called clean_sentence")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
print tokens 
['As', 'said', 'looked', 'hands', ',', 'surprised', 'see', 'put', 'one', 'Rabbit', '’', 'little', 'white', 'kid', 'gloves', 'talking', '.', '“', 'How', 'I', 'done', '?', '”', 'thought', '.', '“', 'I', 'must', 'growing', 'small', '.', '”']


In [6]:
sentence = "Wow this is cool ! wow this is awesome."
lit_sentence = sentence.lower()
upp_sentence = sentence.upper()

print (lit_sentence)

print (upp_sentence)

wow this is cool ! wow this is awesome.
WOW THIS IS COOL ! WOW THIS IS AWESOME.


# Stemming and Lemmatization

There are various forms of words in grammer of any language such as differ (verb), different (adjective), differently (adverb), differentness (noun). Both stemming and lemmatization have the same goal of reducing inflectional forms. For example, "different", "differently", and "differentness" will be reduced to "differ". "am", "is", "are" will be reduced to "be".  Stemming usually chops off prefix/suffix from words. For example: 

Alice's ring has different colors. => Alice ring has differ color.

Lemmatization uses morphological analysis to smoothly transform various form of words to a root form, which is called a *Lemma*. 

For the word sing, if you use stemming it may return "s" after chopping "ing", whereas lemmatization returns "sing" if the token was as verb. Porter's algorithm is one of the most common and effective algorithms for stemming. Let's see how we can leverage Porter's algorithm for stemming.

In [7]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

sentence = "There might be some sense in your knocking."

tokens = word_tokenize(sentence)
word_stemmed = [stemmer.stem(word) for word in tokens]

print ("List of Tokens ")
print (tokens)

print ("\n")
print ("List of stemmed words")
print (word_stemmed)

List of Tokens 
['There', 'might', 'be', 'some', 'sense', 'in', 'your', 'knocking', '.']


List of stemmed words
['there', 'might', 'be', 'some', 'sens', 'in', 'your', 'knock', '.']


In [8]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('wordnet')

Lemmatizer = WordNetLemmatizer()
sentence = "There might be some sense in your knocking."
tokens = word_tokenize(sentence)

lemma = [Lemmatizer.lemmatize(token) for token in tokens]

print (lemma)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
['There', 'might', 'be', 'some', 'sense', 'in', 'your', 'knocking', '.']


# Bag-of-Words
![](bag_of_words.png)
We will now go through one of the basic techniques in NLP, which is quite useful to search (information retrieval and classification) similar documents from an available corpus of documents. This model is known as the bag-of-words model. It neither cares about grammar nor sequence order. Only the frequency of every word matters. Usually this will be applied after the removal of stop words because we don't want to count the frequency of words that don't provide much information and are there just for the grammar of the language. The frequency of words is used as a feature for the machine learning model. This technique is useful for elementary sentimental analysis, or for basic spam filtering because, usually, there are repetitive positive and negative words to figure out positive/negative sentiment, whereas a lot of spam uses the same words such as "lottery", "win", etc. We will use `sklearn` to divide sentences into tokens and count frequencies. The `CountVectorizer` function helps us achieve this. Once we use the function, `corpus_vectorizer.vocabulary_` is used to check the number of words (in a technical words vocabulary of corpus) in a sentence. Let's see this in action.

In [9]:
corpus = ["Emma was a Catholic because her mother was a Catholic, and Nory’s mother was a Catholic because her father was a Catholic, and her father was a Catholic because his mother was a Catholic, or had been."]

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
corpus_vectorizer = CountVectorizer()
corpus_vectorizer.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [11]:
#show count for each word present in corpus
corpus_vectorizer.vocabulary_

{'and': 0,
 'because': 1,
 'been': 2,
 'catholic': 3,
 'emma': 4,
 'father': 5,
 'had': 6,
 'her': 7,
 'his': 8,
 'mother': 9,
 'nory': 10,
 'or': 11,
 'was': 12}

We have used the `ntlk` library to get split words and get frequency. Let's understand it without using any NLP library. First, the sentence is broken down with white space and then a dictionary is created to keep track of tokens/words. `nltk` performs some more beautification such as removal of extra characters (,;.! etc). For example, see "nory" (above) and "Nory's" in the following code. `nltk` also converts all letters to lowercase so that "WORD" and "word" are considered the same word.

In [12]:
corpus = "Emma was a Catholic because her mother was a Catholic, and Nory’s mother was a Catholic because her father was a Catholic, and her father was a Catholic because his mother was a Catholic, or had been."

In [13]:
#Python code without any nlp library
tokens = corpus.split()
#Uncomment to see all the tokens
#print (tokens)
#define empty dictionary 
dic = {}

#save frequency of each token in dictiory
for token in tokens:
  count = 1
  if token not in dic:
    dic[token] = count
  else:
    dic[token] = dic[token] + 1

print (dic)

{'Emma': 1, 'was': 6, 'a': 6, 'Catholic': 3, 'because': 3, 'her': 3, 'mother': 3, 'Catholic,': 3, 'and': 2, 'Nory’s': 1, 'father': 2, 'his': 1, 'or': 1, 'had': 1, 'been.': 1}


# TF-IDF



This section will help you with another feature extraction technique, which emphasizes unique words. TF-IDF stands for term-frequency inverse-document-frequency. It calculates two things: 
1. **Term Frequency** - (number of times a word _w_ appreared in the document) / (the number of words that appreared in the document)
2. **Inverse Document Frequency** - How important the particular word is to the document. It can be derived by the following equation.

  IDF(w) = log (Total number of documents / number of documents with word _w_ in it)

For example, consider a corpus having 1 million documents and each document is made of 100 words. If the word 'love' appears 10 times in a document and across all the documents, the frequency of the word "love" is 10,000:

Term Frequency = 10 / 100 = 0.1

IDF = log(1000000/10000) = 2

TF-IDF = 0.1 * 2 = 0.2

Let's understand this concept by emulating a small corpus of documents (though the size of any corpus can be enormous in the real world). We will use `TfidVectorizer` from `sklearn` to convert a raw document in TF-IDF matrix form. To visualize the output better we have created a pandas data frame. Each row represents various words available in corpus. Each column represents document names. "cute" is present the same number of times in all documents across corpus, so the score is 1 for all and it's similar for other same words. "she" is unique across corpus so it has a higher score for `doc_1`. "very" is not common across all documents. For `doc_2`, "very" is less important in comparison to `doc_3` where it occured twice. "she", "he", and "very" is important for `doc_1`, `doc_2`, `doc_3`, respectively. In short, the higher the number, the more important term it is for that document.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [15]:
doc_1 = "She is cute"
doc_2 = "He is very cute"
doc_3 = "Angelina is very very cute"

corpus = [doc_1, doc_2, doc_3]

corpus_preprocess = []

for doc in corpus:
  corpus_preprocess.append(doc.lower())

corpus_vectorizer = TfidfVectorizer(norm=None)
tf_idf_scores = corpus_vectorizer.fit_transform(corpus_preprocess)

feature_names = corpus_vectorizer.get_feature_names()
corpus_index = [doc for doc in corpus_preprocess]

print ("TF-IDF table of corpus")
print ("======================\n")

print(pd.DataFrame(tf_idf_scores.T.todense(), index = feature_names, columns = ["doc_1", "doc_2", "doc_3"]))


TF-IDF table of corpus

             doc_1     doc_2     doc_3
angelina  0.000000  0.000000  1.693147
cute      1.000000  1.000000  1.000000
he        0.000000  1.693147  0.000000
is        1.000000  1.000000  1.000000
she       1.693147  0.000000  0.000000
very      0.000000  1.287682  2.575364


Internally, `CountVectorizer` and `TfidfTransformer` are used to calculate term frequency and inverse document frequency in `TfidfVectorizer`. For a better understanding, a breakdown is given below.

In [37]:
#Part 1 - Calculating Term Frequencies
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
term_frequency = vectorizer.fit_transform(corpus_preprocess)

#uncomment following to check feature names
#print(vectorizer.get_feature_names())

print ("Term Frequency Matrix")
print ("=====================")
print (term_frequency.toarray())
print ("=====================")
#Part 2 - Calculating Inverse Document Frequency
from sklearn.feature_extraction.text import TfidfTransformer

vectorizer = TfidfTransformer(norm=None)
vectorizer.fit(term_frequency)

idf = vectorizer.idf_

df = pd.DataFrame(idf, index = feature_names, columns=['IDF'])
print ("\nIDF table")
print ("=========")
print(df)


tf_idf_scores = vectorizer.fit_transform(term_frequency.toarray())


print ("\nTF-IDF matrix")
print ("=============\n")
print (tf_idf_scores.toarray())

Term Frequency Matrix
[[0 1 0 1 1 0]
 [0 1 1 1 0 1]
 [1 1 0 1 0 2]]

IDF table
               IDF
angelina  1.693147
cute      1.000000
he        1.693147
is        1.000000
she       1.693147
very      1.287682

TF-IDF matrix

[[0.         1.         0.         1.         1.69314718 0.        ]
 [0.         1.         1.69314718 1.         0.         1.28768207]
 [1.69314718 1.         0.         1.         0.         2.57536414]]
