# <center>3.2 Manipulating human language implementation on NLP </center>
In this Notebook, you will learn about
- Tokenization
- Stopwords in NLP
- Keywrord extraction in NLP
- Lemmatization and Stemming
- Bag-of-words

## <center>Tokenization </center>
**This session contains friendly description of multiple tokenization methods and python code.**

### Introduction <a id ="1"></a>

Tokenization is one of the first step in any NLP pipeline. Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens. If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'. Generally 'space' is used to perform the word tokenization and characters like 'periods, exclamation point and newline char are used for Sentence Tokenization.  We have to choose the appropriate method as per the task in hand. While performing the tokenization few characters like spaces, punctuations are ignored and will not be the part of final list of tokens.

![image.png](attachment:image.png)

# Why Tokenization is Required? <a id ="2"></a>
Every sentence gets its meaning by the words present in it. So by analyzing the words present in the text we can easily interpret the meaning of the text. Once we have a list of words we can also use statistical tools and methods to get more insights into the text. For example, we can use word count and word frequency to find out important of word in that sentence or document.

# Tokenization Techniques <a id ="3"></a>
There are multiple ways we can perform tokenization on given text data. We can choose any method based on language, library and purpose of modeling.

## Tokenization Using Python's Inbuilt Method <a id ="4"></a>

![NLP_Tokenization](https://raw.githubusercontent.com/satishgunjal/images/master/python_split_syntax.png)

* We can use **split()** method to split a string into a list where each word is a list item.
* By default split() use whitespace as separater, but we can change it to anything.

### Word Tokenization

In [1]:
text = "There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."
# Split text by whitespace
tokens = text.split()
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge,', 'library', 'and', 'purpose', 'of', 'modeling.']


Observe in above list, words like 'language,' and  'modeling.' are containing punctuation at the end of them. **Python split method do not consider punctuation as separate token.**

### Sentence Tokenization

In [2]:
# Lets split the given text by full stop (.)
text = "Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."
text.split(". ") # Note the space after the full stop makes sure that we dont get empty element at the end of list.

['Characters like periods, exclamation point and newline char are used to separate the sentences',
 'But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method.']

## Tokenization Using NLTK <a id ="6"></a>
* Natural Language Toolkit (NLTK) is library written in python for natural language processing.
* NLTK has module **word_tokenize()** for word tokenization and **sent_tokenize()** for sentence tokenization.
* Syntax to install NLTK is as below
```
!pip install --user -U nltk
```
* Note that we are going use "!" before the command to let notebook know that, it should read as commandline command

### Word Tokenization

In [None]:
!pip install --user -U nltk

In [3]:
import nltk

In [None]:
nltk.download('punkt')

In [4]:
from nltk.tokenize import word_tokenize

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""
tokens = word_tokenize(text)
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', ',', 'library', 'and', 'purpose', 'of', 'modeling', '.']


Notice that NLTK word tokenization also consider the punctuation as token. During text cleaning process we have to account for this.

### Sentence Tokenization

In [5]:
from nltk.tokenize import sent_tokenize

text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""
sent_tokenize(text)

['Characters like periods, exclamation point and newline char are used to separate the sentences.',
 'But one drawback with split() method, that we can only use one separator at a time!',
 'So sentence tonenization wont be foolproof with split() method.']

# <center> Stopwords in NLP</center>
**This session contains friendly description of removal of stopwords and its python code.**

# Introduction <a id ="1"></a>

Stopwords are the most common words in any natural language. For the purpose of analyzing text data and building NLP models, these stopwords might not add much value to the meaning of the document.
Generally, the most common words used in a text are “the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” etc.
![image-2.png](attachment:image-2.png)

## Here are a few key benefits of removing stopwords:
- On removing stopwords, dataset size decreases and the time to train the model also decreases.
- Removing stopwords can potentially help improve the performance as there are fewer and only meaningful tokens left. Thus, it could increase classification accuracy.
- Even search engines like Google remove stopwords for fast and relevant retrieval of data from the database.


In [6]:
import nltk

In [7]:
from nltk.corpus import stopwords

In [None]:
nltk.download('stopwords')

In [8]:
len(set(stopwords.words('english')))

179

In [9]:
# sample sentence
text = """He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and 
fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had 
indeed the vaguest idea where the wood and river in question were."""

# set of stop words
stop_words = set(stopwords.words('english')) 

# tokens of words  
word_tokens = word_tokenize(text) 
    
filtered_sentence = [] 
  
for w in word_tokens: 
    
    if w not in stop_words: 
        filtered_sentence.append(w) 

In [10]:
print("\nOriginal Sentence \n")
print(" ".join(word_tokens)) 


Original Sentence 

He determined to drop his litigation with the monastry , and relinguish his claims to the wood-cuting and fishery rihgts at once . He was the more ready to do this becuase the rights had become much less valuable , and he had indeed the vaguest idea where the wood and river in question were .


In [11]:
print("\nFiltered Sentence \n")
print(" ".join(filtered_sentence)) 


Filtered Sentence 

He determined drop litigation monastry , relinguish claims wood-cuting fishery rihgts . He ready becuase rights become much less valuable , indeed vaguest idea wood river question .


# <center> Keywrord extraction in NLP</center>
**This session contains friendly description of howto extract keywords in doucument and its python code.**

## Introduction
Keyword extraction is figuring out which words and phrases in a piece of text are the most important. These keywords can be used to summarise the content of the text. A common use case is using keywords to improve search engine optimization (SEO) and make content more easily discoverable online.

## SpaCy keyword extraction
- SpaCy is all in one python library for NLP tasks. 
- But we are interested in the keyword extraction functionality of spaCy.
- We will start with installing the spaCy library, then download a model **en_core_web_lg**.
- After that, pass the article text into the NLP pipeline. It will return the extracted keywords.


In [None]:
!pip install SpaCy

In [12]:
import spacy

In [13]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [14]:
# Load the Spacy model and create a new document 
nlp = spacy.load("en_core_web_lg")
text=". Samagra Shiksha an integrated scheme covering all classes from pre-primary to senior secondary and PM POSHAN scheme have been aligned with the recommendations of NEP 2020. Samagra Shiksha covers 11.68 lakh schools, around 15.62 crore students and 57.67 lakh Teachers of Govt. and Aided schools (up to senior secondary level). The major objectives of the Scheme are: (i) Support States and UTs in implementing the recommendations of the National Education Policy 2020 (NEP 2020); (ii) Support States in implementation of Right of Children to Free and Compulsory Education (RTE) Act, 2009; (iii) Focus on Early Childhood Care and Education (iv) Emphasis on Foundational Literacy and Numeracy (v) Thrust on Holistic, Integrated, Inclusive and activity based Curriculum and Pedagogy to impart 21st century skills among the students; (vi) Provision of quality education and enhancing learning outcomes of students; (vi) Bridging Social and Gender Gaps in School Education; (vii) Ensuring equity and inclusion at all levels of school education; (ix) Strengthening and up-gradation of State Councils for Educational Research and Training (SCERTs)/State Institutes of Education and District Institutes for Education and Training (DIET) as a nodal agency for teacher training; (x) Ensuring safe, secure and conducive learning environment and minimum standards in schooling provisions; and (xi) Promoting vocational education. Samagra Shiksha is providing support for implementation of major NEP recommendations such as emphasis on Foundational Literacy and Numeracy, provision for Holistic Progress Card (HPC), Capacity building of teachers (50 Hrs CPD), Bagless days and internships, support for OOSC in age group of 16- 19 years, Separate stipend for CWSN girl child, identification of CWSN and Resource Centre at block level, expansion of schooling facilities including Residential Hostels, KGBVs, and vocational education etc."
doc = nlp(text) 

In [15]:
# Use the noun_chunks property of the document to identify the noun phrases in the text 
noun_phrases = [chunk.text for chunk in doc.noun_chunks] 

In [16]:
# Use term frequency-inverse document frequency (TF-IDF) analysis to rank the noun phrases 
from sklearn.feature_extraction.text import TfidfVectorizer 
vectorizer = TfidfVectorizer() 
tfidf = vectorizer.fit_transform([doc.text]) 

In [17]:
# Get the top 3 most important noun phrases 
top_phrases = sorted(vectorizer.vocabulary_, key=lambda x: tfidf[0, vectorizer.vocabulary_[x]], reverse=True)[:30] 

In [18]:
# Print the top 3 keywords 
print(top_phrases)

['and', 'of', 'education', 'for', 'the', 'in', 'to', 'support', 'on', 'samagra', 'shiksha', 'scheme', 'recommendations', 'nep', '2020', 'students', 'training', 'integrated', 'all', 'senior', 'secondary', 'lakh', 'schools', 'teachers', 'up', 'level', 'major', 'states', 'implementation', 'emphasis']


## Lemmatization and Stemming
Stemming and Lemmatization are text preprocessing methods within the field of NLP that are used to standardize text, words, and documents for further analysis. Both in stemming and in lemmatization, we attempt to reduce a given word to its “root”. The root word is called a “stem” in the stemming process, and it is called a “lemma” in the lemmatization process.


![image.png](attachment:image.png)

## Stemming
- PorterStemmer
- SnowballStemmer

### PorterStemmer
- **Porter's Stemmer** is the oldest stemmer method. Porter's Stemmer applies a set of five sequential rule to determine common suffixes from sentences done with regular expressions.

In [19]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
print(ps.stem("stemming"))

stem


In [20]:
sentence = "It is good to be a pythoner who is pythonly pythoning with python"
words = nltk.word_tokenize(sentence)
for word in words:
    ps = PorterStemmer()
    print(ps.stem(word))

it
is
good
to
be
a
python
who
is
pythonli
python
with
python


### SnowballStemmer
- **Snowball stemmer** is an updated version of Porter’s Stemmer with new rules. The process is exactly the same as Porter’s Stemmer

In [21]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language = "english")
word = "civilization"
stemmer.stem(word)

'civil'

In [22]:
sentence = "It is good to be a pythoner who is pythonly pythoning with python"
words = nltk.word_tokenize(sentence)
for word in words:
    stemmer = SnowballStemmer(language = "english")
    print(stemmer.stem(word))

it
is
good
to
be
a
python
who
is
python
python
with
python


### Lancaster Stemmer
- **Lancaster Stemmer** rules are more agressive than Porter and Snowball and it is one of the most agressive stemmers. As a general rule of thumb, think that the rules of Lancaster’s Stemmer try to reduce the word to the shortest stem possible

In [23]:
from nltk.stem import SnowballStemmer, LancasterStemmer
sentence = "It is good to be a pythoner who is pythonly pythoning with python"
words = nltk.word_tokenize(sentence)
for word in words:
    lanc = LancasterStemmer()
    print(lanc.stem(word))


it
is
good
to
be
a
python
who
is
python
python
with
python


In [24]:
text = "Lemmatization is a type of normalization. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. Lemmatizers need a lot more data on the structure of a language, which makes the creation of lemmatizers a harder process than making a stemming algorithm."

In [26]:
tokenized_eu = word_tokenize(text)
ps = PorterStemmer()
porter_eu = [ps.stem(word) for word in tokenized_eu]
print(f" PorterStemmer: {100*round(len(''.join(porter_eu))/len(''.join(word_tokenize(text))),3)}%")

snowball = SnowballStemmer(language='english')
porter_eu = [snowball.stem(word) for word in tokenized_eu]
print(f" SnowballStemmer: {100*round(len(''.join(porter_eu))/len(''.join(word_tokenize(text))),3)}%")

lanc = LancasterStemmer()
porter_eu = [lanc.stem(word) for word in tokenized_eu]
print(f" LancasterStemmerr: {100*round(len(''.join(porter_eu))/len(''.join(word_tokenize(text))),3)}%")

 PorterStemmer: 82.0%
 SnowballStemmer: 81.5%
 LancasterStemmerr: 68.0%


As expected, the stemmer that retains less information is the Lancaster Stemmer.

## Lemmatization
Lemmatization is a type of normalization. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. Lemmatizers need a lot more data on the structure of a language, which makes the creation of lemmatizers a harder process than making a stemming algorithm.

In [27]:
from nltk.stem import WordNetLemmatizer

In [28]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\AbdulAziz\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [29]:
porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()
print(f" better\n Stemming: {porter.stem('better')}\n Lemmatization: { lemmatizer.lemmatize('better', pos ='a')}" )

 better
 Stemming: better
 Lemmatization: good


In [30]:
sentence = "There are mistakes"
print(f'Sentence: {sentence}')

word_list = nltk.word_tokenize(sentence)
print(f'word_list: {word_list}')

lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(f'Lemmatization: {lemmatized_output}')

Sentence: There are mistakes
word_list: ['There', 'are', 'mistakes']
Lemmatization: There are mistake


### Lemmatization vs stemming
- Stemming is a faster process than lemmatization as stemming chops off the word irrespective of the context, whereas the latter is context-dependent
- Stemming is a rule-based approach, whereas lemmatization is a canonical dictionary-based approach.
- Lemmatization has higher accuracy than stemming.
- Lemmatization is preferred for context analysis, whereas stemming is recommended when the context is not important.

## Bag-of-words
Bag-of-words is technique does not consider the position of a word in a document. The idea is to count the number of times each word appears in each of the documents. 

> Consider the 4 following rows and count the number of times each word appears in each rows.

- This is the first document.
- This document is the second document.
- And this is the third one.
- Is this the first document?

In [31]:
text = ['This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?']
print(text)

['This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?']


In [32]:
import pandas as pd

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text)
columns = vectorizer.get_feature_names()

df =  pd.DataFrame(X.toarray(), columns=columns, index = text)
df

Unnamed: 0,and,document,first,is,one,second,the,third,this
This is the first document.,0,1,1,1,0,0,1,0,1
This document is the second document.,0,2,0,1,0,1,1,0,1
And this is the third one.,1,0,0,1,1,0,1,1,1
Is this the first document?,0,1,1,1,0,0,1,0,1


>The matrix calculated on this simple example of four sentences can be generalized to many documents in the corpus. Each document is a row, and each token is a column. In our situation token is word. Such a matrix is called the document-term matrix. It describes the frequency of terms that occur in a collection of documents and is used as input to a machine learning classifier.

>We can have a very large amount of data and then the vector can consist of thousands or millions of elements. As a consequence, there will be many zeros in the vector representation. Vectors with many zeros are called sparse vectors and require more memory and computational resources.

>However, we can reduce the number of known words. We can use the following techniques:

- Ignoring the case of words
- Ignoring punctuation
- Ignoring frequent words that don’t contain much information, called stop words, like “a,” “of,” etc.
- Reduction of words to their basic forms (lemmatization and stemming)
- Correction of misspelled words

# Good Job!