In this notebook we will demostrate how to perform tokenization,stemming,lemmatization and pos_tagging using libraries like [spacy](https://spacy.io/) and [nltk](https://www.nltk.org/)

In [None]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

!pip install numpy==1.19.5
!pip install nltk==3.2.5
!pip install spacy==2.2.4

# ===========================


In [None]:
#This will be our corpus which we will work on
corpus_original = "Need to finalize the demo corpus which will be used for this notebook and it should be done soon !!. It should be done by the ending of this month. But will it? This notebook has been run 4 times !!"
corpus = "Need to finalize the demo corpus which will be used for this notebook & should be done soon !!. It should be done by the ending of this month. But will it? This notebook has been run 4 times !!"

In [None]:
#lower case the corpus
corpus = corpus.lower()
print(corpus)

In [None]:
#removing digits in the corpus
import re
corpus = re.sub(r'\d+','', corpus)
print(corpus)

In [None]:
#removing punctuations
import string
corpus = corpus.translate(str.maketrans('', '', string.punctuation))
print(corpus)

In [None]:
#removing trailing whitespaces
corpus = ' '.join([token for token in corpus.split()])
corpus

In [None]:
!python -m spacy download en_core_web_sm

### Tokenizing the text

In [None]:
#corpus = "Tesla is looking at buying U.S. startup for $6 million"
corpus = "Tesla isn't   looking into startups anymore."

In [None]:
from pprint import pprint
##NLTK
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
stop_words_nltk = set(stopwords.words('english'))

tokenized_corpus_nltk = word_tokenize(corpus)
print("\nNLTK\nTokenized corpus:",tokenized_corpus_nltk)
tokenized_corpus_without_stopwords = [i for i in tokenized_corpus_nltk if not i in stop_words_nltk]
print("Tokenized corpus without stopwords:",tokenized_corpus_without_stopwords)


##SPACY
from spacy.lang.en.stop_words import STOP_WORDS
import spacy

# Load the English language model
spacy_model = spacy.load('en_core_web_sm')

stopwords_spacy = spacy_model.Defaults.stop_words
print("\nSpacy:")
tokenized_corpus_spacy = word_tokenize(corpus)
print("Tokenized Corpus:",tokenized_corpus_spacy)
tokens_without_sw= [word for word in tokenized_corpus_spacy if not word in stopwords_spacy]

print("Tokenized corpus without stopwords",tokens_without_sw)


print("Difference between NLTK and spaCy output:\n",
      set(tokenized_corpus_without_stopwords)-set(tokens_without_sw))

Notice the difference output after stopword removal using nltk and spacy

### Stemming

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()

print("Before Stemming:")
print(corpus)

print("After Stemming:")
for word in tokenized_corpus_nltk:
    print(stemmer.stem(word),end=" ")

### Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
lemmatizer=WordNetLemmatizer()

for word in tokenized_corpus_nltk:
    print(lemmatizer.lemmatize(word),end=" ")

### POS Tagging

In [9]:
#POS tagging using spacy
import spacy
import nltk
from nltk.tokenize import word_tokenize
nltk.download("punkt")
nlp=spacy.load("en_core_web_sm")
print("POS Tagging using spacy:")
doc = nlp("my name is parth and i am from india")
# Token and Tag
for token in doc:
    print(token,":", token.pos_)

#pos tagging using nltk
nltk.download('averaged_perceptron_tagger')
print("POS Tagging using NLTK:")
print(nltk.pos_tag(word_tokenize(doc)))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


POS Tagging using spacy:
my : PRON
name : NOUN
is : AUX
parth : ADJ
and : CCONJ
i : PRON
am : AUX
from : ADP
india : PROPN
POS Tagging using NLTK:


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


TypeError: expected string or bytes-like object