# **Explaination of individual Stages**

## **NLP packages**
- NLTK package (Moreover programmer friendly, hardcore programmer)
- Spacy package (Data Science | Analysis Friendly , not required to be hardcore programmer)

# **Spacy Basics**

In [None]:
import spacy

## **Download the Language Model**

In [None]:
! python3 -m spacy download en_core_web_sm 

### **Check whether spacy is working or not**

In [None]:
import spacy

In [None]:
nlp = spacy.load('en_core_web_sm')

## **Work with Spacy**

In [None]:
data = "Tesla is looking at buying U.S. startups for $6 million"

# passed the data to Spacy 
output = nlp(data)

In [None]:
for element in output:
  print(element.text, element.pos_)

Tesla PROPN
is AUX
looking VERB
at ADP
buying VERB
U.S. PROPN
startups NOUN
for ADP
$ SYM
6 NUM
million NUM


# **Tokenization**

### **Example-1**

In [None]:
stringData = "We're moving to Bangalore!"
stringData

"We're moving to Bangalore!"

In [None]:
output = nlp(stringData)

In [None]:
for tokens in output:
  print(tokens.text)

We
're
moving
to
Bangalore
!


### **Example-2**

In [None]:
output2 = nlp(u"We are here to help you! Send your concerns @prashant@teacher.com or visit us @ http://teacher.com/support")

In [None]:
for tokens in output2:
  print(tokens.text, tokens.pos_)

We PRON
are AUX
here ADV
to PART
help VERB
you PRON
! PUNCT
Send VERB
your DET
concerns NOUN
@prashant@teacher.com PUNCT
or CCONJ
visit VERB
us PRON
@ ADP
http://teacher.com/support NOUN


In [None]:
output3 = nlp(u"Let's visit St. Louis in the U.S. next year.")
for tokens in output3:
  print(tokens.text, tokens.pos_)

Let VERB
's PRON
visit VERB
St. PROPN
Louis PROPN
in ADP
the DET
U.S. PROPN
next ADJ
year NOUN
. PUNCT


# **Named Entitiy Recognition (NER )**

In [None]:
output4 = nlp(u"Apple to build a Hong Kong and Mumbai factory for $10 million with Simplilearn and Microsoft")
for tokens in output4:
  print(tokens.text)

Apple
to
build
a
Hong
Kong
and
Mumbai
factory
for
$
10
million
with
Simplilearn
and
Microsoft


## **For NER we need to call ``.ents`` function**
- Hong kong as GPE (Geopolitical Entity)
- Apple - Organisation
- Stopwords removed
- $10 million is MONEY

In [None]:
for tokens in output4.ents:
  print(tokens.text, tokens.label_, spacy.explain(tokens.label_))

Apple ORG Companies, agencies, institutions, etc.
Hong Kong GPE Countries, cities, states
Mumbai GPE Countries, cities, states
$10 million MONEY Monetary values, including unit
Simplilearn PRODUCT Objects, vehicles, foods, etc. (not services)
Microsoft ORG Companies, agencies, institutions, etc.


## **Exception**

In [None]:
output5 = nlp(u"We eat Apple")
for tokens in output5.ents:
  print(tokens.text, tokens.label_, spacy.explain(tokens.label_))

Apple ORG Companies, agencies, institutions, etc.


In [None]:
output6 = nlp(u"Elon Musk is the founder of Tesla and Prashant Nair is the trainer of NLP!")
for tokens in output6.ents:
  print(tokens.text, tokens.label_, spacy.explain(tokens.label_))

Elon Musk PERSON People, including fictional
Tesla and Prashant Nair ORG Companies, agencies, institutions, etc.
NLP ORG Companies, agencies, institutions, etc.


### **Rs. 10 crore not recognized as money, Indian words not recognized.**

In [None]:
output7 = nlp(u"Chaiwala Inc has the turnover of Rs.10 crore")
for tokens in output7.ents:
  print(tokens.text, tokens.label_, spacy.explain(tokens.label_))

Chaiwala Inc ORG Companies, agencies, institutions, etc.


# **Stemming**

In [None]:
import nltk
from nltk.stem.porter import PorterStemmer

In [None]:
stemObject = PorterStemmer()

In [None]:
words = ["branching", "branched", "branches"]
for word in words:
  print(f"{word}, {stemObject.stem(word)}")

branching, branch
branched, branch
branches, branch


## **Following is defect for NLU but not for Machine Learning.**

In [None]:
words1 = ["caching", "cached", "caches"]
for word in words1:
  print(f"{word}, {stemObject.stem(word)}")

caching, cach
cached, cach
caches, cach


In [None]:
words1 = ["easy", "easily"]
for word in words1:
  print(f"{word}, {stemObject.stem(word)}")

easy, easi
easily, easili


In [None]:
words1 = ["running", "runner", "run", "ran"]
for word in words1:
  print(f"{word}, {stemObject.stem(word)}")

running, run
runner, runner
run, run
ran, ran


# **Lemmatization**
- Spacy Lemmatization is better

In [None]:
output8 = nlp(u"I am a runner running in a race because I love to run since I ran today")

for element in output8:
  print(f"{element.text}, {element.lemma_}")

I, -PRON-
am, be
a, a
runner, runner
running, run
in, in
a, a
race, race
because, because
I, -PRON-
love, love
to, to
run, run
since, since
I, -PRON-
ran, run
today, today


# **Stopwords**

In [None]:
print(nlp.Defaults.stop_words)

{'we', "'m", 'onto', 'she', 'through', 'them', 'towards', 'because', 'can', 'hence', 'thereby', 'might', 'again', 'in', 'whence', 'since', 'whereafter', 'became', 'using', 'toward', 'beyond', 'is', 'though', '’re', 'around', 'yourselves', '‘re', 'fifty', 'it', 'show', 'under', 'behind', 'throughout', 'nowhere', 'thence', 'perhaps', 'enough', 'among', 'already', 'each', 'such', "'re", 'thereupon', 'and', 'wherever', 'into', 'quite', 'others', 'either', 'you', 'once', 'more', 'sometime', 'many', 'one', 'six', 'could', 'nine', 'front', 'amongst', 'whereby', 'unless', 'always', 'herein', 'for', 'ours', 'may', 'due', 'cannot', 'these', 'becomes', 'call', 'noone', 'those', 'above', '’m', 'had', 'anyhow', 'go', 'so', 're', 'nothing', 'ever', 'someone', 'take', 'same', 'up', 'beforehand', 'doing', 'its', 'by', 'if', 'somewhere', 'will', 'between', 'please', 'now', 'back', 'well', 'twelve', 'hereupon', 'therein', 'across', 'almost', "n't", 'least', 'nevertheless', '‘m', 'fifteen', '’d', 'much',

### Adding ``mystery`` as a stopword.

In [None]:
nlp.vocab['mystery'].is_stop

True

In [None]:
len(nlp.Defaults.stop_words)

326

In [None]:
nlp.Defaults.stop_words.add("mystery")
len(nlp.Defaults.stop_words)

327

In [None]:
nlp.vocab['mystery'].is_stop

True

In [None]:
nlp.vocab['mytery'].is_stop = True
nlp.vocab['mystery'].is_stop

True

## **Add ``btw`` as stopword**

In [None]:
nlp.vocab['btw'].is_stop

False

In [None]:
nlp.Defaults.stop_words.add("btw")

In [None]:
nlp.vocab['btw'].is_stop = True
nlp.vocab['btw'].is_stop

True

## **Removing Stopwords**

In [None]:
nlp.Defaults.stop_words.remove("btw")


In [None]:
nlp.vocab['btw'].is_stop

True

In [None]:
nlp.vocab['btw'].is_stop = False
nlp.vocab['btw'].is_stop

False