# NLTK

NLTK is a Python library for working with human language data, and a corpus is a large collection of texts used for natural language processing (NLP) and linguistic analysis

### Natural Language Processing

1. **Segmentation** : Separating the sentences using punctuatuions ". ," etc
ex : "I like NLP." "I like programming"

2. **Tokenization** : Separating words from the sentences.
ex : "I" "like" "NLP"

3. **Stop words removal** : It removes not so necessary words.
ex : is, a, the

4. **Stemming** : Cutting off suffixes to get a root form,
ex : playing ---> play

5. **Lemmatization** : Uses vocabulary and grammer to return the proper base form.
ex : "better" ---> "good"

6. **Parts of Speech** : Each word is assigned a parts of speech.
ex : noun, verb, adverb

7. **Named Entity Recognition (NER) tagging** : NER tagging identified and categorises proper nouns or specific entity in a text.
ex : "John works at Google" 
"John" ---> Person
"Google" ---> Organisation

In [1]:
import nltk
from nltk.corpus import stopwords

In [2]:
#nltk.download('punkt')

In [3]:
text = "John is a good guy, He works at Google. John is a software engineer. He lives in New York. He is travelling to India Next Week. He earns 50000 Dollars. On first of this month he is travelling to Eiffel Tower. He is 1, one in his Organization. He is a good guy."

Sentence Tokenisation

In [4]:
sentences = nltk.sent_tokenize(text)
for i,sentences in enumerate(sentences):
    print(f"{i+1}: {sentences}")

1: John is a good guy, He works at Google.
2: John is a software engineer.
3: He lives in New York.
4: He is travelling to India Next Week.
5: He earns 50000 Dollars.
6: On first of this month he is travelling to Eiffel Tower.
7: He is 1, one in his Organization.
8: He is a good guy.


Word Tokenisation

In [5]:
words = nltk.word_tokenize(text)
print(words)

['John', 'is', 'a', 'good', 'guy', ',', 'He', 'works', 'at', 'Google', '.', 'John', 'is', 'a', 'software', 'engineer', '.', 'He', 'lives', 'in', 'New', 'York', '.', 'He', 'is', 'travelling', 'to', 'India', 'Next', 'Week', '.', 'He', 'earns', '50000', 'Dollars', '.', 'On', 'first', 'of', 'this', 'month', 'he', 'is', 'travelling', 'to', 'Eiffel', 'Tower', '.', 'He', 'is', '1', ',', 'one', 'in', 'his', 'Organization', '.', 'He', 'is', 'a', 'good', 'guy', '.']


Removing Stop words like is, the etc

In [6]:
#nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
print(filtered_words)

['John', 'good', 'guy', ',', 'works', 'Google', '.', 'John', 'software', 'engineer', '.', 'lives', 'New', 'York', '.', 'travelling', 'India', 'Next', 'Week', '.', 'earns', '50000', 'Dollars', '.', 'first', 'month', 'travelling', 'Eiffel', 'Tower', '.', '1', ',', 'one', 'Organization', '.', 'good', 'guy', '.']


Lemmatzing

In [7]:
#nltk.download('wordnet')
#nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()
lemm_words = [lemm.lemmatize(word) for word in filtered_words]
print(lemm_words)

['John', 'good', 'guy', ',', 'work', 'Google', '.', 'John', 'software', 'engineer', '.', 'life', 'New', 'York', '.', 'travelling', 'India', 'Next', 'Week', '.', 'earns', '50000', 'Dollars', '.', 'first', 'month', 'travelling', 'Eiffel', 'Tower', '.', '1', ',', 'one', 'Organization', '.', 'good', 'guy', '.']


Stemming

In [8]:
from nltk.stem import PorterStemmer
stem = nltk.PorterStemmer()
stemmed_words = [stem.stem(word) for word in lemm_words]
print(stemmed_words)

['john', 'good', 'guy', ',', 'work', 'googl', '.', 'john', 'softwar', 'engin', '.', 'life', 'new', 'york', '.', 'travel', 'india', 'next', 'week', '.', 'earn', '50000', 'dollar', '.', 'first', 'month', 'travel', 'eiffel', 'tower', '.', '1', ',', 'one', 'organ', '.', 'good', 'guy', '.']


In [9]:
print(words)
print(filtered_words)
print(lemm_words)
print(stemmed_words)

['John', 'is', 'a', 'good', 'guy', ',', 'He', 'works', 'at', 'Google', '.', 'John', 'is', 'a', 'software', 'engineer', '.', 'He', 'lives', 'in', 'New', 'York', '.', 'He', 'is', 'travelling', 'to', 'India', 'Next', 'Week', '.', 'He', 'earns', '50000', 'Dollars', '.', 'On', 'first', 'of', 'this', 'month', 'he', 'is', 'travelling', 'to', 'Eiffel', 'Tower', '.', 'He', 'is', '1', ',', 'one', 'in', 'his', 'Organization', '.', 'He', 'is', 'a', 'good', 'guy', '.']
['John', 'good', 'guy', ',', 'works', 'Google', '.', 'John', 'software', 'engineer', '.', 'lives', 'New', 'York', '.', 'travelling', 'India', 'Next', 'Week', '.', 'earns', '50000', 'Dollars', '.', 'first', 'month', 'travelling', 'Eiffel', 'Tower', '.', '1', ',', 'one', 'Organization', '.', 'good', 'guy', '.']
['John', 'good', 'guy', ',', 'work', 'Google', '.', 'John', 'software', 'engineer', '.', 'life', 'New', 'York', '.', 'travelling', 'India', 'Next', 'Week', '.', 'earns', '50000', 'Dollars', '.', 'first', 'month', 'travelling', '

# Parts of Speech Tags
![image.png](attachment:image.png)

In [10]:
#parts of speech tagging 
pos_tags = nltk.pos_tag(words)
print(pos_tags)

[('John', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('guy', 'NN'), (',', ','), ('He', 'PRP'), ('works', 'VBZ'), ('at', 'IN'), ('Google', 'NNP'), ('.', '.'), ('John', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('software', 'NN'), ('engineer', 'NN'), ('.', '.'), ('He', 'PRP'), ('lives', 'VBZ'), ('in', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('.', '.'), ('He', 'PRP'), ('is', 'VBZ'), ('travelling', 'VBG'), ('to', 'TO'), ('India', 'NNP'), ('Next', 'NNP'), ('Week', 'NNP'), ('.', '.'), ('He', 'PRP'), ('earns', 'VBZ'), ('50000', 'CD'), ('Dollars', 'NNP'), ('.', '.'), ('On', 'IN'), ('first', 'JJ'), ('of', 'IN'), ('this', 'DT'), ('month', 'NN'), ('he', 'PRP'), ('is', 'VBZ'), ('travelling', 'VBG'), ('to', 'TO'), ('Eiffel', 'NNP'), ('Tower', 'NNP'), ('.', '.'), ('He', 'PRP'), ('is', 'VBZ'), ('1', 'CD'), (',', ','), ('one', 'CD'), ('in', 'IN'), ('his', 'PRP$'), ('Organization', 'NN'), ('.', '.'), ('He', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('guy', 'NN'), ('.', '.')]


# Named Entity Recognition Tags

1. **PERSON** - People, including fictional.
2. **NORP** - Nationalities or religious or political groups.
3. **FAC** - Buildings, airports, highways, bridges, etc.
4. **ORG** - Companies, agencies, institutions, etc.
5. **GPE** - Countries, cities, states.
6. **LOC** - Non-GPE locations, mountain ranges, bodies of water.
7. **PRODUCT** - Objects, vehicles, foods, etc. (not services).
8. **EVENT** - Named hurricanes, battles, wars, sports events, etc.
9. **WORK_OF_ART** - Titles of books, songs, etc.
10. **LAW** - Named documents made into laws.
11. **LANGUAGE** - Any named language.
12. **DATE** - Absolute or relative dates or periods.
13. **TIME** - Times smaller than a day.
14. **PERCENT** - Percentage, including ”%“.
15. **MONEY** - Monetary values, including unit.
16. **QUANTITY** - Measurements, as of weight or distance.
17. **ORDINAL** - “first”, “second”, etc.
18. **CARDINAL** - Numerals that do not fall under another type.

In [11]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for entity in doc.ents:
    print(entity.text, entity.label_)

John PERSON
Google ORG
John PERSON
New York GPE
India GPE
50000 Dollars MONEY
first ORDINAL
Eiffel Tower FAC
1 CARDINAL
