# NLTK

**NLTK is a package used for performing operations on text-based datasets, also known as text preprocessing. These operations include -**

- Lowercasing
- Removing Stop words
- Using regex to find patterns, remove unwanted characters/strings, etc
- Tokenization
- Stemming
- Lemmatization
- Creating/Analyzing N-grams

### Tokenization

In [5]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize, sent_tokenize

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/ritik_saxena/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [8]:
sentences = "Her cat's name is Luna. Her dog's name is Max"

In [9]:
sent_tokenize(sentences)

["Her cat's name is Luna.", "Her dog's name is Max"]

In [10]:
word_tokenize(sentences)

['Her',
 'cat',
 "'s",
 'name',
 'is',
 'Luna',
 '.',
 'Her',
 'dog',
 "'s",
 'name',
 'is',
 'Max']

In [11]:
sent_tokenize(sentences)

["Her cat's name is Luna.", "Her dog's name is Max"]

### Stemming

In [12]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()


In [13]:
connect_tokens = ["connnecting", "connect", "connected", "conn"]
for tok in connect_tokens:
    print(f"{tok} : {ps.stem(tok)}")

connnecting : connnect
connect : connect
connected : connect
conn : conn


### Lemmatization

Lemmatizing preserves the context of the word as it is used, unlike stemming

In [15]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ritik_saxena/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [17]:
from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()

In [18]:
for tok in connect_tokens:
    print(f"{tok}: {lm.lemmatize(tok)}")

connnecting: connnecting
connect: connect
connected: connected
conn: conn


In [21]:
print(f"learners - {lm.lemmatize('learners')}")

learners - learner


### N-grams

N-grams help us analyze the relationship between neighboring words. Example - unigrams, bigrams, trigrams

In [23]:
import nltk, pandas as pd
import matplotlib.pyplot as plt

Matplotlib is building the font cache; this may take a moment.


In [25]:
tokens = ["the", "rise", "of", "aritifical", "intelligence", "has", "led", "to", "advancements", "in", "computer", "vision"]
n=1
unigrams = (pd.Series(nltk.ngrams(tokens, n)).value_counts())
print(unigrams[:10])

(the,)             1
(rise,)            1
(of,)              1
(aritifical,)      1
(intelligence,)    1
(has,)             1
(led,)             1
(to,)              1
(advancements,)    1
(in,)              1
Name: count, dtype: int64




***If you change the value of `n` to 2, it'll be frequency of occurence of bigrams, and so on..***