In [1]:
!pip install nltk
import nltk
nltk.download('punkt')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
corpus = """
Stella, a software engineer with a fire in her eyes, slammed her laptop shut. Her latest code, a groundbreaking algorithm, refused to compile. Hours of meticulous work seemed on the verge of collapse. Was she just tilting at windmills, her human effort insignificant against the unforgiving logic of the machine?

"Even the most elegant code can't predict every twist," said a gentle voice. David, a senior developer with a calming presence, stood beside her.

Stella scoffed. "So, are we all just at the mercy of random bugs and glitches?"

David chuckled. "Think of it like this. You write the code (agency), you craft the logic (action), but sometimes unforeseen complexities (external forces) arise. A skilled programmer trusts their abilities but also acknowledges the inherent mysteries of the digital world."

Stella felt a knot loosen in her chest. Maybe her frustration stemmed from the illusion of total control. Perhaps her role was to leverage her skills and knowledge while embracing the unpredictable nature of the digital realm. It was about finding the bridge between determined effort and accepting the larger design of the system's intricate workings.  She took a deep breath, a renewed sense of purpose sparking in her eyes. There was still a solution waiting to be found, a harmonious dance between her will and the hidden logic of the code.

"""

# Tokenization

## 1. Sentence Tokenization

## 2. Word Tokenization

- 1. Senetence Tokenization

In [3]:
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(corpus)
for x in sentences:
  print(x)


Stella, a software engineer with a fire in her eyes, slammed her laptop shut.
Her latest code, a groundbreaking algorithm, refused to compile.
Hours of meticulous work seemed on the verge of collapse.
Was she just tilting at windmills, her human effort insignificant against the unforgiving logic of the machine?
"Even the most elegant code can't predict every twist," said a gentle voice.
David, a senior developer with a calming presence, stood beside her.
Stella scoffed.
"So, are we all just at the mercy of random bugs and glitches?"
David chuckled.
"Think of it like this.
You write the code (agency), you craft the logic (action), but sometimes unforeseen complexities (external forces) arise.
A skilled programmer trusts their abilities but also acknowledges the inherent mysteries of the digital world."
Stella felt a knot loosen in her chest.
Maybe her frustration stemmed from the illusion of total control.
Perhaps her role was to leverage her skills and knowledge while embracing the un

- 2. Word Tokenization

In [4]:
from nltk.tokenize import word_tokenize

words = word_tokenize("Hello, friend. Hello, friend? That's lame. Maybe I should give you a name, but that's a slippery slope. You're only in my head. We have to remember that.")
for x in words:
  print(x)

Hello
,
friend
.
Hello
,
friend
?
That
's
lame
.
Maybe
I
should
give
you
a
name
,
but
that
's
a
slippery
slope
.
You
're
only
in
my
head
.
We
have
to
remember
that
.


# Stemming

## 1. Porter Stemming

## 2. Regexp Stemming

## 3. Snowball Stemming

- 1. Porter Stemmer

In [5]:
from nltk.stem import PorterStemmer

words = word_tokenize("Hello, friend. Hello, friend? That's lame. Maybe I should give you a name, but that's a slippery slope. You're only in my head. We have to remember that.")

ps = PorterStemmer()
for x in words:
  print(ps.stem(x), end=" ")

hello , friend . hello , friend ? that 's lame . mayb i should give you a name , but that 's a slipperi slope . you 're onli in my head . we have to rememb that . 

- 2. Regexp Stemming

In [6]:
from nltk.stem import RegexpStemmer

words = word_tokenize("Hello, friend. Hello, friend? That's lame. Maybe I should give you a name, but that's a slippery slope. You're only in my head. We have to remember that.")

rs = RegexpStemmer('ing$|able$|s$')
for x in words:
  print(rs.stem(x), end=" ")

Hello , friend . Hello , friend ? That ' lame . Maybe I should give you a name , but that ' a slippery slope . You 're only in my head . We have to remember that . 

- 3. Snowball Stemming

In [7]:
from nltk.stem import SnowballStemmer

words = word_tokenize("Hello, friend. Hello, friend? That's lame. Maybe I should give you a name, but that's a slippery slope. You're only in my head. We have to remember that.")

ss = SnowballStemmer("english")
for x in words:
  print(ss.stem(x), end=" ")

hello , friend . hello , friend ? that 's lame . mayb i should give you a name , but that 's a slipperi slope . you re onli in my head . we have to rememb that . 

# Lemmatization

## 1. Wordnet Lemmatization

In [8]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

words = word_tokenize("Hello, friend. Hello, friend? That's lame. Maybe I should give you a name, but that's a slippery slope. You're only in my head. We have to remember that.")

lemmatizer = WordNetLemmatizer()

for x in words:
  print(lemmatizer.lemmatize(x, pos="v"), end=" ")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Hello , friend . Hello , friend ? That 's lame . Maybe I should give you a name , but that 's a slippery slope . You 're only in my head . We have to remember that . 

# Stop words removal

In [9]:
from nltk.corpus import stopwords
nltk.download('stopwords')

words = word_tokenize("Hello, friend. Hello, friend? That's lame. Maybe I should give you a name, but that's a slippery slope. You're only in my head. We have to remember that.")

stop_words = set(stopwords.words("english"))

for x in words:
  if x not in stop_words:
    print(x, end=" ")

Hello , friend . Hello , friend ? That 's lame . Maybe I give name , 's slippery slope . You 're head . We remember . 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Parts of Speech Tagging (POS)

In [10]:
nltk.download('averaged_perceptron_tagger')
words = word_tokenize("Hello, friend. Hello, friend? That's lame. Maybe I should give you a name, but that's a slippery slope. You're only in my head. We have to remember that.")

for x in words:
  print(x, nltk.pos_tag([x]))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Hello [('Hello', 'NN')]
, [(',', ',')]
friend [('friend', 'NN')]
. [('.', '.')]
Hello [('Hello', 'NN')]
, [(',', ',')]
friend [('friend', 'NN')]
? [('?', '.')]
That [('That', 'DT')]
's [("'s", 'POS')]
lame [('lame', 'NN')]
. [('.', '.')]
Maybe [('Maybe', 'RB')]
I [('I', 'PRP')]
should [('should', 'MD')]
give [('give', 'VB')]
you [('you', 'PRP')]
a [('a', 'DT')]
name [('name', 'NN')]
, [(',', ',')]
but [('but', 'CC')]
that [('that', 'IN')]
's [("'s", 'POS')]
a [('a', 'DT')]
slippery [('slippery', 'NN')]
slope [('slope', 'NN')]
. [('.', '.')]
You [('You', 'PRP')]
're [("'re", 'VBP')]
only [('only', 'RB')]
in [('in', 'IN')]
my [('my', 'PRP$')]
head [('head', 'NN')]
. [('.', '.')]
We [('We', 'PRP')]
have [('have', 'VB')]
to [('to', 'TO')]
remember [('remember', 'VB')]
that [('that', 'IN')]
. [('.', '.')]
