

---


# Lab 9: Introduction to NLP

---







# NAME: RAJA HAIDER ALI
# CMS ID: 346900
# GROUP: 02

In [1]:
!pip install nltk==3.5



In [2]:
import nltk

# Task1 (Tokenization)
Tokenization means either to split a paragraph into words or into sentences. NLTK has builtin functions which can perform both of these types of tokenization.
In task 1 you are given an example string and your task is to convert it into token both with
respect to words and sentences.

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
example_string = """
Muad'Dib learned rapidly because his first training was in how to learn.
And the first lesson of all was the basic trust the he could learn.
It's shocking to find how many people do not believe they can learn.
and how many more believe learning to be difficult."""

In [5]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

# tokenizing into sentences
sentences = sent_tokenize(example_string)

# tokenizing into words
words = word_tokenize(example_string)

In [6]:
sentences[1]

'And the first lesson of all was the basic trust the he could learn.'

In [7]:
words[0]

"Muad'Dib"

# Task2 (Filtering Stop words):-
Stop words are words that you want to ignore, so you filter them out of your text when
you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop words
since they don’t add a lot of meaning to a text in and of themselves.
Here’s how to import the relevant parts of NLTK in order to filter out stop words

In [8]:
nltk.download("stopwords")
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
word_quote = "Sir, I protest. I am not a merry man"

In [10]:
# tokenizing into words
words = word_tokenize(word_quote)
words

['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'a', 'merry', 'man']

In [30]:
english_stops = set(stopwords.words('english'))
for word in english_stops:
  if "I" in word:
    print("present")

In [12]:
filtered_list = [word for word in words if word not in english_stops]
filtered_list

['Sir', ',', 'I', 'protest', '.', 'I', 'merry', 'man']

In [13]:
print(filtered_list)

['Sir', ',', 'I', 'protest', '.', 'I', 'merry', 'man']


# Task3 (Stemming):-
Stemming is a text processing task in which you reduce words to their root, which is the core
part of a word. For example, the words “helping” and “helper” share the root “help.”
Stemming allows you to zero in on the basic meaning of a word rather than all the details of
how it’s being used. NLTK has more than one stemmer, but you’ll be using the Porter

In [14]:
from nltk.stem import PorterStemmer

In [15]:
string_for_stemming = """
The crew of the USS Discovery discovered many discoveries.
Discovering is what explorers do."""

In [16]:
ps = PorterStemmer()

In [17]:
# tokenizing into words
words = word_tokenize(string_for_stemming)
words

['The',
 'crew',
 'of',
 'the',
 'USS',
 'Discovery',
 'discovered',
 'many',
 'discoveries',
 '.',
 'Discovering',
 'is',
 'what',
 'explorers',
 'do',
 '.']

In [18]:
lst = []
for word in words:
  stem =ps.stem(word)
  lst.append(stem)
lst

['the',
 'crew',
 'of',
 'the',
 'uss',
 'discoveri',
 'discov',
 'mani',
 'discoveri',
 '.',
 'discov',
 'is',
 'what',
 'explor',
 'do',
 '.']

# Task5 (Lemmatizing):-
Now that you’re up to speed on parts of speech, you can circle back to lemmatizing. Like
stemming, lemmatizing reduces words to their core meaning, but it will give you a complete
English word that makes sense on its own instead of just a fragment of a word like 'discoveri'.
Given the above strin your task is to lemmatize it to obtain the following output.

In [19]:
string_for_lemmatizing = "The friends of DeSoto love scarves."

In [20]:
 nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [21]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [22]:
# tokenizing into words
words = word_tokenize(string_for_lemmatizing)
words

['The', 'friends', 'of', 'DeSoto', 'love', 'scarves', '.']

In [23]:
lst=[]

for word in words:
  lem = lemmatizer.lemmatize(word)
  lst.append(lem)
print(lst)

['The', 'friend', 'of', 'DeSoto', 'love', 'scarf', '.']


# Task6 (Chunking):-
Chunking is defined as the process of natural language processing used to identify parts of
speech and short phrases present in a given sentence.

In [24]:
lotr_quote = "It's a dangerous business, Frodo, going out your door"

In [25]:
from nltk import pos_tag, RegexpParser

In [26]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [27]:
# tokenizing into words
words = word_tokenize(lotr_quote)
words

['It',
 "'s",
 'a',
 'dangerous',
 'business',
 ',',
 'Frodo',
 ',',
 'going',
 'out',
 'your',
 'door']

In [28]:
# Step 2: Tag the words by part of speech.
pos_tags = pos_tag(words)

# Step 3: Define a chunk grammar.
chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"

# Step 4: Create a chunk parser with the defined grammar.
chunk_parser = RegexpParser(chunk_grammar)

# Step 5: Parse the tagged output through the parser.
tree = chunk_parser.parse(pos_tags)

# Display the resulting tree
print(tree)

(S
  It/PRP
  's/VBZ
  (NP a/DT dangerous/JJ business/NN)
  ,/,
  Frodo/NNP
  ,/,
  going/VBG
  out/RP
  your/PRP$
  (NP door/NN))
