# Lab - NLP Tools and Curated Datasets

## Lab Summary:
Experiment with NLP tools and perform NLP tasks, including tokenization and stemming.

## Learning Outcomes:
Upon completion of this lab, students can:
<ul>
    <li> Load common NLP tools: NLTK, spacy and textblob</li>
    <li> Use these tools to carry out NLP tasks </li>
    <li> Access datasets from curated data sources </li>
</ul>

## Key Packages and Classes

In this lab we will use the following libraries:
<ol>
    <li> NLTK </li>
    <li> Spacy </li>
    <li> TextBlob </li>
</ol>



# NLP tools

Natural Language Processing tools can help you discover valuable
insights in text. They help us solve a variety of text analysis
problems like sentiment analysis, topic classification, and more

# NLTK: The Natural Language Toolkit
NLTK is one of the leading Python tools in NLP model building. 

Some of NLTK's capabilities include tokenization, tagging, stemming, parsing, and classification.

Reference: https://www.nltk.org/book/

In [None]:
! pip install nltk spacy textblob pyarrow datasets
! python -m spacy download en_core_web_md


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting en-core-web-md==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [None]:
# Import libraries

import nltk # https://www.nltk.org/install.html
# nltk.download('averaged_perceptron_tagger') # Use for Google Colab
nltk.download('averaged_perceptron_tagger_eng') # Use for Jupyter Lab
nltk.download('punkt_tab')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/rewheaton/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/rewheaton/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

# Tokenization Libraries

## nltk.tokenize.word_tokenize()

This method extracts tokens from strings of characters.

Reference: https://www.geeksforgeeks.org/python-nltk-nltk-tokenizer-word_tokenize/

<b>Task: Tokenize a paragraph and print the first 12 tokens</b> 

In [2]:
from nltk.tokenize import word_tokenize
from nltk.text import Text
my_string = "Natural Language Processing (NLP) is a branch of \
artificial intelligence that enables computers to understand, \
interpret, and generate human language. It combines linguistics, \
computer science, and machine learning to analyze large volumes \
of natural language data. NLP powers applications like chatbots, \
language translation, sentiment analysis, and voice recognition. \
Techniques in NLP include tokenization, part-of-speech tagging, \
named entity recognition, and syntactic parsing. As language is \
complex and nuanced, NLP continues to evolve to better handle \
context, ambiguity, and cultural variations in communication."

# Tokenize the paragraph:
tokens = word_tokenize(my_string)

# Note that the result is a list:
print(type(tokens))

# Print the first 12 tokens.
print(tokens[:12])

<class 'list'>
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'branch', 'of', 'artificial', 'intelligence']


## Punkt tokenizer

This tokenizer divides a text into a list of sentences with an unsupervised algorithm. 

The NLTK data package includes a pre-trained Punkt tokenizer for English. 

Reference: https://www.nltk.org/api/nltk.tokenize.html

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/rewheaton/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

<b>What nltk.download('punkt') does:</b>

Downloads a file named <i>punkt.zip</i> to your local <i>nltk_data/tokenizers</i> directory.

The file contains language-specific rules and statistical models.

Once downloaded, you can use `sent_tokenize()` without needing internet access again.

In [4]:
# Import sent_tokenize
from nltk.tokenize import sent_tokenize

# Read the first sentence from the paragraph above, using sent_tokenize:
sentences = sent_tokenize(my_string)
sentences[:1]

['Natural Language Processing (NLP) is a branch of artificial intelligence that enables computers to understand, interpret, and generate human language.']

## Tweet Tokenizer

TweetTokenizer is a subset of word_tokenize and is often used specifically for X (formerly Twitter) text.

TweetTokenizer keeps hashtags intact, while word_tokenize doesn't.

In [5]:
# Import TweetTokenizer:
from nltk.tokenize import TweetTokenizer
tweet = "One guess only and no googling. Answer at 10pm. Best of luck. #guessthemysterycelebrity"

# Initialize a TweetTokenizer object:
tknzr = TweetTokenizer()
tknzr.tokenize(tweet)

['One',
 'guess',
 'only',
 'and',
 'no',
 'googling',
 '.',
 'Answer',
 'at',
 '10pm',
 '.',
 'Best',
 'of',
 'luck',
 '.',
 '#guessthemysterycelebrity']

# Parts of Speech Tagging

Part-of-speech tagging is the process of identifying a word in text as a particular part of speech, based on both its definition and its context. 

<b>Example:</b>
<br>I am a student</br>
<br>I: <b>pronoun</b>
<br>am: <b>verb</b>
<br>a: <b>indefinite article</b>
<br>student: <b>noun</b></br>
<br>Reference: https://www.geeksforgeeks.org/part-speech-tagging-stop-words-using-nltk-python/

Some of the possible POS tags are:

1) CC: conjunction, coordinating

2) DT: determiner

3) NN: noun, common, singular or mass

4) NNS: noun, common, plural

5) PRP: pronoun, personal

6) VB: verb, base form

7) VBD: verb, past tense

8) VBG: verb, present participle or gerund

9) JJS: adjective, superlative etc.



In [None]:
# Import pos_tag library and tokenizer.
from nltk import pos_tag
from nltk.tokenize import word_tokenize

# Tokenize a sentence.
text = word_tokenize("Roger played tennis all day and became the best")

# Use the pos_tag from NLTK to do part of speech tagging
pos_tag(text)

[('Roger', 'NNP'),
 ('played', 'VBD'),
 ('tennis', 'NN'),
 ('all', 'DT'),
 ('day', 'NN'),
 ('and', 'CC'),
 ('became', 'VBD'),
 ('the', 'DT'),
 ('best', 'JJS')]

# Practice: Parts of Speech

1. Analyze the parts of speech of the following sentence using pos_tag:

- "Natural Language Processing is my favorite kind of artificial intelligence!"

In [7]:
# Your Code Here:
pos_tag_text = "Natural Language Processing is my favorite kind of artificial intelligence!"
# Tokenize and tag parts of speech
tokens = word_tokenize(pos_tag_text)
print(pos_tag(tokens))

[('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('is', 'VBZ'), ('my', 'PRP$'), ('favorite', 'JJ'), ('kind', 'NN'), ('of', 'IN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('!', '.')]


# Spacy

spaCy is a free open-source library for Natural Language Processing in Python. 

It features Named Entity Recognition (NER), POS tagging, dependency parsing, word vectors and more.

Reference: https://spacy.io/

In [11]:
# Get started with spacy by using the "en_core_web_sm" pretrained English Language model.
# The model is listed as "sm" because it is a smaller model than others (and therefore faster but less accurate). 
# "md" and "lg" are also available.
import spacy
nlp = spacy.load("en_core_web_sm")

# Lemmatization with spaCy

In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood.

Lemmatization helps identify the root forms of inflected (derived) words

![Image](http://kavita-ganesan.com/wp-content/uploads/2019/02/Screen-Shot-2019-02-20-at-4.49.08-PM.png)

Reference: https://builtin.com/machine-learning/lemmatization

In [12]:
# Load some text into the "doc" variable using the nlp() function.

doc = nlp("All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience.")

# In spacy, a lemma of a word can be printed with lemma_
# Code to print the lemmatised form of a word alongside each word in doc:
for word in doc:
    print(word.text, word.lemma_)


All all
human human
beings being
are be
born bear
free free
and and
equal equal
in in
dignity dignity
and and
rights right
. .
They they
are be
endowed endow
with with
reason reason
and and
conscience conscience
. .


We saw how to use pos_tag with NLTK.

With spaCy, we can also identify parts of speech.

In [13]:
# Identify the parts of speech from the previous document:

for word in doc:
    print(word,word.pos_)

All DET
human ADJ
beings NOUN
are AUX
born VERB
free ADJ
and CCONJ
equal ADJ
in ADP
dignity NOUN
and CCONJ
rights NOUN
. PUNCT
They PRON
are AUX
endowed VERB
with ADP
reason NOUN
and CCONJ
conscience NOUN
. PUNCT


# Practice: Lemmatization

1. Extract the parts of speech from the following sentence:
- "Only these topics: A mathematical puzzle, A biological experiment, A wooden boat."

2. Print only the adjectives.

In [17]:
# Your code here:
spacy_text = "Only these topics: A mathematical puzzle, A biological experiment, A wooden boat."
spacy_doc = nlp(spacy_text)

for word in spacy_doc:
    if(word.pos_ == "ADJ"):
        print(word, "=>", word.pos_)

mathematical => ADJ
biological => ADJ
wooden => ADJ


# TextBlob

TextBlob is an extension of NLTK and is often used for easy NLP tasks.

It provides a simple interface for common NLP tasks, including parts of speech tagging, noun phrase extract, sentiment analysis, classification, translation, and more.

For this lab, we will use it for sentiment analysis.

Reference: https://textblob.readthedocs.io/en/dev/

In [20]:
# Import textblob library
from textblob import TextBlob

In [21]:
# Assign a text string to a variable for analysis.
testimonial = TextBlob("Alexa has been great so far. I am still learning about some the features")

 #Analyze the sentiment using TextBlob
testimonial.sentiment

Sentiment(polarity=0.45, subjectivity=0.875)

# Curated Datasets

The HuggingFace Hub offers many datasets for NLP tasks like text classification, question answering, and language modeling.

Curated Datasets like these are a great way to familiarize yourself with machine learning techniques.

Reference: https://huggingface.co/docs/datasets/

In [None]:
# pyarrow and datasets libraries are required.
import pyarrow
import datasets

In [22]:
# Let's look at one of the datasets and its attributes.
from datasets import load_dataset
dset = load_dataset("squad")
print(dset)
print(len(dset))
print(dset.shape)

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|██████████| 87599/87599 [00:00<00:00, 2064406.28 examples/s]
Generating validation split: 100%|██████████| 10570/10570 [00:00<00:00, 2208078.16 examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})
2
{'train': (87599, 5), 'validation': (10570, 5)}





In [27]:
# Investigate the first row in the "train" data.
dset['train'][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

# Practice: Curated Datasets

Load the 'imdb' dataset from the huggingface repository and:

1) Print the dataset info

2) Print the length of the dataset

3) Print the shape of the d

4) Print the first row of the "test" dataset.atasets/

In [30]:
# Your code here:
imdb = load_dataset("imdb")

print("Length: ", len(imdb))
print("Shape: ", imdb.shape)
print("First Row: ")
print(imdb['test'][0])

Length:  3
Shape:  {'train': (25000, 2), 'test': (25000, 2), 'unsupervised': (50000, 2)}
First Row: 
{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and rea

# Practice: Sentiment Analysis
1. Read the first row of the training dataset from imdb and make a determination of your own about its sentiment. Is it positive, neutral, or negative.
2. Use what you learned regarding NLP sentiment analysis to estimate the sentiment of the first record in the imdb training dataset.
3. Answer the following question: Does the sentiment analysis from your NLP analysis match your own judgment from step 1? Why do you think that is?

In [37]:
# Your code here:
#1. Read the first row.
first_row = imdb['train'][0]
print(first_row)

#2. Sentiment Analysis
polarity_guess = .4
print("My guess was: ", polarity_guess)
text_blob = TextBlob(first_row['text'])
print("TextBlob Sentiment: ", text_blob.sentiment)

#3. Comparison of NLP analysis to your own judgment.  Why is your judgment different or the same?
print("My guess was more positive than TextBlob's analysis.  The negative comments may have influenced the sentiment analysis to be less positive than my own judgment.")


{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

# Data Visualization Preview

By using visual elements like charts, NLP data can be made more understandable.  We will learn more about visualization in a different module.

Using the data provided, we can visualize the sentiment of multiple records in a dataset.

In [2]:
import numpy as np # for Linear Algebra needs
import pandas as pd # for data processing, CSV file I/O (pd.read_csv)
import nlplot
from plotly.subplots import make_subplots
import plotly.express as px
# load the train.csv file provided in this week's materials
train = pd.read_csv("train.csv")

In [3]:
train = train.sample(n=1000, random_state=0)
# Convert text to lowercase
train['text'] = train['text'].apply(lambda x: x.lower())
display(train.head(), train.shape)

Unnamed: 0,textID,text,selected_text,sentiment
20149,80a1e6bc32,i just saw a shooting star... i made my wish,wish,positive
12580,863097735d,gosh today sucks! i didnt get my tax returns! ...,gosh today sucks!,negative
13135,264cd5277f,tired and didn`t really have an exciting satur...,tired and didn`t really have an exciting Satur...,neutral
14012,baee1e6ffc,i`ve been eating cheetos all morning..,i`ve been eating cheetos all morning..,neutral
21069,67d06a8dee,haiiii sankq i`m fineee ima js get a checkup ...,haiiii sankQ i`m fineee ima js get a checkup c...,neutral


(1000, 4)

In [4]:
df = train.groupby('sentiment').size().reset_index(name='count')
fig = px.bar(df, y='count', x='sentiment', text='count')
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(
    title=str('sentiment counts'),
    xaxis_title=str('sentiment'),
    width=700,
    height=500,
    )
fig.show()