# Natural Language Processing

**AI4PH Short Course**  
Fall 2023

**Joon Lee, PhD**  
Associate Professor  
[Data Intelligence for Health Lab](https://cumming.ucalgary.ca/dih)  
Cumming School of Medicine    
University of Calgary

# Natural Language Processing (NLP)
* About interactions between computers and human languages
* An important topic in informatics because much data, information, and knowledge is in natural language

# Learning Objectives

Learn:
1. How to pre-process raw text data
2. How to perform part-of-speech tagging
3. How to develop simple machine learning-based text classification and prediction models

# Reference for This Course
NLTK Book (mostly Chapters 3, 5, and 6): http://www.nltk.org/book

# How to Run This Notebook
* On your local machine if you have Python and Jupyter Notebook installed, or
* On Google Colab (https://colab.research.google.com/)

# Some Terminology

![text data hierarchy](https://miro.medium.com/max/1400/1*f9XQxMUDkYkquZDDcccEJA.png)

Image: https://medium.com/@amitrani/an-overview-of-nlp-fe597ed7e8b6

# Package Installation

In [1]:
# run the line below if you don't have these packages installed

!pip install nltk wikipedia gensim



In [2]:
# download some more stuff from NLTK that will be used in this session

import nltk

nltk.download('brown')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.download('vader_lexicon')
nltk.download('names')
nltk.download('treebank')
nltk.download('movie_reviews')

[nltk_data] Downloading package brown to /home/joonwu.lee/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/joonwu.lee/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/joonwu.lee/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/joonwu.lee/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to
[nltk_data]     /home/joonwu.lee/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/joonwu.lee/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package names to /home/joonwu.lee/nltk_data...
[nltk_data]   Package names is already up-to

True

In [3]:
# import all packages used in this course

from nltk.corpus import brown
from nltk.corpus import stopwords
from nltk.corpus import names
from nltk.corpus import movie_reviews
from nltk.sentiment import SentimentIntensityAnalyzer
import wikipedia
import gensim
import gensim.downloader
import string
import re
import math
import operator
import random

# Basic String Operations

In [4]:
# in Python, each string is a list of characters
# either single or double quotation is fine

str = 'AI4PH'
len(str)

5

In [5]:
# zero indexed

str[0]

'A'

In [6]:
# another indexing example

str[-2:]

'PH'

In [7]:
# break a string into pieces

longer_str = 'a, b, c, d'
longer_str.split(',')

['a', ' b', ' c', ' d']

In [8]:
# get rid of whitespaces using list comprehension

[x.strip() for x in longer_str.split(',')]

['a', 'b', 'c', 'd']

In [9]:
# find a substring
# index is returned

longer_str.find('b')

3

In [10]:
# replace a substring

longer_str.replace(', ', '|')

'a|b|c|d'

In [11]:
# convert to lowercase

'ABC'.lower()

'abc'

In [12]:
# convert to uppercase

'abc'.upper()

'ABC'

In [13]:
# concatenate

s1 = 'Artificial'
s2 = 'Intelligence'
print(s1 + " " + s2)

Artificial Intelligence


# Example Text Data

In [14]:
# NLTK has many text data sets
# let's use a corpus called "brown"

brown_words = brown.words()
print(brown_words[:10])
print(len(brown_words))

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
1161192


In [15]:
# get sentences

brown_sents = brown.sents()
print(brown_sents[:5])
print(len(brown_sents))

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ['The', 'September-October', 'term', 'jury', 'had', 'been', 'charged', 'by', 'Fulton', 'Superior', 'Court', 'Judge', 'Durwood', 'Pye', 'to', 'investigate', 'reports', 'of', 'possible', '``', 'irregularities', "''", 'in', 'the', 'hard-fought', 'primary', 'which', 'was', 'won', 'by', 'Mayor-nominate', 'Ivan', 'Allen', 'Jr.', '.'], ['``', 'Only', 'a', 'relative', 'handful', 'of', 'such', 'rep

In [16]:
# wikipeida package allows you to directly import wikipedia pages

ph = wikipedia.page("Public Health")
ph.title

'Public health'

In [17]:
# summary at the top of the wikipedia page

ph.summary

'Public health is "the science and art of preventing disease, prolonging life and promoting health through the organized efforts and informed choices of society, organizations, public and private, communities and individuals". Analyzing the determinants of health of a population and the threats it faces is the basis for public health. The public can be as small as a handful of people or as large as a village or an entire city; in the case of a pandemic it may encompass several continents. The concept of health takes into account physical, psychological, and social well-being.Public health is an interdisciplinary field. For example, epidemiology, biostatistics, social sciences and management of health services are all relevant. Other important sub-fields include environmental health, community health, behavioral health, health economics, public policy, mental health, health education, health politics, occupational safety, disability, oral health, gender issues in health, and sexual and 

In [18]:
# URL of the wikipedia page

ph.url

'https://en.wikipedia.org/wiki/Public_health'

In [19]:
# let's get all text from the entire page and use it in this course

print(ph.content)

Public health is "the science and art of preventing disease, prolonging life and promoting health through the organized efforts and informed choices of society, organizations, public and private, communities and individuals". Analyzing the determinants of health of a population and the threats it faces is the basis for public health. The public can be as small as a handful of people or as large as a village or an entire city; in the case of a pandemic it may encompass several continents. The concept of health takes into account physical, psychological, and social well-being.Public health is an interdisciplinary field. For example, epidemiology, biostatistics, social sciences and management of health services are all relevant. Other important sub-fields include environmental health, community health, behavioral health, health economics, public policy, mental health, health education, health politics, occupational safety, disability, oral health, gender issues in health, and sexual and r

# Punctuation Removal

In [20]:
# string package has a list of punctuation

print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [21]:
# remove punctuation

text = ph.content.lower()
filtered_text = text.translate(str.maketrans('', '', string.punctuation))
print(filtered_text)

public health is the science and art of preventing disease prolonging life and promoting health through the organized efforts and informed choices of society organizations public and private communities and individuals analyzing the determinants of health of a population and the threats it faces is the basis for public health the public can be as small as a handful of people or as large as a village or an entire city in the case of a pandemic it may encompass several continents the concept of health takes into account physical psychological and social wellbeingpublic health is an interdisciplinary field for example epidemiology biostatistics social sciences and management of health services are all relevant other important subfields include environmental health community health behavioral health health economics public policy mental health health education health politics occupational safety disability oral health gender issues in health and sexual and reproductive health public health

# Tokenization
* Basically breaking up text into words

In [22]:
# tokenize into words

tokens = nltk.word_tokenize(filtered_text)
tokens[:10]

['public',
 'health',
 'is',
 'the',
 'science',
 'and',
 'art',
 'of',
 'preventing',
 'disease']

In [23]:
# build a vocabulary in lowercase

vocab = sorted(set(tokens))
len(vocab)

2215

In [24]:
# numbers are at the beginning

vocab[:10]

['0', '1', '11', '114', '123', '1247', '1346–53', '15', '153', '154']

In [25]:
# more meaningful words 

vocab[2000:2010]

['theory',
 'therapy',
 'there',
 'therefore',
 'these',
 'they',
 'third',
 'this',
 'thomas',
 'those']

In [26]:
# tokenize into sentences
# you can see that it looks for sentence-ending punctuation such as periods
# use the text before punctuation removal

sents = nltk.sent_tokenize(text)
sents[:10]

['public health is "the science and art of preventing disease, prolonging life and promoting health through the organized efforts and informed choices of society, organizations, public and private, communities and individuals".',
 'analyzing the determinants of health of a population and the threats it faces is the basis for public health.',
 'the public can be as small as a handful of people or as large as a village or an entire city; in the case of a pandemic it may encompass several continents.',
 'the concept of health takes into account physical, psychological, and social well-being.public health is an interdisciplinary field.',
 'for example, epidemiology, biostatistics, social sciences and management of health services are all relevant.',
 'other important sub-fields include environmental health, community health, behavioral health, health economics, public policy, mental health, health education, health politics, occupational safety, disability, oral health, gender issues in he

# Stopword Removal
* Remove meaningless words

In [27]:
# NLTK has a list of stopwords

print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [28]:
# remove stopwords from the tokenized text

filtered_tokens = [token for token in tokens if not token in stopwords.words('english')]
print(filtered_tokens[:100])

['public', 'health', 'science', 'art', 'preventing', 'disease', 'prolonging', 'life', 'promoting', 'health', 'organized', 'efforts', 'informed', 'choices', 'society', 'organizations', 'public', 'private', 'communities', 'individuals', 'analyzing', 'determinants', 'health', 'population', 'threats', 'faces', 'basis', 'public', 'health', 'public', 'small', 'handful', 'people', 'large', 'village', 'entire', 'city', 'case', 'pandemic', 'may', 'encompass', 'several', 'continents', 'concept', 'health', 'takes', 'account', 'physical', 'psychological', 'social', 'wellbeingpublic', 'health', 'interdisciplinary', 'field', 'example', 'epidemiology', 'biostatistics', 'social', 'sciences', 'management', 'health', 'services', 'relevant', 'important', 'subfields', 'include', 'environmental', 'health', 'community', 'health', 'behavioral', 'health', 'health', 'economics', 'public', 'policy', 'mental', 'health', 'health', 'education', 'health', 'politics', 'occupational', 'safety', 'disability', 'oral', 

# Regular Expressions
* Widely used for word pattern matching
* Advanced regular expressions can be very powerful
* Only basics are covered here
* See Table 3.3 in the NLTK Book

In [29]:
# use regular expressions to find words that end in "ed"
# $ means the end of a string

past_tense = [w for w in vocab if re.search('ed$', w)]
past_tense[:10]

['accelerated',
 'accounted',
 'achieved',
 'adapted',
 'added',
 'adopted',
 'advanced',
 'advised',
 'advocated',
 'affected']

In [30]:
# find words that start with a number
# ^ means the beginning of a string
# [0-9] means any number
# + means one or more of previous item

numbers = [w for w in vocab if re.search('^[0-9]+', w)]
numbers[:10]

['0', '1', '11', '114', '123', '1247', '1346–53', '15', '153', '154']

In [31]:
# find hyphenated words

hyphenated = [w for w in vocab if re.search('^[a-z0-9]+–[a-z0-9]+$', w)]
hyphenated[:10]

['1346–53',
 '1620–1674',
 '1749–1823',
 '1753–1846',
 '1793–1859',
 '1819–1891',
 '1822–1895',
 '1843–1910',
 '1848–1869',
 '1856–1941']

# Stemming
* Variations of the same word need to be normalized
    * -ed, -ing, etc.
* Stemming simply truncates for the most part, so the stem may not be an actual word
* Lemmatization considers the word morphology and the resulting lemma is an actual word

In [32]:
# NLTK provides several stemmers
# that can normalize variations of the same word

porter = nltk.PorterStemmer()
porter_stems = [porter.stem(word) for word in vocab]
print(vocab[2000:2010])
print("\n")
print(porter_stems[2000:2010])

['theory', 'therapy', 'there', 'therefore', 'these', 'they', 'third', 'this', 'thomas', 'those']


['theori', 'therapi', 'there', 'therefor', 'these', 'they', 'third', 'thi', 'thoma', 'those']


In [33]:
# different stemmer gives slightly different results

lancaster = nltk.LancasterStemmer()
lancaster_stems = [lancaster.stem(word) for word in vocab]
print(vocab[2000:2010])
print("\n")
print(porter_stems[2000:2010])
print("\n")
print(lancaster_stems[2000:2010])

['theory', 'therapy', 'there', 'therefore', 'these', 'they', 'third', 'this', 'thomas', 'those']


['theori', 'therapi', 'there', 'therefor', 'these', 'they', 'third', 'thi', 'thoma', 'those']


['the', 'therapy', 'ther', 'theref', 'thes', 'they', 'third', 'thi', 'thoma', 'thos']


# Term Frequencies

In [34]:
# calculate how many times each token appears
# let's use the tokens before stopword removal
# most common terms are usually meaningless

tokens_stemmed = [nltk.PorterStemmer().stem(token) for token in tokens]
tf = nltk.FreqDist(tokens_stemmed)
tf.most_common()[:20]

[('the', 487),
 ('and', 395),
 ('of', 392),
 ('health', 369),
 ('in', 215),
 ('public', 196),
 ('to', 165),
 ('a', 134),
 ('as', 102),
 ('for', 84),
 ('is', 70),
 ('diseas', 62),
 ('by', 58),
 ('that', 58),
 ('on', 53),
 ('develop', 49),
 ('prevent', 47),
 ('with', 42),
 ('countri', 40),
 ('it', 39)]

In [35]:
# in order to give more weight to not-too-common words
# idf (inverse document frequency) can be calculated
# simplest form of tf-idf is the product between tf and idf
# let's break up into 1000 words and treat each block as a document

tf_idf = {}
doc_size = 1000
ntokens = len(tokens_stemmed)
ndocs = math.ceil(ntokens/doc_size)

for token in tf:
    count = 0
    doc_idx = 0
    
    while doc_idx < ndocs:
        if doc_idx == ndocs-1:
            doc = tokens_stemmed[doc_idx*doc_size:]
        else:
            doc = tokens_stemmed[doc_idx*doc_size:((doc_idx+1)*doc_size)]
            
        if token in doc:
            count += 1
            
        doc_idx += 1
    
    tf_idf[token] = tf[token] * math.log(ndocs / count)

In [36]:
# now the words with the largest tf-idf values are more meaningful

sorted_tf_idf = sorted(tf_idf.items(), 
                       key=operator.itemgetter(1), 
                       reverse=True)  # this is how you sort a dict
sorted_tf_idf[:20]

[('degre', 27.073393141972936),
 ('school', 23.070858062030304),
 ('aid', 23.070858062030304),
 ('term', 15.380572041353537),
 ('scienc', 13.78581367567759),
 ('act', 13.536696570986468),
 ('they', 13.536696570986468),
 ('doctor', 13.183347464017316),
 ('differ', 12.084735175349207),
 ('master', 12.032619174210193),
 ('program', 11.755733298042381),
 ('water', 11.353023027028602),
 ('train', 10.986122886681098),
 ('diabet', 10.986122886681098),
 ('town', 10.986122886681098),
 ('academ', 10.986122886681098),
 ('initi', 10.580159968238142),
 ('state', 10.542092810812274),
 ('follow', 10.528541777433919),
 ('countri', 10.052577131236246)]

In [37]:
# another option would be to do term frequencies after stopword removal

filtered_tokens_stemmed = [nltk.PorterStemmer().stem(token) for token in filtered_tokens]
tf_filtered = nltk.FreqDist(filtered_tokens_stemmed)
tf_filtered.most_common()[:20]

[('health', 369),
 ('public', 196),
 ('diseas', 62),
 ('develop', 49),
 ('prevent', 47),
 ('countri', 40),
 ('popul', 36),
 ('care', 33),
 ('includ', 30),
 ('state', 26),
 ('organ', 22),
 ('medicin', 22),
 ('world', 21),
 ('school', 21),
 ('unit', 21),
 ('aid', 21),
 ('medic', 20),
 ('program', 20),
 ('social', 19),
 ('vaccin', 19)]

# Part-of-speech (POS) Tagging
* The process of classifying words into POS categories.
* POS is also called word classes or lexical categories.
* NLTK makes this process very easy.

In [38]:
# use NLTK to tag the wikipedia page after punctuation and stopword removal

tags = nltk.pos_tag(filtered_tokens)
tags[:20]

[('public', 'JJ'),
 ('health', 'NN'),
 ('science', 'NN'),
 ('art', 'NN'),
 ('preventing', 'VBG'),
 ('disease', 'NN'),
 ('prolonging', 'VBG'),
 ('life', 'NN'),
 ('promoting', 'VBG'),
 ('health', 'NN'),
 ('organized', 'VBN'),
 ('efforts', 'NNS'),
 ('informed', 'VBD'),
 ('choices', 'NNS'),
 ('society', 'NN'),
 ('organizations', 'NNS'),
 ('public', 'JJ'),
 ('private', 'JJ'),
 ('communities', 'NNS'),
 ('individuals', 'NNS')]

In [39]:
# look up tag acronyms like this

nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [40]:
# NLTK's tagging is quite smart
# it can tell the difference between 
# the first and second "patient" below

nltk.pos_tag(nltk.word_tokenize('The patient was patient.'))

[('The', 'DT'),
 ('patient', 'NN'),
 ('was', 'VBD'),
 ('patient', 'JJ'),
 ('.', '.')]

In [41]:
# obtain the frequency distribution of the tags in the wikipedia page

tf_tag = nltk.FreqDist(tag for (word, tag) in tags)
tf_tag.most_common(10)

[('NN', 1735),
 ('JJ', 1106),
 ('NNS', 858),
 ('VBG', 280),
 ('VBP', 219),
 ('VBD', 210),
 ('RB', 191),
 ('CD', 161),
 ('VBN', 147),
 ('VBZ', 67)]

In [42]:
# most common nouns

tf_word_tag = nltk.FreqDist(tags)
[wt[0] for (wt, _) in tf_word_tag.most_common() if wt[1] == 'NN'][:20]

['health',
 'care',
 'disease',
 'population',
 'world',
 'aid',
 'medicine',
 'research',
 'system',
 'water',
 'education',
 'development',
 'prevention',
 'century',
 'government',
 'example',
 'policy',
 'science',
 'epidemiology',
 'organization']

In [43]:
# most common verbs
# use a regular expression to capture all verb variations

[wt[0] for (wt, _) in tf_word_tag.most_common() if re.search('^VB.*$', wt[1])][:20]

['developing',
 'including',
 'developed',
 'led',
 'include',
 'began',
 'considered',
 'assessing',
 'smoking',
 'promoting',
 'became',
 'improving',
 'wellbeing',
 'related',
 'reducing',
 'made',
 'seen',
 'based',
 'training',
 'preventing']

In [44]:
# it's also possible to analyze which pairs of words are common
# word pairs are also known as bigrams

tf_word_tag_pair = nltk.FreqDist(nltk.bigrams(tags))
tf_word_tag_pair.most_common(20)

[((('public', 'JJ'), ('health', 'NN')), 167),
 ((('health', 'NN'), ('care', 'NN')), 21),
 ((('developing', 'VBG'), ('countries', 'NNS')), 19),
 ((('united', 'JJ'), ('states', 'NNS')), 14),
 ((('health', 'NN'), ('aid', 'NN')), 14),
 ((('population', 'NN'), ('health', 'NN')), 12),
 ((('health', 'NN'), ('initiatives', 'NNS')), 9),
 ((('infectious', 'JJ'), ('diseases', 'NNS')), 9),
 ((('schools', 'NNS'), ('public', 'JJ')), 9),
 ((('preventive', 'JJ'), ('medicine', 'NN')), 8),
 ((('health', 'NN'), ('health', 'NN')), 7),
 ((('world', 'NN'), ('health', 'NN')), 7),
 ((('health', 'NN'), ('organization', 'NN')), 7),
 ((('international', 'JJ'), ('health', 'NN')), 6),
 ((('determinants', 'NNS'), ('health', 'NN')), 5),
 ((('basis', 'NN'), ('public', 'JJ')), 5),
 ((('health', 'NN'), ('services', 'NNS')), 5),
 ((('health', 'NN'), ('education', 'NN')), 5),
 ((('health', 'NN'), ('issues', 'NNS')), 5),
 ((('health', 'NN'), ('programs', 'NNS')), 5)]

# Creating an Automatic Tagger

In [45]:
# the default tagger assigns the same tag to every token

default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(filtered_tokens[:10])

[('public', 'NN'),
 ('health', 'NN'),
 ('science', 'NN'),
 ('art', 'NN'),
 ('preventing', 'NN'),
 ('disease', 'NN'),
 ('prolonging', 'NN'),
 ('life', 'NN'),
 ('promoting', 'NN'),
 ('health', 'NN')]

In [46]:
# the regular expression tagger defines patterns first
# you may want to use a more comprehensive list of patterns

patterns = [(r'.*ing$', 'VBG'),               # gerunds
            (r'.*ed$', 'VBD'),                # simple past
            (r'.*es$', 'VBZ'),                # 3rd singular present
            (r'.*ould$', 'MD'),               # modals
            (r'.*\'s$', 'NN$'),               # possessive nouns
            (r'.*s$', 'NNS'),                 # plural nouns
            (r'^-?[0-9]+(\.[0-9]+)?$', 'CD'),  # cardinal numbers
            (r'.*', 'NN')                     # nouns (default)
           ]

In [47]:
# now create a tagger using the regular expressions above

regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(filtered_tokens[:10])

[('public', 'NN'),
 ('health', 'NN'),
 ('science', 'NN'),
 ('art', 'NN'),
 ('preventing', 'VBG'),
 ('disease', 'NN'),
 ('prolonging', 'VBG'),
 ('life', 'NN'),
 ('promoting', 'VBG'),
 ('health', 'NN')]

In [48]:
# unigram tagging looks at each token individually
# you can train your own unigram tagger 
# using a tagged corpus in NLTK

brown_tagged_sents = brown.tagged_sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.tag(filtered_tokens[:10])

[('public', 'JJ'),
 ('health', 'NN'),
 ('science', 'NN'),
 ('art', 'NN'),
 ('preventing', None),
 ('disease', None),
 ('prolonging', None),
 ('life', 'NN'),
 ('promoting', None),
 ('health', 'NN')]

In [49]:
# in general, N-gram tagging looks at the current token
# and the tags of the N-1 previous tokens
# 1-gram: unigram, 2-gram: bigram, 3-gram: trigram
# N-gram taggers need to work with sentences
# tokenize each sentence first

sents_tokenized = []
for sent in sents:
    sents_tokenized.append(nltk.word_tokenize(sent))
sents_tokenized[:2]

[['public',
  'health',
  'is',
  '``',
  'the',
  'science',
  'and',
  'art',
  'of',
  'preventing',
  'disease',
  ',',
  'prolonging',
  'life',
  'and',
  'promoting',
  'health',
  'through',
  'the',
  'organized',
  'efforts',
  'and',
  'informed',
  'choices',
  'of',
  'society',
  ',',
  'organizations',
  ',',
  'public',
  'and',
  'private',
  ',',
  'communities',
  'and',
  'individuals',
  "''",
  '.'],
 ['analyzing',
  'the',
  'determinants',
  'of',
  'health',
  'of',
  'a',
  'population',
  'and',
  'the',
  'threats',
  'it',
  'faces',
  'is',
  'the',
  'basis',
  'for',
  'public',
  'health',
  '.']]

In [50]:
# create and apply a bigram tagger
# why do you think this tagger is unable to tag anything?

bigram_tagger = nltk.BigramTagger(brown_tagged_sents)
bigram_tagger.tag(sents_tokenized[0])

[('public', None),
 ('health', None),
 ('is', None),
 ('``', None),
 ('the', None),
 ('science', None),
 ('and', None),
 ('art', None),
 ('of', None),
 ('preventing', None),
 ('disease', None),
 (',', None),
 ('prolonging', None),
 ('life', None),
 ('and', None),
 ('promoting', None),
 ('health', None),
 ('through', None),
 ('the', None),
 ('organized', None),
 ('efforts', None),
 ('and', None),
 ('informed', None),
 ('choices', None),
 ('of', None),
 ('society', None),
 (',', None),
 ('organizations', None),
 (',', None),
 ('public', None),
 ('and', None),
 ('private', None),
 (',', None),
 ('communities', None),
 ('and', None),
 ('individuals', None),
 ("''", None),
 ('.', None)]

In [51]:
# apply the bigram tagger to the brown corpus

brown_sents = brown.sents(categories='news')
bigram_tagger.tag(brown_sents[0])

[('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL'),
 ('said', 'VBD'),
 ('Friday', 'NR'),
 ('an', 'AT'),
 ('investigation', 'NN'),
 ('of', 'IN'),
 ("Atlanta's", 'NP$'),
 ('recent', 'JJ'),
 ('primary', 'NN'),
 ('election', 'NN'),
 ('produced', 'VBD'),
 ('``', '``'),
 ('no', 'AT'),
 ('evidence', 'NN'),
 ("''", "''"),
 ('that', 'CS'),
 ('any', 'DTI'),
 ('irregularities', 'NNS'),
 ('took', 'VBD'),
 ('place', 'NN'),
 ('.', '.')]

In [52]:
# unless a very large, relevant tagged corpus is used for training
# it is difficult to cover all N-gram sequences
# hence, it's useful to have a simpler tagger as a Plan B

combined_tagger = nltk.UnigramTagger(brown_tagged_sents, 
                                     backoff=regexp_tagger)
combined_tagger.tag((filtered_tokens)[:10])

[('public', 'JJ'),
 ('health', 'NN'),
 ('science', 'NN'),
 ('art', 'NN'),
 ('preventing', 'VBG'),
 ('disease', 'NN'),
 ('prolonging', 'VBG'),
 ('life', 'NN'),
 ('promoting', 'VBG'),
 ('health', 'NN')]

# Sentiment Analysis
* Use of NLP to systematically identify, extract, quantify, and study affective states and subjective information in text

In [53]:
# neg, neu, and pos values sum to one
# compound is some kind of a summary metric (not average)

sia = SentimentIntensityAnalyzer()
sia.polarity_scores("excellent")

{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.5719}

In [54]:
sia.polarity_scores("terrible")

{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound': -0.4767}

In [55]:
for sent in sents[:5]:
    print(sent)
    print(sia.polarity_scores(sent))
    print("\n")

public health is "the science and art of preventing disease, prolonging life and promoting health through the organized efforts and informed choices of society, organizations, public and private, communities and individuals".
{'neg': 0.034, 'neu': 0.89, 'pos': 0.077, 'compound': 0.34}


analyzing the determinants of health of a population and the threats it faces is the basis for public health.
{'neg': 0.141, 'neu': 0.859, 'pos': 0.0, 'compound': -0.4215}


the public can be as small as a handful of people or as large as a village or an entire city; in the case of a pandemic it may encompass several continents.
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


the concept of health takes into account physical, psychological, and social well-being.public health is an interdisciplinary field.
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


for example, epidemiology, biostatistics, social sciences and management of health services are all relevant.
{'neg': 0.0, 'neu': 1.0, '

# Bag of Words
* For subsequent steps in NLP such as machine learning, text needs to be represented as vectors
* Bag of words is one way to do this
* Basically it counts how many times each word appears
* Each dimension of the vector is each word in the vocabulary

![bag of words example](https://user.oc-static.com/upload/2020/10/23/16034397439042_surfin%20bird%20bow.png)

Image: https://openclassrooms.com/en/courses/6532301-introduction-to-natural-language-processing/6980811-apply-a-simple-bag-of-words-approach

In [56]:
# first, let's define a function that counts how many times each word appears
# given a vocabulary

def calculateBOW(vocab, text):
    bow = dict.fromkeys(vocab,0)
    for word in text:
        bow[word]=text.count(word)
    return bow

In [57]:
# let's apply bag of words to the first sentence in the public health wikipedia page

print(calculateBOW(vocab, nltk.word_tokenize(sents[0])))



# Word Embeddings
* Another way to represent text as a vector is to use a word embedding which maps a word to a vector
* There are pre-trained word embeddings that you can just take and use
* You can develop your own word embedding too

![Word Embedding Visualization](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*SYiW1MUZul1NvL1kc1RxwQ.png)

Image: https://towardsdatascience.com/a-guide-to-word-embeddings-8a23817ab60f

In [58]:
# gensim has several pre-trained word embedding models

gensim.downloader.info()['models']

{'fasttext-wiki-news-subwords-300': {'num_records': 999999,
  'file_size': 1005007116,
  'base_dataset': 'Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens)',
  'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/fasttext-wiki-news-subwords-300/__init__.py',
  'license': 'https://creativecommons.org/licenses/by-sa/3.0/',
  'parameters': {'dimension': 300},
  'description': '1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).',
  'read_more': ['https://fasttext.cc/docs/en/english-vectors.html',
   'https://arxiv.org/abs/1712.09405',
   'https://arxiv.org/abs/1607.01759'],
  'checksum': 'de2bb3a20c46ce65c9c131e1ad9a77af',
  'file_name': 'fasttext-wiki-news-subwords-300.gz',
  'parts': 1},
 'conceptnet-numberbatch-17-06-300': {'num_records': 1917247,
  'file_size': 1225497562,
  'base_dataset': 'ConceptNet, word2vec, GloVe, and OpenSubtitles 2016',
  'reader_code': 'https:/

In [59]:
# let's try a model pre-trained on wikipedia and gigaword data

gensim.downloader.info()['models']['glove-wiki-gigaword-50']

{'num_records': 400000,
 'file_size': 69182535,
 'base_dataset': 'Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)',
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-50/__init__.py',
 'license': 'http://opendatacommons.org/licenses/pddl/',
 'parameters': {'dimension': 50},
 'description': 'Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/).',
 'preprocessing': 'Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt`.',
 'read_more': ['https://nlp.stanford.edu/projects/glove/',
  'https://nlp.stanford.edu/pubs/glove.pdf'],
 'checksum': 'c289bc5d7f2f02c6dc9f2f9b67641813',
 'file_name': 'glove-wiki-gigaword-50.gz',
 'parts': 1}

In [60]:
# it maps a given word to a 50-dimensional vector

glove_vectors = gensim.downloader.load('glove-wiki-gigaword-50')
print(glove_vectors['health'])
print(len(glove_vectors['health']))

[ 0.31161   0.33903   0.033922 -0.30914  -0.43078   0.3417   -0.6864
 -0.83817   1.478    -0.5249    0.13326  -0.069083  0.40058  -0.35225
  0.20453   0.16683  -0.56978  -0.1359    1.1439    0.15662  -0.23462
  0.60111  -0.13868  -0.3787   -0.03634  -1.6429   -0.10716  -0.73417
 -0.62077   0.88903   3.3969    0.88545  -0.20321  -1.1283   -0.36811
  0.088206  0.055528  0.39659   1.6077    0.031832 -0.91684   0.07666
  0.67848   0.64672   0.74549   0.56715  -1.0098    0.81053   0.85948
  1.0404  ]
50


In [61]:
# get similar words

glove_vectors.most_similar('health')

[('care', 0.8932641744613647),
 ('medical', 0.843585193157196),
 ('education', 0.7962790131568909),
 ('welfare', 0.7925363183021545),
 ('prevention', 0.7868608832359314),
 ('public', 0.7673367261886597),
 ('environmental', 0.7635215520858765),
 ('aids', 0.7625102400779724),
 ('healthcare', 0.7602637410163879),
 ('poor', 0.755757749080658)]

# Gender Classification

In [62]:
# names can be clasified into male and female names
# what features should be used?
# let's start with the last letter of the name

def gender_features(word):
    return {'last_letter': word[-1]}

gender_features('John')

{'last_letter': 'n'}

In [63]:
# NLTK contains lists of male and female names
# load and shuffle them

labeled_names = ([(name, 'male') for name in names.words('male.txt')] + 
                 [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)
print(labeled_names[:10])
print(len(labeled_names))

[('Lizbeth', 'female'), ('Carlena', 'female'), ('Pierrette', 'female'), ('Rad', 'male'), ('Henry', 'male'), ('Doralynne', 'female'), ('Scarlet', 'female'), ('Benn', 'male'), ('Hayden', 'male'), ('Frankie', 'male')]
7944


In [64]:
# it's necessary to partition the data into
# training and test data
# use the training data to train a naive Bayes classifier
# and evaluate the classifier on the test data

gender_featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = gender_featuresets[500:], gender_featuresets[:500]
classifier_gender = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier_gender, test_set)

0.792

In [65]:
# you can apply the classifier to a particular name as well
# it correctly classifies my name!

classifier_gender.classify(gender_features('Joon'))

'male'

In [66]:
# you can also examine which feature values were most useful
# likelihood ratios are displayed

classifier_gender.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     33.3 : 1.0
             last_letter = 'k'              male : female =     30.1 : 1.0
             last_letter = 'f'              male : female =     15.9 : 1.0
             last_letter = 'p'              male : female =     12.5 : 1.0
             last_letter = 'v'              male : female =     10.5 : 1.0


# Document Classification

In [67]:
# Movie Reviews corpus contains postivie and negative movie reviews
# we can train ML models that classify movie reviews into positive or negative
# let's prepare the data set first

reviews = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(reviews)
print(reviews[0])

(['a', 'few', 'months', 'before', 'the', 'release', 'of', 'star', 'wars', 'episode', '1', ',', 'the', 'phantom', 'menace', ',', '20th', 'century', 'fox', 'decides', 'to', 'release', 'another', 'space', 'film', ',', 'that', 'is', 'a', 'complete', 'rip', 'off', 'of', 'star', 'wars', '.', 'what', 'is', 'the', 'point', 'of', 'this', '?', 'i', 'do', 'not', 'know', ',', 'but', 'i', 'wish', 'it', 'hadn', "'", 't', 'been', 'done', ',', 'considering', 'wing', 'commander', 'is', 'definitely', 'the', 'year', "'", 's', 'worst', 'film', 'so', 'far', '.', 'to', 'attract', 'people', 'to', 'this', 'horrible', 'movie', ',', 'they', 'attached', 'the', 'full', 'trailer', 'for', 'the', 'phantom', 'menace', '.', 'wing', 'commander', 'will', 'draw', 'large', 'crowds', ',', 'because', 'this', 'is', 'the', 'only', 'film', 'where', 'you', 'can', 'find', 'the', 'phantom', 'menace', 'full', 'trailer', 'attached', 'at', 'this', 'time', '.', 'the', 'trailer', 'for', 'the', 'phantom', 'menace', 'was', 'certainly', 

In [68]:
# for features, use the presence/absence
# of the 2000 most frequent words

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

def review_features(review):
    review_words = set(review)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in review_words)
    return features

In [69]:
# construct the features and train a naive Bayes classifier

review_featuresets = [(review_features(d), c) for (d,c) in reviews]
train_set, test_set = review_featuresets[100:], review_featuresets[:100]
classifier_review = nltk.NaiveBayesClassifier.train(train_set)

In [70]:
# test classification performance

nltk.classify.accuracy(classifier_review, test_set)

0.83

In [71]:
# five most important features

classifier_review.show_most_informative_features(5) 

Most Informative Features
   contains(outstanding) = True              pos : neg    =     10.7 : 1.0
         contains(mulan) = True              pos : neg    =      8.9 : 1.0
        contains(seagal) = True              neg : pos    =      7.9 : 1.0
         contains(damon) = True              pos : neg    =      7.7 : 1.0
   contains(wonderfully) = True              pos : neg    =      7.4 : 1.0


# POS Tag Classification

In [72]:
# instead of manually creating a POS tagger
# it's possible to train a classifier to learn suffix patterns
# let's extract common suffixes first

suffix_fdist = nltk.FreqDist()
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1

common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]
common_suffixes[:10]

['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of']

In [73]:
# define a feature extractor
# that indicates whether the given word ends with 
# one of the common suffixes

def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features

In [74]:
# train a decision tree and evaluate

tagged_words = brown.tagged_words(categories='news')
tag_featuresets = [(pos_features(n), g) for (n,g) in tagged_words]
size = int(len(tag_featuresets) * 0.1)
train_set, test_set = tag_featuresets[size:], tag_featuresets[:size]
classifier_tag = nltk.DecisionTreeClassifier.train(train_set)
nltk.classify.accuracy(classifier_tag, test_set)

0.6270512182993535

In [75]:
# see how the classifier performs for a specific word

classifier_tag.classify(pos_features('artificial intelligence'))

'NN'

# Sentence Segmentation with Classification

In [76]:
# setence segmentation essentially looks for 
# sentence-ending punctuation
# which can be learned using machine learning
# first we need segmented data

sents = nltk.corpus.treebank_raw.sents()
tokens = []
boundaries = set()
offset = 0
for sent in sents:
    tokens.extend(sent)
    offset += len(sent)
    boundaries.add(offset-1)

In [77]:
# 'tokens' contains tokens from individual sentences

tokens[:10]

['.', 'START', 'Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will']

In [78]:
# boundaries contains indexes of sentence-boundary tokens

print(boundaries)

{1, 90116, 16389, 40968, 81929, 24587, 16396, 65548, 73741, 8207, 32784, 81931, 90128, 98315, 20, 65557, 57366, 8221, 24611, 36, 32804, 38, 57382, 81957, 98345, 73771, 8236, 41004, 8238, 16430, 49201, 73785, 49210, 16445, 57405, 64, 90176, 66, 32835, 73794, 24649, 16459, 65615, 98383, 82001, 57426, 49235, 8276, 82003, 90195, 32855, 90197, 41050, 8285, 73824, 32868, 49252, 102, 16487, 65640, 98410, 82029, 24688, 49265, 73840, 41075, 90227, 16502, 32887, 65662, 8319, 24704, 57474, 134, 41097, 41099, 49292, 73867, 32911, 65686, 57495, 73878, 98454, 8346, 82074, 8348, 24732, 90267, 32930, 163, 8355, 24740, 41126, 57511, 49320, 73897, 16554, 82088, 98475, 65712, 41137, 57522, 16568, 90304, 98497, 82114, 8390, 199, 24775, 49352, 57545, 16587, 73929, 211, 24788, 65749, 32982, 90324, 41180, 57569, 49378, 73955, 228, 49380, 98537, 65770, 237, 8430, 16622, 33010, 41203, 82162, 73973, 57590, 90359, 8442, 258, 24836, 49413, 98565, 16650, 65803, 41228, 74000, 33041, 82195, 90389, 49436, 57628, 286,

In [79]:
# extract the following features

def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
            'prev-word': tokens[i-1].lower(),
            'punct': tokens[i],
            'prev-word-is-one-char': len(tokens[i-1]) == 1}

In [80]:
# extract features and use them 
# to train and evaluate a decision tree

seg_featuresets = [(punct_features(tokens, i), (i in boundaries))
               for i in range(1, len(tokens)-1)
               if tokens[i] in '.?!']
size = int(len(seg_featuresets) * 0.1)
train_set, test_set = seg_featuresets[size:], seg_featuresets[:size]
classifier_seg = nltk.DecisionTreeClassifier.train(train_set)
nltk.classify.accuracy(classifier_seg, test_set)

0.9730639730639731

# Closing Remarks
* We only covered basic NLP
* More ML can be done with scikit-learn, TensorFlow, PyTorch, etc.
* Large language models are much more advanced but what we covered in this course is still foundational to them