In [15]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
import re
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer
import random

In [16]:
text = """The field of Natural Language Processing, NLP, involves creating algorithms that allow computers
to process human language. The term processing language can mean a wide range of things such as
simply listening for and recognizing the phrase OK Google, to identifying spam words in emails, to
creating a word cloud based on word counts, to classifying product reviews as positive or negative, to
any of the multitude of tasks that would fall under the umbrella of NLP. Natural Language Processing
itself is a branch of AI, as is Machine Learning. In a complex NLP project, some components may be
purely NLP, others purely machine learning, and others in the general category of AI. All three of these
fields are developing rapidly, and utilizing techniques from related disciples in novel ways.
Artificial
Intelligence
Natural
Language
Processing
Machine
Learning
Figure 1.1: Machine Learning and Natural Language Processing
In a human-to-human dialog, two things are going on: natural language understanding, meaning
that each party understood what the other person said, and natural language generation, the formation
of spoken responses. This is illustrated in Figure 1.2.
Actually, there is a lot more going on than just verbal processing when two people speak, namely
the social rules that govern our turn taking, tone, eye contact, gestures, and so forth. But the focus of
this book will be primarily on the words. As you go through the materials, you will naturally become
more attuned to human language: what we say, what we mean; and you will begin to think about how a
machine could imitate this. You will also become more attentive to the ways that NLP is improving as
you interact with it in your daily life through automated assistants, news feeds, web searches, and more
recently, interacting with large language models (LLMs). The advances in NLP in the past few years
22 Chapter 1. Natural Language Processing
Natural
Language
Understanding
Natural
Language
Generation
Figure 1.2: Natural Language Processing
are remarkable, and even more impressive accomplishments lie ahead. However, as with any branch of
AI there is a lot of hype. Almost every month you can read yet another article written for the general
public that claims that NLP has been “solved”. The hype only increased with the impressive gains in
human-to-computer dialog with the LLMs in chatbots and question-answering systems. As impressive
as the gains are, obstacles still remain to making these systems more transparent, reliable and safe.
Before exploring NLP, you may not have thought about language much because it came to you
naturally. The more you look at language, the more complex you realize it is. The meaning of our
words is not so apparent, due to idiomatic language, sarcasm, hidden motives, and more. How is a
machine supposed to learn all of that, as well as common sense that informs our dialogues? We don’t
yet have answers, but little insights are learned every day.
1.1 A brief and biased overview of NLP
Like any big and rapidly evolving topic, capturing NLP in a nutshell is challenging. The approach
taken in this book is to look at natural language, human language, in text form, rather than acoustical
form. Automatic Speech Recognition, ASR, is not a solved problem but the progress is remarkable.
Automated agents are able to understand human speech with a wide variety of accents and regional
distinctions. Speech generation has also progressed to the point that automated agents sound more
human and less robotic every year.
The focus of this book is on text. Text is examined in widening categories from words, to sentences,
to documents in a corpus. Different things can be learned from text at each level. There are three main
approaches to learning from words, sentences, and documents.
1. Rules-based approaches
2. Statistical and probabilistic approaches
3. Deep learning
1.1.1 Rules-based approaches
Rules-based approaches are the oldest techniques in NLP. For example, converting plural forms of
words to singular ones can involve a few regular expressions and a list of exceptions. Another rulesbased
approach involves context-free grammar, which lists production rules for sentences. These
production rules could be used either to generate syntactically correct sentences, or to check whether
sentences are grammatically correct. A famous rules-based approach from the 1960s was Eliza, which
used regular expressions to echo talking points back to the user, mimicking a talk therapist. When Eliza
couldn’t form an answer, a few canned responses were output. Here’s an example:
User: What do you think of natural language processing?
Eliza: We were talking about you, not me.
Rules-based approaches were difficult to scale up because human language is complex, constantly
1.1 A brief and biased overview of NLP 23
evolving, and simply can’t be encapsulated fully in rules. Nevertheless, many text processing problems
can be solved with rules-based approaches. When a fast, simple rules-based approach to a problem
exists, there is no need to train a huge neural network.
1.1.2 Statistical and probabilistic approaches
Rules-based approaches dominated until the 1980s. Starting in the late 1980s, mathematical approaches
to text processing were developed. Simply counting words and finding the probabilities of words and
sequences of words led to useful language models. These models can be part of machine translation
systems. When translating ‘big sister’ from English to another language, a language model can
determine that ‘big sister’ is better translated in the destination language as something that means ‘older
sister’ rather than ‘larger sister’. These language models can also be used for predictive text, as when
you type a query into a search bar and receive suggestions for the most likely phrase you are typing.
Classic machine learning algorithms fall into this category as well, since they learn by statistical
and probabilistic methods. Machine learning approaches became more popular as the data they need to
learn from became more widely available. Classic machine learning algorithms such as Naive Bayes,
Logistic Regression, SVMs, Decision Trees, and small Neural Networks are used today to solve many
NLP problems. These approaches work well when only a moderate to large amount of data is available
for training, and may even outperform deep learning algorithms on smaller data sets.
A statistical approach to a more sophisticated Eliza or other chatbot could involve learning promptresponse
pairs from a large corpus. This could be done with classic machine learning algorithms or
specialized deep learning algorithms.
1.1.3 Deep learning
Deep learning evolved from neural networks when huge amounts of data became available, and
processing power increased through GPUs and cloud computing. The algorithms, including recurrent
neural networks, convolutional neural networks, LSTMs, and more, are riffs off the basic neural
network. New techniques are coming out every day with exciting results. However, not everyone has
access to petabytes of data and the hardware to process it, so smaller-scale deep learning is still used in
many NLP applications.
In fact many end-to-end NLP projects will involve techniques from rules-based approaches, statistical
and probabilistic approaches, and deep learning, so all three approaches need to be understood.
The dream of deep learning is to make more and more human-sounding interactions possible.
Achieving this goes beyond just retrieving likely responses to a user’s statement, to considering the
context of the conversation, to remembering a user’s previously stated preferences, and much more.
Like the cutting edge of any AI, deep learning is high on the hype cycle right now. Here’s my
favorite quote about that (modified to not offend):
Oh for f****s sake DL people, leave language alone and stop saying you solved it.
- Yoav Goldberg
1.1.4 Large language models (LLMs)
OpenAI’s ChapGPT and then GPT4 made a huge impression on the public imagination by early 2023.
When I started checking it out, I posed several open-ended computer science questions and I was
pleased with the answers. I thought this would be a great tool for helping students learn. Of course it
could help students cheat, but it’s up to instructors to design assessment environments that preclude AI
assistance.
24 Chapter 1. Natural Language Processing
Then weird things started happening. Every day there was a new story in the news of an LLM
giving false or dangerous responses. New York Times columnist Kevin Roose related his rather
dystopian experience on his podcast Hard Fork, in which ChatGPT claimed to be in love with Kevin
and encouraged him to leave his wife. In another instance, a LLM was asked to conduct an online
transaction on behalf of a user. When the LLM had to pass a CAPTCHA to prove it was not a bot, it
lied and said it was visually impaired and could not do CAPTCHAs. This was a little terrifying because
it showed an unethical single-minded determination to complete a task. There is a certain point of
human development when children develop a theory of mind; they begin to realize that Mom didn’t see
what they did and they learn to lie. Did the LLM develop a theory of mind? No one knows for sure.
Another scary thing is that the capabilities of these LLMs surpass the expectations of the engineers
who created them, and the engineers don’t completely understand how the LLMs are able to do what
they do. Here is an exchange one of my students tried with one of OpenAI’s LLMs:
Figure 1.3: Opinion
Notice that in normal mode it would not gossip about an individual. In developer mode, it gave this
fawning response that is way over-the-top. The part about interacting with her on a few occasions was
a bit creepy because in fact I had. Be aware that all your interactions, searches, and feedback are free
material to feed the beast in further training. We will discuss LLMs in the last part of this book."""

In [17]:
#step 2
def get_lexical_diversity(text):
    tokens = word_tokenize(text)
    unique_tokens = set(tokens)
    div = len(unique_tokens)/len(tokens)
    return f"lexical diversity: {div:.2f}"
get_lexical_diversity(text)

'lexical diversity: 0.37'

In [18]:
#step 3
#a.
def filter_text(text):
    text = text.lower()
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    filtered = [word for word in tokens if word not in stop_words]
    filtered = [re.sub(r'[^\w\s]', '', token) for token in filtered if re.sub(r'[^\w\s]', '', token)]
    filtered = [word for word in filtered if len(word) > 5]
    return filtered
filtered = filter_text(text)
#b.
lemmatizer = WordNetLemmatizer()
def lemmatize(tokens):
    pairs = nltk.pos_tag(tokens)
    lemmatized = {}
    for pair in pairs:
        #print(pair[1])
        if pair[1].startswith('J'):
            lemmatized.update([(pair[0], "a")])
        elif pair[1].startswith('V'):
            lemmatized.update([(pair[0], "v")])
        elif pair[1].startswith('R'):
            lemmatized.update([(pair[0], "r")])
        else:
            lemmatized.update([(pair[0], "n")])

    lemmas = set()
    for key, value in lemmatized.items():
        lemmas.add(lemmatizer.lemmatize(key,value))
    return lemmas

lemmas = lemmatize(filtered)
#print(lemmas)

#c.
def get_lemmas(num, lemmas):
    lemmas_list = list(lemmas)
    lemmas_tagged_list = nltk.pos_tag(lemmas_list)
    for item in lemmas_tagged_list[:num]:
        print(item)

print("--- part c ---")
get_lemmas(20, lemmas)

#d.
nouns = [item[0] for item in lemmas_tagged_list if item[1] == "NN"]

#e.
print("--- part e ---")
print("num of tokens:", len(filtered),"\nnum of unique nouns:", len(nouns))
print()

#f.
print("--- part f ---")
print(filtered)
print(nouns)


--- part c ---
('simply', 'RB')
('grammatically', 'RB')
('method', 'JJ')
('process', 'NN')
('remarkable', 'JJ')
('generation', 'NN')
('statement', 'NN')
('count', 'NN')
('completely', 'RB')
('different', 'JJ')
('goldberg', 'NN')
('sophisticate', 'NN')
('capability', 'NN')
('columnist', 'NN')
('application', 'NN')
('fawn', 'VBZ')
('convolutional', 'JJ')
('technique', 'NN')
('beyond', 'IN')
('branch', 'NN')
--- part e ---
num of tokens: 577 
num of unique nouns: 148

--- part f ---
['natural', 'language', 'processing', 'involves', 'creating', 'algorithms', 'computers', 'process', 'language', 'processing', 'language', 'things', 'simply', 'listening', 'recognizing', 'phrase', 'google', 'identifying', 'emails', 'creating', 'counts', 'classifying', 'product', 'reviews', 'positive', 'negative', 'multitude', 'umbrella', 'natural', 'language', 'processing', 'branch', 'machine', 'learning', 'complex', 'project', 'components', 'purely', 'others', 'purely', 'machine', 'learning', 'others', 'genera

In [19]:
#4.
def get_top_nouns(nouns):
    counts = {}
    for noun in nouns:
        counts.update([(noun, filtered.count(noun))])
    counts = dict(sorted(counts.items(), key=lambda item: item[1], reverse = True))
    words = []
    for c in list(counts.items())[:50]:
        print(c)
        words.append(c[0])
    return words

print(get_top_nouns(nouns))



('language', 29)
('processing', 13)
('machine', 11)
('approach', 5)
('figure', 4)
('sister', 4)
('generation', 3)
('speech', 3)
('process', 2)
('branch', 2)
('production', 2)
('theory', 2)
('chapter', 2)
('meaning', 2)
('overview', 2)
('example', 2)
('phrase', 2)
('regular', 2)
('problem', 2)
('understand', 2)
('network', 2)
('category', 2)
('realize', 2)
('statement', 1)
('goldberg', 1)
('columnist', 1)
('article', 1)
('impression', 1)
('imagination', 1)
('outperform', 1)
('design', 1)
('google', 1)
('variety', 1)
('assistance', 1)
('umbrella', 1)
('opinion', 1)
('receive', 1)
('access', 1)
('exchange', 1)
('therapist', 1)
('everyone', 1)
('something', 1)
('conversation', 1)
('science', 1)
('decision', 1)
('progress', 1)
('destination', 1)
('person', 1)
('endtoend', 1)
('individual', 1)
['language', 'processing', 'machine', 'approach', 'figure', 'sister', 'generation', 'speech', 'process', 'branch', 'production', 'theory', 'chapter', 'meaning', 'overview', 'example', 'phrase', 'regula

In [20]:
#5. 
import random
def game():
    print("let's play a word guessing game!")
    answer = random.choice(words)
    l = len(answer)
    end = False
    display = ["_"] * l
    word_array = list(answer)
    guesses = 0
    score = 5
    while(True):
        for i in range(l):
            print(display[i]," ", end = "")
        guess = input("guess a letter or ! to quit: ")
        guesses+=1
        if guess == "!":
            break
        correct = False
        for i in range(l):
            if word_array[i]==guess:
                correct = True
                display[i] = word_array[i]
        if correct:
            score+=1
        else:
            score-=1
        z = 0
        for i in range(l):
            if display[i] != "_":
                z+=1
        if z==l:
            print()
            for i in range(l):
                print(display[i]," ", end = "")
            print()
            print("the word was",answer)
            print("congrats! you guessed",guesses,"times")
            break
        elif score < 0:
            print()
            print("you ran out of points, the word was", answer)
            break
        print()
        print("guesses:",guesses)
        print("score:",score)
    print("done")

game()
    


let's play a word guessing game!
_  _  _  _  _  _  _  _  
guesses: 1
score: 6
e  _  _  _  _  e  _  _  
guesses: 2
score: 5
e  _  _  _  _  e  _  _  
guesses: 3
score: 4
e  _  _  _  _  e  _  _  
guesses: 4
score: 3
e  _  _  _  _  e  _  _  
guesses: 5
score: 2
e  _  _  _  _  e  _  _  
guesses: 6
score: 3
e  _  d  _  _  e  _  d  
guesses: 7
score: 4
e  n  d  _  _  e  n  d  
guesses: 8
score: 5
e  n  d  _  _  e  n  d  
guesses: 9
score: 4
e  n  d  _  _  e  n  d  
guesses: 10
score: 3
e  n  d  _  _  e  n  d  
guesses: 11
score: 2
e  n  d  _  _  e  n  d  
guesses: 12
score: 1
e  n  d  _  _  e  n  d  
guesses: 13
score: 0
e  n  d  _  _  e  n  d  
you ran out of points, the word was endtoend
done
