<a href="https://colab.research.google.com/github/raj-vijay/nl/blob/master/06_Language_Processing_Gettysburg_Address.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###1. Tokenizing Gettysburg Address

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords.

We tokenize one of the most famous speeches of all time: the Gettysburg Address delivered by American President Abraham Lincoln during the American Civil War.

In [1]:
!wget https://github.com/raj-vijay/nl/raw/master/files/Gettysburg.txt

--2021-11-26 18:20:43--  https://github.com/raj-vijay/nl/raw/master/files/Gettysburg.txt
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/raj-vijay/nl/master/files/Gettysburg.txt [following]
--2021-11-26 18:20:43--  https://raw.githubusercontent.com/raj-vijay/nl/master/files/Gettysburg.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1455 (1.4K) [text/plain]
Saving to: ‘Gettysburg.txt’


2021-11-26 18:20:43 (23.8 MB/s) - ‘Gettysburg.txt’ saved [1455/1455]



In [2]:
with open('Gettysburg.txt', 'r') as f:
    gettysburg = f.read()

In [3]:
gettysburg[:100]

'Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived '

In [4]:
import spacy

In [5]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

In [6]:
# Create a Doc object
doc = nlp(gettysburg)

In [7]:
# Generate the tokens
tokens = [token.text for token in doc]
print(tokens)

['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', ',', 'a', 'new', 'nation', ',', 'conceived', 'in', 'Liberty', ',', 'and', 'dedicated', 'to', 'the', 'proposition', 'that', 'all', 'men', 'are', 'created', 'equal', '.', '\n', 'Now', 'we', 'are', 'engaged', 'in', 'a', 'great', 'civil', 'war', ',', 'testing', 'whether', 'that', 'nation', ',', 'or', 'any', 'nation', 'so', 'conceived', 'and', 'so', 'dedicated', ',', 'can', 'long', 'endure', '.', 'We', 'are', 'met', 'on', 'a', 'great', 'battlefield', 'of', 'that', 'war', '.', '\n', 'We', 'have', 'come', 'to', 'dedicate', 'a', 'portion', 'of', 'that', 'field', ',', 'as', 'a', 'final', 'resting', 'place', 'for', 'those', 'who', 'here', 'gave', 'their', 'lives', 'that', 'that', 'nation', 'might', 'live', '.', 'It', 'is', 'altogether', 'fitting', 'and', 'proper', 'that', 'we', 'should', 'do', 'this', '.', '\n', 'But', ',', 'in', 'a', 'larger', 'sense', ',', 'we', 'can', 'not', 'ded

###2. Lemmatizing the Gettysburg Address

**Stemming and lemmatization**

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

    am, are, is > be
    car, cars, car's, cars' > car


The result of this mapping of text will be something like:

    the boy's cars are different colors > the boy car be differ color

However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun.

In [8]:
# Print the gettysburg address
print(gettysburg)

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battlefield of that war.
We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.
But, in a larger sense, we can not dedicate, we can not consecrate, we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract.
The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so

In [9]:
import spacy

In [10]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

In [11]:
# Create a Doc object
doc = nlp(gettysburg)

In [12]:
# Generate lemmas
lemmas = [token.lemma_ for token in doc]

In [13]:
# Convert lemmas into a string
print(' '.join(lemmas))

four score and seven year ago -PRON- father bring forth on this continent , a new nation , conceive in Liberty , and dedicate to the proposition that all man be create equal . 
 now -PRON- be engage in a great civil war , test whether that nation , or any nation so conceive and so dedicated , can long endure . -PRON- be meet on a great battlefield of that war . 
 -PRON- have come to dedicate a portion of that field , as a final resting place for those who here give -PRON- life that that nation may live . -PRON- be altogether fitting and proper that -PRON- should do this . 
 but , in a large sense , -PRON- can not dedicate , -PRON- can not consecrate , -PRON- can not hallow this ground . the brave man , living and dead , who struggle here , have consecrate -PRON- , far above -PRON- poor power to add or detract . 
 the world will little note , nor long remember what -PRON- say here , but -PRON- can never forget what -PRON- do here . -PRON- be for -PRON- the living , rather , to be dedica