<center><b>DIGHUM101</b></center>
<center>2-6: Text Analysis: Introduction</center>

---

# Fast review

1. What is a data frame? 
2. What methods and other syntax can be used to subset rows and columns?

# Learning objectives

1. Download a .txt file from Project Gutenberg and import it into Python
2. Quick walkthrough of Hathi Trust Research Center (HTRC) resources
3. Learn the basics of text preprocessing: 
    - Tokenization
    - Punctuation removal
    - Count words, unique words, and word frequencies
    - Stop word removal
    - Stemming/lemmatization
    - Part of speech tagging
    - Quick introduction to n-grams, skip-grams, and BERT

In [None]:
# Module to remove punctuation from string library
from string import punctuation
print(punctuation)
print(len(punctuation))

In [None]:
# Module to count word frequencies
from collections import Counter

In [None]:
# Module to help us remove stopwords
import nltk
nltk.download("stopwords")
nltk.download("averaged_perceptron_tagger") # a pre-trained part-of-speech (POS) tagger
from nltk.corpus import stopwords

In [None]:
# Install spaCy and trained model downloaded.
# install spacy
#!pip install spacy

# Download a trained English model (small)
# !python -m spacy download en_core_web_sm 

# Download the large model as well
# !python -m spacy download en_core_web_lg
# import spacy

# Project Gutenberg

[Project Gutenberg](https://www.gutenberg.org/) has more than 60,000 texts for you to download. Be sure to check out their [Terms of Use](https://www.gutenberg.org/wiki/Gutenberg:Terms_of_Use). You can find many .txt files here that are in the public domain. 

In [None]:
# Try it! Search for a book, download it, copy it to your working directory, and import it.

## YOUR CODE HERE
import os
os.getcwd()

In [None]:
os.chdir("../Data/") # CHANGE PATH HERE
%ls

In [None]:
## HOW TO IMPORT dracula.txt?
dracula = open("dracula.txt").read()
print(dracula[501:])

# The Hathi Trust Research Center (HTRC)

Check out the [HTRC](https://www.hathitrust.org/) and learn about their many [collections tools](https://www.hathitrust.org/htrc_collections_tools) and the [Python library](https://github.com/htrc/htrc-feature-reader) to connect to the API. The [Analytics](https://analytics.hathitrust.org/) website gives you access to many canned features if you don't want to mess with the Python code. 

# Text Preprocessing: Strings in depth

Text preprocessing is an essential first step to coding and understanding machine learning algorithms. For machine learning portions of this course, we will focus on [bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) models, namely [document-term](https://en.wikipedia.org/wiki/Document-term_matrix) and [term frequency-inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) matrices from the [sklearn library](https://scikit-learn.org/stable/).

Text preprocessing/pattern matching can be further enhanced through use of [regular expressions](https://docs.python.org/2/library/re.html).

In [7]:
borges = '''In the fullness of the years, like it or not,
a luminous mist surrounds me, unvarying, 
that breaks things down into a single thing,
colorless, formless. Almost into a thought. 
The elemental, vast night and the day
teeming with people have become that fog
of constant, tentative light that does not flag,

and lies in wait at dawn. I longed to see
just once a human face. Unknown to me
the closed encyclopedia, the sweet play
in volumes I can do no more than hold, 
the tiny soaring birds, the moons of gold.
Others have the world, for better or worse; 
I have this half-dark, and the toil of verse.'''


print(type(borges))
print() # this just prints a new line
print(borges)

<class 'str'>

In the fullness of the years, like it or not,
a luminous mist surrounds me, unvarying, 
that breaks things down into a single thing,
colorless, formless. Almost into a thought. 
The elemental, vast night and the day
teeming with people have become that fog
of constant, tentative light that does not flag,

and lies in wait at dawn. I longed to see
just once a human face. Unknown to me
the closed encyclopedia, the sweet play
in volumes I can do no more than hold, 
the tiny soaring birds, the moons of gold.
Others have the world, for better or worse; 
I have this half-dark, and the toil of verse.


In [8]:
# What do the triple quotes do in the assignment of borges above?

# Also, make a copy to preserve the original borges variable
poem = borges

# Tokenization

Tokenization is the process of splitting text into _something_ - often words. Each word is called a "token" and a word such as "the" might adhere to multiple tokens of "the" within a text based on its capitalization, punctuation, etc.

The `.split()` method allows us to split the text based on some sort of separator. The default is blank and will split on the blank spaces between words.

In [9]:
# Split the string into a list of strings (single words)
print(poem.split())

['In', 'the', 'fullness', 'of', 'the', 'years,', 'like', 'it', 'or', 'not,', 'a', 'luminous', 'mist', 'surrounds', 'me,', 'unvarying,', 'that', 'breaks', 'things', 'down', 'into', 'a', 'single', 'thing,', 'colorless,', 'formless.', 'Almost', 'into', 'a', 'thought.', 'The', 'elemental,', 'vast', 'night', 'and', 'the', 'day', 'teeming', 'with', 'people', 'have', 'become', 'that', 'fog', 'of', 'constant,', 'tentative', 'light', 'that', 'does', 'not', 'flag,', 'and', 'lies', 'in', 'wait', 'at', 'dawn.', 'I', 'longed', 'to', 'see', 'just', 'once', 'a', 'human', 'face.', 'Unknown', 'to', 'me', 'the', 'closed', 'encyclopedia,', 'the', 'sweet', 'play', 'in', 'volumes', 'I', 'can', 'do', 'no', 'more', 'than', 'hold,', 'the', 'tiny', 'soaring', 'birds,', 'the', 'moons', 'of', 'gold.', 'Others', 'have', 'the', 'world,', 'for', 'better', 'or', 'worse;', 'I', 'have', 'this', 'half-dark,', 'and', 'the', 'toil', 'of', 'verse.']


# Count words

Jump in! 

In [10]:
# How many characters in poem?
len(poem)

600

In [11]:
# How many words?
len(poem.split())

110

In [12]:
# How many lines?
len(poem.split("\n"))

15

In [13]:
# How many periods? 
# Should this be equal to the number of sentences in the cell below?
poem.count(".")

6

In [None]:
poem.split(".")

In [14]:
# ... but how many sentences? Why is this different from the number of periods?
len(poem.split("."))

7

In [15]:
# How many stanzas?
len(poem.split("\n\n"))

2

In [16]:
# At which index does the word "me" first appear?
# .find() is "forward search"
poem.find("me")

72

In [17]:
poem[72:74]

'me'

In [18]:
# .index works as well
poem.index("me")

72

In [19]:
# Note that .find does not throw an error when an element is not found (but .index does)
poem.find("kangaroo")

-1

In [20]:
# At which index does the word "me" last appear?
# .rfind() starts at the highest index and works in reverse
poem.rfind("me")

434

# Count _unique_ words

In [24]:
# How many unique words?
# "Casting" our list into a set
len(set(poem.split()))

84

In [21]:
poem_list = poem.split()
print(poem_list)

['In', 'the', 'fullness', 'of', 'the', 'years,', 'like', 'it', 'or', 'not,', 'a', 'luminous', 'mist', 'surrounds', 'me,', 'unvarying,', 'that', 'breaks', 'things', 'down', 'into', 'a', 'single', 'thing,', 'colorless,', 'formless.', 'Almost', 'into', 'a', 'thought.', 'The', 'elemental,', 'vast', 'night', 'and', 'the', 'day', 'teeming', 'with', 'people', 'have', 'become', 'that', 'fog', 'of', 'constant,', 'tentative', 'light', 'that', 'does', 'not', 'flag,', 'and', 'lies', 'in', 'wait', 'at', 'dawn.', 'I', 'longed', 'to', 'see', 'just', 'once', 'a', 'human', 'face.', 'Unknown', 'to', 'me', 'the', 'closed', 'encyclopedia,', 'the', 'sweet', 'play', 'in', 'volumes', 'I', 'can', 'do', 'no', 'more', 'than', 'hold,', 'the', 'tiny', 'soaring', 'birds,', 'the', 'moons', 'of', 'gold.', 'Others', 'have', 'the', 'world,', 'for', 'better', 'or', 'worse;', 'I', 'have', 'this', 'half-dark,', 'and', 'the', 'toil', 'of', 'verse.']


In [25]:
for word in poem_list:
    print(word)

In
the
fullness
of
the
years,
like
it
or
not,
a
luminous
mist
surrounds
me,
unvarying,
that
breaks
things
down
into
a
single
thing,
colorless,
formless.
Almost
into
a
thought.
The
elemental,
vast
night
and
the
day
teeming
with
people
have
become
that
fog
of
constant,
tentative
light
that
does
not
flag,
and
lies
in
wait
at
dawn.
I
longed
to
see
just
once
a
human
face.
Unknown
to
me
the
closed
encyclopedia,
the
sweet
play
in
volumes
I
can
do
no
more
than
hold,
the
tiny
soaring
birds,
the
moons
of
gold.
Others
have
the
world,
for
better
or
worse;
I
have
this
half-dark,
and
the
toil
of
verse.


In [23]:
poem_set = set(poem_list)
print(poem_set)

{'breaks', 'sweet', 'closed', 'me', 'just', 'unvarying,', 'mist', 'In', 'in', 'down', 'encyclopedia,', 'than', 'colorless,', 'lies', 'or', 'world,', 'and', 'with', 'toil', 'teeming', 'things', 'people', 'can', 'night', 'human', 'become', 'it', 'tiny', 'does', 'light', 'this', 'to', 'hold,', 'elemental,', 'for', 'flag,', 'longed', 'The', 'a', 'the', 'single', 'soaring', 'worse;', 'once', 'more', 'into', 'of', 'at', 'I', 'like', 'see', 'not,', 'better', 'formless.', 'birds,', 'not', 'thought.', 'volumes', 'verse.', 'face.', 'do', 'years,', 'fullness', 'moons', 'no', 'vast', 'Unknown', 'tentative', 'Almost', 'dawn.', 'Others', 'me,', 'wait', 'that', 'fog', 'constant,', 'thing,', 'have', 'half-dark,', 'gold.', 'day', 'play', 'luminous', 'surrounds'}


In [26]:
for word in poem_set:
    print(word)

breaks
sweet
closed
me
just
unvarying,
mist
In
in
down
encyclopedia,
than
colorless,
lies
or
world,
and
with
toil
teeming
things
people
can
night
human
become
it
tiny
does
light
this
to
hold,
elemental,
for
flag,
longed
The
a
the
single
soaring
worse;
once
more
into
of
at
I
like
see
not,
better
formless.
birds,
not
thought.
volumes
verse.
face.
do
years,
fullness
moons
no
vast
Unknown
tentative
Almost
dawn.
Others
me,
wait
that
fog
constant,
thing,
have
half-dark,
gold.
day
play
luminous
surrounds


In [27]:
# Why two less unique words when we convert all the text to lower?
len(set(poem.lower().split()))

82

In [None]:
# Print the unique words
print(set(poem.lower().split()))

In [None]:
# What type of data structure is this? 
type(set(poem.lower().split()))

In [None]:
# Why is this different from .lower()?
len(set(poem.split()))

# Punctuation removal 

Remember how we imported that nice string of English punctuation in the first cell of this notebook? We could manually remove all of the punctuation using the .replace method, but this would get old fast!

In [28]:
# How many characters
len(punctuation)

32

In [None]:
# Replace periods with nothing
del_periods = poem.replace(".", "")
del_periods

But, what if you have tons of text and don't know exactly what punctuation is present? A quick comprehension can help us remove all the punctuation from dirty, i.e. !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~)

In [29]:
# For loop
for char in punctuation:
    poem = poem.lower().replace(char, "")

In [30]:
# Punctuation is gone! 
print(poem)

in the fullness of the years like it or not
a luminous mist surrounds me unvarying 
that breaks things down into a single thing
colorless formless almost into a thought 
the elemental vast night and the day
teeming with people have become that fog
of constant tentative light that does not flag

and lies in wait at dawn i longed to see
just once a human face unknown to me
the closed encyclopedia the sweet play
in volumes i can do no more than hold 
the tiny soaring birds the moons of gold
others have the world for better or worse 
i have this halfdark and the toil of verse


# Count word frequencies

In [None]:
# Tokenize poem into single words
tokens = poem.split()
print(tokens)

In [None]:
# Show the ten most common words (stopwords included)
freq = Counter(tokens)
freq.most_common(10)

# Stop word removal

[Stop words](https://en.wikipedia.org/wiki/Stop_words) are the most common words in a language, and may or may not add information about the content of the analysis.

In [None]:
stop = stopwords.words("english")
print(stop)

In [None]:
# This is the same as the following:
no_stops = []
for word in tokens:
    if word not in stopwords.words('english'):
        no_stops.append(word)
print(no_stops)

In [None]:
freq2 = Counter(no_stops)
freq2.most_common(10)