From Text to Data
==============

This chapter, which we've asked you to read prior to the first workshop, is a general discussion of working with 
textual data in Python. While the workshop series assumes you have at least a basic understanding of Python, we'll 
quickly review how to load, or "read in," a single text file and format it for text analysis. We'll do so both as a 
refresher and because this simple action illuminates an important aspect of working with textual data: namely, that 
to your computer, text is above all a **sequence of characters**. This is a key thing to keep in mind when 
preparing your data for text mining and/or NLP.

As you work through this chapter, use it as a check on your Python skills. If you feel comfortable writing the code 
below, you should be prepared for our sessions. The skills covered in this chapter include:

+ Loading text data into Python
+ Working with different Python data structures (strings, lists, dictionaries)
+ Control flow with `for` loops
+ Using `Pandas` dataframes

```{tip}
Need to brush up on Python? The DataLab offers a Python Basics workshop series. You can find links to the series 
reader and recording on our [Workshop Archive page].

[Workshop Archive page]: https://datalab.ucdavis.edu/workshops/
```

Loading Text
---------------

To open a text file, we'll use `with...open`. This saves us from forgetting to close the file stream and thereby 
frees up a little memory for later computation. The memory strain a single text file puts on your computer isn't 
very large at all, but dozens, to say nothing of hundreds or thousands of texts, can start to slow things down, so 
it's good to get in the habit of automatically closing file streams right from the start.

In [1]:
with open("data/shelley_frankenstein.txt", 'r') as f:
    frankenstein = f.read()

### Plain text

Here we use `r` in the `mode` argument because we're working with **plain text** data (as opposed to **binary** 
data, which would require `rb`). In computing, plain text has multiple fuzzy, interlocking meanings, but generally 
it refers to some kind of data that is stored in a human-readable form, which is to say, it is comprised of a 
collection of text characters (usually [ASCII], but increasingly [UTF-8]).

[ASCII]: https://en.wikipedia.org/wiki/ASCII
[UTF-8]: https://en.wikipedia.org/wiki/UTF-8

All the texts we'll be working with are plain text files. But depending on your research area, access to plain text 
representations of documents may be the exception, not the rule. If that's the case, you would need to convert 
your documents into a machine-readable form. Options for doing so range from automated methods, like [optical 
character recognition], to good old fashioned hand transcription.

[optical character recognition]: https://en.wikipedia.org/wiki/Optical_character_recognition

In plain text representations, every keystroke you would use to hand-transcribe a text has a corresponding sequence 
of characters. That means this print output:

In [2]:
print(frankenstein[:364])

Letter 1

_To Mrs. Saville, England._


St. Petersburgh, Dec. 11th, 17—.


You will rejoice to hear that no disaster has accompanied the
commencement of an enterprise which you have regarded with such evil
forebodings. I arrived here yesterday, and my first task is to assure
my dear sister of my welfare and increasing confidence in the success
of my undertaking.


...is represented by the following in plain text:

In [3]:
frankenstein[:364]

'Letter 1\n\n_To Mrs. Saville, England._\n\n\nSt. Petersburgh, Dec. 11th, 17—.\n\n\nYou will rejoice to hear that no disaster has accompanied the\ncommencement of an enterprise which you have regarded with such evil\nforebodings. I arrived here yesterday, and my first task is to assure\nmy dear sister of my welfare and increasing confidence in the success\nof my undertaking.'

```{margin} You might ask...
Why aren't spaces also represented by a character? Well, they are, but even in this relatively unformatted view 
your computer continues to automatically render text in a human-readable way.
```

See all the `\n`, or **newline**, characters? Each one represents a linebreak. On the backend, your computer uses 
newline characters to demarcate things like paragraphs, stanzas, titles, and so forth, but typically it suppreses 
these characters when it renders text for our eyes. What we see in the two different outputs above, then, is a 
difference between **print conventions** and **code conventions**: what appears as a blank in the former is in fact 
an addressable unit in the latter.

### Character sequences

The distinction between print and code conventions has some significant consequences for us, both practical and 
conceptual. While, in the print view of the world, we tend to think of the word as the atomic unit of text, in the 
code view of the world, text is -- again -- a **sequence of characters**. We can see this if we try to count how 
many units are in our text:

In [4]:
print("The length of Frankenstein is:", len(frankenstein))

The length of Frankenstein is: 418917


The Penguin edition of _Frankenstein_ clocks in at ~220 pages. If we assume each page contains something like 350 
words, that makes the book ~77,000 words long -- far less than the number outputted above. Why, then, did Python 
output this number? Because it counted characters, not words. To Python, our text is currently represented as a 
giant blob. This blob makes no distinction between the start of one word and the next; its atomic unit is the 
character, and so most of the operations we can run on it will thus address characters, not words.

### Tokenizing strings

But we want to address our text at the level of words. To do so, we'll need to manipulate how Python represents our 
data, changing it from a long stream of characters to discrete (and preferably indexed) units. The process of doing 
this is called **tokenization**. _To tokenize_ means to break a continuous sequence of text data into substrings, 
which we call "tokens." Ultimately, tokens are what we will end up counting in text analytics. They are the atomic 
unit of/for almost everything we'll discuss in this series.

Notably, a token is more of a generic entity than it is a particular kind of text. Tokens don't always mean words 
(though you'll often see them treated this way). In one sense, for example, our text is already tokenized -- it's 
just tokenized by characters, which isn't much use for us now. What we want to do, then, is tokenize our text in 
such a way that we can address each word therein. This will help us keep track of those words, rather than mucking 
around with blobby character data.

There are a number of different Python libraries that can tokenize text for you, but it's easy enough to do one 
version of this task with Python's base functionality. For now, we'll simply use `split()`. The default character 
this function takes in its argument is any whitespace, which will nicely isolate words (whitespace characters 
include `\n`, `\t`, and of course plain old spaces). We'll call `split()` on our text and save the result to `doc`.

In [5]:
doc = frankenstein.split()

Now, if we call `len()` on `doc`, we'll see this:

In [6]:
print("The length of Frankenstein is:", len(doc))

The length of Frankenstein is: 74975


Much better! Now that we have tokenized by words, this number is considerably closer to our estimations above.

```{admonition} A look ahead
While you'll most often tokenize on whitespaces, there are cases where you might want to chunk your text using 
different characters, or even entire sequences of characters. For example, if you are studying poetry, you might 
want to know some information about the average number of lines in a stanza. In that case, splitting on `\n` could 
be more useful than space. We'll cover this topic more fully in a later section; for the moment just keep in mind 
that there are many different and valid ways to tokenize text.
```

Counting Words
-------------------

With our text data loaded and properly formatted, we can start one of the core tasks of text analysis: counting words. While the next chapter will discuss this process in greater detail, we'll preview it here to get a sense of 
what's to come and to review the basics of control flow in Python.

Splitting text transforms it into a list, where each word has its own separate index position. Remember that, in 
Python, lists are ordered arrays that store multiple, potentially repeatable values. With this representation of 
our data, it's much easier get global word counts using something as simple as a `for` loop: we can simply iterate 
through every item in the list and tally them all up.

### A first pass

Let's do this now. We can build a little loop to find the cumulative number of times each word occurs in 
_Frankenstein_. To store this information, we'll use a dictionary. This will provide us with a way to access the 
counts of individual words once we've looped through the entire novel.

```{margin} What this loop does:
For every word in the novel, check whether that word is in the dictionary:
+ If it isn't in the dictionary, add that word and count `1`
+ If it is, increase that word's count by `1`
```

In [7]:
word_counts = {}

for word in doc:
    if word not in word_counts:
        word_counts[word] = 1
    else:
        word_counts[word] += 1

With this done, we can determine the total number of unique words in _Frankstein_.

In [8]:
print("Total number of unique words in Frankenstein:", len(word_counts))

Total number of unique words in Frankenstein: 11590


And we can also access the counts of individual words. Let's pick two: "imagination" and "monster."

In [9]:
for word in ["imagination", "monster"]:
    print(f"{word:<12} {word_counts[word]}")

imagination  14
monster      21


```{tip}
For the sake of readability, this reader uses extra string formatting to control the print spacing of our output. 
This isn't necessary though, so feel free to work without it. If you do want to use the extra formatting, you can 
do so by appending any print string with `f`. Then, use `{}` around variables that you'd like to interpolate into 
the string. `:<[NUMBER]` and `:>[NUMBER]` will control left and right justification, respectively.
```

If you're familiar with _Frankenstein_, you'll know that it's an epistolary novel, meaning it's written as a series 
of letters. It even begins this way: the heading in the first print output above reads "Letter 1."

With that in mind, let's tack on "letter" to our loop above:

In [10]:
for word in ["imagination", "monster", "letter"]:
    print(f"{word:<12} {word_counts[word]}")

imagination  14
monster      21
letter       17


### Top words

Great! This all seems to work well, though we won't get very far if we continue to take a top-down approach and 
spot check single words. How would we know what all is in the novel and what isn't? Instead of approaching the data 
in this way, it would be more useful to see what turns up if we just look at the count distribution as a whole.

To do so, let's sort our dictionary by the number of times each word appears. Putting these counts in a `Pandas` 
dataframe will make them much easier to work with.

In [11]:
import pandas as pd

df = pd.DataFrame.from_dict(word_counts, columns = ['COUNT'], orient = 'index')
df = df.sort_values('COUNT', ascending = False)

Now let's take a look at the 50 most frequent words:

In [12]:
df.head(50)

Unnamed: 0,COUNT
the,3897
and,2903
I,2719
of,2634
to,2072
my,1631
a,1338
in,1071
was,992
that,974


And there they are!

```{warning}
*Except look:* do you notice anything strange about these counts? Inspect them closely. The word "The" appears 
about 20 words up from the end of the output -- and yet it also appears as the *first* entry in this output. What's 
going on here?
```

### Investigating duplicates

Let's investigate. To see whether something might be off in the way we've generated our counts, we'll look back at 
our third example, "letter".

Let's grab the value for "letter" one more time.

In [13]:
df[df.index == "letter"]

Unnamed: 0,COUNT
letter,17


That corresponds to what we have above. But remember: the start of _Frankenstein_ doesn't start with "letter." It 
starts with "Letter." Might this make a difference, as with "the"/"The"? Let's look.

In [14]:
df[df.index == "Letter"]

Unnamed: 0,COUNT
Letter,4


Not good... something seems to be off. There appear to be multiple copies of the same word in our dataframe.

To diagnose this problem, let's dig in even further. We'll search through all unique words in _Frankenstein_ and 
see whether we're somehow missing any other copies of "letter." We can do so by searching through the index of our 
dataframe and testing whether "letter" is a substring of a given index position.

```{Admonition} Reminder
We haven't yet removed numbers from our data, so be sure to convert your indices to strings to avoid mismaches in 
datatypes.
```

In [15]:
df[df.index.str.contains("letter")]

Unnamed: 0,COUNT
letter,17
letters,12
"letter,",4
"letters,",3
letter.,2
letter:,1
letters;,1
letters.,1


For good measure, let's do this with "monster" as well.

In [16]:
df[df.index.str.contains("monster")]

Unnamed: 0,COUNT
monster,21
"monster,",5
monster!,2
monster.’,1
‘monster!,1
"monsters,",1
monster;,1
monsters,1


Wait, What's a Word?
-------------------------

The outputs above should make clear what is happening. *We have a problem in the way we've defined the concept of a 
word.* Remember, to our computers, text is just a sequence of characters. Computers are highly literal in this 
respect; they only ever read character-by-character. And while they don't have an in-built concept of what words 
are, we were able to coax them into treating character sequences as words by splitting those sequences on the basis 
of spaces. That is, we said to our computers: "whenever you find a space, this marks the beginning or end of a 
word."

In doing so, we ended up creating a de facto definition of what constitutes a word: for this definition, a word is 
any sequence of characters surrounded by spaces.

If we frame what we've done in this way, we can see that our computers followed our definition perfectly, doing 
nothing more or less than splitting sequences of characters on spaces. In their character-by-character way of 
reading, "letter" is different from "letter;" -- and understandably so, for each is a different sequence of 
characters surrounded by spaces. The same goes for "letter" and "Letter": both are different character sequences 
surrounded by spaces, for in a very rudimentary sense, lowercase _l_ and uppercase _L_ are simply not the same 
character. (To be exact, the underlying Unicode "[codepoints]" for these letters are `U+006C` and `U+004C`, 
respectively.)

[codepoints]: https://en.wikipedia.org/wiki/Universal_Character_Set_characters

```{margin} To complicate things further...
Non-English languages and non-alphabetic writing systems add a productive challenge to all this. We can't cover 
this topic in full, but Quinn Drombowski has written about it in this [helpful blogpost] on text analytics and the 
"English default."

[helpful blogpost]: http://quinndombrowski.com/?q=blog/2020/10/15/whats-word-multilingual-dh-and-english-default
```

In another sense, however, they _are_ the same letter. We could say the same of "the" and "The." The problem here 
arises from the fact that, as opposed to our computers' highly literal way of reading, we tend to consider the 
meaning of words to be something that transcends differences in capitalization; that is mostly separable from 
punctuation; and that sometimes even goes beyond spelling (think American v. British English) and inflection 
("run," "running," "ran" => "run"). In the above output, what we'd really like to see is something closer to what 
linguists call **lexemes**, or the abstract units of meaning that underlie groups of words. Otherwise, we're still 
just counting characters.

The next chapter -- and with it, our first workshop session -- will discuss how to prepare textual data so as to 
begin analyzing words.