<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Xanda Schofield](https://www.cs.hmc.edu/~xanda) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email xanda@cs.hmc.edu.<br />
____

# Text Data Curation 1

This is lesson 1 of 3 in the educational series on Text Data Curation. This notebook is intended to introduce the basics of treating text documents as data and how to store and filter those documents. 

**Audience:** `Learners` / `Researchers`

**Use case:** [`How-To`](https://constellate.org/docs/documentation-categories#howtoproblemoriented) 

**Difficulty:** `Intermediate`
This course is open to those with a basic level of proficiency in Python. Taking the Python Basics course the week before is sufficient.

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* How Python libraries work (installation and imports)
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
* How text is stored on computers (encodings, file types)
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Parse and generate XML, JSON, and CSV files given raw text documents
2. Identify when text encodings affect the interpretation of text data
3. Use a lexicon to select relevant documents within a text collection
4. Use fuzzy matching to remove duplicate items from a collection
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* [Tesseract](https://tesseract-ocr.github.io/) for performing [optical character recognition](https://docs.constellate.org/key-terms/#ocr).
* [Pandas](https://pandas.pydata.org/) for manipulating and cleaning data.
* [Pdf2image](https://pdf2image.readthedocs.io/en/latest/) for converting pdf files into image files.

## Install Required Libraries

In [None]:
### Install Libraries ###

# Using !pip installs
!pip install beautifulsoup4

# Required Data

`List out the data sources, including their formats and a few sentences describing the data. Include a link to the data source description, if possible.`

**Data Format:** 
* plain text (.txt)
* delimited files (.csv, .tsv)
* structured files (.json, .xml, .html)

**Data Source:**
* [Spanish Poetry Dataset](https://www.kaggle.com/datasets/andreamorgar/spanish-poetry-dataset): Spanish poem names scraped from [Poemas del Alma](https://www.poemas-del-alma.com/).

**Data Description:**

`This lesson uses XXXX data in XXX format from XXXX source. Additional details about the data used.`

## Download Required Data

In [None]:
### Grab files with console `wget` and `mv` ###
!wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
!mv eng.traineddata /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata


In [None]:
### Grab a single file and supply name ###
urllib.request.urlretrieve('https://file.address.txt', 'filename.txt')

In [None]:
### Retrieve multiple files using a list ###

download_urls = [
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_01.pdf',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_02.pdf',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_03.pdf'
]

for url in download_urls:
    urllib.request.urlretrieve(url, url.rsplit('/', 1)[-1])

In [None]:
### Retrieve multiple files using a list ###
### With data folder creation using os ###

# Files hosted somewhere else (don't store data on GitHub)

# Check if a folder exists to hold pdfs. If not, create it.
if os.path.exists('sample_pdfs') == False:
    os.mkdir('sample_pdfs')

# Move into our new directory
os.chdir('sample_pdfs')

# Download the pdfs into our directory
import urllib.request
download_urls = [
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_01.pdf',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_02.pdf',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_03.pdf'
]

for url in download_urls:
    urllib.request.urlretrieve(url, url.rsplit('/', 1)[-1])
    
## Move back out of our directory
os.chdir('../')

## Success message
print('Folder created and pdfs added.')

In [None]:
### Constellate Example ###

# Importing your dataset with a dataset ID
import constellate
# Pull in the sampled dataset (1500 documents) that matches `dataset_id`
# in the form of a gzipped JSON lines file.
# The .get_dataset() method downloads the gzipped JSONL file
# to the /data folder and returns a string for the file name and location
dataset_id = "02b8c5c7-64bd-efe3-01d8-88c9efe7d17c"
dataset_file = constellate.get_dataset(dataset_id)

# To download the full dataset (up to a limit of 25,000 documents),
# request it first in the builder environment. See the Constellate Client
# documentation at: https://constellate.org/docs/constellate-client
# Then use the `constellate.download` method show below.
#dataset_file = constellate.download(dataset_id, 'jsonl')


# Introduction

This is the first of three lessons on **text data curation**, aimed for an audience with a little bit of information on how Python works that's interested in embarking on a quest to work with a large collection of text using computational tools. Here, we will look specifically at the ways we store text and collections of documents on computers and tools we can use to filter those larger collections into something usable. The tutorial is targeted at individuals interested in computational text analysis who are acquainted with the basics of programming in Python but are still new to developing models of text or more complex natural language processing tasks.

# Lesson


## When text is data
There are lots of ways to gain understanding of texts. We can read the texts themselves closely, read about the history of the texts' subject and of its authors, dig up related or contemporary texts, and grapple with the critical responses others have had to each. Depending on the questions we want to answer or our worldview about textual analysis in general, we might take a mixture of these different strategies together to analyze, answer, and argue on one topic.

Computational text analysis provides another set of strategies to do this, with a specific emphasis on what is (or isn't) a pattern across numerous texts. To do this, text collections are fed to computer programs that count, contrast, and correlate events that happen in texts, ranging from the basic frequencies of individual words and phrases to more complex inferences of what events occur for a particular character or what sentiments or moods are prevalent around mentions of a specific theme. Approaches to find these patterns range from direct programming (e.g. writing code in Python or R) to interacting with visual interfaces (like Voyant or Tableau) to options that merge the two (like building an extensive spreadsheet in Excel).

In my experience, there's an amazing ecosystem of different tools and tutorials out there to train models, compute statistics, and visualize results around text: it's just a matter of getting the text you care about into the right state to be usable by that system. And that's where **text data curation** comes in.

I define **text data** to be a structured format of textual information (that is, sequences of symbols that encode human language) intended for use in some kind of processing and computation. When we treat text as data, we have to enforce a certain degree of uniformity in how this text looks so the processing can be consistent. Most human-generated text collections, however, are not so tidy, no matter where they come from, who collected them, or what journey the text took to become a modern-day record. To succeed in text data analysis, we need to be able to grapple with this non-uniformity, to **curate** what text stays, how to store it, and how to transform it into something uniform.

Today, we're going to focus primarily on the first two questions: how to select the documents to keep and what representations are convenient for storing that text.

## TSVs and a basic filtering task

It's important to acknowledge before we embark on this project: the process of text data curation for these applications is not neutral: it's *subjective* and *destructive*. Let's think about a case study to understand how this looks.

Suppose I'm interested in studying dominant themes in contemporary Spanish language poetry. To start, I want to find a large collection of these poems. I find out that the website [Poemas del Alma ("Poems of the Soul")](https://www.poemas-del-alma.com/) has hundreds of thousands of user-submitted poems from 2009 onwards. I also find an existing script written in Python to scrape the literary poems on the website, which I could adapt to get the user poems as well.

Let's take a look at the *metadata* of these. This includes the titles, authors, month of publication, and URLs for each poem. I've actually created three different versions of this metadata file: a TSV file, a CSV file, a JSON file, and an XML file. We'll start by loading in the first TSV file. I've added some helper functions for reading and writing TSVs into lists of dictionaries below:

In [29]:
import csv

def read_tsv_as_dicts(filename):
    """Load in a TSV, or tab-separated value file, using Python's 
    built-in library `csv` for parsing fixed delimiter files. Loads
    in each row as a dictionary."""
    # open the file in read mode
    with open(filename) as tsv_file:
        reader = csv.DictReader(tsv_file, delimiter='\t')
        row_dictionaries = [row for row in reader]
    return row_dictionaries

def write_tsv_from_dicts(rows, filename):
    """Given a list of dictionaries with consistent keys, writes
    out a tab-separated value file using Python's built-in library
    `csv` to interpret the rows.
    """
    # open a file in write mode
    with open(filename, 'w') as tsv_file:
        # we grab the list of column names from the keys of
        # one of the rows
        columns = list(rows[0].keys())
        writer = csv.DictWriter(tsv_file, columns, delimiter='\t')
        writer.writeheader()
        writer.writerows(rows)

This function gives us `row_dictionaries`, a list of dictionaries that each contain the information of one row. In this case, each row will have the metadata of one poem. Because this is a TSV file, each of the poems has one line in the file, with a **T**ab character between each entry to delimit the different pieces of information into columns. (Because people don't really use tabs in their poem names, this is a pretty safe option to split our text.) We can see the names of the columns by looking at the first 10 lines of the file:

In [17]:
with open('poetry_metadata_2009.tsv') as metadata_file:
    # print the first ten raw lines of the file
    for i in range(10):
        print(metadata_file.readline())

url	title	author	year	month	page

//www.poemas-del-alma.com/blog/mostrar-poema-36	Haiku.  Yo soy	 Rafael Merida Cruz-Lascano	2009	1	0

//www.poemas-del-alma.com/blog/mostrar-poema-49	"Poema Épico  ""Tecum Umán"""	 Rafael Merida Cruz-Lascano	2009	1	0

//www.poemas-del-alma.com/blog/mostrar-poema-56	-Sorsonete- “LA  LIBERTAD  DE  CRISEIDA”	 Rafael Merida Cruz-Lascano	2009	1	0

//www.poemas-del-alma.com/blog/mostrar-poema-58	Romance nuevo SIRENA CONSENTIDA	 Rafael Merida Cruz-Lascano	2009	1	0

//www.poemas-del-alma.com/blog/mostrar-poema-78	Vacia por dentro	 sha_nena	2009	2	0

//www.poemas-del-alma.com/blog/mostrar-poema-79	Sola en la realidad	 sha_nena	2009	2	0

//www.poemas-del-alma.com/blog/mostrar-poema-88	El destino de ser flor.	 Lizelizalde	2009	2	0

//www.poemas-del-alma.com/blog/mostrar-poema-89	Recuerdo de Infancia	 Angela	2009	2	0

//www.poemas-del-alma.com/blog/mostrar-poema-91	Al amar	 Oscar Raul Quiroz Cortejana	2009	2	0



In this file, we can see the 0th line (remember, Python starts counting at 0) gives us the names of each of the columns, and each of the lines after that is for a specific poem. This sort of file can get loaded into Excel if you want cleanly visible columns, but if we just read it as a plain text file, we can still roughly sort out what items should be present in what order. It's a little easier to read these once we have them loaded in as dictionaries in Python, so let's look at one of those:

In [30]:
# read in the data from a TSV format
poetry_metadata = read_tsv_as_dicts('poetry_metadata_2009.tsv')

for i in range(10):
    print(poetry_metadata[i])

{'url': '//www.poemas-del-alma.com/blog/mostrar-poema-36', 'title': 'Haiku.  Yo soy', 'author': ' Rafael Merida Cruz-Lascano', 'year': '2009', 'month': '1', 'page': '0'}
{'url': '//www.poemas-del-alma.com/blog/mostrar-poema-49', 'title': 'Poema Épico  "Tecum Umán"', 'author': ' Rafael Merida Cruz-Lascano', 'year': '2009', 'month': '1', 'page': '0'}
{'url': '//www.poemas-del-alma.com/blog/mostrar-poema-56', 'title': '-Sorsonete- “LA  LIBERTAD  DE  CRISEIDA”', 'author': ' Rafael Merida Cruz-Lascano', 'year': '2009', 'month': '1', 'page': '0'}
{'url': '//www.poemas-del-alma.com/blog/mostrar-poema-58', 'title': 'Romance nuevo SIRENA CONSENTIDA', 'author': ' Rafael Merida Cruz-Lascano', 'year': '2009', 'month': '1', 'page': '0'}
{'url': '//www.poemas-del-alma.com/blog/mostrar-poema-78', 'title': 'Vacia por dentro', 'author': ' sha_nena', 'year': '2009', 'month': '2', 'page': '0'}
{'url': '//www.poemas-del-alma.com/blog/mostrar-poema-79', 'title': 'Sola en la realidad', 'author': ' sha_nena'

We can guess pretty quickly that this dataset is in chronological order starting from January (month 1) 2009, which only had four poems - three of which are by the same author. That makes sense, since the website may have just been starting up, but does lead us to ask a question: are there trends in the number of poems being added over time? And how distributed across authors are our poems?

These are answers we can evaluate using the `Counter` class, which you may have encountered already in Intro to Python. Whether or not you have, the quick version is that it's a special kind of dictionary that is designed to count how many times each unique item shows up in a sequence. We can do this to sort out how many times each author shows up once we've loaded in our data:

In [31]:
from collections import Counter

# make a list of the author from each row of poetry metadata, then
# give that list to the Counter to count up
author_counter = Counter([row['author'] for row in poetry_metadata])

# list authors in decreasing order of how many poems they wrote.
# To just list the top K authors, you can say .most_common(K),e.g.
#    top_100_authors = author_counter.most_common(100)
top_authors = author_counter.most_common()
print("Total authors:", len(top_authors))
for author, count in top_authors:
    print(count, author)

Total authors: 1757
245  ANEUDIS PEREZ
183  julio oropeza
168  FERNANDO CARDONA
158  Violeta
151  Eddy Gtz
148  Jesus Paredes Ortiz
138  ivan semilla
135  Franklin Sandi
133  Sergio Jacobo "el poeta irreverente"
124  luz
109  Antonia Ceada Acevedo
103  Ra_Tito
98  Geovani
98  MODESTOELPOETA1953
97  Sorgalim Narud
96  Rosa de los vientos
94  El de las Rosas
91  luna de hielo
90  Nano_Veliz
87  checovick
85  AZULNEFERTARY
82  angelab
78  Eco del alma
78  Cyrene
77  Adrian VeMo
77  hmoliut
72  eowyn
71  YoKo
70  Roberto Moran
68  migreriana
67  JESIMAR
66  SERGIO FERNANDO
64  Faeton
64  Félix Moreno
63  Bendecido7
62  H3c70r P3r32
62  saly_rosa
60  el duende
60  William Cerdas Logan
58  Rafael Merida Cruz-Lascano
58  skyfire
56  C.J poeta
55  Miguel Angel Ortigoza García
52  cbastias
50  Alejandro
50  KALITA_007
49  aby1982
48  figueredo jorge
47  Lord VanVle
47  sagui
46  Herrera Andreyna
46  shiny
45  ELEPE
44  LindaSakura
44  Josue sz
44  maicolmanya
43  psy_angelito
43  gatoconbotas_5

We can scroll through this list and see we have some variety in how prolific our authors are: a small number have written a hundred or more poems, while a much larger number have only written one each. If we wanted to track how authors changed over time, we could limit ourselves just to poets who had contributed more than X poems. However, most projects I work on that take on something like this are more interested in not overrepresenting significant contributors.

Let's look at a piece of code that samples no more than 10 poems from each author. This uses a special kind of Python dictionary, the `defaultdict`, that's a cousin of the Python `Counter`: it allows you to specify what the type is of values in the dictionary so that when you access a key that hasn't been used before, it provides a default value of that type. For instance, a `defaultdict(int)` would have a default value of 0, while a `defaultdict(list)` defaults to an empty list. (You can specify more complicated functions if you want for these, but let's leave it at that for now.

We'll use our `defaultdict` to make a list of the entries as the value each author key by appending each entry to the list for that author. Then, we'll use Python's built-in `random` library to grab a sample for any that are too long.

In [33]:
from collections import defaultdict
import random

# Using the poetry_metadata variable from a few cells ago,
# we'll make a list for each author
metadata_by_author = defaultdict(list)
for meta_dict in poetry_metadata:
    metadata_by_author[meta_dict['author']].append(meta_dict)

# Iterate through each of the keys (author names) in the list
# and add up to 10 poems to our filtered list
max_per_author = 10
filtered_author_metadata = []
for author in metadata_by_author:
    if len(metadata_by_author[author]) > 10:
        filtered_author_metadata += random.sample(metadata_by_author[author], max_per_author)
    else:
        filtered_author_metadata += metadata_by_author[author]
        
print("Length of original collection:", len(poetry_metadata))
print("Length of filtered collection:", len(filtered_author_metadata))

Length of original collection: 14519
Length of filtered collection: 7592


We've now filtered down our collection by making some choices: namely, deciding we don't want to overrepresent any one author, and more specifically, that we want no more than ten poems from each author. This limits the size of our collection, but it may also make it easier for us to make certain arguments about what is (or isn't) in there; for instance, it would be harder to make the case that a particular trend was just the effect of one or a small group of members in the Poemas community.

Once we have a filtered version of our collection, it's worth writing it out so that we don't have to regenerate it each time we run subsequent analyses. There are two reasons for this: first, because running this sort of processing can take a while on a larger collection, and second, because it helps us keep track of what version of the text we are using.

In [34]:
write_tsv_from_dicts(filtered_author_metadata, "poetry_metadata_22_6_27-author_limit_10.tsv")

Importantly, when we write out a processed version of a collection, we always want to keep track of the changes we made from the original collection to help with reporting and recreating our process later. Using an informative filename can help that (almost all my processed data files include the date I made them as well as some keywords for what changed), but there's no replacement for having an ongoing document that keeps track of your intermediate versions of "cleaned" text collections.

**Warning**: Using something like a Jupyter notebook can make this problem feel like it's already solved, since you can add text around where you generated a particular file. However, **Jupyter notebooks only work as logs of your procedure if you don't go back and edit the script you used to generate the data!** If you think you might make a version 1 and then make some changes to your process for a version 2, 3, and so on, you should either make sure to record what you did in version 1 clearly in some place where that information won't get changed, or you should delete everything produced from the version 1 procedure so it can never accidentally be reused. Not doing this causes a *lot* of problems, especially if you have multiple people working on a project who might mistake an old file for the one to use.

We've now seen a short introduction to working with TSV data. We could also have stored our data in comma-separated value (CSV) files by taking the code above and omitting the `delimiter` keyword:


In [36]:
# This code should look familiar!
import csv

def read_csv_as_dicts(filename):
    """Load in a CSV, or comma-separated value file, using Python's 
    built-in library `csv` for parsing fixed delimiter files. Loads
    in each row as a dictionary."""
    # open the file in read mode
    with open(filename) as csv_file:
        reader = csv.DictReader(csv_file)
        row_dictionaries = [row for row in reader]
    return row_dictionaries

def write_csv_from_dicts(rows, filename):
    """Given a list of dictionaries with consistent keys, writes
    out a comma-separated value file using Python's built-in library
    `csv` to interpret the rows.
    """
    # open a file in write mode
    with open(filename, 'w') as csv_file:
        # we grab the list of column names from the keys of
        # one of the rows
        columns = list(rows[0].keys())
        writer = csv.DictWriter(csv_file, columns, delimiter='\t')
        writer.writeheader()
        writer.writerows(rows)

If you want, you can try these out and explore the differences in how this renders files. CSV files rely on quote characters and commas to separate out fields, which comes with a bit of danger for text processing, since commas show up quite often in text. CSV files will typically use a double quote `"` as an *escape character* to define the boundaries of text so that commas inside a piece of text aren't read as the end of a column. This, in turn, produces interesting quirks for how to render quotes. If you use Python's `csv` library, it should take care of all of that for you, defaulting to the same behavior as what Microsoft Excel does, as specified by the default argument `dialect='excel'`. If you want to change this to another structure, I would recommend against trying to code it yourself, since it's easy to introduce errors, and instead check what dialect makes the most sense to use - you can even find Excel's official tab-separated value dialect!

In [38]:
csv.list_dialects()

['excel', 'excel-tab', 'unix']

We've looked at storing and saving delimited files. Let's look at another format, the JSON file.

## JSON, standards, and encodings


JSON stands for **J**ava**S**cript **O**bject **N**otation. It's a syntax to describe structures of information based on how JavaScript makes objects, but because of its flexibility and clarity, it's also used in a variety of web applications and as a storage mechanism for some datasets.

We can use Python's built-in `json` library to see what it would look like to write out the last ten of our unfiltered metadata from before:

In [43]:
import json

# use Python slicing and negative indexing to grab the
# last 10 elements of the list
last_ten_metadata = poetry_metadata[-10:]
last_ten_json = json.dumps(last_ten_metadata)
print(last_ten_json)

[{"url": "//www.poemas-del-alma.com/blog/mostrar-poema-27379", "title": ":::::Sentidos:::::", "author": " Cock", "year": 2009, "month": 12, "page": 8}, {"url": "//www.poemas-del-alma.com/blog/mostrar-poema-27382", "title": "TODO ES PERFECTO", "author": " chitto_cat", "year": 2009, "month": 12, "page": 8}, {"url": "//www.poemas-del-alma.com/blog/mostrar-poema-27383", "title": "Esperandote", "author": " sagui", "year": 2009, "month": 12, "page": 8}, {"url": "//www.poemas-del-alma.com/blog/mostrar-poema-27384", "title": "Al verte", "author": " Wilkis Santana", "year": 2009, "month": 12, "page": 8}, {"url": "//www.poemas-del-alma.com/blog/mostrar-poema-27385", "title": "CUENTA REGRESIVA", "author": " Elediz", "year": 2009, "month": 12, "page": 8}, {"url": "//www.poemas-del-alma.com/blog/mostrar-poema-27386", "title": "Fin...", "author": " robert marbre", "year": 2009, "month": 12, "page": 8}, {"url": "//www.poemas-del-alma.com/blog/mostrar-poema-27388", "title": "Felizzz 2010!!", "author":

It looks like a list of dictionaries, each of which seems to have different key-value pairs. We notice here that we haven't really done anything to indicate what's a number and what's a string, so everything is being written out as strings. We also might have a little trouble reading this the way it's printing right now - everything's running together in one long line. So let's see if we can make some fixes.

First, let's turn our values for year, month, and page into actual integers! It's a basic for loop, but it's the sort of thing I end up doing all the time to help read in structured information by recasting the text of a number as an actual number. To get integers, I'll just use `int()`.

In [40]:
for meta_dict in last_ten_metadata:
    for key in ["year", "month", "page"]:
        # replace the string with an integer from that string
        meta_dict[key] = int(meta_dict[key])

Better - now let's try writing out the JSON format more legibly. One thing that would help is having some visible indentation to help us tell when something new is starting. The Python `json` library will let us do that with the keyword argument `indent` - every time a new dictionary or list starts, it will use an additional `indent` number of spaces to pad the start of the line. Here, with our list of ten items, it'll look like this:

In [45]:
# Pretty print our new dictionary
pretty_ten_json = json.dumps(last_ten_metadata, indent=2)
print(pretty_ten_json)

[
  {
    "url": "//www.poemas-del-alma.com/blog/mostrar-poema-27379",
    "title": ":::::Sentidos:::::",
    "author": " Cock",
    "year": 2009,
    "month": 12,
    "page": 8
  },
  {
    "url": "//www.poemas-del-alma.com/blog/mostrar-poema-27382",
    "title": "TODO ES PERFECTO",
    "author": " chitto_cat",
    "year": 2009,
    "month": 12,
    "page": 8
  },
  {
    "url": "//www.poemas-del-alma.com/blog/mostrar-poema-27383",
    "title": "Esperandote",
    "author": " sagui",
    "year": 2009,
    "month": 12,
    "page": 8
  },
  {
    "url": "//www.poemas-del-alma.com/blog/mostrar-poema-27384",
    "title": "Al verte",
    "author": " Wilkis Santana",
    "year": 2009,
    "month": 12,
    "page": 8
  },
  {
    "url": "//www.poemas-del-alma.com/blog/mostrar-poema-27385",
    "title": "CUENTA REGRESIVA",
    "author": " Elediz",
    "year": 2009,
    "month": 12,
    "page": 8
  },
  {
    "url": "//www.poemas-del-alma.com/blog/mostrar-poema-27386",
    "title": "Fin...",
   

Okay, that's much easier to read - we can see each entry separately, and we can see that while there are quotes around the URL, title, and author, the year, month, and page don't have quotes - we're supposed to read them as a piece of code would, which is as raw numbers, not as characters in a sequence.

Importantly, these two versions of printing the text will have very different lengths and contents. We can also compare this to how much space the last ten lines would be in CSV form, which would just combine the fields in order. As mentioned before, we should usually let the `csv` library do the work of reading and writing these files, but for this example, I'll just turn everything back into a string myself and combine each line together with tabs using Python's string `join` method, which combines a list of strings into one using the calling string as a delimiter:

In [48]:
tsv_keys = list(last_ten_metadata[0].keys())

original_ten_tsv = ""
for metadata in last_ten_metadata[-10:]:
    # combine all fields with tabs
    row_columns = [str(metadata[k]) for k in tsv_keys]
    line = "\t".join(row_columns)
    # add the combined line and newline character
    original_ten_tsv += line + "\n"

print(original_ten_tsv)

//www.poemas-del-alma.com/blog/mostrar-poema-27379	:::::Sentidos:::::	 Cock	2009	12	8
//www.poemas-del-alma.com/blog/mostrar-poema-27382	TODO ES PERFECTO	 chitto_cat	2009	12	8
//www.poemas-del-alma.com/blog/mostrar-poema-27383	Esperandote	 sagui	2009	12	8
//www.poemas-del-alma.com/blog/mostrar-poema-27384	Al verte	 Wilkis Santana	2009	12	8
//www.poemas-del-alma.com/blog/mostrar-poema-27385	CUENTA REGRESIVA	 Elediz	2009	12	8
//www.poemas-del-alma.com/blog/mostrar-poema-27386	Fin...	 robert marbre	2009	12	8
//www.poemas-del-alma.com/blog/mostrar-poema-27388	Felizzz 2010!!	 Lau_22	2009	12	8
//www.poemas-del-alma.com/blog/mostrar-poema-27391	Una sombra mas	 MrShadow23	2009	12	8
//www.poemas-del-alma.com/blog/mostrar-poema-27395	A veces...	 migreriana	2009	12	8
//www.poemas-del-alma.com/blog/mostrar-poema-27399	Amor infinito	 Latino	2009	12	8



Now, let's look at the relationship between how easy it is for us to read these strings versus how much space they take:

In [50]:
print("Number of characters in ten metadata fields")
print("Pretty JSON:", len(pretty_ten_json))
print("Raw JSON:", len(last_ten_json))
print("Raw TSV:", len(original_ten_tsv))

Number of characters in ten metadata fields
Pretty JSON: 1802
Raw JSON: 1500
Raw TSV: 850


Surprised? Probably not - we spent extra space in our JSON representation to rewrite the names of our "columns" in every entry, and in the pretty printing, we also added bonus spaces. Of course, because of this format, JSON also has some flexibility we're not using: for instance, we can nest lists and objects inside other lists and objects, in the same way we could make a Python dictionary a value inside a Python dictionary. We can also choose to have some "keys" or attributes exist only some of the time; maybe if I had an author profile page for some authors but not others, I could add that attribute only where I need it in the JSON representation, but I would have to consistently have that column exist whether populated or not in the TSV representation. But it's worth talking about space because when we're representing data on computers in general, space adds up!

To break this down a little more, let's review how strings store data. (I say review because I believe some of this is covered in the Intro to Python sequence for TAP.) A string is a list of characters, or symbols. Since computer memory is built to store information as binary numbers, a string is stored as a list of binary numbers, with one number for each character we see, which we call that symbol's *code point*. For instance, for pretty much any computer you'll use, the letter `A` has code point 65 if we're counting in base ten, and `a` has code point 97.

A listing of these numbers is a *standard* - these numbers both come from the [ASCII standard](https://en.wikipedia.org/wiki/ASCII), or American Standard Code for Information Interchange, which dates back to the '60s. The ASCII standard is a product of its place and time: it goes from 0 to 127 and, fitting the expectations of popular characters for US English, it includes numbers, Latin letters without accents, punctuation, spacing, and a series of special symbols meant to match up to different typewriter operations. (After all, back in the 60's, typewriters were a standard way to handle input and output for computers.)

The standard we interact with commonly online and on our phones is the [Unicode standard](https://en.wikipedia.org/wiki/Unicode). Unicode starts with the same 128 symbols as ASCII, but then extends well beyond that to include characters from other alphabets, diacritical marks, stylized symbols, and even emoji. It's also updated annually by the Unicode Consortium, a non-profit whose voting members comprise many well-known tech companies and research organizations. At the time of producing this tutorial, [Unicode 15.0 is about to be rolled out with over 149,000 characters](https://home.unicode.org/unicode-15-0-beta-review/).

Of course, this gets at the way we map symbols to numbers, but to actually encode text - that is, to write it into our computer files and memory - we need a way to turn those numbers into a consistent sequence of ones and zeros. We call this a text encoding. Unicode actually has several different ways to do this.

Let's give ourselves an example made-up username with some non-ASCII characters to see how this looks:

In [57]:
username = "Mar\u00EDa \u2615"
print(username)

María ☕


Python will get upset if we try to paste non-ASCII characters into a string, so to specify those characters as Unicode, we use the `\u####` format to specify the code point of the symbol we want in *hexadeximal*. Hexadecimal is base 16, which has some more digits than the base-10 *decimal* counting system we're used to - it goes 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, 10. The hexadecimal number 10 is equivalent to the decimal number sixteen. So, to write the decimal number forty-two, we'd use two sixteens and a ten, which we're write as 2A (2\*16 + 10). If we wanted to write out sixteen squared in hexadecimal, we could just write 100, the same way 100 in our usual decimal counting system is ten squared. It's convenient for programming because each hexadecimal digit can be written using exactly four bits.

We wrote this using two hexadecimal numbers, one for our i-acute and one for our coffee emoji. We can even use Python to convert the numbers out of hexadecimal (or "hex") if we'd like:

In [59]:
print("\u00ED", "has hex code point 00ED and dec code point", int("00ED", 16))
print("\u2615", "has hex code point 2615 and dec code point", int("2615", 16))

í has hex code point 00ED and dec code point 237
☕ has hex code point 2615 and dec code point 9749


Now that we have unicode, let's think about exactly how much memory (not just how many characters) will get used by our different strings when we write them out using a particular encoding:

In [77]:
# get the text encoded using different encodings
for encoding in ['utf-8', 'utf-32', 'latin1']:
    byte_val = "Mar\u00EDa".encode(encoding)
    print(encoding, "needs", len(byte_val), "bytes")
    print("Hex code:", byte_val.hex())
    print()

utf-8 needs 6 bytes
Hex code: 4d6172c3ad61

utf-32 needs 24 bytes
Hex code: fffe00004d0000006100000072000000ed00000061000000

latin1 needs 5 bytes
Hex code: 4d6172ed61



What's going on? I've used three different encodings here for just the name Mar&#237;a: two associated with Unicode and one, `latin1`, that predates Unicode. (Heads up, `latin1`, or `ISO-8859-1`, used to be a common encoding for text based on a Latin alphabet with accents, so you may run into it now and again. If you get a bunch of As with accents instead of the text you expect from a document, try to read the file using `encoding=latin1` and see if that fixes it.)

To break down what's going on here with each encoding:
* `utf-8` uses a variable-length encoding, where one bit of each byte says whether it'll need another byte to write out the number or not. This means that for the four letters contained in ASCII, it only uses one byte each, but for the accented &#237;, it has to use two bytes, giving us 6 total bytes.
* `utf-32` is fixed-length, but it uses four bytes (32 bits) for each symbol, plus an additional symbol at the front that tells it how it'll order the four bytes (since some programs read bytes front to back and others back to front). So, we get 5 * 4 + 4 bytes, or 24. If this was all emoji, this might have a lot more contents, but since it's mostly ASCII, we see a lot of 0s in the extra bytes.
* `latin1` is explicitly designed to support some latin characters with accents without needing a second byte, so it is able to write the acute &#237; in one byte, using only five bytes. (However, if we needed to write the coffee emoji, it'd throw an error - try it!)

___
[Proceed to next lesson: Text Curation 2/3 ->](./textcuration-2.ipynb)

# Exercises (Optional)

`If possible, include practice exercises for users to do on their own. These may have clear solutions or be more open-ended.`

# Solutions (Optional)
`Offer some possible solutions for the practice exercises.`


# References (Optional)
No citations required but include this if you have cited academic sources. Use whatever format you like, just be consistent. Markdown footnotes are not well-supported in notebooks.[$^{1}$](#1) I suggest using an anchor link with plain html as shown.[$^{2}$](#2)

1. <a id="1"></a> Here is an anchor link footnote.
2. <a id="2"></a> D'Ignazio, Catherine and Lauren F. Klein. [*Data Feminism*](https://mitpress.mit.edu/books/data-feminism). MIT Press, 2020.