<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Xanda Schofield](https://www.cs.hmc.edu/~xanda) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email xanda@cs.hmc.edu.<br />
____

# Text Data Curation 1

This is lesson 1 of 3 in the educational series on Text Data Curation. This notebook is intended to introduce the basics of treating text documents as data and how to store and filter those documents. 

**Audience:** `Learners` / `Researchers`

**Use case:** [`How-To`](https://constellate.org/docs/documentation-categories#howtoproblemoriented) 

**Difficulty:** `Intermediate`
This course is open to those with a basic level of proficiency in Python. Taking the Python Basics course the week before is sufficient.

**Completion time:** 90 minutes

**Knowledge Required:** 

* Python basics (variables, flow control, functions, lists, dictionaries)
* How Python libraries work (installation and imports)


**Knowledge Recommended:**

* Basic file operations (open, close, read, write)
* How text is stored on computers (encodings, file types)


**Learning Objectives:**
After this lesson, learners will be able to:

1. Parse and generate XML, JSON, and CSV files given raw text documents
2. Identify when text encodings affect the interpretation of text data
3. Use a lexicon to select relevant documents within a text collection
4. Use fuzzy matching to remove duplicate items from a collection

___

## Install Required Libraries

In [None]:
### Installs and Imports ##
!pip install beautifulsoup4

# Import libraries
import csv
import json
import os
import urllib

# Required Data

**Data Format:** 
* delimited files (.csv, .tsv)
* structured files (.json, .xml)

**Data Source:**
* [Contemporary Spanish Poetry Metadata](https://cs.hmc.edu/~xanda/data/poemas_metadata.json): Spanish poem names scraped from [Poemas del Alma](https://www.poemas-del-alma.com/). Scraped by Xanda Schofield in June 2022 and encoded in UTF-8. Note that this does not contain the poems for redistribution reasons (though if you are interested in looking at the full body of poems for a project, let me know!)


## Download Required Data

In [None]:
### Retrieve multiple files using a list ###

download_urls = [
    'https://cs.hmc.edu/~xanda/data/poemas_metadata.json',
    'https://cs.hmc.edu/~xanda/data/poemas_metadata.tsv'
]

for url in download_urls:
    urllib.request.urlretrieve(url, url.rsplit('/', 1)[-1])

# Introduction

This is the first of three lessons on **text data curation**. What does the term mean? First, when I talk about *text data*, I refer to text that has been serialized in a computer into some numerical representation. When I refer to curating that data, I think about the OED's [definition of curation](https://www.lexico.com/en/definition/curation): "The action or process of selecting, organizing, and looking after the items in a collection or exhibition." When we curate text data for the purpose of computational text analysis, our goal is the selection, organization, and looking after a dataset of text documents whose direct audience will be a computer program: we need to provide consistency and precision in our representation so that the program can run. However, it is also important that the output of the computer program itself speaks to the reasons we are interested in these texts: so our text must both make sense to a computer program and to us. While these two audiences aren't totally at odds, they need different things. Hopefully, this tutorial approaches how to navigate that, both with specific examples of ways we curate text in a dataset (through filtering, cleaning, and normalizing) and through a discussion of strategies of poking and prodding at these large collections in order to notice when something is amiss.

This tutorial is aimed for an audience with a little bit of information on how Python works that's interested in embarking on a quest to work with a large collection of text using computational tools. In this first section, we will look specifically at the ways we store text and collections of documents on computers and tools we can use to filter those larger collections into something usable. The tutorial is targeted at individuals interested in computational text analysis who are acquainted with the basics of programming in Python but are still new to developing models of text or more complex natural language processing tasks.

# Lesson


## When text is "data"
There are lots of ways to gain understanding of texts. We can read the texts themselves closely, read about the history of the texts' subject and of its authors, dig up related or contemporary texts, and grapple with the critical responses others have had to each. Depending on the questions we want to answer or our worldview about textual analysis in general, we might take a mixture of these different strategies together to analyze, answer, and argue on one topic.

Computational text analysis provides another set of strategies to do this, with a specific emphasis on what is (or isn't) a pattern across numerous texts. To do this, text collections are fed to computer programs that count, contrast, and correlate events that happen in texts, ranging from the basic frequencies of individual words and phrases to more complex inferences of what events occur for a particular character or what sentiments or moods are prevalent around mentions of a specific theme. Approaches to find these patterns range from direct programming (e.g. writing code in Python or R) to interacting with visual interfaces (like Voyant or Tableau) to options that merge the two (like building an extensive spreadsheet in Excel). In my experience, there's an amazing ecosystem of different tools and tutorials out there to train models, compute statistics, and visualize results around text. However, without text data curation, one cannot actually learn anything from these models.

If you take nothing else away from this week, there are two main things I want to revisit:
1. Text data curation requires making subjective decisions, and it is important to think about and document these choices.
2. Text data curation succeeds when we are curious and suspicious about the contexts of our text collection, so we should continue to think of ways to check that the text we have looks the way we think it does.

## TSVs and a basic filtering task

It's important to acknowledge before we embark on this project: the process of text data curation for these applications is not neutral: it's *subjective* and *destructive*. Let's think about a case study to understand how this looks.

Suppose I'm interested in studying dominant themes in contemporary Spanish language poetry. To start, I want to find a large collection of these poems. I find out that the website [Poemas del Alma ("Poems of the Soul")](https://www.poemas-del-alma.com/) has hundreds of thousands of user-submitted poems from 2009 onwards. While I found a [dataset on Kaggle](https://www.kaggle.com/datasets/andreamorgar/spanish-poetry-dataset) that contains the literary poems on the site, it only has the literary poems that were reposted there (about 5,000), and I want all of them, so I start by writing a script to get the list of poems and their URLs so I can figure out which ones I want to download.

*Aside:* scraping a website &mdash; that is, running a program that systematically walks through pages of the site to download their contents &mdash; is a really tough thing to do to a webserver without either putting a sufficient amount of time between sites (so other traffic can get through) or, better yet, coordinating with the website hosts for a way to do so that won't block web traffic.

Let's take a look at the *metadata* of these. This includes the titles, authors, month of publication, and URLs for each poem. I've actually created three different versions of this metadata file: a TSV file, a CSV file, a JSON file, and an XML file. We'll start by loading in the first TSV file. I've added some helper functions for reading and writing TSVs into lists of dictionaries below:

In [None]:
import csv

def read_tsv_as_dicts(filename):
    """Load in a TSV, or tab-separated value file, using Python's 
    built-in library `csv` for parsing fixed delimiter files. Loads
    in each row as a dictionary."""
    # open the file in read mode
    with open(filename, encoding='utf-8') as tsv_file:
        reader = csv.DictReader(tsv_file, delimiter='\t')
        row_dictionaries = [row for row in reader]
    return row_dictionaries

def write_tsv_from_dicts(rows, filename):
    """Given a list of dictionaries with consistent keys, writes
    out a tab-separated value file using Python's built-in library
    `csv` to interpret the rows.
    """
    # open a file in write mode
    with open(filename, 'w', encoding='utf-8') as tsv_file:
        # we grab the list of column names from the keys of
        # one of the rows
        columns = list(rows[0].keys())
        writer = csv.DictWriter(tsv_file, columns, delimiter='\t')
        writer.writeheader()
        writer.writerows(rows)

The first function `read_tsv_as_dicts` gives us a return variable `row_dictionaries`, a list of dictionaries that each contain the information of one row. In this case, each row will have the metadata of one poem. Because this is a TSV file, each of the poems has one line in the file, with a **T**ab character between each entry to delimit the different pieces of information into columns. (Because people don't really use tabs in their poem names, this is a pretty safe option to split our text.) We can see the names of the columns by looking at the first 10 lines of the file:

In [None]:
# We're storing these files with a utf-8 encoding
with open('poemas_metadata.tsv', encoding='utf8') as metadata_file:
    # print the first ten raw lines of the file
    for i in range(10):
        print(metadata_file.readline())

In this file, we can see the 0th line (remember, Python starts counting at 0) gives us the names of each of the columns, and each of the lines after that is for a specific poem. This sort of file can get loaded into Excel if you want cleanly visible columns, but if we just read it as a plain text file, we can still roughly sort out what items should be present in what order. It's a little easier to read these once we have them loaded in as dictionaries in Python, so let's look at one of those:

In [None]:
# read in the data from a TSV format as dictionaries
poetry_metadata = read_tsv_as_dicts('poemas_metadata.tsv')

for i in range(10):
    print(poetry_metadata[i])

**Exercise.** Look at the first 10 documents and then the last 10 documents (i.e. starting at index `-10`) - what data do we know about each poem? Is there anything missing or unclear?

From our discussion, we might have a question: are there trends in the number of poems being added over time? And how distributed across authors are our poems?

These are answers we can evaluate using the `Counter` class, which you may have encountered already in Intro to Python. Whether or not you have, the quick version is that it's a special kind of dictionary that is designed to count how many times each unique item shows up in a sequence. We can do this to sort out how many times each author shows up once we've loaded in our data:

In [None]:
from collections import Counter

# make a list of the author from each row of poetry metadata, then
# give that list to the Counter to count up
author_counter = Counter([row['author'] for row in poetry_metadata])

# list authors in decreasing order of how many poems they wrote.
# To just list the top K authors, you can say .most_common(K),e.g.
#    top_100_authors = author_counter.most_common(100)
top_authors = author_counter.most_common()
print("Total authors:", len(top_authors))
for author, count in top_authors:
    print(count, author)

**Exercise**. Scroll through the output of this counter - what do you notice?

If we wanted to track how authors changed over time, we could limit ourselves just to poets who had contributed more than X poems. However, most projects I work on that take on something like this are more interested in not overrepresenting significant contributors.

Let's look at a piece of code that samples no more than 10 poems from each author. This uses a special kind of Python dictionary, the `defaultdict`, that's a cousin of the Python `Counter`: it allows you to specify what the type is of values in the dictionary so that when you access a key that hasn't been used before, it provides a default value of that type. For instance, a `defaultdict(int)` would have a default value of 0, while a `defaultdict(list)` defaults to an empty list. (You can specify more complicated functions if you want for these, but let's leave it at that for now.

We'll use our `defaultdict` to make a list of the entries as the value each author key by appending each entry to the list for that author. Then, we'll use Python's built-in `random` library to grab a sample for any that are too long.

In [None]:
from collections import defaultdict
import random

# Using the poetry_metadata variable from a few cells ago,
# we'll make a list for each author
metadata_by_author = defaultdict(list)
for meta_dict in poetry_metadata:
    metadata_by_author[meta_dict['author']].append(meta_dict)

# Iterate through each of the keys (author names) in the list
# and add up to 10 poems to our filtered list
max_per_author = 10
filtered_author_metadata = []
for author in metadata_by_author:
    if len(metadata_by_author[author]) > 10:
        filtered_author_metadata += random.sample(metadata_by_author[author], max_per_author)
    else:
        filtered_author_metadata += metadata_by_author[author]
        
print("Length of original collection:", len(poetry_metadata))
print("Length of filtered collection:", len(filtered_author_metadata))

We've now filtered down our collection by making some choices: namely, deciding we don't want to overrepresent any one author, and more specifically, that we want no more than ten poems from each author. This limits the size of our collection, but it may also make it easier for us to make certain arguments about what is (or isn't) in there.

**Exercise** Add code to ensure that we also excluse authors who have contributed fewer than 5 poems. How much more does this limit the corpus size? When might this be worthwhile?

Once we have a filtered version of our collection, it's worth writing it out so that we don't have to regenerate it each time we run subsequent analyses. There are two reasons for this: first, because running this sort of processing can take a while on a larger collection, and second, because it helps us keep track of what version of the text we are using. In this case, since we're doing something random to select the text, it's extra-important - if we rerun this step, we'll get a different subset of the corpus!

In [None]:
write_tsv_from_dicts(filtered_author_metadata, "poemas_metadata_22_6_27-author_limit_10.tsv")

Importantly, when we write out a processed version of a collection, we always want to keep track of the changes we made from the original collection to help with reporting and recreating our process later. Using an informative filename can help that (almost all my processed data files include the date I made them as well as some keywords for what changed), but there's no replacement for having an ongoing document that keeps track of your intermediate versions of "cleaned" text collections.

**Warning**: Using something like a Jupyter notebook can make this problem feel like it's already solved, since you can add text around where you generated a particular file. However, **Jupyter notebooks only work as logs of your procedure if you don't go back and edit the script you used to generate the data!** If you think you might make a version 1 and then make some changes to your process for a version 2, 3, and so on, you should either make sure to record what you did in version 1 clearly in some place where that information won't get changed, or you should delete everything produced from the version 1 procedure so it can never accidentally be reused. Not doing this causes a *lot* of problems, especially if you have multiple people working on a project who might mistake an old file for the one to use.

We've now seen a short introduction to working with TSV data. We could also have stored our data in comma-separated value (CSV) files by taking the code above and omitting the `delimiter` keyword:


In [None]:
# This code should look familiar!
import csv

def read_csv_as_dicts(filename):
    """Load in a CSV, or comma-separated value file, using Python's 
    built-in library `csv` for parsing fixed delimiter files. Loads
    in each row as a dictionary."""
    # open the file in read mode
    with open(filename, encoding='utf-8') as csv_file:
        reader = csv.DictReader(csv_file)
        row_dictionaries = [row for row in reader]
    return row_dictionaries

def write_csv_from_dicts(rows, filename):
    """Given a list of dictionaries with consistent keys, writes
    out a comma-separated value file using Python's built-in library
    `csv` to interpret the rows.
    """
    # open a file in write mode
    with open(filename, 'w', encoding='utf-8') as csv_file:
        # we grab the list of column names from the keys of
        # one of the rows
        columns = list(rows[0].keys())
        writer = csv.DictWriter(csv_file, columns, delimiter='\t')
        writer.writeheader()
        writer.writerows(rows)

If you want, you can try these out and explore the differences in how this renders files. CSV files rely on quote characters and commas to separate out fields, which comes with a bit of danger for text processing, since commas show up quite often in text. CSV files will typically use a double quote `"` as an *escape character* to define the boundaries of text so that commas inside a piece of text aren't read as the end of a column. This, in turn, produces interesting quirks for how to render quotes. If you use Python's `csv` library, it should take care of all of that for you, defaulting to the same behavior as what Microsoft Excel does, as specified by the default argument `dialect='excel'`. If you want to change this to another structure, I would recommend against trying to code it yourself, since it's easy to introduce errors, and instead check what dialect makes the most sense to use - you can even find Excel's official tab-separated value dialect!

In [None]:
csv.list_dialects()

We've looked at storing and saving delimited files. Let's look at another format, the JSON file.

## JSON, standards, and encodings


JSON stands for **J**ava**S**cript **O**bject **N**otation. It's a syntax to describe structures of information based on how JavaScript makes objects, but because of its flexibility and clarity, it's also used in a variety of web applications and as a storage mechanism for some datasets.

We can use Python's built-in `json` library to see what it would look like to write out the last ten of our unfiltered metadata from before. The library has two functions for writing out text in a JSON format: `json.dumps` returns a string, while `json.dump` writes to an open file. Since we want to see a string of the output, we'll use `dumps`:

In [None]:
import json

# use Python slicing and negative indexing to grab the
# last 10 elements of the list
last_ten_metadata = poetry_metadata[-10:]
last_ten_json = json.dumps(last_ten_metadata)
print(last_ten_json)

It looks like a list of dictionaries, each of which seems to have different key-value pairs. We notice here that we haven't really done anything to indicate what's a number and what's a string, so everything is being written out as strings. We also might have a little trouble reading this the way it's printing right now - everything's running together in one long line. Finally, we can spot that we've actually had a leading space in front of all our author names - looks like the data wasn't encoded super well! So let's see if we can make some fixes.

First, let's turn our values for year, month, and page into actual integers! It's a basic for loop, but it's the sort of thing I end up doing all the time to help read in structured information by recasting the text of a number as an actual number. To get integers, I'll just use `int()`.

In [None]:
for meta_dict in poetry_metadata:
    for key in ["year", "month", "page"]:
        # replace the string with an integer from that string
        meta_dict[key] = int(meta_dict[key])

One step down. Now, to get rid of leading spaces (and trailing spaces if we have them), we'll use the `strip()` function built into strings:

In [None]:
for meta_dict in poetry_metadata:
    meta_dict["author"] = meta_dict["author"].strip()

Better - now let's try writing out the JSON format more legibly. One thing that would help is having some visible indentation to help us tell when something new is starting. The Python `json` library will let us do that with the keyword argument `indent` - every time a new dictionary or list starts, it will use an additional `indent` number of spaces to pad the start of the line. Here, with our list of ten items, it'll look like this:

In [None]:
# Pretty print our new dictionary
pretty_ten_json = json.dumps(last_ten_metadata, indent=2)
print(pretty_ten_json)

**Exercise.** We made our changes to `poetry_metadata`, but saw those changes reflected in `last_ten_metadata`. Why? 

Okay, that's much easier to read - we can see each entry separately, and we can see that while there are quotes around the URL, title, and author, the year, month, and page don't have quotes - we're supposed to read them as a piece of code would, which is as raw numbers, not as characters in a sequence.

Importantly, these two versions of printing the text will have very different lengths and contents. We can also compare this to how much space the last ten lines would be in CSV form, which would just combine the fields in order. As mentioned before, we should usually let the `csv` library do the work of reading and writing these files, but for this example, I'll just turn everything back into a string myself and combine each line together with tabs using Python's string `join` method, which combines a list of strings into one using the calling string as a delimiter:

In [None]:
tsv_keys = list(last_ten_metadata[0].keys())

original_ten_tsv = ""
for metadata in last_ten_metadata[-10:]:
    # combine all fields with tabs
    row_columns = [str(metadata[k]) for k in tsv_keys]
    line = "\t".join(row_columns)
    # add the combined line and newline character
    original_ten_tsv += line + "\n"

print(original_ten_tsv)

Now, let's look at the relationship between how easy it is for us to read these strings versus how much space they take:

In [None]:
print("Number of characters in ten metadata fields:")
print("Pretty JSON:", len(pretty_ten_json))
print("Raw JSON:", len(last_ten_json))
print("Raw TSV:", len(original_ten_tsv))

Surprised? Probably not - we spent extra space in our JSON representation to rewrite the names of our "columns" in every entry, and in the pretty printing, we also added bonus spaces. Of course, because of this format, JSON also has some flexibility we're not using: for instance, we can nest lists and objects inside other lists and objects, in the same way we could make a Python dictionary a value inside a Python dictionary. We can also choose to have some "keys" or attributes exist only some of the time; maybe if I had an author profile page for some authors but not others, I could add that attribute only where I need it in the JSON representation, but I would have to consistently have that column exist whether populated or not in the TSV representation. But it's worth talking about space because when we're representing data on computers in general, space adds up!

## A digression - text encodings

To break this down a little more, let's review how strings store data. (I say review because I believe some of this is covered in the Intro to Python sequence for TAP.) A string is a list of characters, or symbols. Since computer memory is built to store information as binary numbers, a string is stored as a list of binary numbers, with one number for each character we see, which we call that symbol's *code point*. For instance, for pretty much any computer you'll use, the letter `A` has code point 65 if we're counting in base ten, and `a` has code point 97.

A listing of these numbers is a *standard* - these numbers both come from the [ASCII standard](https://en.wikipedia.org/wiki/ASCII), or American Standard Code for Information Interchange, which dates back to the '60s. The ASCII standard is a product of its place and time: it goes from 0 to 127 and, fitting the expectations of popular characters for US English, it includes numbers, Latin letters without accents, punctuation, spacing, and a series of special symbols meant to match up to different typewriter operations. (After all, back in the 60's, typewriters were a standard way to handle input and output for computers.)

The standard we interact with commonly online and on our phones is the [Unicode standard](https://en.wikipedia.org/wiki/Unicode). Unicode starts with the same 128 symbols as ASCII, but then extends well beyond that to include characters from other alphabets, diacritical marks, stylized symbols, and even emoji. It's also updated annually by the Unicode Consortium, a non-profit whose voting members comprise many well-known tech companies and research organizations. At the time of producing this tutorial, [Unicode 15.0 is about to be rolled out with over 149,000 characters](https://home.unicode.org/unicode-15-0-beta-review/).

Of course, this gets at the way we map symbols to numbers, but to actually encode text - that is, to write it into our computer files and memory - we need a way to turn those numbers into a consistent sequence of ones and zeros. We call this a text encoding. Unicode actually has several different ways to do this.

Let's give ourselves an example made-up username with some non-ASCII characters to see how this looks:

In [None]:
username = "Mar\u00EDa \u2615"
print(username)

Python will get upset if we try to paste non-ASCII characters into a string, so to specify those characters as Unicode, we use the `\u####` format to specify the code point of the symbol we want in *hexadeximal*. Hexadecimal is base 16, which has some more digits than the base-10 *decimal* counting system we're used to - it goes 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, 10. The hexadecimal number 10 is equivalent to the decimal number sixteen. So, to write the decimal number forty-two, we'd use two sixteens and a ten, which we're write as 2A (2\*16 + 10). If we wanted to write out sixteen squared in hexadecimal, we could just write 100, the same way 100 in our usual decimal counting system is ten squared. It's convenient for programming because each hexadecimal digit can be written using exactly four bits.

We wrote this using two hexadecimal numbers, one for our i-acute and one for our coffee emoji. We can even use Python to convert the numbers out of hexadecimal (or "hex") if we'd like:

In [None]:
print("\u00ED", "has hex code point 00ED and dec code point", int("00ED", 16))
print("\u2615", "has hex code point 2615 and dec code point", int("2615", 16))

Now that we have unicode, let's think about exactly how much memory (not just how many characters) will get used by our different strings when we write them out using a particular encoding:

In [None]:
# get the text encoded using different encodings
for encoding in ['utf-8', 'utf-32', 'latin1']:
    byte_val = "Mar\u00EDa".encode(encoding)
    print(encoding, "needs", len(byte_val), "bytes")
    print("Hex code:", byte_val.hex())
    print()

What's going on? I've used three different encodings here for just the name Mar&#237;a: two associated with Unicode and one, `latin1`, that predates Unicode. (Heads up, `latin1`, or `ISO-8859-1`, used to be a common encoding for text based on a Latin alphabet with accents, so you may run into it now and again. If you get a bunch of As with accents instead of the text you expect from a document, try to read the file using `encoding=latin1` and see if that fixes it.)

To break down what's going on here with each encoding:
* `utf-8` uses a variable-length encoding, where one bit of each byte says whether it'll need another byte to write out the number or not. This means that for the four letters contained in ASCII, it only uses one byte each, but for the accented &#237;, it has to use two bytes, giving us 6 total bytes.
* `utf-32` is fixed-length, but it uses four bytes (32 bits) for each symbol, plus an additional symbol at the front that tells it how it'll order the four bytes (since some programs read bytes front to back and others back to front). So, we get 5 * 4 + 4 bytes, or 24. If this was all emoji, this might have a lot more contents, but since it's mostly ASCII, we see a lot of 0s in the extra bytes.
* `latin1` is explicitly designed to support some latin characters with accents without needing a second byte, so it is able to write the acute &#237; in one byte, using only five bytes. (However, if we needed to write the coffee emoji, it'd throw an error - try it!)

## Finding duplicates (or near-duplicates)

We've now thought a bit about what's going on with our pieces of text. So let's check something - are all of our poems in our dataset unique? Let's start back into our original list of poems and see how unique our titles are.

In [None]:
title_counter = Counter([poem['title'] for poem in poetry_metadata])
print(title_counter.most_common(20))

Huh - looks like we have a lot of poems about solitude. This isn't a big surprise, that titles are getting used often for common themes, but we might be concerned if we have multiple copies of the same poem from the same author. Let's try including that information and see if that removes our duplicates.

I'll sometimes see people do this task by making some new special text field by glueing things together, e.g. by putting the title, a weird delimiter, and then the author into one string, like "SOLEDAD~+~Mar&#237;a". However, we don't have to do this - Python is perfectly happy to use a tuple, e.g. `(title, author)`, in place of a single string or value as a key in a dictionary, and it saves us time and silliness trying to chop up and glue together pieces of information. So we'll just use that instead:

In [None]:
title_author_counter = Counter([(poem['title'], poem['author']) for poem in poetry_metadata])
print(title_author_counter.most_common(20))

Wow, it looks like we have a much deeper problem than just common titles! Duplicates or near-duplicates are an inevitable part of text datasets. In this case, we probably want to make sure we don't include more than one copy of the same poem from the same author, so we might want to walk through our dataset and only include the *last* poem of a particular title by that author. Since our dataset is in chronological order, we can just grab the last poem using `defaultdict` again for this processing:

In [None]:
# Using the poetry_metadata variable again,
# we'll make a list for each title and author
metadata_by_title_author = defaultdict(list)
for poem in poetry_metadata:
    metadata_by_title_author[(poem['title'], poem['author'])].append(poem)

# Grab the latest poem in each list
unique_poems = []
for title_author in metadata_by_title_author:
    unique_poems.append(metadata_by_title_author[title_author][-1])

In [None]:
print("Removed", len(poetry_metadata) - len(unique_poems), "duplicate poems")
print("Percentage of poems remaining:", 100 * len(unique_poems) / len(poetry_metadata))

We've gotten rid of several thousand duplicate entries, but this was assuming that duplicates had exactly the same orthography. Usually, duplicate entries may have slightly different information: for instance, the same article from the AP newswire may be printed with different titles in different newspapers, or a book's database entry may vary based on extra text in the title related to edition number or whether the author's middle initial was included. As a result, we usually need to do some kind of **fuzzy matching** to be sure we've caught duplicates with subtle variations. This can include counting words to see whether there's >95% overlap in the exact count of words for longer documents, or using something like NLTK's `nltk.metrics.distance.edit_distance` to measure the "distance" between strings in terms of how many single-character edits would be needed to get from one string to the other. (Usually, this "number of characters to change" metric goes by the name *Levenshtein distance*).

Here's an example of using edit distance to compare a few words, all with the same edit distance of 2:

In [None]:
from nltk.metrics.distance import edit_distance

pairs = [
    ("create", "creative"),
    ("cat", "cow"),
    ("maria", "Mar\u00EDa"),
    ("Assume", "Ass u me")
]
for w1, w2 in pairs:
    print(w1, "/", w2, "-", edit_distance(w1, w2))

This simple formulation of edit distance isn't terribly context-aware: it doesn't understand that changes in capitalization or the omission of an accent may be smaller changes in our eyes, nor does it understand that the length of a word may play a role. Depending on the setting you're in, you may want to make substitutions or lower-case text in advance of doing something like this, or look for a more sophisticated fuzzy matching tool.


**Exercise.** Rewrite the code for finding unique poems to lower-case all author and poem titles before comparing them. Does this have an effect? *Extra -* Can you find the poems that are removed in one and not the other?

## One more organization scheme: XML

Let's see one more format of storing our text: XML, or e**X**tensible **M**arkup **L**anguage. XML is designed to allow you to explicitly structure fairly complex data as a tree of different elements. For instance, our tree could have a root element as the list of all our poems, then have branches - or sub-elements - for each separate poem. These branches can have their own sub-elements (like the title and author) as well as attributes (like their URL or the year they were written). If you've looked at HTML code before, the syntax might look familiar. Let's make an example with just our last ten poems:

In [None]:
import xml.etree.ElementTree as ET

# build an XML tree (ick)
poems_root = ET.Element('poems')
for poem_dict in last_ten_metadata:
    new_poem = ET.SubElement(poems_root, 'poem')
    
    # Set attributes
    for attribute in ['url', 'year', 'month', 'page']:
        new_poem.set(attribute, str(poem_dict[attribute]))
    
    # Add text to subelements
    title = ET.SubElement(new_poem, 'title')
    title.text = poem_dict['title']
    author = ET.SubElement(new_poem, 'author')
    author.text = poem_dict['author']

xml_data = ET.tostring(poems_root, encoding='utf-8')
print(xml_data.decode('utf-8'))

Wel...that's definitely a file!

Being completely honest, I do not enjoy working with XML. While XML is powerful, XML parsing tends to be computationally expensive and a little sensitive, and working with the tree structure the way the Python `xml` library has it set up is less than fun, so I almost never write out an XML file if I'm using Python. However, I do sometimes need to parse them, as XML and other markup languages often exist in data archives. For XML and HTML websites, I usually do this using `BeautifulSoup` instead of the built-in functionality for a few reasons:
1. I find its functionality for printing parts of the parsed data more intuitive,
2. It's a lot better about handling malformed XML or HTML (which is not uncommon), and
3. It has nice [documentation and examples](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup).


In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(xml_data, 'xml')

In [None]:
print(soup.prettify())

BeautifulSoup can take in a raw string of text and parse it out as nicely-formatted XML (you can see it even added an annotation for the XML type.) If we want to grab all the poems, we can just use `soup.find_all(tag)`to get a list of every element with that tag:

In [None]:
soup.find_all('poem')

If I want to get text out of fields, I usually use a list comprehension over the tag I'm interested in. For instance, if I want all the titles, I can do the following simple list comprehension:

In [None]:
titles = [t.text for t in soup.find_all('title')]
print(titles[:10])

We've discussed several ways to organize text - in delimited files, JSON files, and markdown files. We've also talked about some simple things we might do to filter down a large collection, like detecting duplicates or downsampling authors who are common. All of these were done before we even got the full documents together - in this case, a useful thing for making sure we don't download more poems than we actually want to use (using up our space and time, as well as their server's bandwidth).

However, we still haven't been looking inside the documents themselves. In the next lesson, we'll look at how to normalize text in documents. 

___
[Proceed to next lesson: Text Curation 2/3 ->](./textcuration-2.ipynb)