<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Xanda Schofield](https://www.cs.hmc.edu/~xanda) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email xanda@cs.hmc.edu.<br />
____

# Text Data Curation 1

This is lesson 1 of 3 in the educational series on Text Data Curation. This notebook is intended to introduce the basics of treating text documents as data and how to store and filter those documents. 

**Audience:** `Learners` / `Researchers`

**Use case:** [`How-To`](https://constellate.org/docs/documentation-categories#howtoproblemoriented) 

**Difficulty:** `Intermediate`
This course is open to those with a basic level of proficiency in Python. Taking the Python Basics course the week before is sufficient.

**Completion time:** 90 minutes

**Knowledge Required:** 

* Python basics (variables, flow control, functions, lists, dictionaries)
* How Python libraries work (installation and imports)


**Knowledge Recommended:**

* Basic file operations (open, close, read, write)
* How text is stored on computers (encodings, file types)


**Learning Objectives:**
After this lesson, learners will be able to:

1. Parse and generate XML, JSON, and CSV files given raw text documents
2. Identify when text encodings affect the interpretation of text data
3. Use a lexicon to select relevant documents within a text collection
4. Use fuzzy matching to remove duplicate items from a collection

___

## Install Required Libraries

In [2]:
### Installs and Imports ##
!pip install beautifulsoup4

# Import libraries
import csv
import json
import os
import urllib



# Required Data

`List out the data sources, including their formats and a few sentences describing the data. Include a link to the data source description, if possible.`

**Data Format:** 
* delimited files (.csv, .tsv)
* structured files (.json, .xml)

**Data Source:**
* [Contemporary Spanish Poetry Metadata](https://cs.hmc.edu/~xanda/data/poemas_metadata.json): Spanish poem names scraped from [Poemas del Alma](https://www.poemas-del-alma.com/). Scraped by Xanda Schofield in June 2022 and encoded in UTF-8. Note that this does not contain the poems for redistribution reasons (though if you are interested in looking at the full body of poems for a project, let me know!)


## Download Required Data

In [10]:
### Retrieve multiple files using a list ###

download_urls = [
    'https://cs.hmc.edu/~xanda/data/poemas_metadata.json',
    'https://cs.hmc.edu/~xanda/data/poemas_metadata.tsv'
]

for url in download_urls:
    urllib.request.urlretrieve(url, url.rsplit('/', 1)[-1])

# Introduction

This is the first of three lessons on **text data curation**. What does the term mean? First, when I talk about *text data*, I refer to text that has been serialized in a computer into some numerical representation. When I refer to curating that data, I think about the OED's [definition of curation](https://www.lexico.com/en/definition/curation): "The action or process of selecting, organizing, and looking after the items in a collection or exhibition." When we curate text data for the purpose of computational text analysis, our goal is the selection, organization, and looking after a dataset of text documents whose direct audience will be a computer program: we need to provide consistency and precision in our representation so that the program can run. However, it is also important that the output of the computer program itself speaks to the reasons we are interested in these texts: so our text must both make sense to a computer program and to us. While these two audiences aren't totally at odds, they need different things. Hopefully, this tutorial approaches how to navigate that, both with specific examples of ways we curate text in a dataset (through filtering, cleaning, and normalizing) and through a discussion of strategies of poking and prodding at these large collections in order to notice when something is amiss.

This tutorial is aimed for an audience with a little bit of information on how Python works that's interested in embarking on a quest to work with a large collection of text using computational tools. In this first section, we will look specifically at the ways we store text and collections of documents on computers and tools we can use to filter those larger collections into something usable. The tutorial is targeted at individuals interested in computational text analysis who are acquainted with the basics of programming in Python but are still new to developing models of text or more complex natural language processing tasks.

# Lesson


## When text is "data"
There are lots of ways to gain understanding of texts. We can read the texts themselves closely, read about the history of the texts' subject and of its authors, dig up related or contemporary texts, and grapple with the critical responses others have had to each. Depending on the questions we want to answer or our worldview about textual analysis in general, we might take a mixture of these different strategies together to analyze, answer, and argue on one topic.

Computational text analysis provides another set of strategies to do this, with a specific emphasis on what is (or isn't) a pattern across numerous texts. To do this, text collections are fed to computer programs that count, contrast, and correlate events that happen in texts, ranging from the basic frequencies of individual words and phrases to more complex inferences of what events occur for a particular character or what sentiments or moods are prevalent around mentions of a specific theme. Approaches to find these patterns range from direct programming (e.g. writing code in Python or R) to interacting with visual interfaces (like Voyant or Tableau) to options that merge the two (like building an extensive spreadsheet in Excel). In my experience, there's an amazing ecosystem of different tools and tutorials out there to train models, compute statistics, and visualize results around text. However, without text data curation, one cannot actually learn anything from these models.

If you take nothing else away from this week, there are two main things I want to revisit:
1. Text data curation requires making subjective decisions, and it is important to think about and document these choices.
2. Text data curation succeeds when we are curious and suspicious about the contexts of our text collection, so we should continue to think of ways to check that the text we have looks the way we think it does.

## TSVs and a basic filtering task

It's important to acknowledge before we embark on this project: the process of text data curation for these applications is not neutral: it's *subjective* and *destructive*. Let's think about a case study to understand how this looks.

Suppose I'm interested in studying dominant themes in contemporary Spanish language poetry. To start, I want to find a large collection of these poems. I find out that the website [Poemas del Alma ("Poems of the Soul")](https://www.poemas-del-alma.com/) has hundreds of thousands of user-submitted poems from 2009 onwards. I also find an existing script written in Python to scrape the literary poems on the website, which I could adapt to get the user poems as well.

Let's take a look at the *metadata* of these. This includes the titles, authors, month of publication, and URLs for each poem. I've actually created three different versions of this metadata file: a TSV file, a CSV file, a JSON file, and an XML file. We'll start by loading in the first TSV file. I've added some helper functions for reading and writing TSVs into lists of dictionaries below:

In [25]:
import csv

def read_tsv_as_dicts(filename):
    """Load in a TSV, or tab-separated value file, using Python's 
    built-in library `csv` for parsing fixed delimiter files. Loads
    in each row as a dictionary."""
    # open the file in read mode
    with open(filename, encoding='utf-8') as tsv_file:
        reader = csv.DictReader(tsv_file, delimiter='\t')
        row_dictionaries = [row for row in reader]
    return row_dictionaries

def write_tsv_from_dicts(rows, filename):
    """Given a list of dictionaries with consistent keys, writes
    out a tab-separated value file using Python's built-in library
    `csv` to interpret the rows.
    """
    # open a file in write mode
    with open(filename, 'w', encoding='utf-8') as tsv_file:
        # we grab the list of column names from the keys of
        # one of the rows
        columns = list(rows[0].keys())
        writer = csv.DictWriter(tsv_file, columns, delimiter='\t')
        writer.writeheader()
        writer.writerows(rows)

The first function `read_tsv_as_dicts` gives us a return variable `row_dictionaries`, a list of dictionaries that each contain the information of one row. In this case, each row will have the metadata of one poem. Because this is a TSV file, each of the poems has one line in the file, with a **T**ab character between each entry to delimit the different pieces of information into columns. (Because people don't really use tabs in their poem names, this is a pretty safe option to split our text.) We can see the names of the columns by looking at the first 10 lines of the file:

In [19]:
# We're storing these files with a utf-8 encoding
with open('poemas_metadata.tsv', encoding='utf8') as metadata_file:
    # print the first ten raw lines of the file
    for i in range(10):
        print(metadata_file.readline())

url	title	author	year	month	page

//www.poemas-del-alma.com/blog/mostrar-poema-36	Haiku.  Yo soy	 Rafael Merida Cruz-Lascano	2009	1	0

//www.poemas-del-alma.com/blog/mostrar-poema-49	"Poema Épico  ""Tecum Umán"""	 Rafael Merida Cruz-Lascano	2009	1	0

//www.poemas-del-alma.com/blog/mostrar-poema-56	-Sorsonete- “LA  LIBERTAD  DE  CRISEIDA”	 Rafael Merida Cruz-Lascano	2009	1	0

//www.poemas-del-alma.com/blog/mostrar-poema-58	Romance nuevo SIRENA CONSENTIDA	 Rafael Merida Cruz-Lascano	2009	1	0

//www.poemas-del-alma.com/blog/mostrar-poema-78	Vacia por dentro	 sha_nena	2009	2	0

//www.poemas-del-alma.com/blog/mostrar-poema-79	Sola en la realidad	 sha_nena	2009	2	0

//www.poemas-del-alma.com/blog/mostrar-poema-88	El destino de ser flor.	 Lizelizalde	2009	2	0

//www.poemas-del-alma.com/blog/mostrar-poema-89	Recuerdo de Infancia	 Angela	2009	2	0

//www.poemas-del-alma.com/blog/mostrar-poema-91	Al amar	 Oscar Raul Quiroz Cortejana	2009	2	0



In this file, we can see the 0th line (remember, Python starts counting at 0) gives us the names of each of the columns, and each of the lines after that is for a specific poem. This sort of file can get loaded into Excel if you want cleanly visible columns, but if we just read it as a plain text file, we can still roughly sort out what items should be present in what order. It's a little easier to read these once we have them loaded in as dictionaries in Python, so let's look at one of those:

In [20]:
# read in the data from a TSV format as dictionaries
poetry_metadata = read_tsv_as_dicts('poemas_metadata.tsv')

for i in range(10):
    print(poetry_metadata[i])

{'url': '//www.poemas-del-alma.com/blog/mostrar-poema-36', 'title': 'Haiku.  Yo soy', 'author': ' Rafael Merida Cruz-Lascano', 'year': '2009', 'month': '1', 'page': '0'}
{'url': '//www.poemas-del-alma.com/blog/mostrar-poema-49', 'title': 'Poema Épico  "Tecum Umán"', 'author': ' Rafael Merida Cruz-Lascano', 'year': '2009', 'month': '1', 'page': '0'}
{'url': '//www.poemas-del-alma.com/blog/mostrar-poema-56', 'title': '-Sorsonete- “LA  LIBERTAD  DE  CRISEIDA”', 'author': ' Rafael Merida Cruz-Lascano', 'year': '2009', 'month': '1', 'page': '0'}
{'url': '//www.poemas-del-alma.com/blog/mostrar-poema-58', 'title': 'Romance nuevo SIRENA CONSENTIDA', 'author': ' Rafael Merida Cruz-Lascano', 'year': '2009', 'month': '1', 'page': '0'}
{'url': '//www.poemas-del-alma.com/blog/mostrar-poema-78', 'title': 'Vacia por dentro', 'author': ' sha_nena', 'year': '2009', 'month': '2', 'page': '0'}
{'url': '//www.poemas-del-alma.com/blog/mostrar-poema-79', 'title': 'Sola en la realidad', 'author': ' sha_nena'

**Exercise.** Look at the first 10 documents and then the last 10 documents (i.e. starting at index `-10`) - what data do we know about each poem? Is there anything missing or unclear?

From our discussion, we might have a question: are there trends in the number of poems being added over time? And how distributed across authors are our poems?

These are answers we can evaluate using the `Counter` class, which you may have encountered already in Intro to Python. Whether or not you have, the quick version is that it's a special kind of dictionary that is designed to count how many times each unique item shows up in a sequence. We can do this to sort out how many times each author shows up once we've loaded in our data:

In [47]:
from collections import Counter

# make a list of the author from each row of poetry metadata, then
# give that list to the Counter to count up
author_counter = Counter([row['author'] for row in poetry_metadata])

# list authors in decreasing order of how many poems they wrote.
# To just list the top K authors, you can say .most_common(K),e.g.
#    top_100_authors = author_counter.most_common(100)
top_authors = author_counter.most_common()
print("Total authors:", len(top_authors))
for author, count in top_authors:
    print(count, author)

Total authors: 18643
2121 Raul Gonzaga
1880 Edmundo Rodriguez
1830 EMYZAG
1720 El Hombre de la Rosa
1406 Hermes Antonio Varillas Labrador
1386 Yolanda Barry
1345 bonifacio
1275 Oscar Perez
1217 ALVARO J. MARQUEZ
1210 DAVID FERNANDEZ FIS
1170 la negra rodriguez
1086 Diaz Valero Alejandro José
1075 Rafael Escobar
1061 Violeta
1059 syglesias
1036 Esteban Mario Couceyro
987 Un Rincon Infantil
947 gaston campano
927 joanmoypra
912 Sergio Jacobo "el poeta irreverente"
902 sergiocabas
863 Rafael Merida Cruz-Lascano
846 pani
839 Vito_Angeli
833 Eco del alma
803 boris gold
742 nando_barra
716 Pyck05
680 Francisco 1987
662 Ramón Bonachí
651 joaquin Méndez
637 FELINA
634 alicia perez hernandez
628 linda abdul baki
623 Roberto V
613 Alberto Escobar
591 Raúl Daniel
590 Freddy Kalvo
585 huertero
555 jorgeluisotero
553 FERNANDO CARDONA
553 Trovador de Sueños ...y realidades.
548 RICARDO ALVAREZ
528 argantonio
528 jureme
524 benchy43
524 Jareth Cruz
514 Lincol
514 Alejandro Diaz Quero
508 Santiago Mir

33 kefma-2
33 Neileth Martinez
33 Montevidiosa Natalia Pias
33 Graciela Dantes
33 Lauren Duràn
33 Pachuco
33 mikezu
33 Logoterapeuta
33 Denielig
33 koki.jum
33 Poeta Traicionero
33 beauty_cris
33 Landy Torres Baños
33 C.A.R
33 Daylin
33 Nestor Varela
33 Emilio Contreras
33 Marco Antonio (El Gringo)
33 RobertoFerreira
33 tamypaloma
33 Alcionico
33 Luis Pizarro M.
33 pedro pablo mejia torres
33 Inmovil en blanco
33 Rafael Huertes Lacalle
32 polita
32 kristi
32 Cieroska Porras
32 Fco Peiro
32 tony soto flores
32 Fredi17
32 mirada
32 Isaias Medina Lopez
32 el ultimo suspiro
32 JOHNWWWW
32 atalayax
32 El Sapo Cancionero
32 rosa espina
32 Estefania Marel
32 Ever Cordero
32 osmarlene
32 Manuel Alarcon
32 Ana Maria Delgado
32 Marisol Andrade
32 Carmen Angelical
32 Maferc
32 devilmind
32 eliudl
32 Alejandrina
32 Hada del fuego
32 maestrairma
32 vanessie
32 wilson flores
32 Braian Donald
32 juan_ca7
32 gilhian
32 el poeta de la nota triste
32 El poeta de la soledad
32 VENDAVAL DE ILUSIÓN
32 Clau

14 Dogen
14 Luz DeLuna
14 Antonio Guerra Colón Elmagodelasletras
14 Rogervan
14 Carlos Cianciaruso
14 Dylan Smith
14 Auro im Quebar
14 Alex Omar Medrano Vasquez
14 Manugongue
14 J.J.Bucar
14 Ro Vercelli
14 Escritora Zoralda
14 PilarF
14 Moaa
14 elisain maldonado
14 Eddy Cárdenas García
14 Jannine
14 Vito Monrroe
14 Poeta del aire
14 Elena Nikkinen
14 Esdras Gamarra Ponte
14 Rounald Araica
14 Fernando?
14 Yamel Murillo
14 Reiniero
14 Lucia Rodriguez Lopez
14 mrx2000
14 LuisMG
14 Gabriela Fernández O
14 Laura Ontiveros Plaza
14 mariapdfoxa
14 Efraín Ramírez
14 AlejandraMonica
14 Claudio Ernesto Poeta
14 IldaC
14 Sir_Luzio
14 Draven Noré
14 Yo soy Araceli
14 Geja Alaras
14 lobo_estepario
14 Dailyn Arce
14 Kevin Hernandez
14 An Vite
14 IVAN DE NERVAL
14 José Esteban Chávarro Cuéllar.
14 Melisa 94
13 Boa-Gente
13 Flor de amor
13 Alfonsina G.
13 walberto campos
13 Eli-Momo
13 Shekina
13 Diego Somoza
13 PABLO DURAND
13 Goldenman
13 gea52
13 yustonnils
13 AQUILES PIMIENTO
13 siemprefiel
13 mel

8 Liiseth67
8 Enrique Garcia
8 Nicole Marie
8 tinta de pulpo
8 ROSICOLMENARES
8 lisy8621
8 Hansel Guro
8 aries du monte
8 la necedad
8 Carlos Vara
8 Adriana Sastre
8 Overmans Mendoza
8 Alberto Corral
8 Midnightfrases
8 PauCath08
8 Rob Aldrin
8 leonardo caro
8 Dante Lucrecio
8 Filipochka
8 Oli macnauj
8 Sinus Iridum
8 darwin santillan
8 Ann Ivana
8 🆃🅴🅹🅴🅳🅾🆁🅰 🅳🅴 🅿🅾🅴🅼🅰🆂
8 antonia camargo
8 Saul Vazquez
8 Dan en proceso
8 Liricoloco
8 Andres Ruizz
8 Nerezza
8 dany saenz
8 Julián Valdés Vásquez
8 Adrian Fernando Cano
8 mariangelgutierrez
8 sinnombre
8 Jesús Alejandro Escudero
8 cuervo.oc
8 Isaac Castillo
8 Xose López
8 Psyco_Girl
8 marialauzm
8 karekarenina
8 Poetageneraciondel17
8 carlos belmont
8 Cornelius Francisco Svelti
8 Nitsuki Ezequiel.
8 Manuelsp
8 jecsebell_perez
8 Leoner Lozano
8 alex91
8 JOSÉ GRIMALT
8 ÍNGRID
8 Laura Benavidez
8 Sa. J. Jalley
8 CoKe Fish
8 D\'Marco Salas
8 Fernando Pérez Licea
8 Venus negra
8 manuelelafrontera
8 Misael Gaston
8 Federico Joel
8 oliver_Garcia97
8 t

5 Jozz
5 G. Sekspyr
5 el eco de un poeta
5 Martin Guerra
5 calamardo
5 tatiana13
5 claudiooyarzun_
5 monibe
5 The last poet
5 Eimai
5 mafertorres
5 megadragito
5 Ricardo Ayestaran
5 Zel Rosenthal
5 Kiang94
5 DagoDeza
5 Poetalegas
5 Bryan Joshua
5 PoEtA De MiL PeNaS
5 Krouvaz
5 legodellimbo
5 jickso
5 rafael romero12
5 ElPoetaJuan
5 Kolifato2008
5 javive22
5 La Tinta Roja
5 LALUCHI
5 sanae
5 agurne
5 sentimar
5 Intensa
5 Divagacionesdeunapoetiza
5 corazon_de_metal
5 rosablanca
5 g.me
5 pablo barattini
5 Tony Pichs
5 S&amp;M
5 rikr2
5 RAUL ARROYO
5 Rooster
5 unahuellaenelinfinito
5 morocho delchat
5 Dreams
5 Isabel Espinoza
5 chachy
5 Sthian Eurata
5 IVAN VELEZ
5 Poemas de La Susodicha_
5 poemaria
5 estasalgoloca
5 Denny Sangiovanni
5 martinalorain
5 abimael diaz
5 caryprincess
5 franvinicio
5 poetadecalle
5 CRISTINA IGLESIAS
5 tachito
5 ojalapueda
5 Gemmei26
5 Renzo Camarena
5 rebecaelizaldeblasco
5 Johny Mentero
5 10235575
5 Carlos Siu
5 apasionadadelamor
5 Izra Dafenyn
5 kelvin07
5 Al

3 Carlos Arbaiza..
3 alexiita
3 patricio ariel gaete gaete
3 luma_911
3 rokeler
3 Love aLways
3 Tamara Vanessa
3 saharab
3 Majomovi
3 kamy2013
3 Alborada
3 JessiLove15
3 Tremy
3 soner
3 Venado Azul
3 chucho chaparro
3 miguel serna
3 Jonathan Rojas
3 eerick
3 zarita
3 Ark1993
3 sol.
3 Alejandra Jaramillo
3 alvaroboavista
3 iki
3 pearl1
3 lanima
3 Vademecum
3 Betillo
3 Mc fox
3 PatricioCortes
3 tozztaditaww
3 artimanayolvido
3 porrikitaz
3 flor da silva
3 BBz
3 THE ARTIST
3 thetimburto
3 lyss
3 mauricio sinchi
3 Mariana Ramirez
3 Devil woman
3 Cristhian Mero
3 C.G
3 AVILIX
3 Sergio Andrés Pastor
3 Elizabeth torres ruiz
3 Cecy Perez
3 Gabriela Alejandra Gaitan
3 MayFer
3 Paulino Ruiz Mendoza
3 Fanny Garcia
3 poeta emergente
3 Eterna Enamorada
3 Brenda Vi
3 Cedrick Dalla Torre Zamora
3 pronavens87
3 David ortiz
3 neptuno
3 elvilasa@gmail.com
3 Rule Ortiz M.
3 loannysolano
3 Carlos Gamissans
3 Raimundo Ramirez
3 Jejsav
3 jose may
3 Homunculus
3 cesar isidoro marmol
3 principios
3 Daniel-M-S

2 harck123
2 arellano_0320
2 poetadeamor
2 terrenal
2 Garóe
2 Martha888
2 issabela1987
2 FiorPz
2 loboestepario
2 cristf_9
2 alucard0412
2 mody doom
2 Maralni
2 gipsy merly
2 Adrian1274
2 FAGA
2 lumago
2 ALBERTO ROSAS
2 MIDORI2
2 Little eagle
2 beetty
2 kalenjose
2 Omar Nava
2 Julio_Lestat
2 Diangel
2 Said
2 BELIU AGUIRRE
2 dulce amargo
2 VERA ZOMI
2 the syke
2 Jesus_cch
2 Hernán Uvaldo Cortés
2 Versoflexia
2 josefina suarez-kaulitz
2 contracorriente
2 ELI1999
2 nepmir
2 poeta elias contreras
2 raulnavas
2 tu ciel0
2 YubalGM
2 Almagy
2 rana0sapo
2 docazy
2 frui
2 Nary Cebyas
2 Jeshy Page
2 Fabiola Salinas
2 gaviota andariega
2 Elloco
2 Hinoku
2 galldebrega
2 Lilith13
2 Leslie Urdanivia
2 EL MAGO MERLIN
2 Marlon Barrios
2 Orihime
2 fav
2 tohui
2 marco carvacho
2 joosefa.k
2 prudenciogf
2 MargoC
2 zhimek
2 Blueve
2 mariaelenasancho
2 Silvia Natalia
2 Ana Maria SG
2 vardodimagia
2 conquistador16
2 Zeldamaniatic
2 freddy chala
2 Sed
2 Dark Princess Gothic
2 Brandon bainilla
2 Lost Souls
2 

2 badaracco
2 YohelavaDL
2 Agre Ten Ten
2 E.Zaín
2 SongiKim
2 UnluckyRaven
2 Sebastián Rodríguez
2 Andrés Salas Aguilar
2 lunadesolas.
2 Andrea Thompson
2 Ayekan
2 Berta Licet
2 Angel16
2 ChukiA
2 jesus polo
2 Potro Estepario
2 Claudelle Henet
2 Christian Tevar
2 Eloy Moreno Sierra
2 osby
2 W.P. Lopez
2 A.Masero
2 RENE ARTURO CRUZ MAYORGA
2 gbarrachinaf
2 yaco.Qh
2 CarmenSoledad
2 Wiana Gara
2 Luis Mario 81
2 Marcos Pereira Cardoso
2 Walter Quiroz Bustamante
2 Lucas Saccone
2 Catano
2 Daniel Yoldi
2 Netito
2 Sultan1973
2 Jotaquil
2 PalabrasEnJuego
2 Clara Bustamante Calderón
2 reginaldo castela
2 Pablomiranda
2 Alberto Navarro.
2 Lady Bird
2 Bernabé Solano
2 Cristian Cerna Q.
2 Nata1719
2 Ivanna Villavicencio
2 Calixtheo
2 Bryan09
2 Lorenzo Benito
2 Josué Torres (Josh)
2 nosoyrobot
2 rising11
2 Rogelio I. Martinez
2 Eder lara aguilar
2 Bucle
2 Juan Carlos Oberst
2 Peco😘
2 LRaulArguellesMtz
2 Charedy Martinez Caro
2 Lozano
2 Joseph Aragón
2 Andres Garcia11
2 Armando Reyes
2 Farcar
2 Gis

1 psicoN
1 R.A.K.
1 hospa
1 aylen arguello
1 fhmoreno66
1 mencha43
1 luce_strella
1 dylan leo
1 Querataro
1 jessyka
1 krlita alarcon
1 MCV
1 RELTIH
1 Bella Luna
1 tnd1030
1 Juliett.
1 manuelmaoc
1 lexxytmoon
1 Lyzon
1 Jaime Antonio Guzman
1 Alonzo_BMTH
1 Campanitta
1 Leopol Do
1 akselett
1 el poeta sin corazon
1 Alfredo2012
1 juancvaldes
1 Anto Unlz
1 juny66
1 dhfabillo
1 Jvnior666
1 MALALCHE
1 nickfury
1 gloriajara
1 Carls E. Ricalde Peniche
1 lihani
1 nelmared
1 HommieVagalesOrozco
1 preguntatti
1 genius
1 manuel ballester
1 licha.
1 Gustavo Esparza
1 JCVR
1 noriega
1 peki17
1 vicctube1
1 Minds
1 gato-x
1 CaP
1 alexandernaie
1 Angela cesped
1 elsabaez
1 lieblichneumond
1 lucy70599
1 Belen_jaz
1 rajaman
1 April Rose
1 Debys
1 Zahde
1 lalo perrrea
1 Josefina Buckley
1 EzerquielCriscuolo
1 mechudo
1 Papaya
1 cristof
1 Gonsua
1 thepoetEdward
1 katherine-mooze
1 Maga2525
1 ReneVazquez
1 Rhapsody_34
1 sunpelican
1 arturohdez
1 CarlosMerk
1 Antonio Vieyra
1 IrmaO
1 Verano sin sol
1 Andres S

1 Edgaliano
1 frankcastel
1 Sarty Marty
1 Arion Varela
1 leiner lopez
1 Danae Armstrong
1 Manuel fernando
1 Pajarodemar
1 chava_3132
1 jes gandon
1 MCberg
1 Yoidee
1 Javier bueno
1 Ravenclaw
1 DANIELADARVE16
1 neela
1 CESAR ALDANA QUINTANA
1 CarlosDBM
1 MarCan
1 U-A
1 Carolina Rosero
1 Polo Rodriguez
1 DiosDeNada
1 Ubaldo Ugarte
1 CarlosEduardoRevilla
1 santirodriguez
1 PaolaArtigas
1 Ielgoldoll
1 josepheastman
1 LA CEGUERA DE LOS CUERVOS
1 JoseMoyano
1 MisKatonic
1 Emilio1291
1 maylo
1 elajedrecista777
1 Luna Alfaro
1 Daniel c. ortiz
1 R.Bulhosen
1 Eloy Munive
1 lemonroge
1 Johan Vergara
1 Emil Cioran
1 fanny gusman
1 Nicolas Gildardo
1 D-key
1 Camacho02
1 Sarai Salame
1 Eleazar Santizo
1 lionel
1 Antonio Breno
1 Neferneferura
1 Carlos E. Guevara Piminchumo
1 angel sterling
1 Meraki
1 davidlopez
1 Camilo Gutierrez
1 Juana Vidaurre
1 Jared-Martim
1 Melany Madeleyn
1 MarinaMadsen
1 michini
1 Anacreonte
1 Isaac Mercado
1 Muro Cadarso
1 o_catu
1 Azucena correa
1 Ginebra Lit
1 miguelhernan

1 leodeclamador
1 Jose Victor Sanchez Trujillo
1 LJ Navi
1 carlos penelas
1 belladvara
1 Jos45
1 clp
1 eduardochvzb
1 luisgarciacorre
1 Leo.R.G
1 Ricardo Bermudez
1 Alex Palacios
1 AleAlondra
1 lorenaEmm
1 Ale Martinez
1 Nory Sama
1 Joao
1 vanfortheworld
1 Ruben Gimenez
1 William Bermudez
1 mitificada
1 chiquilla_poeta
1 Centella fenix
1 Pablo Distinto
1 Daiteraria
1 MaiOn
1 Enrique Gines
1 Rolando Reinoso
1 magnolia13
1 Yehudi Collas
1 Adrian9090
1 Eulalio IV
1 watchtowerflores21@gmail.com
1 omar_cartas
1 Odracirch
1 HANAEL
1 8ositodegoma8
1 Danny CaT
1 cthaeh
1 ArmandoWW
1 cesar maldonado diaz
1 Baraal
1 Leonardo Siré
1 Alejandra Coy
1 Christian Teneda
1 AngelGray
1 Ivonne Cadena
1 Bosco Manol
1 José González Oliva
1 Steve Pacas
1 Adrian Jaramillo
1 8adeur2
1 Esteban85
1 rimas
1 Diana Osornio
1 Am. S. D
1 Angela Celeste
1 Cruz De Luna
1 Rakkun
1 Carlos D.
1 juanpichus
1 Lenuah
1 Lorena Lencinas
1 JAAB
1 Goid Caster
1 Jan Ces
1 Matias Salazar
1 Hector Jose Corredor Cuervo
1 LoveU
1 Pa

**Exercise**. Scroll through the output of this counter - what do you notice?

If we wanted to track how authors changed over time, we could limit ourselves just to poets who had contributed more than X poems. However, most projects I work on that take on something like this are more interested in not overrepresenting significant contributors.

Let's look at a piece of code that samples no more than 10 poems from each author. This uses a special kind of Python dictionary, the `defaultdict`, that's a cousin of the Python `Counter`: it allows you to specify what the type is of values in the dictionary so that when you access a key that hasn't been used before, it provides a default value of that type. For instance, a `defaultdict(int)` would have a default value of 0, while a `defaultdict(list)` defaults to an empty list. (You can specify more complicated functions if you want for these, but let's leave it at that for now.

We'll use our `defaultdict` to make a list of the entries as the value each author key by appending each entry to the list for that author. Then, we'll use Python's built-in `random` library to grab a sample for any that are too long.

In [23]:
from collections import defaultdict
import random

# Using the poetry_metadata variable from a few cells ago,
# we'll make a list for each author
metadata_by_author = defaultdict(list)
for meta_dict in poetry_metadata:
    metadata_by_author[meta_dict['author']].append(meta_dict)

# Iterate through each of the keys (author names) in the list
# and add up to 10 poems to our filtered list
max_per_author = 10
filtered_author_metadata = []
for author in metadata_by_author:
    if len(metadata_by_author[author]) > 10:
        filtered_author_metadata += random.sample(metadata_by_author[author], max_per_author)
    else:
        filtered_author_metadata += metadata_by_author[author]
        
print("Length of original collection:", len(poetry_metadata))
print("Length of filtered collection:", len(filtered_author_metadata))

Length of original collection: 307396
Length of filtered collection: 87837


We've now filtered down our collection by making some choices: namely, deciding we don't want to overrepresent any one author, and more specifically, that we want no more than ten poems from each author. This limits the size of our collection, but it may also make it easier for us to make certain arguments about what is (or isn't) in there.

**Exercise** Add code to ensure that we also excluse authors who have contributed fewer than 5 poems. How much more does this limit the corpus size? When might this be worthwhile?

Once we have a filtered version of our collection, it's worth writing it out so that we don't have to regenerate it each time we run subsequent analyses. There are two reasons for this: first, because running this sort of processing can take a while on a larger collection, and second, because it helps us keep track of what version of the text we are using. In this case, since we're doing something random to select the text, it's extra-important - if we rerun this step, we'll get a different subset of the corpus!

In [26]:
write_tsv_from_dicts(filtered_author_metadata, "poemas_metadata_22_6_27-author_limit_10.tsv")

Importantly, when we write out a processed version of a collection, we always want to keep track of the changes we made from the original collection to help with reporting and recreating our process later. Using an informative filename can help that (almost all my processed data files include the date I made them as well as some keywords for what changed), but there's no replacement for having an ongoing document that keeps track of your intermediate versions of "cleaned" text collections.

**Warning**: Using something like a Jupyter notebook can make this problem feel like it's already solved, since you can add text around where you generated a particular file. However, **Jupyter notebooks only work as logs of your procedure if you don't go back and edit the script you used to generate the data!** If you think you might make a version 1 and then make some changes to your process for a version 2, 3, and so on, you should either make sure to record what you did in version 1 clearly in some place where that information won't get changed, or you should delete everything produced from the version 1 procedure so it can never accidentally be reused. Not doing this causes a *lot* of problems, especially if you have multiple people working on a project who might mistake an old file for the one to use.

We've now seen a short introduction to working with TSV data. We could also have stored our data in comma-separated value (CSV) files by taking the code above and omitting the `delimiter` keyword:


In [28]:
# This code should look familiar!
import csv

def read_csv_as_dicts(filename):
    """Load in a CSV, or comma-separated value file, using Python's 
    built-in library `csv` for parsing fixed delimiter files. Loads
    in each row as a dictionary."""
    # open the file in read mode
    with open(filename, encoding='utf-8') as csv_file:
        reader = csv.DictReader(csv_file)
        row_dictionaries = [row for row in reader]
    return row_dictionaries

def write_csv_from_dicts(rows, filename):
    """Given a list of dictionaries with consistent keys, writes
    out a comma-separated value file using Python's built-in library
    `csv` to interpret the rows.
    """
    # open a file in write mode
    with open(filename, 'w', encoding='utf-8') as csv_file:
        # we grab the list of column names from the keys of
        # one of the rows
        columns = list(rows[0].keys())
        writer = csv.DictWriter(csv_file, columns, delimiter='\t')
        writer.writeheader()
        writer.writerows(rows)

If you want, you can try these out and explore the differences in how this renders files. CSV files rely on quote characters and commas to separate out fields, which comes with a bit of danger for text processing, since commas show up quite often in text. CSV files will typically use a double quote `"` as an *escape character* to define the boundaries of text so that commas inside a piece of text aren't read as the end of a column. This, in turn, produces interesting quirks for how to render quotes. If you use Python's `csv` library, it should take care of all of that for you, defaulting to the same behavior as what Microsoft Excel does, as specified by the default argument `dialect='excel'`. If you want to change this to another structure, I would recommend against trying to code it yourself, since it's easy to introduce errors, and instead check what dialect makes the most sense to use - you can even find Excel's official tab-separated value dialect!

In [29]:
csv.list_dialects()

['excel', 'excel-tab', 'unix']

We've looked at storing and saving delimited files. Let's look at another format, the JSON file.

## JSON, standards, and encodings


JSON stands for **J**ava**S**cript **O**bject **N**otation. It's a syntax to describe structures of information based on how JavaScript makes objects, but because of its flexibility and clarity, it's also used in a variety of web applications and as a storage mechanism for some datasets.

We can use Python's built-in `json` library to see what it would look like to write out the last ten of our unfiltered metadata from before. The library has two functions for writing out text in a JSON format: `json.dumps` returns a string, while `json.dump` writes to an open file. Since we want to see a string of the output, we'll use `dumps`:

In [30]:
import json

# use Python slicing and negative indexing to grab the
# last 10 elements of the list
last_ten_metadata = poetry_metadata[-10:]
last_ten_json = json.dumps(last_ten_metadata)
print(last_ten_json)

[{"url": "//www.poemas-del-alma.com/blog/mostrar-poema-663171", "title": "ESTACIONES", "author": " Seydel", "year": "2022", "month": "6", "page": "2"}, {"url": "//www.poemas-del-alma.com/blog/mostrar-poema-663172", "title": "Me Perd\u00ed En Sue\u00f1os Ap\u00f3crifos...  \u00a0", "author": " alicia perez hernandez", "year": "2022", "month": "6", "page": "2"}, {"url": "//www.poemas-del-alma.com/blog/mostrar-poema-663173", "title": "Seres Imperfectos", "author": " Isla \u2728", "year": "2022", "month": "6", "page": "2"}, {"url": "//www.poemas-del-alma.com/blog/mostrar-poema-663174", "title": "Gracias por estar", "author": " lesly nadal", "year": "2022", "month": "6", "page": "2"}, {"url": "//www.poemas-del-alma.com/blog/mostrar-poema-663175", "title": "UNA RAMA ENTRE GUIJARROS", "author": " mariapdfoxa", "year": "2022", "month": "6", "page": "2"}, {"url": "//www.poemas-del-alma.com/blog/mostrar-poema-663176", "title": "\u00a1...TRES REGALOS...!", "author": " Mauro Perez Arteaga", "year"

It looks like a list of dictionaries, each of which seems to have different key-value pairs. We notice here that we haven't really done anything to indicate what's a number and what's a string, so everything is being written out as strings. We also might have a little trouble reading this the way it's printing right now - everything's running together in one long line. Finally, we can spot that we've actually had a leading space in front of all our author names - looks like the data wasn't encoded super well! So let's see if we can make some fixes.

First, let's turn our values for year, month, and page into actual integers! It's a basic for loop, but it's the sort of thing I end up doing all the time to help read in structured information by recasting the text of a number as an actual number. To get integers, I'll just use `int()`.

In [41]:
for meta_dict in poetry_metadata:
    for key in ["year", "month", "page"]:
        # replace the string with an integer from that string
        meta_dict[key] = int(meta_dict[key])

One step down. Now, to get rid of leading spaces (and trailing spaces if we have them), we'll use the `strip()` function built into strings:

In [42]:
for meta_dict in poetry_metadata:
    meta_dict["author"] = meta_dict["author"].strip()

Better - now let's try writing out the JSON format more legibly. One thing that would help is having some visible indentation to help us tell when something new is starting. The Python `json` library will let us do that with the keyword argument `indent` - every time a new dictionary or list starts, it will use an additional `indent` number of spaces to pad the start of the line. Here, with our list of ten items, it'll look like this:

In [43]:
# Pretty print our new dictionary
pretty_ten_json = json.dumps(last_ten_metadata, indent=2)
print(pretty_ten_json)

[
  {
    "url": "//www.poemas-del-alma.com/blog/mostrar-poema-663171",
    "title": "ESTACIONES",
    "author": "Seydel",
    "year": 2022,
    "month": 6,
    "page": 2
  },
  {
    "url": "//www.poemas-del-alma.com/blog/mostrar-poema-663172",
    "title": "Me Perd\u00ed En Sue\u00f1os Ap\u00f3crifos...  \u00a0",
    "author": "alicia perez hernandez",
    "year": 2022,
    "month": 6,
    "page": 2
  },
  {
    "url": "//www.poemas-del-alma.com/blog/mostrar-poema-663173",
    "title": "Seres Imperfectos",
    "author": "Isla \u2728",
    "year": 2022,
    "month": 6,
    "page": 2
  },
  {
    "url": "//www.poemas-del-alma.com/blog/mostrar-poema-663174",
    "title": "Gracias por estar",
    "author": "lesly nadal",
    "year": 2022,
    "month": 6,
    "page": 2
  },
  {
    "url": "//www.poemas-del-alma.com/blog/mostrar-poema-663175",
    "title": "UNA RAMA ENTRE GUIJARROS",
    "author": "mariapdfoxa",
    "year": 2022,
    "month": 6,
    "page": 2
  },
  {
    "url": "//www.poe

**Exercise.** We made our changes to `poetry_metadata`, but saw those changes reflected in `last_ten_metadata`. Why? 

Okay, that's much easier to read - we can see each entry separately, and we can see that while there are quotes around the URL, title, and author, the year, month, and page don't have quotes - we're supposed to read them as a piece of code would, which is as raw numbers, not as characters in a sequence.

Importantly, these two versions of printing the text will have very different lengths and contents. We can also compare this to how much space the last ten lines would be in CSV form, which would just combine the fields in order. As mentioned before, we should usually let the `csv` library do the work of reading and writing these files, but for this example, I'll just turn everything back into a string myself and combine each line together with tabs using Python's string `join` method, which combines a list of strings into one using the calling string as a delimiter:

In [44]:
tsv_keys = list(last_ten_metadata[0].keys())

original_ten_tsv = ""
for metadata in last_ten_metadata[-10:]:
    # combine all fields with tabs
    row_columns = [str(metadata[k]) for k in tsv_keys]
    line = "\t".join(row_columns)
    # add the combined line and newline character
    original_ten_tsv += line + "\n"

print(original_ten_tsv)

//www.poemas-del-alma.com/blog/mostrar-poema-663171	ESTACIONES	Seydel	2022	6	2
//www.poemas-del-alma.com/blog/mostrar-poema-663172	Me Perdí En Sueños Apócrifos...   	alicia perez hernandez	2022	6	2
//www.poemas-del-alma.com/blog/mostrar-poema-663173	Seres Imperfectos	Isla ✨	2022	6	2
//www.poemas-del-alma.com/blog/mostrar-poema-663174	Gracias por estar	lesly nadal	2022	6	2
//www.poemas-del-alma.com/blog/mostrar-poema-663175	UNA RAMA ENTRE GUIJARROS	mariapdfoxa	2022	6	2
//www.poemas-del-alma.com/blog/mostrar-poema-663176	¡...TRES REGALOS...!	Mauro Perez Arteaga	2022	6	2
//www.poemas-del-alma.com/blog/mostrar-poema-663177	LO QUE SE LLEVA AL VIENTO!	Omaris Redman	2022	6	2
//www.poemas-del-alma.com/blog/mostrar-poema-663179	A un constructor de mundos posibles	JIEL	2022	6	2
//www.poemas-del-alma.com/blog/mostrar-poema-663180	BALADA METAL	tulipan4922	2022	6	2
//www.poemas-del-alma.com/blog/mostrar-poema-663181	Altanera	Jose Gonzalez Bolon	2022	6	2



Now, let's look at the relationship between how easy it is for us to read these strings versus how much space they take:

In [45]:
print("Number of characters in ten metadata fields:")
print("Pretty JSON:", len(pretty_ten_json))
print("Raw JSON:", len(last_ten_json))
print("Raw TSV:", len(original_ten_tsv))

Number of characters in ten metadata fields:
Pretty JSON: 1937
Raw JSON: 1705
Raw TSV: 955


Surprised? Probably not - we spent extra space in our JSON representation to rewrite the names of our "columns" in every entry, and in the pretty printing, we also added bonus spaces. Of course, because of this format, JSON also has some flexibility we're not using: for instance, we can nest lists and objects inside other lists and objects, in the same way we could make a Python dictionary a value inside a Python dictionary. We can also choose to have some "keys" or attributes exist only some of the time; maybe if I had an author profile page for some authors but not others, I could add that attribute only where I need it in the JSON representation, but I would have to consistently have that column exist whether populated or not in the TSV representation. But it's worth talking about space because when we're representing data on computers in general, space adds up!

## A digression - text encodings

To break this down a little more, let's review how strings store data. (I say review because I believe some of this is covered in the Intro to Python sequence for TAP.) A string is a list of characters, or symbols. Since computer memory is built to store information as binary numbers, a string is stored as a list of binary numbers, with one number for each character we see, which we call that symbol's *code point*. For instance, for pretty much any computer you'll use, the letter `A` has code point 65 if we're counting in base ten, and `a` has code point 97.

A listing of these numbers is a *standard* - these numbers both come from the [ASCII standard](https://en.wikipedia.org/wiki/ASCII), or American Standard Code for Information Interchange, which dates back to the '60s. The ASCII standard is a product of its place and time: it goes from 0 to 127 and, fitting the expectations of popular characters for US English, it includes numbers, Latin letters without accents, punctuation, spacing, and a series of special symbols meant to match up to different typewriter operations. (After all, back in the 60's, typewriters were a standard way to handle input and output for computers.)

The standard we interact with commonly online and on our phones is the [Unicode standard](https://en.wikipedia.org/wiki/Unicode). Unicode starts with the same 128 symbols as ASCII, but then extends well beyond that to include characters from other alphabets, diacritical marks, stylized symbols, and even emoji. It's also updated annually by the Unicode Consortium, a non-profit whose voting members comprise many well-known tech companies and research organizations. At the time of producing this tutorial, [Unicode 15.0 is about to be rolled out with over 149,000 characters](https://home.unicode.org/unicode-15-0-beta-review/).

Of course, this gets at the way we map symbols to numbers, but to actually encode text - that is, to write it into our computer files and memory - we need a way to turn those numbers into a consistent sequence of ones and zeros. We call this a text encoding. Unicode actually has several different ways to do this.

Let's give ourselves an example made-up username with some non-ASCII characters to see how this looks:

In [57]:
username = "Mar\u00EDa \u2615"
print(username)

María ☕


Python will get upset if we try to paste non-ASCII characters into a string, so to specify those characters as Unicode, we use the `\u####` format to specify the code point of the symbol we want in *hexadeximal*. Hexadecimal is base 16, which has some more digits than the base-10 *decimal* counting system we're used to - it goes 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, 10. The hexadecimal number 10 is equivalent to the decimal number sixteen. So, to write the decimal number forty-two, we'd use two sixteens and a ten, which we're write as 2A (2\*16 + 10). If we wanted to write out sixteen squared in hexadecimal, we could just write 100, the same way 100 in our usual decimal counting system is ten squared. It's convenient for programming because each hexadecimal digit can be written using exactly four bits.

We wrote this using two hexadecimal numbers, one for our i-acute and one for our coffee emoji. We can even use Python to convert the numbers out of hexadecimal (or "hex") if we'd like:

In [59]:
print("\u00ED", "has hex code point 00ED and dec code point", int("00ED", 16))
print("\u2615", "has hex code point 2615 and dec code point", int("2615", 16))

í has hex code point 00ED and dec code point 237
☕ has hex code point 2615 and dec code point 9749


Now that we have unicode, let's think about exactly how much memory (not just how many characters) will get used by our different strings when we write them out using a particular encoding:

In [77]:
# get the text encoded using different encodings
for encoding in ['utf-8', 'utf-32', 'latin1']:
    byte_val = "Mar\u00EDa".encode(encoding)
    print(encoding, "needs", len(byte_val), "bytes")
    print("Hex code:", byte_val.hex())
    print()

utf-8 needs 6 bytes
Hex code: 4d6172c3ad61

utf-32 needs 24 bytes
Hex code: fffe00004d0000006100000072000000ed00000061000000

latin1 needs 5 bytes
Hex code: 4d6172ed61



What's going on? I've used three different encodings here for just the name Mar&#237;a: two associated with Unicode and one, `latin1`, that predates Unicode. (Heads up, `latin1`, or `ISO-8859-1`, used to be a common encoding for text based on a Latin alphabet with accents, so you may run into it now and again. If you get a bunch of As with accents instead of the text you expect from a document, try to read the file using `encoding=latin1` and see if that fixes it.)

To break down what's going on here with each encoding:
* `utf-8` uses a variable-length encoding, where one bit of each byte says whether it'll need another byte to write out the number or not. This means that for the four letters contained in ASCII, it only uses one byte each, but for the accented &#237;, it has to use two bytes, giving us 6 total bytes.
* `utf-32` is fixed-length, but it uses four bytes (32 bits) for each symbol, plus an additional symbol at the front that tells it how it'll order the four bytes (since some programs read bytes front to back and others back to front). So, we get 5 * 4 + 4 bytes, or 24. If this was all emoji, this might have a lot more contents, but since it's mostly ASCII, we see a lot of 0s in the extra bytes.
* `latin1` is explicitly designed to support some latin characters with accents without needing a second byte, so it is able to write the acute &#237; in one byte, using only five bytes. (However, if we needed to write the coffee emoji, it'd throw an error - try it!)

## Finding duplicates (or near-duplicates)

We've now thought a bit about what's going on with our pieces of text. So let's check something - are all of our poems in our dataset unique? Let's start back into our original list of poems and see how unique our titles are.

In [63]:
title_counter = Counter([poem['title'] for poem in poetry_metadata])
print(title_counter.most_common(20))

[('Soledad', 190), ('SOLEDAD', 137), ('Amor', 129), ('Recuerdos', 127), ('Mujer', 118), ('TE AMO', 118), ('Eres', 116), ('Hoy', 115), ('MUJER', 114), ('Quiero', 110), ('Silencio', 104), ('Tú', 96), ('Ella', 96), ('AMOR', 92), ('Ausencia', 91), ('Quisiera', 91), ('QUIERO', 84), ('TE EXTRAÑO', 82), ('Despedida', 82), ('AUSENCIA', 81)]


Huh - looks like we have a lot of poems about solitude. This isn't a big surprise, that titles are getting used often for common themes, but we might be concerned if we have multiple copies of the same poem from the same author. Let's try including that information and see if that removes our duplicates.

I'll sometimes see people do this task by making some new special text field by glueing things together, e.g. by putting the title, a weird delimiter, and then the author into one string, like "SOLEDAD~+~Mar&#237;a". However, we don't have to do this - Python is perfectly happy to use a tuple, e.g. `(title, author)`, in place of a single string or value as a key in a dictionary, and it saves us time and silliness trying to chop up and glue together pieces of information. So we'll just use that instead:

In [66]:
title_author_counter = Counter([(poem['title'], poem['author']) for poem in poetry_metadata])
print(title_author_counter.most_common(20))

[(('sin título', 'pablo beltran'), 42), (('TEMA SEMANAL.', 'CUARTEL DE POETAS LOCOS.'), 36), (('De: Sin Inspiracion', 'amdiosteza'), 19), (('REFLEXIONES...Y ALGO MÁS', 'boris gold'), 18), (('Tres versos...', 'PETALOS DE NOCHE'), 16), (('\\"OVILLEJOS\\"', 'Rafael Escobar'), 16), (('ALGUNOS HAIKUS', 'boris gold'), 15), (('Sin título', 'Ángel del Caos'), 15), (('a veces...', 'sergiocabas'), 14), (('¡¡¡ VERSOS DESDE CÓRDOBA !!!', 'El Hombre de la Rosa'), 14), (('GENESIS', 'FERNANDO CARDONA'), 11), (('RUMBO AL CIELO', 'fernandocardonakaro'), 11), (('Palabras', 'Esteban Mario Couceyro'), 8), (('El viento', 'Esteban Mario Couceyro'), 8), (('...', 'Roel Ybañez'), 8), (('Amor', 'Esteban Mario Couceyro'), 7), (('Recuerdos', 'Esteban Mario Couceyro'), 7), (('...', 'Joel Jaramillo'), 7), (('saber', 'Ταδεθσ'), 7), (('MUJER:', 'Abel Niquinga Ruiz'), 6)]


Wow, it looks like we have a much deeper problem than just common titles! Duplicates or near-duplicates are an inevitable part of text datasets. In this case, we probably want to make sure we don't include more than one copy of the same poem from the same author, so we might want to walk through our dataset and only include the *last* poem of a particular title by that author. Since our dataset is in chronological order, we can just grab the last poem using `defaultdict` again for this processing:

In [72]:
# Using the poetry_metadata variable again,
# we'll make a list for each title and author
metadata_by_title_author = defaultdict(list)
for poem in poetry_metadata:
    metadata_by_title_author[(poem['title'], poem['author'])].append(poem)

# Grab the latest poem in each list
unique_poems = []
for title_author in metadata_by_title_author:
    unique_poems.append(metadata_by_title_author[title_author][-1])

In [77]:
print("Removed", len(poetry_metadata) - len(unique_poems), "duplicate poems")
print("Percentage of poems remaining:", 100 * len(unique_poems) / len(poetry_metadata))

Removed 4067 duplicate poems
Percentage of poems remaining: 98.67695090372028


We've gotten rid of several thousand duplicate entries, but this was assuming that duplicates had exactly the same orthography. Usually, duplicate entries may have slightly different information: for instance, the same article from the AP newswire may be printed with different titles in different newspapers, or a book's database entry may vary based on extra text in the title related to edition number or whether the author's middle initial was included. As a result, we usually need to do some kind of **fuzzy matching** to be sure we've caught duplicates with subtle variations. This can include counting words to see whether there's >95% overlap in the exact count of words for longer documents, or using something like NLTK's `nltk.metrics.distance.edit_distance` to measure the "distance" between strings in terms of how many single-character edits would be needed to get from one string to the other. (Usually, this "number of characters to change" metric goes by the name *Levenshtein distance*).

Here's an example of using edit distance to compare a few words, all with the same edit distance of 2:

In [81]:
from nltk.metrics.distance import edit_distance

pairs = [
    ("create", "creative"),
    ("cat", "cow"),
    ("maria", "Mar\u00EDa"),
    ("Assume", "Ass u me")
]
for w1, w2 in pairs:
    print(w1, "/", w2, "-", edit_distance(w1, w2))

create / creative - 2
cat / cow - 2
maria / María - 2
Assume / Ass u me - 2


This simple formulation of edit distance isn't terribly context-aware: it doesn't understand that changes in capitalization or the omission of an accent may be smaller changes in our eyes, nor does it understand that the length of a word may play a role. Depending on the setting you're in, you may want to make substitutions or lower-case text in advance of doing something like this, or look for a more sophisticated fuzzy matching tool.


**Exercise.** Rewrite the code for finding unique poems to lower-case all author and poem titles before comparing them. Does this have an effect? *Extra -* Can you find the poems that are removed in one and not the other?

## One more organization scheme: XML

Let's see one more format of storing our text: XML, or e**X**tensible **M**arkup **L**anguage. XML is designed to allow you to explicitly structure fairly complex data as a tree of different elements. For instance, our tree could have a root element as the list of all our poems, then have branches - or sub-elements - for each separate poem. These branches can have their own sub-elements (like the title and author) as well as attributes (like their URL or the year they were written). If you've looked at HTML code before, the syntax might look familiar. Let's make an example with just our last ten poems:

In [56]:
import xml.etree.ElementTree as ET

# build an XML tree (ick)
poems_root = ET.Element('poems')
for poem_dict in last_ten_metadata:
    new_poem = ET.SubElement(poems_root, 'poem')
    
    # Set attributes
    for attribute in ['url', 'year', 'month', 'page']:
        new_poem.set(attribute, str(poem_dict[attribute]))
    
    # Add text to subelements
    title = ET.SubElement(new_poem, 'title')
    title.text = poem_dict['title']
    author = ET.SubElement(new_poem, 'author')
    author.text = poem_dict['author']

ET.indent(poems_root)
xml_data = ET.tostring(poems_root, encoding='utf-8')
print(xml_data.decode('utf-8'))

<poems>
  <poem url="//www.poemas-del-alma.com/blog/mostrar-poema-663171" year="2022" month="6" page="2">
    <title>ESTACIONES</title>
    <author>Seydel</author>
  </poem>
  <poem url="//www.poemas-del-alma.com/blog/mostrar-poema-663172" year="2022" month="6" page="2">
    <title>Me Perdí En Sueños Apócrifos...   </title>
    <author>alicia perez hernandez</author>
  </poem>
  <poem url="//www.poemas-del-alma.com/blog/mostrar-poema-663173" year="2022" month="6" page="2">
    <title>Seres Imperfectos</title>
    <author>Isla ✨</author>
  </poem>
  <poem url="//www.poemas-del-alma.com/blog/mostrar-poema-663174" year="2022" month="6" page="2">
    <title>Gracias por estar</title>
    <author>lesly nadal</author>
  </poem>
  <poem url="//www.poemas-del-alma.com/blog/mostrar-poema-663175" year="2022" month="6" page="2">
    <title>UNA RAMA ENTRE GUIJARROS</title>
    <author>mariapdfoxa</author>
  </poem>
  <poem url="//www.poemas-del-alma.com/blog/mostrar-poema-663176" year="2022" month=

I...do not enjoy writing code to parse XML in Python. While XML is powerful, XML parsing tends to be computationally expensive and a little sensitive, and working with the tree structure the way the XML library has it set up is less than fun, so I almost never write out an XML file if I'm using Python. However, I do sometimes need to parse them. For XML and HTML websites, I usually do this with `BeautifulSoup` for a few reasons:
1. I find its functionality for printing parts of the parsed data more intuitive,
2. It's a lot better about handling malformed XML or HTML (which is not uncommon), and
3. It has nice [documentation and examples](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup).


In [59]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(xml_data, 'xml')

In [61]:
print(soup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<poems>
 <poem month="6" page="2" url="//www.poemas-del-alma.com/blog/mostrar-poema-663171" year="2022">
  <title>
   ESTACIONES
  </title>
  <author>
   Seydel
  </author>
 </poem>
 <poem month="6" page="2" url="//www.poemas-del-alma.com/blog/mostrar-poema-663172" year="2022">
  <title>
   Me Perdí En Sueños Apócrifos...
  </title>
  <author>
   alicia perez hernandez
  </author>
 </poem>
 <poem month="6" page="2" url="//www.poemas-del-alma.com/blog/mostrar-poema-663173" year="2022">
  <title>
   Seres Imperfectos
  </title>
  <author>
   Isla ✨
  </author>
 </poem>
 <poem month="6" page="2" url="//www.poemas-del-alma.com/blog/mostrar-poema-663174" year="2022">
  <title>
   Gracias por estar
  </title>
  <author>
   lesly nadal
  </author>
 </poem>
 <poem month="6" page="2" url="//www.poemas-del-alma.com/blog/mostrar-poema-663175" year="2022">
  <title>
   UNA RAMA ENTRE GUIJARROS
  </title>
  <author>
   mariapdfoxa
  </author>
 </poem>
 <poem m

BeautifulSoup can take in a raw string of text and parse it out as nicely-formatted XML (you can see it even added an annotation for the XML type.) If we want to grab all the poems, we can just use `soup.find_all(tag)`to get a list of every element with that tag:

In [62]:
soup.find_all('poem')

[<poem month="6" page="2" url="//www.poemas-del-alma.com/blog/mostrar-poema-663171" year="2022">
 <title>ESTACIONES</title>
 <author>Seydel</author>
 </poem>,
 <poem month="6" page="2" url="//www.poemas-del-alma.com/blog/mostrar-poema-663172" year="2022">
 <title>Me Perdí En Sueños Apócrifos...   </title>
 <author>alicia perez hernandez</author>
 </poem>,
 <poem month="6" page="2" url="//www.poemas-del-alma.com/blog/mostrar-poema-663173" year="2022">
 <title>Seres Imperfectos</title>
 <author>Isla ✨</author>
 </poem>,
 <poem month="6" page="2" url="//www.poemas-del-alma.com/blog/mostrar-poema-663174" year="2022">
 <title>Gracias por estar</title>
 <author>lesly nadal</author>
 </poem>,
 <poem month="6" page="2" url="//www.poemas-del-alma.com/blog/mostrar-poema-663175" year="2022">
 <title>UNA RAMA ENTRE GUIJARROS</title>
 <author>mariapdfoxa</author>
 </poem>,
 <poem month="6" page="2" url="//www.poemas-del-alma.com/blog/mostrar-poema-663176" year="2022">
 <title>¡...TRES REGALOS...!</

We've discussed several ways to organize text - in delimited files, JSON files, and markdown files. We've also talked about some simple things we might do to filter down a large collection, like detecting duplicates or downsampling authors who are common. However, we still haven't been looking inside the documents themselves. In the next lesson, we'll look at how to normalize text in documents. 

___
[Proceed to next lesson: Text Curation 2/3 ->](./textcuration-2.ipynb)