# Research Exercise 6: Anthologies as Data

As part of this week's research exercise, we're going to think about the work that anthologies do, and the different ways we might approach them?

## Part 1: *The Norton Anthology of English Literature*

Between 1962 and 2018, W.W. Norton has published ten different installments of *The Norton Anthology of English Literature*. (Most editions were published in two volumes; starting with the 8th edition, they appeared in 6 volmes)


Take a quick peak at the rough data on each of the anthologies in [this CSV file](https://github.com/sceckert/Data-and-Literary-Study-Spring2022/blob/main/_datasets/norton-anthologies-of-english-literature-metadata.csv). Then, [skim through the folder of tables of contents from the 2nd, 3rd, 4th, 8th, and 10th editions of the Norton Anthologies](https://princeton.instructure.com/courses/6331/files/folder/Norton-Anthology-of-English-Literature-tables-of-contents).

+ What patterns do you notice?

+ What questions might we ask of this dataset of anthologies?




Think about one of the questions above -- what kinds of choices would we have to make in encoding this data if we wanted to explore your question?


## ~~**~~ Your REFLECTION HERE ~**~~~

## Part 2: Anthologizing Australian Poetry

In this next part of the exercise, we're going to be working with the Australian National Poetry Anthologies -- a dataset of 15 anthologies of Australian poetry, published between 1946 and 2011 created by Jim Berryman and Caitlin Stone at the University of Melbourne. As Berryman and Stone describe it: 

"This dataset is derived from the tables of contents of fifteen Australian ‘national’ poetry anthologies published between 1946 and 2011. Dataset contains all poems and poets included in the following anthologies

George Mackanness, *Poets of Australia: An Anthology of Australian Verse*, 1946          

George Mackanness, *An Anthology of Australian Verse* (2nd ed.), 1952            

Judith Wright, *A Book of Australian Verse*, 1956

John Thompson, Kenneth Slessor and R. G. Howarth, *The Penguin Book of Australian Verse*, 1958       

Judith Wright, *A Book of Australian Verse* (2nd ed.), 1968

Harry Heseltine, *The Penguin Book of Australian Verse*, 1972

Geoffrey Dutton, *Australian Verse from 1805: A Continuum*, 1975

Rodney Hall, *The Collins Book of Australian Poetry*, 1981

Geoffrey Dutton, *The Heritage of Australian Poetry*, 1984

Les Murray, *The New Oxford Book of Australian Verse*, 1986

Les Murray, *The New Oxford Book of Australian Verse* (2nd ed.), 1991

John Leonard, *Australian Verse: An Oxford Anthology*, 1998

John Leonard, *The Puncher & Wattmann Anthology of Australian Poetry*, 2009

John Kinsella, *The Penguin Anthology of Australian Poetry*, 2009

Geoffrey Lehmann and Robert Gray, *Australian Poetry Since 1788*, 2011"



For more on the dataset, read [the description of the dataset](https://figshare.com/articles/dataset/Australian_National_Poetry_Anthologies_Poems_and_Poets/4479590)

### What kind of questions can we ask with this dataset about the construction of Australian poetry?


Let's read in this dataset:

In [104]:
import pandas as pd

australian_poetry_anthologies_df = pd.read_csv('../_datasets/Australian-National-Poetry-Anthologies-Dataset/australian-national-poetry-anthologies.csv', encoding='utf-8')

In [105]:
australian_poetry_anthologies_df

Unnamed: 0,Author,Poem,Anthology,Date
0,"Adams, Arthur Henry",The Australian,Mackaness 1946,1946.0
1,"Adams, Arthur Henry",The Dwellings of Our Dead,Mackaness 1946,1946.0
2,"Adamson, Bartlett",Wonder Everlasting,Mackaness 1946,1946.0
3,"Allan, James Alexander",Breaking,Mackaness 1946,1946.0
4,"Allan, James Alexander",Pavlova: A Dirge,Mackaness 1946,1946.0
...,...,...,...,...
5504,"Wright, Judith",The Old Prison,Lehmann & Gray 2011,2011.0
5505,"Wright, Judith",The Unborn,Lehmann & Gray 2011,2011.0
5506,"Wright, Judith",Train Journey,Lehmann & Gray 2011,2011.0
5507,"Wright, Judith",Woman to Man,Lehmann & Gray 2011,2011.0


### Question 2a.
How many poems were published in each anthology? 

Write out the code you would need to count the number of poems published in each anthology (and, if you'd like, try to plot this as a bar graph)

In [None]:
## Your code here

### Question 2b.
How many different authors in the dataset as a whole?

Write out the code you would need to output the number of unique author names in this dataset?

In [None]:
## Your code here

### Question 2c.
How many different authors appear in each anthology?

Write out the code you would need to output the number of authors that appear in each anthology?

In [None]:
## Your code here

### Question 2d.

Who were the top 20 *most* anthologized poets? This might seem like an obvious question, but it actually depends on how we define "most anthologized.

Let's start by defining "most anthologized" as the sheer number of times an author appears in our dataset (ie, measuring with a work that has been anthologized in one of the  works has been anthologized).

Type the code below to get top 20 "most anthologized" authors (and the number of times they appear in this dataset).

> Hint: try  outputting value counts for each author name in our dataset in this dataset, looking just at the 'Author' column 

In [1]:
# your code here

But what if we defined "most anthologized" as the number of *distinct anthologies* that a given author appears in? Do we think that would change our results? And how would we do that?

We've already learned how to use the `.groupby()` function. One of the methods that we can use with `.groupby()` is `.nunique()`. This is method for counting the number of unique values in a GroupBy object-–you can read more about it in the [`groupby()` page in the pandas user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#aggregation).



Before you run this code, explain, in plain english, what each part of the code below is doing:

`australian_poetry_anthologies_df.groupby(['Author'])['Anthology'].nunique().sort_values(ascending=False).head(20)`

### ~~**~~ Your explanation HERE ~**~~~

Now, let's look at our results:

In [100]:
print("Author Name:     Number of distinct anthologies this author's work appears in:")
australian_poetry_anthologies_df.groupby(['Author'])['Anthology'].nunique().sort_values(ascending=False).head(20)

Author Name     Number of distinct anthologies this author's work appears in


Author
Wright, Judith          15
Slessor, Kenneth        15
Hope, A.D.              14
Dobson, Rosemary        14
Blight, John            14
Gilmore, Mary           14
Lawson, Henry           14
Stewart, Douglas        13
Campbell, David         13
Neilson, John Shaw      13
Keesing, Nancy          12
Kendall, Henry          12
Webb, Francis           12
Harpur, Charles         12
Riddell, Elizabeth      12
Dutton, Geoffrey        12
Mudie, Ian              11
Manifold, J.S.          11
Brennan, Christopher    11
Paterson, A.B.          10
Name: Anthology, dtype: int64

Compare this answer to the first way of measuring "most anthologized". What do you notice?

## ~~**~~ Your REFLECTION HERE ~**~~~

## Question 2e.
What other questions might you want to ask about this dataset?

## ~~**~~ Your REFLECTION HERE ~**~~~

We'll continue this discussion in class!