# Homework 6: Reasoning about Entities

I'm hoping this is a fairly short and simple assignment.

We developed code in our lab on March 8 that can plot the frequency of a particular type of entity in individual books, using ```compositionyear``` as the horizontal axis.

In the process of exploring changes in different entities, you probably noticed that the frequencies of DATE and TIME entities are both going up across the timeline.

What could explain this?

Are people just getting more interested in time generally?

The British historian E. P. Thompson wrote a famous essay on ["Time, Work-Discipline, and Industrial Capitalism,"](https://www.sv.uio.no/sai/english/research/projects/anthropos-and-the-material/Intranet/economic-practices/reading-group/texts/thompson-time-work-discipline-and-industrial-capitalism.pdf) which argues that new forms of work associated with the industrial revolution tended to reorganize our conception of time around clocks and watches. Work was no longer organized by the task ("we'll work until the harvest's in") but by the clock ("eight to six, with a thirty-minute break for lunch").

If this explanation is right, we might expect the ```TIME``` entities to increase even more than the ```DATE``` entities do.

One way to think about this would be to calculate the ratio of TIME references to DATE references for all the works in our dataset, and graph the ratio on the vertical (y) axis, with date of composition on the horizontal (x) axis. We might need to use Laplacian smoothing to avoid division by zero.

Is there a relationship? Is it statistically significant?

**Here's some code to get you started. But you don't necessarily have to use this if you prefer to create your own notebook.**

In [5]:
from collections import Counter
from pathlib import Path

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

from scipy.stats import pearsonr

%matplotlib inline

In [3]:
bookents = pd.read_csv('../data/book_entities.tsv', sep = '\t')
bookents.head()

Unnamed: 0,book_id,entity_type,entity_text,count
0,38020,TIME,an hour,8
1,38020,TIME,hours,3
2,38020,TIME,that morning,2
3,38020,TIME,the night,1
4,38020,TIME,minutes,1


In [4]:
meta = pd.read_csv('../data/entity_metadata.tsv', sep = '\t')
meta.head()

Unnamed: 0,book_id,author,authordate,title,compositionyear,hathidate,genre,audience,authgender,wordcount
0,6422,"Defoe, Daniel",1661-1731,"The life, adventures, and pyracies, of the fam...",1720,1720.0,fic,,m,593408
1,370,"Defoe, Daniel",1661-1731,The Fortunes and Misfortunes of the Famous Mol...,1722,1765.0,fic,,m,704890
2,52603,"Defoe, Daniel",1661-1731,Life of Colonel Jack.,1731,1810.0,fic,,m,759489
3,12259,"Defoe, Daniel",1661-1731,"Memoirs of the Honourable Col. Andrew Newport,...",1731,1792.0,fic,,m,588231
4,9611,"Fielding, Henry",1707-1754,The history of the adventures of Joseph Andrew...,1742,1743.0,fic,,m,362496


Now you want to achieve these goals:

1) Calculate the ratio of TIME-references to DATE-references in all the books. Since some works might have no references to DATE (or to TIME), and division by zero is undefined, you might want to use Laplacian smoothing, adding one to all the counts. (That's also a defensible way of moderating extremely high or low ratios.) There are several ways to achieve this: one way would be to add one dummy TIME reference and one dummy DATE reference for all the works, before grouping and summing the counts. In other words, you would create two new dataframes with as many rows as there are distinct book_ids, with TIME in every row for one of the dataframes, and DATE in every row for the other. The same dummy entity (perhaps "Laplace"?) would be in every row, and the count would be 1 in every row. Then you'd use ```pd.concat``` to join these to ```bookents``` before grouping and summing rows.

2) Join the grouped frames to certain columns of ```meta,``` including ```genre```--so you can exclude biographies.

3) Plot the ratios on the y axis with ```compositionyear``` on the x axis. 

4) Write a short paragraph (in a cell in your homework notebook) reflecting on the relationship you observed. How does the TIME/DATE ratio change across historical time? What might that suggest about the history of fiction? (For right now, don't worry about the question of *statistical* significance. We haven't discussed how to calculate the significance of a ratio yet, and it's a bit tricky in this case.) So focus on the *size* of the change. If you want a way to estimate the size of the change that's a bit more reliable than an eyeball impression of a scatterplot, you might consider dividing the timeline into two halves, adding up all the books in each half, and comparing the overall ratio of TIME-references to DATE-references in each half.