# Third deduplication

A final standardizing pass on **workmeta**, followed by a temporal filter to produce **contemporaryworkmeta**.


In [1]:
import pandas as pd
import random

In [11]:
work = pd.read_csv('newworkmeta.tsv', sep = '\t', low_memory = False)

### Standardize names

In our second deduplication notebook, we wrote to file pairs of names that we discovered in the process of deduplicating works. The texts and titles of certain works were similar enough to convince us that these author names were also equivalent.

However, we don't know which name is better. To decide, we need some rules, which I will proceed to provide.

In [12]:
authsets = pd.read_csv('authorsets.tsv', sep = '\t', header = None, names = ['name1', 'name2'])
authsets.head()

Unnamed: 0,name1,name2
0,"Robinson, Mary Stephens","Robinson, Mary S"
1,"Howes, Edith Annie","Howes, Edith"
2,"Williams, Elizabeth Whitney, Mrs","Williams, Elizabeth Whitney"
3,"Somerville, E. Œ. (Edith Œnone)","Somerville, E. &#xbf;. (Edith &#xbf;none)"
4,"Pearce, Donn","Pearce, Donald"


In [13]:
def name2prefer(row):
    ''' A bunch of rules we can use to make up our mind.
    Basically, diacritical marks are preferred to garbage
    equivalents; otherwise longer forms are usually preferred.
    '''
    option1 = row.name1
    option2 = row.name2
    if '??' in option1 or '#xbf;' in option1:
        return option2
    elif '??' in option2 or '#xbf;' in option2:
        return option1
    elif 'from old catalog' in option1:
        return option2
    elif 'from old catalog' in option2:
        return option1
    elif '?_' in option1:
        return option2
    elif '?_' in option2:
        return option1
    elif ' ̌' in option1:
        return option2
    elif ' ̌' in option2:
        return option1
    elif '1' in option1:
        return option2
    elif '1' in option2:
        return option1
    elif '  ' in option1:
        return option2
    elif '  ' in option2:
        return option1
    elif 'è' in option1 or 'é' in option1:
        return option1
    elif 'è' in option2 or 'é' in option2:
        return option2
    elif len(option1) > len(option2):
        return option1
    else:
        return option2

authsets = authsets.assign(normative = authsets.apply(name2prefer, axis = 1))
authsets.head()

Unnamed: 0,name1,name2,normative
0,"Robinson, Mary Stephens","Robinson, Mary S","Robinson, Mary Stephens"
1,"Howes, Edith Annie","Howes, Edith","Howes, Edith Annie"
2,"Williams, Elizabeth Whitney, Mrs","Williams, Elizabeth Whitney","Williams, Elizabeth Whitney, Mrs"
3,"Somerville, E. Œ. (Edith Œnone)","Somerville, E. &#xbf;. (Edith &#xbf;none)","Somerville, E. Œ. (Edith Œnone)"
4,"Pearce, Donn","Pearce, Donald","Pearce, Donald"


#### now actually make the change

We're going to print out the names as we change them so I can spot check and make sure I'm not doing something downright awful.

In [14]:
firstset = set(authsets.name1)
secondset = set(authsets.name2)
ctr = 0

def applynorm(name):
    global firstset, secondset, authsets, ctr
    
    if name in firstset:
        newname = str(authsets.loc[authsets.name1 == name, 'normative'].values[0])
        print(newname)
        ctr += 1
        return newname
    elif name in secondset:
        newname = str(authsets.loc[authsets.name2 == name, 'normative'].values[0])
        print(newname)
        ctr += 1
        return newname
    else:
        return name
    
work = work.assign(author = work.author.map(applynorm))  
print(ctr)

Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Reynolds, George W. M. (George William MacArthur)
Maurois, André
Reynolds, George W. M. (George William MacArthur)
Reynolds, George W. M. (George William MacArthur)
Reynolds, George W. M. (George William MacArthur)
Reynolds, George W. M. (George William MacArthur)
Reynolds, George W. M. (George William MacArthur)
Reynolds, George W. M. (George William MacArthur)
Reynolds, George W. M. (George William MacArthur)
Reynolds, George W. M. (George William MacArthur)
Reynolds, George W. M. (George William MacArthur)
Reynolds, George W. M. (George William MacArthur)
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Bal

Yonge, Charlotte M. (Charlotte Mary)
Child, Lydia Maria Francis
Griffin, Gerald
Hammond, S. H. (Samuel H.)
Dalton, James
Souvestre, Émile
Griffin, Gerald
Riley, H. H. (Henry Hiram)
Reynolds, George W. M. (George William MacArthur)
Yonge, Charlotte M. (Charlotte Mary)
Yonge, Charlotte M. (Charlotte Mary)
Reynolds, George W. M. (George William MacArthur)
Reynolds, George W. M. (George William MacArthur)
Griffin, Gerald
Richards, Thomas Addison
freifrau von, Tautphœus, Jemima (Montgomery)
freifrau von, Tautphœus, Jemima (Montgomery)
Reynolds, George W. M. (George William MacArthur)
Borrow, George Henry
Borrow, George Henry
Fullerton, Georgiana Charlotte Seveson-Gower, Lady
Fullerton, Georgiana Charlotte Seveson-Gower, Lady
Lytton, Rosina Bulwer Lytton, Baroness
Lytton, Rosina Bulwer Lytton, Baroness
Lytton, Rosina Bulwer Lytton, Baroness
Jerrold, Douglas William
Peterson, Charles J. (Charles Jacobs)
Yonge, Charlotte M. (Charlotte Mary)
Marryat, Florence R. M. Church Lean
Souvestre, Émile


Yonge, Charlotte M. (Charlotte Mary)
Yonge, Charlotte M. (Charlotte Mary)
Roe, E. R. (Edward Reynolds)
Laboulaye, Edouard
Reynolds, George W. M. (George William MacArthur)
Hale, Edward Everett, Sr
Miller, J. (Joaquin)
Landon, Melville D. (Melville De Lancey)
Turgenev, Ivan Sergi︠e︡evích
Reynolds, George W. M. (George William MacArthur)
Topelius, Zacharias
Thompson, Maurice
Aldrich, Thomas Bailey, Mrs
Turgenev, Ivan Sergi︠e︡evích
Yonge, Charlotte M. (Charlotte Mary)
Yonge, Charlotte M. (Charlotte Mary)
Yonge, Charlotte M. (Charlotte Mary)
Balzac, Honoré de
Hale, Edward Everett, Sr
Roe, Edward Payson
Yonge, Charlotte M. (Charlotte Mary)
Yonge, Charlotte M. (Charlotte Mary)
Miller, J. (Joaquin)
Yonge, Charlotte M. (Charlotte Mary)
Yonge, Charlotte M. (Charlotte Mary)
Wingfield, Lewis Strange
Wingfield, Lewis Strange
Wingfield, Lewis Strange
Erckmann, Emile
Yonge, Charlotte M. (Charlotte Mary)
Hale, Edward Everett, Sr
Roe, Edward Payson
Yonge, Charlotte M. (Charlotte Mary)
Miller, J. (Joaq

Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Hale, Edward Everett, Sr
Hale, Edward Everett, Sr
Marryat, Florence R. M. Church Lean
Turgenev, Ivan Sergi︠e︡evích
Turgenev, Ivan Sergi︠e︡evích
Trevor-Battye, Aubyn [Bernard Rochfort]
Townsend, Edward W. (Edward Waterman)
Baring-Gould, S. (Sabine)
Hepworth, George H. (George Hughes)
Marmontel, Jean François
Baring-Gould, S. (Sabine)
Roberts, Charles George Douglas, Sir
Yonge, Charlotte M. (Charlotte Mary)
Tennyson, Alfred Tennyson, Baron
Tennyson, Alfred Tennyson, Baron
Tennyson, Alfred Tennyson, Baron
Tennyson, Alfred Tennyson, Baron
Turgenev, Ivan Sergi︠e︡evích
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Landon, Melville D. (Melville De Lancey)
Gras, Félix
Balzac, Honoré de

Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Yonge, Charlotte M. (Charlotte Mary)
Turgenev, Ivan Sergi︠e︡evích
Balzac, Honoré de
Balzac, Honoré de
Jackson, Gabrielle E. (Gabrielle Emilie)
Marchmont, Arthur W. (Arthur Williams)
Thompson, Maurice
Bell, Lillian Lida
Woods, Margaret L. (Margaret Louisa)
Baring-Gould, S. (Sabine)
Palacio Valdés, Armando
Palacio Valdés, Armando
Hewlett, Maurice Henry
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Lagerlöf, Selma
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Balzac, Honoré de
Bell, Lillian Lida
Levett-Yeats, S. (Sidney Kilner)
Balzac, Honoré de
Balzac, Honoré de
Zitkala-S̈a
Palacio Valdés, Armando
Topelius, Zacharias
Baring-Gould, S. (Sabine)
Borrow, Geo

Snaith, J. C. (John Collis)
Copping, Arthur E. (Arthur Edward)
Orczy, Emmuska Orczy, Baroness
Jackson, Gabrielle E. (Gabrielle Emilie)
Hamilton, Cicely Mary
Makower, Stanley V
Benson, Robert Hugh, (Spirit)
Orczy, Emmuska Orczy, Baroness
Kirkman, Marshall M. (Marshall Monroe)
Hewlett, Maurice Henry
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Shoemaker, Henry W. (Henry Wharton)
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Meredith, George
Me

Macmillan, Cyrus
Snaith, J. C. (John Collis)
Grow, Malcolm Cummings
Lincoln, Natalie Sumner
Lincoln, Natalie Sumner
Rives, Amélie
Olcott, Frances Jenkins
Roger, Noëlle
Snaith, J. C. (John Collis)
Roberts, Charles George Douglas, Sir
Orczy, Emmuska Orczy, Baroness
Hamilton, Cicely Mary
Somerville, E. Œ. (Edith Œnone)
Bacheller, Irving Addison
Safroni-Middleton, A (Arnold)
Safroni-Middleton, A (Arnold)
Snaith, J. C. (John Collis)
Mathews, Basil Joseph
Olcott, Frances Jenkins
Orczy, Emmuska Orczy, Baroness
Roberts, Charles George Douglas, Sir
Robinson, Eliot H. (Eliot Harlow)
Wood, Michael
Maurois, André
Hewlett, Maurice Henry
Roberts, Charles George Douglas, Sir
Serao, Mathilde
Shoemaker, Henry W. (Henry Wharton)
Bacheller, Irving Addison
Aldrich, Thomas Bailey, Mrs
Hewlett, Maurice Henry
Wassermann, Jakob
Wassermann, Jakob
Przybyszewski, Stanisław
Turgenev, Ivan Sergi︠e︡evích
Turgenev, Ivan Sergi︠e︡evích
Stacpoole, Margaret (Robson), Mrs
Heydrick, Benjamin A. (Benjamin Alexander)
Anders

Stivens, Dallas
Guthrie, A. B., Jr. (Alfred Bertram)
Falkner, William C. (William Clark)
O'Brien, Fitz James
Turgenev, Ivan Sergi︠e︡evích
Blais, Marie-Claire
Richberg, Donald R. (Donald Randall)
Vignant, Jean François
Mphahlele, Ezekiel
Bulgakov, Mikhail Afanasʹevich
Petry, Ann Lane
Houston, James A
Peterson, Charles J. (Charles Jacobs)
West, Morris L
Tanizaki, Junʼichirō
Hardy, Frank Joseph
Stegner, Wallace Earle
Atwood, Margaret Eleanor
Van der Post, Laurons
Bulgakov, Mikhail Afanasʹevich
Ihimaera, Witi Tame
Blais, Marie-Claire
Elliott, Sumner Locke
Guthrie, A. B., Jr. (Alfred Bertram)
Erdman, Paul Emil
Ihimaera, Witi Tame
West, Morris L
Clutesi, George C
Blais, Marie-Claire
Roueché, Berton
Van der Post, Laurons
Turgenev, Ivan Sergi︠e︡evích
Blais, Marie-Claire
Erdman, Paul Emil
Turgenev, Ivan Sergi︠e︡evích
Stewart, George Rippey
Stewart, George Rippey
Bulgakov, Mikhail Afanasʹevich
Long, Catharine, Lady
Elliott, Sumner Locke
Fullerton, Georgiana Charlotte Seveson-Gower, Lady
Child, L

That change was made in 2899 rows.

### Update last date of composition

We have a column **lastcomp** that's supposed to contain an inference about the last possible date of composition, based on certain values of **datetype** and **enddate,** plus authors' death dates.

But the dates of death have been updated since **lastcomp** was created. We may be able to improve that column a little before using it in the next stage of dedup.

In [26]:
ctr = 0

def update(row):
    global ctr 
    
    lastcomp = int(row.latestcomp)
    authordates = str(row.authordate).strip('.] ')
    
    if pd.isnull(row.author) or row.author == 'nan':
        return float('nan')
    
    if pd.isnull(authordates):
        return lastcomp
    elif authordates.startswith('.d') or len(authordates) > 7:
        try:
            death = int(authordates[-4 : ])
        except:
            death = 0
    else:
        death = 0
    
    if (death > 1600 and death < 2010) and death < lastcomp:
        print(lastcomp, death, row.author, row.authordate)
        ctr += 1
        return death
    else:
        return lastcomp

work = work.assign(updated = work.apply(update, axis = 1))
print(ctr)

2100 1967 McKenna, Stephen 1888-1967.
2100 1955 Sharp, Evelyn 1869-1955.
2100 2009 Middleton, Stanley 1919-2009.
1703 1691 Boyle, Robert 1627-1691.
1727 1660 Monsieur, Scarron 1610-1660.
1727 1660 Monsieur, Scarron 1610-1660.
1728 1616 Cervantes Saavedra, Miguel de 1547-1616.
1740 1616 Cervantes Saavedra, Miguel de 1547-1616.
1740 1616 Cervantes Saavedra, Miguel de 1547-1616.
1747 1616 Cervantes Saavedra, Miguel de 1547-1616.
1747 1616 Cervantes Saavedra, Miguel de 1547-1616.
1747 1616 Cervantes Saavedra, Miguel de 1547-1616.
1747 1616 Cervantes Saavedra, Miguel de 1547-1616.
1749 1616 Cervantes Saavedra, Miguel de 1547-1616.
1749 1616 Cervantes Saavedra, Miguel de 1547-1616.
1752 1740 W. G. (William Goodall) fl. 1740.
1752 1740 W. G. (William Goodall) fl. 1740.
1752 1740 W. G. (William Goodall) fl. 1740.
1755 1616 Cervantes Saavedra, Miguel de 1547-1616.
1755 1616 Cervantes Saavedra, Miguel de 1547-1616.
1759 1689 Behn, Aphra 1640-1689.
1759 1689 Behn, Aphra 1640-1689.
1764 1763 Shens

1856 1848 Marryat, Frederick 1792-1848.
1856 1854 Judson, Emily C. (Emily Chubbuck) 1817-1854.
1857 1840 Griffin, Gerald 1803-1840.
1857 1754 Fielding, Henry 1707-1754. 
1857 1754 Fielding, Henry 1707-1754. 
1857 1856 Hubbell, Martha Stone 1814-1856.
1857 1854 Souvestre, Émile 1806-1854.
1857 1833 More, Hannah 1745-1833.
1857 1832 Scott, Walter, Sir 1771-1832.
1857 1832 Scott, Walter, Sir 1771-1832.
1857 1835 Wilson, John Mackay 1804-1835.
1857 1835 Wilson, John Mackay 1804-1835.
1857 1835 Wilson, John Mackay 1804-1835.
1857 1835 Wilson, John Mackay 1804-1835.
1857 1832 Scott, Walter, Sir 1771-1832.
1857 1832 Scott, Walter, Sir 1771-1832.
1857 1848 Marryat, Frederick 1792-1848.
1858 1616 Cervantes Saavedra, Miguel de 1547-1616.
1858 1827 Hauff, Wilhelm 1802-1827.
1858 1633 Herbert, George 1593-1633.
1858 1633 Herbert, George 1593-1633.
1858 1633 Herbert, George 1593-1633.
1858 1857 Jerrold, Douglas William 1803-1857.
1859 1783 Brooke, Henry 1703?-1783.
1859 1783 Brooke, Henry 1703?-178

1884 1883 Turgenev, Ivan Sergi︠e︡evích 1818-1883.
1884 1880 Fleming, May Agnes 1840-1880.
1884 1880 Fleming, May Agnes 1840-1880.
1884 1856 Heine, Heinrich 1797-1856.
1884 1880 Kingston, William Henry Giles 1814-1880.
1884 1876 Lawrence, George Alfred 1827-1876.
1884 1882 Michener, Frances Lavinia 1866-1882.
1884 1864 Hawthorne, Nathaniel 1804-1864.
1885 1877 Quincy, Edmund 1808-1877.
1885 1856 Heine, Heinrich 1797-1856.
1885 1882 Longfellow, Henry Wadsworth 1807-1882.
1885 1832 Scott, Walter, Sir 1771-1832.
1885 1834 Lamb, Charles 1775-1834.
1885 1883 Reid, Mayne 1818-1883.
1885 1863 Thackeray, William Makepeace 1811-1863.
1885 1883 Reid, Mayne 1818-1883.
1886 1859 Irving, Washington 1783-1859.
1886 1811 Moore, George fl. 1797-1811.
1886 1884 Brame, Charlotte M 1836-1884.
1886 1719 Addison, Joseph 1672-1719.
1886 1834 Lamb, Charles 1775-1834.
1886 1880 Fleming, May Agnes 1840-1880.
1886 1882 Longfellow, Henry Wadsworth 1807-1882.
1886 1885 Ewing, Juliana Horatia Gatty 1841-1885.
1886 

1896 1616 Cervantes Saavedra, Miguel de 1547-1616.
1896 1616 Cervantes Saavedra, Miguel de 1547-1616.
1896 1616 Cervantes Saavedra, Miguel de 1547-1616.
1896 1616 Cervantes Saavedra, Miguel de 1547-1616.
1896 1881 Carlyle, Thomas 1795-1881.
1896 1869 Carleton, William 1794-1869.
1896 1887 Cobb, Sylvanus 1823-1887.
1896 1895 Field, Eugene 1850-1895.
1896 1892 Renan, Ernest 1823-1892.
1896 1867 Bulfinch, Thomas 1796-1867.
1896 1887 Cobb, Sylvanus 1823-1887.
1896 1851 Cooper, James Fenimore 1789-1851.
1896 1851 Cooper, James Fenimore 1789-1851.
1896 1851 Cooper, James Fenimore 1789-1851.
1896 1851 Cooper, James Fenimore 1789-1851.
1896 1851 Cooper, James Fenimore 1789-1851.
1896 1851 Cooper, James Fenimore 1789-1851.
1896 1851 Cooper, James Fenimore 1789-1851.
1896 1895 Field, Eugene 1850-1895.
1896 1894 Stevenson, Robert Louis 1850-1894.
1896 1887 Jefferies, Richard 1848-1887.
1896 1695 La Fontaine, Jean de 1621-1695.
1896 1695 La Fontaine, Jean de 1621-1695.
1896 1891 Adams, W. H. Daven

1901 1881 Borrow, George Henry 1803-1881.
1901 1863 Thackeray, William Makepeace 1811-1863.
1901 1859 Irving, Washington 1783-1859.
1901 1811 Moore, George fl. 1797-1811.
1901 1811 Moore, George fl. 1797-1811.
1901 1616 Cervantes Saavedra, Miguel de 1547-1616.
1901 1899 Alger, Horatio 1832-1899.
1902 1754 Fielding, Henry 1707-1754. 
1902 1849 Poe, Edgar Allan 1809-1849.
1902 1625 Bacci, Pietro Giacomo fl. 1625.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas, Alexandre 1802-1870.
1902 1870 Dumas

1904 1849 Poe, Edgar Allan 1809-1849.
1904 1890 Collodi, Carlo 1826-1890.
1904 1898 Carroll, Lewis 1832-1898.
1905 1903 Gissing, George 1857-1903.
1905 1689 Behn, Aphra 1640-1689.
1905 1811 Moore, George fl. 1797-1811.
1905 1903 Gissing, George 1857-1903.
1905 1900 Robinson, Rowland Evans 1833-1900.
1905 1881 Borrow, George Henry 1803-1881.
1905 1896 Goncourt, Edmond de 1822-1896.
1905 1883 Laboulaye, Edouard 1811-1883.
1905 1899 Alger, Horatio 1832-1899.
1905 1850 Bernard, Charles de 1804-1850.
1905 1890 Feuillet, Octave 1821-1890.
1905 1903 Waltz, Elizabeth Cherry 1866-1903.
1905 1857 Musset, Alfred de 1810-1857.
1905 1895 Droz, Gustave 1832-1895.
1905 1861 Murger, Henri 1822-1861.
1905 1899 Alger, Horatio 1832-1899.
1905 1902 Norris, Frank 1870-1902.
1905 1870 Dickens, Charles 1812-1870.
1905 1899 Alger, Horatio 1832-1899.
1906 1897 George, Henry 1839-1897.
1906 1817 Austen, Jane 1775-1817.
1906 1817 Austen, Jane 1775-1817.
1906 1894 Stevenson, Robert Louis 1850-1894.
1906 1896 Dase

1914 1902 Norris, Frank 1870-1902.
1914 1904 Chekhov, Anton Pavlovich 1860-1904.
1914 1909 Finley, Martha 1828-1909.
1914 1896 Goncourt, Edmond de 1822-1896.
1914 1904 Hearn, Lafcadio 1850-1904.
1914 1837 Pushkin, Aleksandr Sergeevich 1799-1837.
1914 1912 Strindberg, August 1849-1912.
1914 1894 Stevenson, Robert Louis 1850-1894.
1914 1913 Warner, Anne 1869-1913.
1914 1883 Turgenev, Ivan Sergi︠e︡evích 1818-1883.
1914 1903 Merriman, Henry Seton 1862-1903.
1914 1912 Strindberg, August 1849-1912.
1914 1913 Janvier, Thomas A. (Thomas Allibone) 1849-1913.
1914 1688 Bunyan, John 1628-1688.
1914 1768 Sterne, Laurence 1713-1768.
1915 1908 Steinberg, Judah 1863-1908.
1913 1873 Gaboriau, Emile 1832-1873.
1915 1689 Behn, Aphra 1640-1689.
1915 1905 Dodge, Mary Mapes 1830-1905.
1915 1902 Stockton, Frank R. (Frank Richard) 1834-1902.
1915 1895 Boyesen, Hjalmar Hjorth 1848-1895.
1915 1908 De Amicis, Edmondo 1846-1908.
1915 1902 Harte, Bret 1836-1902.
1915 1902 Henty, G. A. (George Alfred) 1832-1902.
1

1943 1934 White, Edward Lucas 1866-1934.
1943 1932 Sousa, John Philip 1854-1932.
1944 1939 Riesenberg, Felix 1879-1939.
1946 1932 Sousa, John Philip 1854-1932.
1947 1646 Feng, Menglong 1574-1646.
1947 1688 Bunyan, John 1628-1688.
1948 1646 Feng, Menglong 1574-1646.
1949 1920 Bonnet, Theodore 1865-1920.
1950 1616 Cervantes Saavedra, Miguel de 1547-1616.
1952 1616 Cervantes Saavedra, Miguel de 1547-1616.
1953 1852 Fisher, William 1780-1852.
1953 1941 Walter, Eugene 1874-1941.
1956 1925 Baldwin, James 1841-1925.
1956 1677 Spinoza, Benedictus de 1632-1677.
1957 1931 Nielsen, Carl 1865-1931.
1958 1661 Chaloner, Thomas 1595-1661.
1958 1919 Fox, John 1863-1919.
1958 1957 Tilsley, Frank 1904-1957.
1958 1609 Croce, Giulio Cesare 1550-1609.
1958 1646 Feng, Menglong 1574-1646.
1959 1796 Burns, Robert 1759-1796.
1960 1924 Grant, Douglas 1883-1924.
1960 1616 Cervantes Saavedra, Miguel de 1547-1616.
1960 1906 Garnett, Richard 1835-1906.
1961 1616 Cervantes Saavedra, Miguel de 1547-1616.
1962 1916 Wi

2001 1942 Said, Kurban 1905-1942.
2002 1871 Kennedy, William 1799-1871.
2002 1918 Campbell, Helen 1839-1918.
2003 1966 Kelly, John 1913-1966.
2003 1974 Clarke, Austin 1896-1974.
2003 1941 Gordon, Neil 1895-1941.
2003 1960 Elsschot, Willem 1882-1960.
2003 1815 Murray, John 1741-1815.
2003 1932 Cobb, Thomas 1854-1932.
2004 1981 Krleža, Miroslav 1893-1981.
2004 1815 Murray, John 1741-1815.
2004 1981 Green, Paul 1894-1981.
2004 1835 Scott, Michael 1789-1835.
2004 1732 Gay, John 1685-1732.
2005 1856 Palmer, William 1824-1856.
2005 2000 Hwang, Sun-wŏn 1915-2000.
2005 1940 Simpson, Helen 1897-1940.
2005 1616 Cervantes Saavedra, Miguel de 1547-1616.
2005 1997 Pritchett, V. S. (Victor Sawdon) 1900-1997.
2005 1956 Anderson, Paul 1880-1956.
2005 1920 Reed, John 1887-1920.
2006 1815 Murray, John 1741-1815.
2006 1693 La Fayette, (Marie-Madeleine Pioche de La Vergne), Madame de 1634-1693.
2006 1884 Carroll, John 1809-1884.
2006 1862 White, James 1803-1862.
2007 1646 Feng, Menglong 1574-1646.
1933 18

In [27]:
work['latestcomp'] = work['updated']
work.drop(labels = ['updated'], axis = 1, inplace = True)

In [31]:
def whethertokeep(row):
    infdate = int(row.inferreddate)
    lastcomp = row.latestcomp
    if pd.isnull(lastcomp):
        return True
    else:
        lastcomp = int(lastcomp)
    
    if infdate > (lastcomp + 25):
        return False
    else:
        return True

work = work.assign(contemporary = work.apply(whethertokeep, axis = 1))

In [30]:
work.loc[work.author.str.startswith('Gaskell, E', na = False), : ]

Unnamed: 0,docid,oldauthor,author,authordate,inferreddate,latestcomp,datetype,startdate,enddate,imprint,...,recordid,instances,allcopiesofwork,copiesin25yrs,enumcron,volnum,title,parttitle,shorttitle,tokeep
249,nyp.33433074857115,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,0,1865.0,n,uuuu,uuuu,New York|Dodd|n.d.,...,8665158,1,1,1,,,"Cranford, | with a memoir of the author; | $c:...",,"Cranford, with a memoir of the author;",True
8947,uiuo.ark+=13960=t9z03hc07,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1848,1848.0,s,1848,,London;Chapman and Hall;1848.,...,1419882,1,8,1,v.1,1.0,Mary Barton,,Mary Barton,True
10281,njp.32101019691409,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1853,1853.0,s,1853,,New York;Harper & brothers;1853.,...,1419868,1,35,2,,,"Cranford / | $c: by the author of ""Mary Barton...",,Cranford,True
10426,nyp.33433074857149,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1853,1853.0,s,1853,,Leipzig;B. Tauchnitz;1853.,...,8665214,1,6,6,v. 1,1.0,"Ruth : | a novel / | $c: by the author of ""Mar...",,Ruth : a novel,True
10427,nyp.33433074857131,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1853,1853.0,s,1853,,Leipzig;B. Tauchnitz;1853.,...,8665214,1,6,6,v. 2,2.0,"Ruth : | a novel / | $c: by the author of ""Mar...",,Ruth : a novel,True
11082,nyp.33433074857362,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1855,1855.0,s,1855,,New York;Harper & Bros.;1855.,...,8668329,1,13,4,,,"North and south, | $c: by the author of Mary B...",,North and south,True
12045,nyp.33433082340302,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1858,1858.0,s,1858,,New York;Appleton;1858.,...,8637302,1,1,1,v. 1-2,1.0,The life of Charlotte Brontë / | $c: by E. C....,,The life of Charlotte Brontë,True
12064,nyp.33433074857156,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1858,1858.0,s,1858,,New York;Harper;1858.,...,8665192,1,2,2,,,My Lady Ludlow : | a novel / | $c: by Mrs. Gas...,,My Lady Ludlow : a novel,True
12236,inu.30000007412517,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1859,1859.0,s,1859,,"London;S. Low, son & co.;1859.",...,6059946,1,3,1,v.1,1.0,"Round the sofa. | $c: By the author of ""Mary B...",,Round the sofa,True
12364,uc1.b3322505,"Gaskell, Elizabeth Cleghorn","Gaskell, Elizabeth Cleghorn",1810-1865.,1859,1859.0,s,1859,,London;S. Low;1859.,...,7915307,1,1,1,v.1,1.0,"Round the sofa / | $c: by the author of ""Mary ...",Round the sofa. My Lady Ludlow,Round the sofa. My Lady Ludlow,True


In [32]:
work.drop(labels = ['tokeep'], axis = 1, inplace = True)
sum(work.contemporary)

129249

In [34]:
work.to_csv('../workmeta.tsv', sep = '\t', index = False)