# Make gender-balanced subset

For most of the two centuries this project covers, more works of fiction were published by men than by women. This dataset experimentally levels that imbalance, using the simple technique of discarding works by men. In a few periods at the beginning and end of the timeline, more works were published by women, and we have to reverse the strategy.

This approach means that coverage will be uneven across time; we'll have fewer works overall in periods where gender imbalance was severe.

Given unlimited time, we could go back and select a supplement to fill in those dips. But I don't know for sure that this dataset will be used by anyone outside the current project; it seems unwise to be a perfectionist before knowing the size of the audience.

The concept of "gender balance" implies a binaristic simplification of a more complex reality, but I want to avoid simplifying more than necessary. So I'm not going to discard works where gender is marked as 'u' or 'o.' Instead I'll downsample those works to keep them in proportion to the smaller subset.

Mostly this is going to affect collections, anonymous works, and works with multiple authors.

In [1]:
import pandas as pd
import random

In [2]:
title = pd.read_csv('manual_title_subset.tsv', sep = '\t', index_col = 'docid')

In [5]:
numwomen = sum(title.gender == 'f')
nummen = sum(title.gender == 'm')
numother = len(title) - (numwomen + nummen)
print("START: ", numwomen, nummen, numother)
print()

frames = []

for floor in range (1800, 2010, 5):
    fiveyrs = title.loc[(title.firstpub >= floor) & (title.firstpub < floor + 5), :]
    
    numwomen = sum(fiveyrs.gender == 'f')
    nummen = sum(fiveyrs.gender == 'm')
    numother = len(fiveyrs) - (numwomen + nummen)
    if numwomen > nummen:
        asterisk = "*"
    else:
        asterisk = ''
        
    print(floor, numwomen, nummen, numother, asterisk)
    
    women = fiveyrs.loc[fiveyrs.gender == 'f', :]
    men = fiveyrs.loc[fiveyrs.gender == 'm', :]
    others = fiveyrs.loc[(fiveyrs.gender == 'u') | (fiveyrs.gender == 'o'), :]
    
    if numwomen < nummen:
        men = men.sample(n = numwomen)
    elif nummen < numwomen:
        women = women.sample(n = nummen)
    
    reduction = (len(men) + len(women)) / (nummen + numwomen)
    numothers = len(others)
    reduced = int(numothers * reduction)
    others = others.sample(n = reduced)
    
    newframe = pd.concat([women, men, others], sort = False)
    frames.append(newframe)
    
print()
newtitle = pd.concat(frames)
print(newtitle.shape)
numwomen = sum(newtitle.gender == 'f')
nummen = sum(newtitle.gender == 'm')
numother = len(newtitle) - (numwomen + nummen)
print("END: ", numwomen, nummen, numother)

START:  810 1526 394

1800 26 32 12 
1805 21 29 11 
1810 36 21 8 *
1815 25 29 12 
1820 14 26 18 
1825 16 33 13 
1830 16 38 12 
1835 20 44 13 
1840 17 36 9 
1845 11 30 19 
1850 15 38 14 
1855 23 35 9 
1860 26 41 6 
1865 33 28 5 *
1870 21 32 11 
1875 22 34 5 
1880 21 29 7 
1885 19 30 10 
1890 26 37 4 
1895 19 37 4 
1900 13 41 10 
1905 15 42 3 
1910 15 47 5 
1915 23 37 6 
1920 15 35 7 
1925 16 40 4 
1930 17 42 6 
1935 15 49 9 
1940 19 35 10 
1945 23 34 4 
1950 29 38 6 
1955 12 38 7 
1960 17 39 6 
1965 9 46 9 
1970 14 37 8 
1975 10 36 14 
1980 16 38 11 
1985 17 44 7 
1990 18 29 17 
1995 20 29 7 
2000 14 41 5 
2005 29 26 7 *

(1787, 18)
END:  780 780 227


In [6]:
newtitle.to_csv('gender_balanced_subset.tsv', sep = '\t', index_label = 'docid')