# Make gender-balanced subset

For most of the two centuries this project covers, more works of fiction were published by men than by women. This dataset experimentally levels that imbalance, using the simple technique of discarding works by men. In a few periods at the beginning and end of the timeline, more works were published by women, and we have to reverse the strategy.

This approach means that coverage will be uneven across time; we'll have fewer works overall in periods where gender imbalance was severe.

Given unlimited time, we could go back and select a supplement to fill in those dips. But I don't know for sure that this dataset will be used by anyone outside the current project; it seems unwise to be a perfectionist before knowing the size of the audience.

The concept of "gender balance" implies a binaristic simplification of a more complex reality, but I want to avoid simplifying more than necessary. So I'm not going to discard works where gender is marked as 'u' or 'o.' Instead I'll downsample those works to keep them in proportion to the smaller subset.

Mostly this is going to affect collections, anonymous works, and works with multiple authors.

In [1]:
import pandas as pd
import random

In [2]:
title = pd.read_csv('manual_title_subset.tsv', sep = '\t', index_col = 'docid')

In [4]:
numwomen = sum(title.gender == 'f')
nummen = sum(title.gender == 'm')
numother = len(title) - (numwomen + nummen)
print("START: ", numwomen, nummen, numother)
print()

frames = []

for floor in range (1800, 2010, 5):
    fiveyrs = title.loc[(title.firstpub >= floor) & 
                        (title.firstpub < floor + 5) &
                        (title.category.isin({'longfiction', 'shortfiction'}))
                        , :]
    
    numwomen = sum(fiveyrs.gender == 'f')
    nummen = sum(fiveyrs.gender == 'm')
    numother = len(fiveyrs) - (numwomen + nummen)
    if numwomen > nummen:
        asterisk = "*"
    else:
        asterisk = ''
        
    print(floor, numwomen, nummen, numother, asterisk)
    
    women = fiveyrs.loc[fiveyrs.gender == 'f', :]
    men = fiveyrs.loc[fiveyrs.gender == 'm', :]
    others = fiveyrs.loc[(fiveyrs.gender == 'u') | (fiveyrs.gender == 'o'), :]
    
    if numwomen < nummen:
        men = men.sample(n = numwomen)
    elif nummen < numwomen:
        women = women.sample(n = nummen)
    
    reduction = (len(men) + len(women)) / (nummen + numwomen)
    numothers = len(others)
    reduced = int(numothers * reduction)
    
    if reduced > 0:
        others = others.sample(n = reduced)
        newframe = pd.concat([women, men, others], sort = False)
    else:
        newframe = pd.concat([women, men], sort = False)
        
    frames.append(newframe)
    
print()
newtitle = pd.concat(frames)
print(newtitle.shape)
numwomen = sum(newtitle.gender == 'f')
nummen = sum(newtitle.gender == 'm')
numother = len(newtitle) - (numwomen + nummen)
print("END: ", numwomen, nummen, numother)

START:  810 1526 394

1800 20 27 7 
1805 21 25 7 
1810 34 19 7 *
1815 23 28 9 
1820 12 24 11 
1825 13 30 9 
1830 14 32 8 
1835 19 40 8 
1840 15 33 4 
1845 11 23 13 
1850 12 33 8 
1855 19 28 5 
1860 21 36 5 
1865 28 26 3 *
1870 19 28 9 
1875 19 33 3 
1880 19 25 4 
1885 17 30 6 
1890 24 34 0 
1895 18 34 1 
1900 11 37 5 
1905 11 38 1 
1910 11 39 2 
1915 18 31 1 
1920 15 34 2 
1925 14 40 1 
1930 14 39 4 
1935 15 45 2 
1940 17 31 6 
1945 20 32 3 
1950 20 35 3 
1955 8 34 4 
1960 14 36 1 
1965 8 44 5 
1970 13 34 5 
1975 9 34 11 
1980 15 37 8 
1985 16 43 5 
1990 18 29 13 
1995 19 28 5 
2000 14 41 5 
2005 29 25 3 *

(1501, 18)
END:  686 686 129


In [5]:
newtitle.to_csv('gender_balanced_subset.tsv', sep = '\t', index_label = 'docid')