# Parsing gendered text

Implement the `parse_gender` function as described on pp. 10-12 of the textbook. Run the function over the three texts indicated below and comment (briefly) on the results.

Starter code is included below. When finished, commit your code and issue a pull request to me.

In [23]:
# Imports
import nltk
import os
from   collections import Counter

# Variables
text_dir = os.path.join('..', 'data', 'texts') # Where are the texts?
texts = [
    'A-Alcott-Little_Women-1868-F.txt', # _Little Women_
    'A-Twain-Huck_Finn-1885-M.txt',     # _Huck Finn_
    'B-Eliot-Middlemarch-1869-F.txt'    # _Middlemarch_
]

In [24]:
# Word lists
MALE = 'male'
FEMALE = 'female'
UNKNOWN = 'unknown'
BOTH = 'both'

MALE_WORDS = set([
    'guy','spokesman','chairman',"men's",'men','him',"he's",'his',
    'boy','boyfriend','boyfriends','boys','brother','brothers','dad',
    'dads','dude','father','fathers','fiance','gentleman','gentlemen',
    'god','grandfather','grandpa','grandson','groom','he','himself',
    'husband','husbands','king','male','man','mr','nephew','nephews',
    'priest','prince','son','sons','uncle','uncles','waiter','widower',
    'widowers'
])

FEMALE_WORDS = set([
    'heroine','spokeswoman','chairwoman',"women's",'actress','women',
    "she's",'her','aunt','aunts','bride','daughter','daughters','female',
    'fiancee','girl','girlfriend','girlfriends','girls','goddess',
    'granddaughter','grandma','grandmother','herself','ladies','lady',
    'lady','mom','moms','mother','mothers','mrs','ms','niece','nieces',
    'priestess','princess','queens','she','sister','sisters','waitress',
    'widow','widows','wife','wives','woman'
])

In [34]:
# Your code here ...
'''
You might want to create your own short text sample for use in developing your code.
To be clear, it's fine to copy the textbook code. This exercise is mostly a shakedown to
check that your environment is working and that the GitHub Classroom submission system
works as intended.
'''
def genderize(words):

    mwlen = len(MALE_WORDS.intersection(words))
    fwlen = len(FEMALE_WORDS.intersection(words))

    if mwlen > 0 and fwlen == 0:
        return MALE
    elif mwlen == 0 and fwlen > 0:
        return FEMALE
    elif mwlen > 0 and fwlen > 0:
        return BOTH
    else:
        return UNKNOWN


def count_gender(sentences):

    sents = Counter() # Counters are like dictionaries, 
    words = Counter() # but handle missing elements better

    for sentence in sentences:
        gender = genderize(sentence)
        sents[gender] += 1             # Number of sentences per gender
        words[gender] += len(sentence) # Number of words in the sentence
                                       # Note ALL words in sentence assigned to one gender

    return sents, words


def parse_gender(text):

    # List of lists. Inner items are tokenized words. Outer items are sentences.
    sentences = [
        [word.lower() for word in nltk.word_tokenize(sentence)]
        for sentence in nltk.sent_tokenize(text)
    ]

    sents, words = count_gender(sentences)
    total = sum(words.values()) # Total text wordcount

    pct_male = 0
    pct_female = 0
    
    for gender, count in sorted(words.items()): # Each item is one gender
        pcent = (count / total) * 100
        nsents = sents[gender]
        print(
            "{:0.1f}% {} ({} sentences)".format(pcent, gender, nsents)
        )
    # Female/male ratio
    print(f"{round(words[FEMALE]/words[MALE],2)} female/male ratio")

In [35]:
%%time
# Run and examine the output
for text in texts: # Loop over texts in corpus directory
    print(text)
    with open(os.path.join(text_dir, text), 'r') as f: # Open each text in turn
        parse_gender(f.read()) # Run the gender-parsing function
    print('\n**********\n')

A-Alcott-Little_Women-1868-F.txt
17.9% both (1010 sentences)
33.2% female (2504 sentences)
16.3% male (1393 sentences)
32.6% unknown (4539 sentences)
2.04 female/male ratio

**********

A-Twain-Huck_Finn-1885-M.txt
7.1% both (185 sentences)
9.6% female (415 sentences)
36.4% male (1650 sentences)
47.0% unknown (3576 sentences)
0.26 female/male ratio

**********

B-Eliot-Middlemarch-1869-F.txt
19.9% both (1880 sentences)
14.2% female (1917 sentences)
37.0% male (4558 sentences)
29.0% unknown (6528 sentences)
0.38 female/male ratio

**********

CPU times: user 6.43 s, sys: 40 ms, total: 6.47 s
Wall time: 6.48 s


## Discussion

Some very brief discussion here. What do you know about these books? What would you expect their gender breakdown to be? Did the program return results that matched your expectations? If not, what might you change?