# GenderME: Find Gendered Terms

### What can our code currently do?

1. [x] Load json file from first part.
1. [x] Read our two resource csv files with (un)gendered terms.
1. [x] Untangle our data to make it useful for our project.
1. [x] Use csv file data to iterate over speeches and find (un)gendered terms.
1. [x] Save findings as csv file.

### What should it be able to perhaps do in the future(?) -- idea dump
1. [ ] Find out whether gendered speech was used without human help...? No idea how currently
1. [ ] Other things depends on (1), such as how often was someone using ungendered speech etc.

Load needed packages.

In [87]:
import json
import csv
import re
import string
from tqdm import tqdm

Read `csv file` with gendered terms.

In [88]:
gender_terms = set()

with open('res/gendered.csv', newline='', encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='|')
    next(reader) # skip header

    for row in reader:
        term = row[1].split()
        
        if len(term) == 1:
            gender_terms.add(term[0].lower())

In [89]:
print(len(gender_terms))

1334


Read `csv file` with even more terms. (Seriously, it's a _lot_ of terms).

In [90]:
all_careers = set()
career_lines = []

with open('res/berufe.csv', newline='', encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')
    
    for i in range(8): # skip header
        next(reader)
    
    for row in reader:
        career_lines.append(row[0])

In [91]:
len(career_lines)

27854

## Pipeline

The following steps detail the cases found in the second `csv file`, which - due to the sheer number of terms listed - is a long and quite unclean data set.

**Case 1**  
`Aalfischer/in` → Only add left part of the term. (Since the algorithm will automatically also find the female version if it occurs in a speech.)

In [92]:
new_career_lines = []

for career_line in career_lines:
    if career_line.count('/') == 1:
        careers = career_line.split('/')
        if careers[1].strip() == 'in':
            all_careers.add(careers[0]) # store
            continue
    new_career_lines.append(career_line)
career_lines = new_career_lines

In [93]:
print("Rest:", len(career_lines))
print("Extracted:", len(all_careers))

Rest: 13873
Extracted: 13980


**Case 2**  
`Master of Arts - American Studies` → Remove. (Master and Bachelor)

In [94]:
new_career_lines = []
degrees = ["Bachelor of", "Master of"]

for career_line in career_lines:
    if not any(degree in career_line for degree in degrees):
        new_career_lines.append(career_line)

career_lines = new_career_lines

In [95]:
print("Rest:", len(career_lines))
print("Extracted:", len(all_careers))

Rest: 13494
Extracted: 13980


**Case 3**  
`Abbeizer/in (Dekapierer/in)` → Add left part of lefthand word. Discard rest.

In [96]:
new_career_lines = []

for career_line in career_lines:
    careers = career_line.split('/')
    if len(careers) > 1:
        right_side = ' '.join(careers[1:])
        if right_side[:4] == 'in (':
            all_careers.add(careers[0]) # store
        else:
            new_career_lines.append(career_line)

career_lines = new_career_lines

In [97]:
print("Rest:", len(career_lines))
print("Extracted:", len(all_careers))

Rest: 9478
Extracted: 15562


**Case 4**  
`Bautechniker/in - Baudenkmalpflege/Altbauerneuerung` → Add left part of lefthand word. Discard rest.

In [98]:
new_career_lines = []

for career_line in career_lines:
    careers = career_line.split('/')
    if len(careers) > 1:
        right_side = ' '.join(careers[1:])
        if right_side[:5] == 'in - ':
            all_careers.add(careers[0]) # store
        else:
            new_career_lines.append(career_line)

career_lines = new_career_lines

In [99]:
print("Rest:", len(career_lines))
print("Extracted:", len(all_careers))

Rest: 3568
Extracted: 15825


**Case 5**  
`Verfahrenstechnolog(e/in) - Mühlen-/Getreidew. - Müllerei` → Add male and female version of lefthand word. Discard rest.

In [100]:
new_career_lines = []

for career_line in career_lines:
    careers = career_line.split('(e/in)')
    if len(careers) > 1:
        all_careers.add(careers[0] + 'e') # store
        all_careers.add(careers[0] + 'in') # store
    else:
        careers = career_line.split('(er/in)')
        if len(careers) > 1:
            all_careers.add(careers[0] + 'er') # store
            all_careers.add(careers[0] + 'in') # store
        else:
            new_career_lines.append(career_line)

career_lines = new_career_lines

In [101]:
print("Rest:", len(career_lines))
print("Extracted:", len(all_careers))

Rest: 3196
Extracted: 16043


**Case 6**  
`Androloge/Andrologin` → Add lemma of term, e.g. `Androlog`.

In [102]:
new_career_lines = []

for career_line in career_lines:
    careers = career_line.split('/')
    if len(careers) == 2 and careers[0][-1] == 'e':
        if careers[1] == careers[0][:-1] + 'in':
            all_careers.add(careers[0][:-1]) # store
            continue
    
    new_career_lines.append(career_line)
        
career_lines = new_career_lines

In [103]:
print("Rest:", len(career_lines))
print("Extracted:", len(all_careers))

Rest: 3068
Extracted: 16171


**Case 7**  
`Anrichtegehilfe/-gehilfin` → Add lemma of word, e.g. `Anrichtgehilf`.

In [104]:
new_career_lines = []

for career_line in career_lines:
    careers = career_line.split('/')
    if len(careers) == 2 and careers[0][-1] == 'e':
        if len(careers[1].split()) == 1 and careers[1][0] == '-' and careers[1][-2:] == 'in':
            all_careers.add(careers[0][:-1]) # store
            continue

    new_career_lines.append(career_line)
        
career_lines = new_career_lines

In [105]:
print("Rest:", len(career_lines))
print("Extracted:", len(all_careers))

Rest: 2745
Extracted: 16494


**Case 8**  
`Anwalt/Anwältin` → Add both terms.

In [106]:
new_career_lines = []

for career_line in career_lines:
    careers = career_line.split('/')
    if len(careers) == 2:
        if len(careers[0].split()) == 1 and len(careers[1].split()) == 1:
            if careers[1][0].isupper():
                all_careers.add(careers[0]) # store
                all_careers.add(careers[1]) # store
                continue
    new_career_lines.append(career_line)
        
career_lines = new_career_lines

In [107]:
print("Rest:", len(career_lines))
print("Extracted:", len(all_careers))

Rest: 2679
Extracted: 16615


**Case 9**  
`Politische/r Berater/in` → Only add righthand term, disregard lefthand term as well as `/in`.

In [108]:
new_career_lines = []

for career_line in career_lines:
    if '/r' in career_line and '/in' in career_line:
        found = False
        for word in career_line.split():
            if word[-3:] == '/in':
                all_careers.add(word[:-3]) # store
                found = True
                break
        if found:
            continue

    new_career_lines.append(career_line)

career_lines = new_career_lines

In [109]:
print("Rest:", len(career_lines))
print("Extracted:", len(all_careers))

Rest: 2165
Extracted: 16641


**Case 10**  
`Alleinsteuermann/-frau` → Add both terms.

In [110]:
new_career_lines = []

for career_line in career_lines:
    careers = career_line.split('/')
    if len(careers) == 2:
        if careers[0][-4:] == 'mann' and careers[1][:5] == '-frau':
            all_careers.add(careers[0]) # store
            all_careers.add(careers[0][:-4] + 'frau') # store
            continue
    new_career_lines.append(career_line)

career_lines = new_career_lines

In [111]:
print("Rest:", len(career_lines))
print("Extracted:", len(all_careers))

Rest: 1724
Extracted: 17147


**Case 11**  
`Dipl. Ing.` → Delete all terms featuring this string.

In [112]:
new_career_lines = []

for career_line in career_lines:
    if not "Dipl." in career_line and not "Ing." in career_line:
        new_career_lines.append(career_line)

career_lines = new_career_lines

In [113]:
print("Rest:", len(career_lines))
print("Extracted:", len(all_careers))

Rest: 1624
Extracted: 17147


**Case 12**  
`Stationsschwester/-pfleger` → Add both terms.

In [114]:
new_career_lines = []

for career_line in career_lines:
    careers = career_line.split('/')
    if len(careers) == 2:
        if "schwester" in careers[0] and careers[1] == "-pfleger":
            all_careers.add(careers[0]) # store
            all_careers.add(careers[0][:-9] + 'pfleger') # store
            # store
            continue
            
    new_career_lines.append(career_line)

career_lines = new_career_lines

In [115]:
print("Rest:", len(career_lines))
print("Extracted:", len(all_careers))

Rest: 1584
Extracted: 17224


We now have a total of 17.224 careers that we can find in a speech. For now, we disregard the 1.584 terms that are left.

### Postprocessing

Joining both of our sets (with a total of 18.558 terms):

In [116]:
all_gender_terms = gender_terms.union(all_careers)
print(len(all_gender_terms))

18558


Get rid of all terms with a length of less than 4 characters (since they would likely lead to a lot of false positives).

In [117]:
all_gender_terms = set(
    list(term for term in all_gender_terms if len(term) > 3)
)

Excluding some terms that have lead to lots of errors in test runs. For example, the job of `Wirt` is never discussed in our political speeches, whereas `Wirtschaft` is a term often used, leading to a lot of false positives.

In [118]:
excluded_words = ["Wirt", "Volkswirt", "Erbe", "Prior", "Ober"]

for word in excluded_words:
    if word in all_gender_terms:
        all_gender_terms.remove(word)

Some Words that can't be found by now must be added seperately

In [119]:
include_words = set(["Kollege", "Kollegin", "Zuhörer", "Präsident"])
all_gender_terms = all_gender_terms.union(include_words)

Now, we search for the use of our terms in the speeches.

In [120]:
def terms_in_speech(speech_tokens):
    result_set = set()
    
    for i, token in enumerate(speech_tokens):
        if not token[0].isupper():
            continue

        for gender_term in all_gender_terms:
            found_token = None
            
            if token[:len(gender_term)] == gender_term:
                found_token = token
            elif token[-len(gender_term):] == gender_term.lower():
                found_token = token
            
            if found_token is not None:
                result_set.add((i, found_token))
                break
    
    return sorted(list(result_set))

In [121]:
def analyse_speech(speech):
    clean_speech = speech.translate(str.maketrans('', '', string.punctuation))
    speech_tokens = clean_speech.split()
    
    relevant_pos_terms = terms_in_speech(speech_tokens)
    
    terms_with_context = []
    
    for pos, term in relevant_pos_terms:
        result = '[' + term + '] ' + ' '.join(speech_tokens[max(0, pos-2):pos+3])
        terms_with_context.append(result)
    
    return terms_with_context


## Read JSON File and convert to csv entries

In [130]:
file_name = '209'

In [131]:
with open(f'output/speakers_{file_name}.json', 'r', encoding='utf-8') as f:
  speeches = json.load(f)

len(speeches)

381

In [132]:
csv_lines = []

for i, speech in tqdm(enumerate(speeches, start=1)):
    gender_terms_with_context = analyse_speech(speech['text'])
    party = speech['party'] if 'party' in speech else ''
    csv_lines.append(
        (speech['speaker'], party, i, gender_terms_with_context)
    )

381it [17:22,  2.74s/it]


In [133]:
OUTPUT_FILE = f'output/gender_terms_per_speech_{file_name}.csv'

with open(OUTPUT_FILE, 'w', encoding='utf-8') as f_out:
    f_out.write('Speaker|Party|Speech ID|Gendered Terms\n')
    for name, party, speech_id, terms in csv_lines:
        f_out.write(
            name + '|' + party + '|' + str(speech_id) + '|' + '; '.join(terms) + '\n'
        )