# Formatting - GenderME

## What can the code currently do?
1. [x] Find approximate beginning of speeches. 
1. [x] Split into text parts consisting of president speech portion and following speech.
1. [x] Create table of political affiliation, e.g. Alice Weidel - AfD etc.
1. [x] Find *actual* beginning of speeches by politicians who are **not** (vice) president.
1. [x] Get rid of unnecessary text parts without speeches.
1. [x] Get rid of interjections.

**Notes on the .txt:** 
1. Numerous (vice) presidents announce speakers, this needs to be taken into account. (Search for _präsident_ instead of Schäuble is the better approach).
2. Interjections from other politicians marked with `(...)` -- in the interjections, party affiliation is marked with `[...]`.
3. The begin of a speech is marked with ':' -- however, these also appear in speeches. For normal speakers, party affiliation is indicated with `(...)`.  

## Pipeline

Load test file `plenartest.txt`.

Because the original txt.file is not formatted, our analysis of the file needs to include many different aspects, e.g. finding a useful division into paragraphs ourselves. Therefore we need to split the whole document by spaces to look at the individual words and their interaction and relation with each other rather than the text as a whole. 

In [1]:
import re
import json

In [2]:
plenar_test = ''

with open('res/plenartest.txt', 'r', encoding='utf-8') as file_in:
    plenar_test = file_in.read()

TODO: Dateien Spaces entfernen, damit diese später noch gefunden werden

In [3]:
plenar_test = plenar_test.replace('DIE LINKE', 'DIE_LINKE')
plenar_test = plenar_test.replace('BÜNDNIS 90/DIE GRÜNEN', 'BÜNDNIS_90/DIE_GRÜNEN')

TODO: Text in Tokens splitten

In [4]:
plenar_test = plenar_test.split()

~~Find all instances of `Präsident Dr. Wolfgang Schäuble:` in the .txt file. This helps us:~~

Find all instances of words containing `präsident` in the .txt file. Then, find the first occurance of a capitalised token followed by a colon. (Which  should be the last name of the president -- to assure this, the distance between the word containing `präsident` and the aforementioned token should not be more than the defined threshold.) 

This helps us:

1. Find the beginning of the relevant text, allowing us to purge the meaningless beginning and end each of the plenary session txt files contains.
2. Find the *approximate* beginning of the individual speeches, since presidents and vice presidents introduce all speakers.

Let us illustrate this using an example from the test .txt file:

[...] Der gesamte und damit endgültige Stenografische Bericht der 209. Sitzung wird am 16. Februar 2021 veröffentlicht. **Präsident** *Dr. Wolfgang* **Schäuble:** Guten Morgen, liebe Kolleginnen und Kollegen! [...]

In this instance, the distance between the word **Präsident** and **his last name (plus colon)** is *2*. In the following code, we will use a distance of 4 as threshold to account for PhD title, first name and last name, as well as a placeholder for people with the first names et cetera.  

In [5]:
president_spotted = False
president_position = None
president_positions = []

for i, token in enumerate(plenar_test):
    
    if "präsident" in token.lower():
        president_spotted = True
        president_position = i
    
    if president_spotted:
        president_dist = i - president_position - 1
        
        if president_dist > 4:
            president_spotted = False
        else:
            if token[0].isupper() and token[-1] == ':':
                president_positions.append(president_position)
                president_spotted = False  


In [6]:
len(president_positions)

173

Following this step, we split the text into segments which are prefaced by one of the (vice) presidents speaking. 

In [7]:
president_positions.append(len(plenar_test))

president_segments = []

for i in range(0, len(president_positions)-1):
    first_president_pos = president_positions[i]
    second_president_pos = president_positions[i+1]
    
    president_segment = plenar_test[first_president_pos:second_president_pos]
    president_segments.append(president_segment)

In [8]:
' '.join(president_segments[106][:50])

'Vizepräsident Wolfgang Kubicki: Vielen Dank, Frau Kollegin Hänsel. – Als nächstem Redner erteile ich das Wort dem Kollegen Omid Nouripour, Fraktion Bündnis 90/Die Grünen. (Beifall beim BÜNDNIS_90/DIE_GRÜNEN) Omid Nouripour BÜNDNIS_90/DIE_GRÜNEN Omid Nouripour (BÜNDNIS_90/DIE_GRÜNEN): Herr Präsident! Meine Damen und Herren! Der derzeitige Einsatz der Bundeswehr im Mittelmeer ist sinnvoll. Wir werden'

Now, we try to find the *actual* beginning of the speeches by definining a vector with likely buzzwords, such as party affiliation (e.g. SPD) or political office (e.g. Bundeskanzlerin).

In [9]:
PARTIES  = ['CDU/CSU', 'DIE_LINKE', 'SPD', 'FDP', 'AfD', 'BÜNDNIS_90/DIE_GRÜNEN']

def get_party(token):
    for party in PARTIES:
        if token == '(' + party + '):':
            return party
    return ''

In [10]:
def get_segment_speaker(president_segment, start_pos, found_party):
    speaker_name_start = start_pos
    while speaker_name_start > start_pos - 10:
        if president_segment[speaker_name_start] == found_party:
            speaker = president_segment[speaker_name_start+1:start_pos+1]
            return ' '.join(speaker)
        speaker_name_start -= 1
    return 'No Name Found'

In [11]:
def get_segment_party(president_segment):
    for i, token in enumerate(president_segment):
        found_party = get_party(token)
        if found_party != '':
            speaker = get_segment_speaker(president_segment, i -1, found_party)
            return found_party, i, speaker
    return None, None, None

TODO: Damit lassen sich zu jedem Segment zugehörige Sprecher identifizieren, aber bisher nur diejenigen mit Parteienbezeichnung

In [12]:
for i in range(10): 
    result = get_segment_party(president_segments[i])
    print(result)

(None, None, None)
('AfD', 31, 'Dr. Alice Weidel')
('SPD', 25, 'Dr. Rolf Mützenich')
('FDP', 24, 'Christian Lindner')
('CDU/CSU', 23, 'Ralph Brinkhaus')
('DIE_LINKE', 25, 'Dr. Dietmar Bartsch')
('BÜNDNIS_90/DIE_GRÜNEN', 24, 'Katrin Göring-Eckardt')
('SPD', 21, 'Bärbel Bas')
('AfD', 21, 'Sebastian Münzenmaier')
('CDU/CSU', 20, 'Alexander Dobrindt')


In [13]:
for i, president_segment in enumerate(president_segments):
    result = get_segment_party(president_segment)
    if result[0] is None:
        print(i, ' >> ', ' '.join(president_segment[:50]))
        print()

0  >>  Präsident Dr. Wolfgang Schäuble: Guten Morgen, liebe Kolleginnen und Kollegen! Bitte nehmen Sie Platz. Die Sitzung ist eröffnet. Vor Eintritt in die Tagesordnung gratuliere ich nachträglich der Kollegin Franziska Gminder zum 76. Geburtstag und dem Kollegen Dr. Ernst Dieter Rossmann zum 70. Geburtstag. Im Namen des ganzen Hauses wünsche ich

33  >>  Vizepräsident Wolfgang Kubicki: Vielen Dank, Herr Kollege Dr. Brandl. – Damit schließe ich die Aussprache. Interfraktionell wird Überweisung der Vorlage auf Drucksache 19/26549 an die in der Tagesordnung aufgeführten Ausschüsse vorgeschlagen. Gibt es weitere Überweisungsvorschläge? – Ich sehe, das ist nicht der Fall. Dann verfahren wir wie vorgeschlagen. Ich rufe

82  >>  Vizepräsident in Dagmar Ziegler: Vielen Dank. – Die Aktuelle Stunde ist beendet. Ich rufe den Tagesordnungspunkt 9 auf: Beratung des Antrags der Bundesregierung Fortsetzung der Beteiligung bewaffneter deutscher Streitkräfte an der Mission der Vereinten Nationen in de

In [14]:
speeches_list = []

speakers_dict = {}

for president_segment in president_segments:
    result = get_segment_party(president_segment)
    
    party = result[0]
    pos = result[1]
    speaker = result[2]
    
    if party is None:          #ignore non-party persons, could be TODO
        continue
    
    speech = president_segment[pos+1:]
    speech = ' '.join(speech)
    
    if speaker not in speakers_dict:
        speakers_dict[speaker] = [party, speech]
    else:
        speakers_dict[speaker].append(speech)

TODO Number of speakers with parties printed

In [15]:
len(speakers_dict)

123

TODO Zwischenrufe entfernen

In [16]:
interjection_pattern = '\(.*?\)'

In [17]:
interjections = []

for speaker in speakers_dict:
    for i, speech in enumerate(speakers_dict[speaker]):
        interjections.extend(
            re.findall(interjection_pattern, speech)
        )
        
        speakers_dict[speaker][i] = re.sub(interjection_pattern, '', speech)

In [18]:
# for speaker in speakers_dict:
#     print(speaker)
#     print(speakers_dict[speaker])

In [19]:
clean_speakers_list = []

for speaker in speakers_dict:
    clean_speakers_list.append(
        {
            'speaker': speaker,
            'party': speakers_dict[speaker][0],
            'speeches': speakers_dict[speaker][1:],
        }
    )

In [20]:
with open('speakers.json', 'w', encoding='utf-8') as file_out:
    json.dump(clean_speakers_list, file_out, ensure_ascii=False)