# GenderME

## What can the code currently do?
1. [x] Find approximate beginning of speeches. 
2. [x] Split into text parts consisting of president speech portion and following speech.
3. [ ] Create table of political affiliation, e.g. Alice Weidel - AfD etc.

## What do we still need to implement?
1. [ ] Find *actual* beginning of speeches by politicians who are **not** (vice) president.
2. [ ] Get rid of unnecessary text parts without speeches.
3. [ ] Get rid of interjections.
4. [ ] Find instances of gendered and ungendered speech in text.

**Notes on the .txt:** 
1. ~~Schäuble announces speakers. (Relevant for finding the beginning of new speeches?)~~ Actually, numerous (vice) presidents announce speakers, this needs to be taken into account. (Search for _präsident_ instead of Schäuble is the better approach).
2. Interjections from other politicians marked with `(...)` -- in the interjections, party affiliation is marked with `[...]`.
3. The begin of a speech is marked with ':' -- however, these also appear in speeches. For normal speakers, party affiliation is indicated with `(...)`.  

**Needed packages:**

In [1]:
options(warn=-1)               #hide warning messages; makes the code look nicer

install.packages("readr")
install.packages("tidyverse")

package 'readr' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Arilila\AppData\Local\Temp\RtmpqMM0yG\downloaded_packages
package 'tidyverse' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Arilila\AppData\Local\Temp\RtmpqMM0yG\downloaded_packages


In [2]:
library("readr")
library("tidyverse")

-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
v ggplot2 3.3.3     v dplyr   1.0.4
v tibble  3.0.6     v stringr 1.4.0
v tidyr   1.1.2     v forcats 0.5.1
v purrr   0.3.4     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()


Load test file `plenartest.txt`.

Because the original txt.file is not formatted, our analysis of the file needs to include many different aspects, e.g. finding a useful division into paragraphs ourselves. Therefore we need to split the whole document by spaces to look at the individual words and their interaction and relation with each other rather than the text as a whole. 

In [36]:
plenar_test <- read_file("res/plenartest.txt")

plenar_test <- str_replace_all(plenar_test, "DIE LINKE", "DIE_LINKE")
plenar_test <- str_replace_all(plenar_test, "BÜNDNIS 90/DIE GRÜNEN", "BÜNDNIS_90/DIE_GRÜNEN")

plenar_test_vec <- strsplit(plenar_test, " ")[[1]];

~~Find all instances of `Präsident Dr. Wolfgang Schäuble:` in the .txt file. This helps us:~~

Find all instances of words containing `präsident` in the .txt file. Then, find the first occurance of a capitalised token followed by a colon. (Which  should be the last name of the president -- to assure this, the distance between the word containing `präsident` and the aforementioned token should not be more than the defined threshold.) 

This helps us:

1. Find the beginning of the relevant text, allowing us to purge the meaningless beginning and end each of the plenary session txt files contains.
2. Find the *approximate* beginning of the individual speeches, since presidents and vice presidents introduce all speakers.

Let us illustrate this using an example from the test .txt file:

[...] Der gesamte und damit endgültige Stenografische Bericht der 209. Sitzung wird am 16. Februar 2021 veröffentlicht. **Präsident** *Dr. Wolfgang* **Schäuble:** Guten Morgen, liebe Kolleginnen und Kollegen! [...]

In this instance, the distance between the word **Präsident** and **his last name (plus colon)** is *2*. In the following code, we will use a distance of 4 as threshold to account for PhD title, first name and last name, as well as a placeholder for people with the first names et cetera.  

In [37]:
first_letter_is_upper <- function(text){
    first_letter <- substring(text, 1, 1)
    return(grepl("^[[:upper:]]+$", first_letter))
}

last_letter <- function(text){
    return(
        substring(text, nchar(text))
    )
}

In [38]:
len <- length(plenar_test_vec)

president_spotted <- FALSE
president_position <- -1
president_positions_vec <- vector()

for(i in 1:len) {
    token <- plenar_test_vec[i]
    token_lower <- tolower(token)
    
    if(grepl("präsident", token_lower, fixed = TRUE)) {
        president_spotted <- TRUE
        president_position <- i
    } 
    
    president_dist <- i-president_position-1
    
    if(president_spotted && president_dist<=4){
        
        if(first_letter_is_upper(token) && last_letter(token)==':'){
            president_positions_vec <- c(president_positions_vec, president_position)
            president_spotted <- FALSE    
        }
    }
    else{
        president_spotted <- FALSE
    }
}

In [39]:
length(president_positions_vec)

Following this step, we split the text into segments which are prefaced by one of the (vice) presidents speaking. 

In [40]:
text_segments <- list()
len2 <- length(president_positions_vec)

for(i in 1:len2-1){
    if(i == 0){
        next
    }
    
    first_president_pos <- president_positions_vec[i]
    second_president_pos <- president_positions_vec[i+1]

    president_tokens_vec <- plenar_test_vec[first_president_pos:second_president_pos]
    president_tokens_vec <- president_tokens_vec[1:length(president_tokens_vec)-1]
    
    text_segments[[length(text_segments) + 1]]<- president_tokens_vec
}

last_president_pos <- president_positions_vec[length(president_positions_vec)]
president_tokens_vec <- plenar_test_vec[last_president_pos:length(plenar_test_vec)]

text_segments[[length(text_segments) + 1]]<- president_tokens_vec

In [41]:
print(paste(text_segments[[106]], collapse = ' '))

[1] "Vizepräsident Wolfgang Kubicki: Vielen Dank, Herr Kollege Sauter. – Die nachfolgende Rednerin ist die Kollegin Heike Hänsel, Fraktion Die Linke. (Beifall bei der LINKEN) Heike Hänsel DIE_LINKE Heike Hänsel (DIE_LINKE): Herr Präsident! Sehr geehrte Kolleginnen und Kollegen! Die Bundesregierung will erneut den NATO-Militäreinsatz Sea Guardian im Mittelmeer verlängern, um angeblich Terrorismus zu bekämpfen und den Waffenschmuggel per Schiff, zum Beispiel nach Libyen, zu stoppen. Wir haben doch aber alle erst letztes Jahr erlebt, dass türkische Schiffe auf dem Weg nach Libyen kontrolliert werden sollten und wie kläglich die NATO dabei gescheitert ist. Ein französisches Schiff wurde sogar von einem Kriegsschiff des NATO-Partners Türkei bei dem Versuch der Kontrolle bedroht. Frankreich hatte sich daraufhin aus dieser NATO-Mission erst mal zurückgezogen. Da frage ich mich schon, wie Sie eigentlich dazu kommen, Herr Tauber, von „erfolgreich“ zu sprechen. Was ist denn an dieser Mission eig

Now, we try to find the *actual* beginning of the speeches by definining a vector with likely buzzwords, such as party affiliation (e.g. SPD) or political office (e.g. Bundeskanzlerin).

In [42]:
PARTIES <- c("CDU/CSU", "DIE_LINKE", "SPD", "FDP", "AfD", "BÜNDNIS_90/DIE_GRÜNEN")

get_party <- function(token){
    for(party in PARTIES){
        if(token == paste("(", party, "):", sep='')){
            return(party)
        }
    }
    return("")
}

In [43]:
get_speaker <- function(text_segment_vec, start_pos, found_party){
    speaker_name_start <- start_pos
    
    while(speaker_name_start > start_pos - 10){
        if(text_segment_vec[speaker_name_start] == found_party){
            speaker = text_segment_vec[(speaker_name_start+1):start_pos]
            return(
                paste(speaker, collapse = ' ')
            )
        }
        
        speaker_name_start <- speaker_name_start - 1
    }
    
    return("No Name Found")
}

In [44]:
get_party_for_segment <- function(text_segment_vec){
    for(i in 1:length(text_segment_vec)){
        token <- text_segment_vec[i]
        found_party <- get_party(token)
        if(found_party != ""){
            speaker <- get_speaker(text_segment_vec, i-1, found_party)
            return(
                list(party=found_party, pos=i, speaker=speaker)
            )
        }
        
    }
    return(
        list(party="-", pos=-1, name="")
    )
}

In [47]:
len3 <- length(text_segments)

for(i in 107:len3) {  
    result <- get_party_for_segment(text_segments[[i]])
    print(result)
    break
}

$party
[1] "BÜNDNIS_90/DIE_GRÜNEN"

$pos
[1] 33

$speaker
[1] "Omid Nouripour"



Extract interjections `(...)`. 

After this step, we have our variable `plenar_test` without interjections and a second variable named `interjections` with only the interjections.

In [None]:
parenthesis_pattern <- "(?=\\().*?(?<=\\))(?:[^-a-z0-9A-Z_]|$)"

interjections <- regmatches(
    plenar_test, 
    gregexpr(parenthesis_pattern, plenar_test, perl=T),
    invert = FALSE
)[[1]]

plenar_test <- regmatches(
    plenar_test, 
    gregexpr(parenthesis_pattern, plenar_test, perl=T), 
    invert = TRUE
)[[1]]

This command filters the interjections. The output are interjections with dialogue elements. (This might be useful for later analysis.)

In [None]:
# for(interjection in interjections){
#     if(grepl(':', interjection, fixed = TRUE)){
#         print(interjection)
#     } 
# }