# GenderME

## What can the code currently do?
1. Find approximate beginning of speeches. 
2. 
3. 
....

## What do we still need to implement?
1. Split into text parts consisting of president speech portion and following speech.
2. Find *actual* beginning of speeches by politicians who are **not** (vice) president.
3. Create table of political affiliation, e.g. Alice Weidel - AfD etc.
4. Get rid of unnecessary text parts without speeches.
5. Get rid of interjections.
6. Find instances of gendered and ungendered speech in text.

**Notes on the .txt:** 
1. ~~Schäuble announces speakers. (Relevant for finding the beginning of new speeches?)~~ Actually, numerous (vice) presidents announce speakers, this needs to be taken into account. (Search for _präsident_ instead of Schäuble is the better approach).
2. Interjections from other politicians marked with `(...)` -- in the interjections, party affiliation is marked with `[...]`.
3. The begin of a speech is marked with ':' -- however, these also appear in speeches. For normal speakers, party affiliation is indicated with `(...)`.  

**Needed packages:**

In [1]:
options(warn=-1)               #hide warning messages; makes the code look nicer

install.packages("readr")
install.packages("tidyverse")

package 'readr' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Arilila\AppData\Local\Temp\RtmpCWvRzp\downloaded_packages
package 'tidyverse' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Arilila\AppData\Local\Temp\RtmpCWvRzp\downloaded_packages


In [2]:
library("readr")
library("tidyverse")

-- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.0 --
v ggplot2 3.3.3     v dplyr   1.0.4
v tibble  3.0.6     v stringr 1.4.0
v tidyr   1.1.2     v forcats 0.5.1
v purrr   0.3.4     
-- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()


Load test file `plenartest.txt`.

Because the original txt.file is not formatted, our analysis of the file needs to include many different aspects, e.g. finding a useful division into paragraphs ourselves. Therefore we need to split the whole document by spaces to look at the individual words and their interaction and relation with each other rather than the text as a whole. 

In [3]:
plenar_test <- read_file("res/plenartest.txt")
plenar_test_vec <- strsplit(plenar_test, " ")[[1]];

~~Find all instances of `Präsident Dr. Wolfgang Schäuble:` in the .txt file. This helps us:~~

Find all instances of words containing `präsident` in the .txt file. Then, find the first occurance of a capitalised token followed by a colon. (Which  should be the last name of the president -- to assure this, the distance between the word containing `präsident` and the aforementioned token should not be more than the defined threshold.) 

This helps us:

1. Find the beginning of the relevant text, allowing us to purge the meaningless beginning and end each of the plenary session txt files contains.
2. Find the *approximate* beginning of the individual speeches, since presidents and vice presidents introduce all speakers.

Let us illustrate this using an example from the test .txt file:

[...] Der gesamte und damit endgültige Stenografische Bericht der 209. Sitzung wird am 16. Februar 2021 veröffentlicht. **Präsident** *Dr. Wolfgang* **Schäuble:** Guten Morgen, liebe Kolleginnen und Kollegen! [...]

In this instance, the distance between the word **Präsident** and **his last name (plus colon)** is *2*. In the following code, we will use a distance of 4 as threshold to account for PhD title, first name and last name, as well as a placeholder for people with the first names et cetera.  

In [60]:
first_letter_is_upper <- function(text){
    first_letter <- substring(text, 1, 1)
    return(grepl("^[[:upper:]]+$", first_letter))
}

In [49]:
last_letter <- function(text){
    return(
        substring(text, nchar(text))
    )
}

In [65]:
len <- length(plenar_test_vec)
# len <- 5000

president_spotted <- FALSE
president_position <- -1

for(i in 1:len) {
    token <- plenar_test_vec[i]
    token_lower <- tolower(token)
    
    if(grepl("präsident", token_lower, fixed = TRUE)) {
        president_spotted <- TRUE
        president_position <- i
    } 
    
    president_dist <- i-president_position-1
    
    if(president_spotted && president_dist<=4){
        
        if(first_letter_is_upper(token) && last_letter(token)==':'){
       # print(token)    
        president_spotted <- FALSE    
        }
    }
    else{
        president_spotted <- FALSE
    }
}

Now, we try to find the *actual* beginning of the speeches by definining a vector with likely buzzwords, such as party affiliation (e.g. SPD) or political office (e.g. Bundeskanzlerin).

In [66]:
#TODO

Extract interjections `(...)`. 

After this step, we have our variable `plenar_test` without interjections and a second variable named `interjections` with only the interjections.

In [None]:
parenthesis_pattern <- "(?=\\().*?(?<=\\))(?:[^-a-z0-9A-Z_]|$)"

interjections <- regmatches(
    plenar_test, 
    gregexpr(parenthesis_pattern, plenar_test, perl=T),
    invert = FALSE
)[[1]]

plenar_test <- regmatches(
    plenar_test, 
    gregexpr(parenthesis_pattern, plenar_test, perl=T), 
    invert = TRUE
)[[1]]

In [None]:
plenar_test <- paste(plenar_test, collapse = '')

This command filters the interjections. The output are interjections with dialogue elements. (This might be useful for later analysis.)

In [None]:
# for(interjection in interjections){
#     if(grepl(':', interjection, fixed = TRUE)){
#         print(interjection)
#     } 
# }