# VACC

In this notebook all preprocessing steps for creating the final VACC corpus are run as well as the persistence tagging algorithm for the qualitative analysis. 

As mentioned, the actual data used in the doctoral thesis needs to be requested from Ingo Siegert.

However, a dummy dataset is provided for the code to be executable. This fictitious dataset has the exact same structure as the actual data, but content-wise it is highly repetitive such that all four interactions of all four dummy participants contain the same turns. 

Once the actual data is available, replace the dummy dataset with it, maintaining folder structure.

Refer to the relevant chapters in the doctoral thesis for further explanation of the steps below. 

## Preparations

In [None]:
#importing relevant modules
import pandas as pd, json, sys, os, warnings, re
from tqdm import tqdm
warnings.simplefilter(action='ignore', category=FutureWarning)

#informing Python about a custom code directory and importing some of the modules from there
sys.path.append("../Code/")
import preprocessing, persistence, visualisation

In [None]:
#defining corpus name 
#(some modules contain corpora-specific code, thus a variable with the corpus name is needed to ensure the right code is executed)
which_corpus = "VACC"

## Data Preprocessing

### Creating one csv file

The following code creates *one* csv file containing all turns from all interactions.

In [None]:
#paths to folders with transcripts and speaker list (who made which utterance)
root_transcripts = f"1_Corpus/Original_files/Fertige_Transkripte/"
root_speakers = f"1_Corpus/Original_files/Fertige_Utterances/"
output_destination = f"1_Corpus/Corpus_{which_corpus}.csv"

preprocessing.file_creator_vacc(root_transcripts, root_speakers, output_destination)

### Merging same-speaker turns

The following code merges consecutive same-speaker turns.

In [None]:
preprocessing.turn_merger(f"1_Corpus/Corpus_{which_corpus}.csv", 
                          f"1_Corpus/Corpus_merged_turns_{which_corpus}.csv")

Note that, as explained in the thesis, manual unmerging was performed in certain cases. Hence, when running this notebook with the actual data, it will eventually differ from the data used in the study as this manual step as well as the manual lemma correction step mentioned below cannot be replicated here.

### Tokenising and lemmatising data 

Note that, as explained in the thesis, POS-tagging was initially also performed. As its output was not used this step is disregarded here.

Lemmatisation was performed with both TreeTagger and RNNTagger. As explained in the thesis, the RNNTagger's output proved to be most reliable. To execute the following steps you need to download the [RNNTagger](https://www.cis.lmu.de/~schmid/tools/RNNTagger/) and follow the installation steps indicated there. 

The RNNTagger expects a simple file with one token per row. To be able to remap tagged tokens to their corresponding metadata (which turn they belong to, by which speaker etc.), during tokenisation below, an additional list (`tokens_for_remapping`) is outputted which contains the same tokens as well as a turn boundary marker. This marker enables remapping the tokens to their respective turn, once they have been tagged.

In [None]:
#tokenising using custom code for later token remapping
tokens_for_remapping = preprocessing.tokenise(f"1_Corpus/Corpus_merged_turns_{which_corpus}.csv", 
                                              "2_Preprocessed/Files/txt_file_for_tagger.txt")

Next, run the following lines in your command line inside the conda environment, replacing "rnntagger_path" with the absolute path to your RNNTagger directory, "txt_file_for_tagger.txt" with the absolute path to said file (created in the cell above) and "output" with the absolute path leading to a new file called "VACC/2_Preprocessed/Files/RNN_tagged.txt" (i.e., complement that path based on where you stored the entire repository on your computer). Execute the middle line only if permission errors occur. 

In [None]:
#remapping tagged tokens to their respective turn
preprocessing.remap(f"1_Corpus/Corpus_merged_turns_{which_corpus}.csv", #corpus
                    f"2_Preprocessed/Files/RNN_tagged.txt", #tagger output
                    tokens_for_remapping, #needed for remapping tagger output to corpus
                    f"2_Preprocessed/RNN_{which_corpus}_unigrams.csv", #destination of remapped corpus
                    which_corpus)

Note that, as explained in the thesis, manual lemma correction was performed at this point.

### Creating lemma bi-, tri-, and quadrigrams

The following code creates lemma n-grams.

In [None]:
preprocessing.ngrammer(f"2_Preprocessed/RNN_{which_corpus}_unigrams.csv", which_corpus)

Preprocessing of the corpus is now done. Note again that even if you ran the code with the actual data, it will to some extent be different from the data used in the thesis due to non-replicable manual steps (see above). 

For the quantitative analysis see separate notebooks in the directory "Quantitative_Analysis". This code is in separate notebooks as some quantitative analyses extend beyond one single corpus and are thus organised with regard to alternation sets. Also, a different programming language (R) is used.

## Persistence Tagging for the Qualitative Analysis

As mentioned in the thesis, the persistence tagging algorithm can be used to tag cases of persistence on multiple levels such as lemmata, POS-tags etc. However, the qualitative analysis in the thesis relied solely on lemma-based tagging. This is defined in the following cell, along with stop lemmata and lemmata from the instructions, both of which will be excluded from tagging. For simplicity, the actual instructions are excluded even when using dummy data.

First, "real" cases of persistence from the voice assistant to the human speaker are tagged which the qualitative analysis relied on. Subsequently, instances of quasi-persistence from the human speaker to the voice assistant are tagged which were used as a variable in the model for the DEZEMBER alternation. For that, default values for "speaker_A" and "speaker_B" (see module code) are overwritten.

### Preparations

In [None]:
levels = ["lemma"] #defining level to tag on 

stoplemmas = ['an', 'der', 'ein', 'es', 'für', 'haben', 'ich', 'in', 'mit', 
              'nicht', 'oder', 'sein', 'um', 'und', 'von', 'werden', 'zu'] #defining stopwords to exclude from tagging

#reading and defining instructions for Quiz interactions w/o confederate as well as Calendar interactions
with open("Instructions/Lemmata_in_instructions_with_confederate.txt") as f, open("Instructions/Lemmata_in_instructions_without_confederate.txt") as g, open("Instructions/Lemmata_in_visual_schedule.txt") as h:
    instructions_with_jannik, instructions_without_jannik, schedule = f.read().split(), g.read().split(), h.read().split()

instructions = [instructions_without_jannik, instructions_with_jannik, schedule] #creating list of instructions to pass below

### Tagging

#### From voice assistant to human speaker 

##### Unigrams

In [None]:
#reading unigram corpus
corpus = pd.read_csv(f"2_Preprocessed/RNN_{which_corpus}_unigrams.csv", sep=",", index_col=0, na_filter=False)

#passing unigram corpus to persistence tagger while specifying levels, output destination, instructions and stoplemmas
persistence.tagger(corpus, which_corpus, levels, f"3_Persistence_tagged/single_ngrams/Persistence_{which_corpus}_unigrams.csv", instructions, stoplemmas)

##### Bigrams

In [None]:
#creating bigrams from instructions (disregarding question boundaries for simplicity's sake; 'schedule' contains non-ordered lemmata, thus no n-grams are created)
instructions_bigrams = [[[" ".join(instructions_without_jannik[i:i+2]) for i in range(len(instructions_without_jannik)-2+1)]],
                       [[" ".join(instructions_with_jannik[i:i+2]) for i in range(len(instructions_with_jannik)-2+1)]], schedule]

#reading bigram corpus
corpus = pd.read_csv(f"2_Preprocessed/RNN_{which_corpus}_bigrams.csv", sep=",", index_col=0, na_filter=False)

#passing bigram corpus to persistence tagger while specifying levels, output destination and instructions 
persistence.tagger(corpus, which_corpus, levels, f"3_Persistence_tagged/single_ngrams/Persistence_{which_corpus}_bigrams.csv", instructions)

##### Trigrams

In [None]:
#creating trigrams from instructions (disregarding question boundaries for simplicity's sake; 'schedule' contains non-ordered lemmata, thus no n-grams are created)
instructions_trigrams = [[[" ".join(instructions_without_jannik[i:i+3]) for i in range(len(instructions_without_jannik)-3+1)]],
                        [[" ".join(instructions_with_jannik[i:i+3]) for i in range(len(instructions_with_jannik)-3+1)]], schedule]

#reading trigram corpus
corpus = pd.read_csv(f"2_Preprocessed/RNN_{which_corpus}_trigrams.csv", sep=",", index_col=0, na_filter=False)

#passing trigram corpus to persistence tagger while specifying levels, output destination and instructions 
persistence.tagger(corpus, which_corpus, levels, f"3_Persistence_tagged/single_ngrams/Persistence_{which_corpus}_trigrams.csv", instructions)

##### Quadrigrams

In [None]:
#creating quadrigrams from instructions (disregarding question boundaries for simplicity's sake; 'schedule' contains non-ordered lemmata, thus no n-grams are created)
instructions_quadrigrams = [[[" ".join(instructions_without_jannik[i:i+4]) for i in range(len(instructions_without_jannik)-4+1)]],
                           [[" ".join(instructions_with_jannik[i:i+4]) for i in range(len(instructions_with_jannik)-4+1)]], schedule]

#reading quadrigram corpus
corpus = pd.read_csv(f"2_Preprocessed/RNN_{which_corpus}_quadrigrams.csv", sep=",", index_col=0, na_filter=False)

#passing quadrigram corpus to persistence tagger while specifying levels, output destination and instructions 
persistence.tagger(corpus, which_corpus, levels, f"3_Persistence_tagged/single_ngrams/Persistence_{which_corpus}_quadrigrams.csv", instructions)

For simplicity, the following code combines all n-gram levels into one DataFrame.

In [None]:
persistence.combiner("3_Persistence_tagged/single_ngrams", f"3_Persistence_tagged/Persistence_{which_corpus}_all.csv", which_corpus)

#### From human speaker to voice assistant (quasi-persistence)

##### Unigrams

In [None]:
#reading unigram corpus
corpus = pd.read_csv(f"2_Preprocessed/RNN_{which_corpus}_unigrams.csv", sep=",", index_col=0, na_filter=False)

#passing unigram corpus to persistence tagger while specifying levels, output destination, instructions and stoplemmas,
#and, crucially, switching direction (from voice assistant to human speaker) by overwriting default values
persistence.tagger(corpus, which_corpus, levels, f"3_Persistence_tagged/single_ngrams/Quasi_persistence_{which_corpus}_unigrams.csv", instructions, stoplemmas,
                  speaker_A="S", speaker_B="A")

##### Bigrams

In [None]:
#creating bigrams from instructions (disregarding question boundaries for simplicity's sake; 'schedule' contains non-ordered lemmata, thus no n-grams are created)
instructions_bigrams = [[[" ".join(instructions_without_jannik[i:i+2]) for i in range(len(instructions_without_jannik)-2+1)]],
                       [[" ".join(instructions_with_jannik[i:i+2]) for i in range(len(instructions_with_jannik)-2+1)]], schedule]

#reading bigram corpus
corpus = pd.read_csv(f"2_Preprocessed/RNN_{which_corpus}_bigrams.csv", sep=",", index_col=0, na_filter=False)

#passing unigram corpus to persistence tagger while specifying levels, output destination, instructions and stoplemmas,
#and, crucially, switching direction (from voice assistant to human speaker) by overwriting default values
persistence.tagger(corpus, which_corpus, levels, f"3_Persistence_tagged/single_ngrams/Quasi_persistence_{which_corpus}_bigrams.csv", instructions,
                  speaker_A="S", speaker_B="A")

##### Trigrams

In [None]:
#creating trigrams from instructions (disregarding question boundaries for simplicity's sake; 'schedule' contains non-ordered lemmata, thus no n-grams are created)
instructions_trigrams = [[[" ".join(instructions_without_jannik[i:i+3]) for i in range(len(instructions_without_jannik)-3+1)]],
                        [[" ".join(instructions_with_jannik[i:i+3]) for i in range(len(instructions_with_jannik)-3+1)]], schedule]

#reading trigram corpus
corpus = pd.read_csv(f"2_Preprocessed/RNN_{which_corpus}_trigrams.csv", sep=",", index_col=0, na_filter=False)

#passing unigram corpus to persistence tagger while specifying levels, output destination, instructions and stoplemmas,
#and, crucially, switching direction (from voice assistant to human speaker) by overwriting default values
persistence.tagger(corpus, which_corpus, levels, f"3_Persistence_tagged/single_ngrams/Quasi_persistence_{which_corpus}_trigrams.csv", instructions,
                  speaker_A="S", speaker_B="A")

##### Quadrigrams

In [None]:
#creating quadrigrams from instructions (disregarding question boundaries for simplicity's sake; 'schedule' contains non-ordered lemmata, thus no n-grams are created)
instructions_quadrigrams = [[[" ".join(instructions_without_jannik[i:i+4]) for i in range(len(instructions_without_jannik)-4+1)]],
                           [[" ".join(instructions_with_jannik[i:i+4]) for i in range(len(instructions_with_jannik)-4+1)]], schedule]

#reading quadrigram corpus
corpus = pd.read_csv(f"2_Preprocessed/RNN_{which_corpus}_quadrigrams.csv", sep=",", index_col=0, na_filter=False)

#passing unigram corpus to persistence tagger while specifying levels, output destination, instructions and stoplemmas,
#and, crucially, switching direction (from voice assistant to human speaker) by overwriting default values
persistence.tagger(corpus, which_corpus, levels, f"3_Persistence_tagged/single_ngrams/Quasi_persistence_{which_corpus}_quadrigrams.csv", instructions,
                  speaker_A="S", speaker_B="A")

For simplicity, the following code combines all n-gram levels into one DataFrame.

In [None]:
persistence.combiner("3_Persistence_tagged/single_ngrams", "3_Persistence_tagged/Quasi_persistence_VACC_all.csv", which_corpus)

### Visualisation

The following code visualises tagged cases of persistence on all n-grams levels in HTML files.

In [None]:
visualisation.lemma(which_corpus, "3_Persistence_tagged", "3_Persistence_tagged/visualisation")

### Inspecting frequent cases of persistence

The following code outputs frequent cases of persistence from the voice assistant to the human speaker for each n-gram level. 

In [None]:
visualisation.inspect(levels = ["lemma"], #further levels such as POS-tags could be supplied if tagging was performed on that level
                      ngrams = ["unigrams", "bigrams", "trigrams", "quadrigrams"], 
                      threshold = 0, 
                      which_corpus =  which_corpus, 
                      path = "3_Persistence_tagged/single_ngrams")