# VACC

In this notebook all preprocessing steps for creating the final VACC corpus are run as well as the persistence tagging algorithm for the qualitative analysis. 

A dummy dataset is provided for the code to be executable. Make sure to replace it with the actual data to replicate the corpus used in the doctoral thesis.

## Preparations

In [18]:
#import relevant modules

import pandas as pd, json, sys, os, warnings, re
from importlib import reload
from tqdm import tqdm
warnings.simplefilter(action='ignore', category=FutureWarning)

sys.path.append("../Code/")

if "run_before" not in locals():
    print("Importing modules for the first time.")
    import preprocessing, persistence, combination, visualisation
    run_before = "yes"
else:
    print("Reloading modules.")
    reload(preprocessing)
    reload(persistence)
    reload(combination)
    reload(visualisation)

Reloading modules.


In [3]:
#define corpus name 
#(some modules contain corpora-specific code, thus a variable with the corpus name is needed
#to ensure the right code is executed)
which_corpus = "VACC"

## Data Preprocessing

### Creating one csv file

In [4]:
#paths to folders with transcripts and speaker list (who made which utterance)
root_transcripts = f"1_Corpus/Original_files/Fertige_Transkripte/"
root_speakers = f"1_Corpus/Original_files/Fertige_Utterances/"
output_destination = f"1_Corpus/Corpus_{which_corpus}.csv"

preprocessing.file_creator_vacc(root_transcripts, root_speakers, output_destination)

### Merging same-speaker turns

In [5]:
file = f"1_Corpus/Corpus_{which_corpus}.csv"
output_destination = f"1_Corpus/Corpus_merged_turns_{which_corpus}.csv"
    
preprocessing.turn_merger(file, output_destination)

Note that, as explained in the thesis, manual unmerging was performed in certain cases. Hence, when running this notebook with the actual data, it will eventually differ slightly from the data used in the study as this manual step as well as the manual lemma correction step mentioned below cannot be reproduced here.

### Tokenising and lemmatising data 

Note that, as explained in the thesis, POS-tagging was initially also performed, but as its output was not used, this step is disregarded here.

This step was performed with both TreeTagger and RNNTagger. As explained in the thesis, the RNNTagger's output proved to be most reliable. 

To execute the following steps you need to download the [RNNTagger](https://www.cis.lmu.de/~schmid/tools/RNNTagger/) and follow the installation steps indicated there. 

To be able to remap tagged tokens to the corresponding metadata (which turn they belong to, by which speaker etc.), tokenisation is performed using custom code. The following function both creates a tokenised file for tagging as well as `tokens_for_remapping` which contains the same tokens as well as an additional turn boundary marker that enables remapping the tokens to their respective turn, once they have been tagged.

In [7]:
tokens_for_remapping = preprocessing.tokenise(f"1_Corpus/Corpus_merged_turns_{which_corpus}.csv", 
                                    "2_Preprocessed/Files/txt_file_for_tagger.txt")

Next, run the following code in your command line, replacing "rnntagger_path" with the absolute path to your RNNTagger directory, "txt_file_for_tagger.txt" with the absolute path to said file (created in the cell above) and "output" with the absolute path leading to a new file called "VACC/2_Preprocessed/Files/RNN_tagged.txt" (i.e., complement that path based on where you stored the entire repository on your computer).

In [12]:
preprocessing.remap(f"1_Corpus/Corpus_merged_turns_{which_corpus}.csv", #corpus
                    f"2_Preprocessed/Files/RNN_tagged.txt", #tagger output
                    tokens_for_remapping, #needed for remapping tagger output to corpus
                    f"2_Preprocessed/RNN_{which_corpus}_unigrams.csv", #destination of remapped corpus
                    which_corpus)

Note that, as explained in the thesis, manual lemma correction was performed at this point. 

### Creating lemma bi-, tri-, and quadrigrams

In [20]:
preprocessing.ngrammer(f"2_Preprocessed/RNN_{which_corpus}_unigrams.csv", which_corpus)

bigrams
trigrams
quadrigrams


Preprocessing of the corpus is now done. Note again that even if you ran the code with the actual data, it will be slightly different from the data used in the thesis due to non-replicable manual steps (see above). 

For the quantitative analysis see separate notebooks in the directory "Quantitative_Analysis". This code is in separate notebooks as some quantitative analyses extend beyond one single corpus and are thus organised with regard to alternation sets. Also, a different programming language (R) is used.

## Persistence Tagging for Qualitative Analysis

In [None]:
# tagging (only for allo-persistence)
# visualising (change module name)
# inspect frequent persistent n-grams