# DEZEMBER

In this notebook, the preprocessed VACC corpus is annotated for the DEZEMBER alternation, i.e., all alternating instances of "dezember" and "zwölf" are tagged in the data. Subsequently, a dataset is prepared for modelling, i.e., all relevant variables (e.g., which variant was used in the previous slot?) from the tagged choice contexts are extracted or calculated. Further, the annotated instances are cross-tabulated and a switch rate plot is generated for the Descriptive Statistics part in the thesis.

As mentioned, the actual data used in the doctoral thesis needs to be requested from Ingo Siegert and subsequently preprocessed using the "VACC" notebook in the corresponding folder.

However, the dummy dataset provided for the "VACC" notebook generates a dataset that can also be used to execute this very notebook. 

Refer to the relevant chapter in the doctoral thesis for further explanation of the steps below. 

In [None]:
#importing relevant modules
import pandas as pd, sys, os, shutil, warnings
warnings.filterwarnings('ignore') 

#informing Python about a custom code directory and importing some of the modules from there
sys.path.append("../../Code/")
import annotation, quantification, persistence

In [None]:
#defining name of the alternation set and establishing its variants
alternating = "DEZEMBER"
alternation_set = ["dezember", "zwölf"]

## Annotation

Annotation is typically done in multiple sessions. Hence, after each instance that you have annotated you may decide to end the session in which case everything tagged so far is saved. When starting the next session, you will only be assigned the remaining instances, i.e., those cases that have not been annotated thus far.

### Preparations

The first step presupposes that the "VACC" notebook was executed, either using the actual data or the dummy dataset.

In [None]:
#copying the preprocessed corpus file into the folder "Quantitative_analysis/Annotated_datasets"
#a separate copy for annotation is deemed safer than modifying the persistence-tagged corpus 
source_file = "../../VACC/3_Persistence_tagged/Persistence_VACC_all.csv"
destination_directory = "../Annotated_datasets/"

destination_file = os.path.join(destination_directory, "VACC.csv")

if not os.path.exists(destination_file):
    shutil.copy2(source_file, destination_file)
    print("File moved.")
else:
    print("File already exists.")

In [None]:
#reading in the (copied) corpus file
df = pd.read_csv("../Annotated_datasets/VACC.csv", sep=",", index_col=0)

#lower-casing the lemma column, as the code below only checks for lower-case variants
df.lemma = df.lemma.str.lower()

#creating a column for saving annotation decisions, if it does not already exist
if not alternating in df.columns:
    df[alternating] = pd.NA

#informing about how many alternating instances have already been tagged
print(f"Cases annotated as alternating: {len(df[df[alternating]=='yes'])}")

### Annotation Tool

In short, the tool below 
- informs you about the annotation scheme
- tells you how many untagged instances you have got left
- provides you with the next case to annotate, i.e., a potentially alternating lemma including its immediate context
- displays an input field for deciding the current case according to the scheme
- prompts you to confirm your decision and/or gives you the option to end the current session
- searches for identical contexts prompting you whether the decision should be applied there as well

Decisions are saved in `df_updated`. After each session, this DataFrame still needs to be saved externally. To start a new session, start by reading in the current version of "../Annotated_datasets/VACC.csv" under "Preparations" above.

In [None]:
#annotating
df_updated = annotation.alternation_check(df, alternation_set, alternating)

In [None]:
#saving the updated DataFrame externally, overwriting the empty or part-annotated file
df_updated.to_csv("../Annotated_datasets/VACC.csv")

In [None]:
#once all potentially alternating cases have been annotated, all other tokens of the DataFrame are additionally tagged as non-alternating
if df_updated.loc[df_updated.lemma.isin(alternation_set), alternating].notna().all():
    df_updated[alternating].fillna(value="no", inplace=True)
    df_updated.to_csv("../Annotated_datasets/VACC.csv")
    print("Annotation is completed.")
else:
    print("Annotation is not yet completed, rerun the tool above.")

### Overview

See how many times each speaker used one of the alternating variants.

In [None]:
#reading in the annotated corpus, filtering, grouping and counting values
df = pd.read_csv("../Annotated_datasets/VACC.csv", sep=",", index_col=0, na_filter=False)
df[df[alternating] == "yes"].groupby("speaker").lemma.value_counts()

## Preparing DataFrame for Modelling

In [None]:
#reading in the annotated corpus now including a column indicating where there was an opportunity ("yes") to choose a variant from the alternation set or not ("no")
df = pd.read_csv("../Annotated_datasets/VACC.csv", index_col=0, na_filter=False, sep=",")

Below, the function `prepare_data_for_modeling` extracts or calculates all relevant variables for each choice context and saves the resulting DataFrame externally. This code only works properly if annotation has been completed. The code for saving is commented out though as it would replace the file that was used for modelling in the thesis. As mentioned, said file is shared given its abstract nature.

In the model for this alternation set, a variable for quasi-persistence on the part of the voice assistant was included. Before running `quantification`, information on instances of quasi-persistence need to be extracted from the relevant files and added to `df`.

In [None]:
#adding information on lexical quasi-persistence to df, by reading in the combined file...
df_quasi_p = pd.read_csv("../../VACC/3_Persistence_tagged/Quasi_persistence_VACC_all.csv", na_filter=False, sep=",")
#... and summarising all kinds of lexical quasi-persistence, i.e., writing True in new column, if any lexical SPP was produced by the voice assistant
df["quasi_persistence"] = df_quasi_p[["persistence_unigrams_lemma", "persistence_bigrams_lemma", "persistence_trigrams_lemma", "persistence_quadrigrams_lemma"]].applymap(lambda x: str(x).startswith("SPP")).any(axis=1)

In [None]:
#creating variation_sample, i.e., only annotated choice contexts along with relevant variables
variation_sample = quantification.prepare_data_for_modeling(df, alternating, include_quasi_p=True, restrict="yes", beta_variants=["zwölf", "dezember"])

#saving externally (not done, as it would overwrite the actual data used in the thesis which is shared in this repository due to its abstract nature)
#variation_sample.to_csv(f"{alternating}_for_modelling.csv")
variation_sample

## Descriptive Statistics

### Cross-Tabulation

Creating a table showing how often each variant in PREVIOUS was followed by the same or the other variant in CURRENT, using the data annotated and prepared above. Again, both this code and the following ones only work if annotation has been completed.

In [None]:
contingency_table = pd.crosstab(variation_sample.PREVIOUS, variation_sample.CURRENT)
contingency_table["Total in PREVIOUS"] = contingency_table.sum(axis=1)
contingency_table.loc["Total in CURRENT"] = contingency_table.sum(axis=0)
contingency_table

### Switch Rate Plot

Finally, a switch rate plot is generated. Note that for calculating variant shares, the entire corpus is also needed. Make sure to combine the entire corpus and the variation sample of either the dummy dataset *or* the actual data, ensuring you do not mix the two.

Given its size, when using the dummy dataset, only very few dots will be plotted.

In [None]:
#reading in entire corpus (needed for calculating variant shares) as well as the variation sample
df = pd.read_csv("../Annotated_datasets/VACC.csv", na_filter=False, sep=",", index_col=0)

#defining output path for the plot
path = "switch_rate_plot_DEZEMBER.png"

#generating plot
quantification.plot_switch_rate_over_variant_proportions(df, variation_sample, alternation_set, alternating, save_to=path, DEZEMBER=True)