# Non-Agentivity / VACW

In this notebook, the preprocessed VACW corpus is annotated for the non-agentivity alternation, i.e., all alternating instances of passive voice constructions with "werden" and the impersonal pronoun "man" are tagged in the data. Subsequently, a dataset is prepared for modelling, i.e., all relevant variables (e.g., which variant was used in the previous slot?) from the tagged choice contexts are extracted or calculated. The resulting dataset is then combined with the data for the other two corpora and prepared for modelling in "non-agentivity_all_corpora.ipynb".

As mentioned, the actual data used in the doctoral thesis needs to be requested from Ingo Siegert and subsequently preprocessed using the "VACW" notebook in the corresponding folder.

Refer to the relevant chapter in the doctoral thesis for further explanation of the steps below. 

In [None]:
#importing relevant modules
import pandas as pd, sys, os, shutil

#informing Python about a custom code directory and importing some of the modules from there
sys.path.append("../../../Code/")
import annotation, quantification, persistence

In [None]:
#defining name of the alternation set and establishing its variants
alternating = "Non-agentivity"
alternation_set = ["man", "werden"]

## Annotation

Annotation is typically done in multiple sessions. Hence, after each instance that you have annotated you may decide to end the session in which case everything tagged so far is saved. When starting the next session, you will only be assigned the remaining instances, i.e., those cases that have not been annotated thus far.

### Preparations

The first step presupposes that the "VACW" notebook was executed.

In [None]:
#copying the preprocessed corpus file into the folder "Quantitative_analysis/Annotated_datasets"
#a separate copy for annotation is deemed safer than modifying the persistence-tagged corpus 
source_file = "../../../VACW/3_Persistence_tagged/Persistence_VACW_all.csv"
destination_directory = "../../Annotated_datasets/"

destination_file = os.path.join(destination_directory, "VACW.csv")

if not os.path.exists(destination_file):
    shutil.copy2(source_file, destination_file)
    print("File moved.")
else:
    print("File already exists.")

In [None]:
#reading in the (copied) corpus file
df = pd.read_csv("../../Annotated_datasets/VACW.csv", sep=",", index_col=0)

#lower-casing the lemma column, as the code below only checks for lower-case variants
df.lemma = df.lemma.str.lower()

#creating a column for saving annotation decisions, if it does not already exist
if not alternating in df.columns:
    df[alternating] = pd.NA

#informing about how many alternating instances have already been tagged
print(f"Cases annotated as alternating: {len(df[df[alternating]=='yes'])}")

### Annotation Tool

In short, the tool below 
- informs you about the annotation scheme
- tells you how many untagged instances you have got left
- provides you with the next case to annotate, i.e., a potentially alternating lemma including its immediate context
- displays an input field for deciding the current case according to the scheme
- prompts you to confirm your decision and/or gives you the option to end the current session
- searches for identical contexts prompting you whether the decision should be applied there as well

Decisions are saved in `df_updated`. After each session, this DataFrame still needs to be saved externally. To start a new session, start by reading in the current version of "../Annotated_datasets/VACW.csv" under "Preparations" above.

In [None]:
#annotating
df_updated = annotation.alternation_check(df, alternation_set, alternating)

In [None]:
#saving the updated DataFrame externally, overwriting the empty or part-annotated file
df_updated.to_csv("../../Annotated_datasets/VACW.csv")

In [None]:
#once all potentially alternating cases have been annotated, all other tokens of the DataFrame are additionally tagged as non-alternating
if df_updated.loc[df_updated.lemma.isin(alternation_set), alternating].notna().all():
    df_updated[alternating].fillna(value="no", inplace=True)
    df_updated.to_csv("../../Annotated_datasets/VACW.csv")
    print("Annotation is completed.")
else:
    print("Annotation is not yet completed, rerun the tool above.")

### Overview

See how many times each speaker used one of the alternating variants.

In [None]:
#reading in the annotated corpus, filtering, grouping and counting values
df = pd.read_csv("../../Annotated_datasets/VACW.csv", sep=",", index_col=0, na_filter=False)
df[df[alternating] == "yes"].groupby("speaker").lemma.value_counts()

## Preparing DataFrame for Modelling

In [None]:
#reading in the annotated corpus now including a column indicating where there was an opportunity ("yes") to choose a variant from the alternation set or not ("no")
df = pd.read_csv("../../Annotated_datasets/VACW.csv", index_col=0, na_filter=False, sep=",")

Below, the function `prepare_data_for_modeling` extracts or calculates all relevant variables for each choice context and saves the resulting DataFrame externally. This code only works properly if annotation has been completed. The code for saving is commented out though as it would replace the file that was used for modelling in the thesis. As mentioned, said file is shared given its abstract nature.

In [None]:
#creating variation_sample, i.e., only annotated choice contexts along with relevant variables
variation_sample = quantification.prepare_data_for_modeling(df, alternating, restrict="yes", beta_variants=["man", "werden"])
#variation_sample.to_csv(f"{alternating}_for_modelling_VACW.csv")
variation_sample