# Non-Agentivity / All Corpora

After annotating and preparing a dataset for the non-agentivity alternation for all three corpora separately, in this notebook the datasets are combined and finally prepared for modelling. Further, the annotated instances are cross-tabulated and a switch rate plot is generated for the Descriptive Statistics part in the thesis.

In [None]:
#importing relevant modules
import pandas as pd, sys, warnings
warnings.filterwarnings('ignore') 

#informing Python about a custom code directory and importing some of the modules from there
sys.path.append("../../Code/")
import annotation, quantification, persistence

In [None]:
#defining name of the alternation set and establishing its variants
alternating = "NON-AGENTIVITY"
alternation_set = ["man", "werden"]

## Combination

The following code reads in all three annotated datasets and concatenates them into one DataFrame.

In [None]:
#reading all three datasets
vacc = pd.read_csv("VACC/NON-AGENTIVITY_for_modelling_VACC.csv", index_col=0)
vacw = pd.read_csv("VACW/NON-AGENTIVITY_for_modelling_VACW.csv", index_col=0)
rbc = pd.read_csv("RBC/NON-AGENTIVITY_for_modelling_RBC.csv", index_col=0)

In [None]:
#creating a column with the corpus name for assigning unique interaction ids below
vacc["CORPUS"] = "VACC"
vacw["CORPUS"] = "VACW"
rbc["CORPUS"] = "RBC"

In [None]:
#concatenating all three variation samples into one
variation_sample = pd.concat([vacc, vacw, rbc])

#assigning unique interaction ids
variation_sample.INTERACTION_ID = variation_sample.INTERACTION_ID.astype(str) + "_" + variation_sample.CORPUS

#saving externally
variation_sample.to_csv(f"{alternating}_for_modelling.csv")

## Descriptive Statistics

### Cross-Tabulation

Creating a table showing how often each variant in PREVIOUS was followed by the same or the other variant in CURRENT per corpus.

In [None]:
contingency_table = variation_sample.groupby("CORPUS").apply(lambda group: pd.crosstab(group.PREVIOUS, group.CURRENT)).fillna(0)
contingency_table["Total in PREVIOUS"] = contingency_table.sum(axis=1)
contingency_table.loc["Total in CURRENT"] = contingency_table.sum(axis=0)
contingency_table

### Switch Rate Plot

Finally, a switch rate plot is generated. 

In [None]:
#reading all three entire corpora and combining them into one (needed for calculating variant shares for switch rate plots)
vacc = pd.read_csv("../Annotated_datasets/VACC.csv", index_col=0)
vacw = pd.read_csv("../Annotated_datasets/VACW.csv", index_col=0)
rbc = pd.read_csv("../Annotated_datasets/RBC.csv", index_col=0)

#creating a column with the corpus name for assigning unique interaction ids below
vacc["corpus"] = "VACC"
vacw["corpus"] = "VACW"
rbc["corpus"] = "RBC"

#concatenating all three corpora into one, ignoring irrelevant columns
df = pd.concat([vacc[["lemma", "interaction_id", alternating, "corpus"]], vacw[["lemma", "interaction_id", alternating, "corpus"]], rbc[["lemma", "interaction_id", alternating, "corpus"]]])

#assigning unique interaction ids
df.interaction_id = df.interaction_id.astype(str) + "_" + df.corpus 

#defining output path for the plot
path = "switch_rate_plot_NON-AGENTIVITY.png"

#generating plot
quantification.plot_switch_rate_over_variant_proportions(df, variation_sample, alternation_set, alternating, save_to=path)