# Assessing the Linguistic Complexity of German Abitur Texts from 1963–2013
## 3: Syntactic Complexity

Author: Matilda Schauf

This notebook is about analyzing syntactic complexity of German Abitur texts from 1963-2013 using the GraphVar corpus (Berg et al., 2021).

## Table of Contents

* [Import Modules](#import)
* [Get Data](#getdata)
* [Syntactic Complexity Features](#sycompl)
* [Load Class for Measuring Syntactic Complexity](#loadclass)
* [Get Results and Analyze Data](#getresults)
* [Syntactic Complexity of Express and Zeit](#expresszeit)
* [References](#ref)

## Import Modules <a class="anchor" id="import"></a>

I will use `Pandas` and `NumPy` as well as two functions and a class from two other modules made by me: 

`get_filenames` and `make_df_dict` from the module `functions` and `SynComplMeas` from the module `calc_syn_complexity`.

In [1]:
import sys
import pandas as pd
import numpy as np

# Insert the path of modules folder 
sys.path.insert(0, "src")

from functions import get_filenames, make_df_dict
from calc_syn_complexity import SynComplMeas

## Get Data <a class="anchor" id="getdata"></a>
Get filenames for development or test data. For that, use imported function `get_filenames`.

Then, read them into data frames and save them in a dictionary. For that, use imported function `make_df_dict`.

For more information about the functions, print the docstrings in the cells below.

In [2]:
print(get_filenames.__doc__)


    Function that takes .csv-File with the filenames that have been categorized into dev or test files and returns either the filenames for the development data or the filenames for the test data.
        
        Input:
            1. filenames_categorized (str): Filename of the file with the filenames and their categories
            2. test (bool): True when test filenames should be returned, False when development filenames should be returned
        Output:
            1. filenames (list): List of the wanted conllup filenames (str)


In [3]:
print(make_df_dict.__doc__)


    Function that takes a list of filenames, loads the conllup files into a data frame, and saves them in a dictionary with the key tuples (year, text number).

        Input:
            1. path (str): Path to the files on the computer
            2. filenames (list): List with the filenames of the conllup files (str)
        Output:
            1. dfs_dict (dict): Dictionary with key = tuple (year, text number) and value = data frame


In [4]:
# get filenames (comment out one line)

# development data
#filenames = get_filenames("dataSplits.csv", test=False)

# test data
#filenames = get_filenames("dataSplits.csv", test=True)

# demo data
filenames = get_filenames("src/demo_dataSplits.csv", test=True)

In [5]:
# original GraphVar data
#path = "data/conll/graphvar_1963-2013_DE_conll/"

# demo data
path = "data/"
df_dict = make_df_dict(path, filenames)

## Syntactic Complexity Features <a class="anchor" id="syncompl"></a>
This section introduces the syntactic complextiy features that we will use for our analysis with their definitions and notes on literature. 

The variable names that are dislayed as code are also the names of the attributes of the class that will be loaded in a later cell.

- `sent_lens` = **Mean Sentence Length**
    - *Mean Sentence Length in Tokens* (Meyer et al., 2020); *Mean length of sentences* (Chen & Zechner, 2011)
        - \# Tokens / # Sentences
- `clauses_s` = **Clauses per Sentence**
    - *Sentence Coordination Ratio/Sentence Complexity Ratio/T-Unit Complexity Ratio* (Meyer et al., 2020); *number of clauses per sentence* (Chen & Zechner, 2011)
        - \# Clause (paratactic constructions, relative, simplex) tokens / # Sentence
- `subc_s` = **Subordinate Clauses per Sentence**
    - *Dependent Clauses per T-Unit* (Meyer et al., 2020)
        - \# C / # Sentences
- `clause_lens` = **Mean Clause Length** in Tokens
    - *Mean Length of Clause* (Meyer et al., 2020); *mean length of clauses* (Chen & Zechner, 2011)
        - \# Clause tokens / # Clauses

- `simpx_lens` = **Mean Simplex Clause Length** in Tokens
    - *Mean length of simple sentences* (Chen & Zechner, 2011)
        - \# SIMPX tokens / # SIMPX
- `relc_lens` = **Mean Relative Clause Length in Tokens** in Tokens
    - not listed
        - \# R-SIMPX tokens / # R-SIMPX

- `simpx_c` = **Simplex Clause Ratio**
    - not listed
        - \# SIMPX / # Clauses
- `relc_c` = **Relative Clause Ratio**
    - not listed, but mentioned in paper (Meyer et al., 2020)
        - \# R-SIMPX / # Clauses
- `parac_c` = **Paratactic Clause Construction Ratio**
    - not listed, but mentioned in paper (Meyer et al., 2020)
        - \# P-SIMPX / # Clauses

- `vf_lens`, `mf_lens`, `nf_lens` = **Mean Prefield Length**, **Mean Middle Field Length**, **Mean Postfield Lenght** in Node Tags or Tokens
    - not listed
        - \# {VF | MF | NF} tokens / # {VF | MF | N}
- `nx_lens`, `px_lens` = **Mean Noun Phrase Length**, **Mean Prepositional Phrase Length** in Node Tags
    - not listed
        - \# {NX | PX} tokens / # {NX | PX }

- `verbx_s` = **Verb Phrases per Sentence**
    - *Verb Phrases per T-Unit* (Meyer et al., 2020); *mean number of verbs per sentence* (Chen & Zechner, 2011)
        - \# VXFIN+VXINF tokens / # Sentence
- `nx_s` = **Noun Phrases per Sentence**
    - *Mean number of noun phrases (NP) per sentence* (Chen & Zechner, 2011)
        - \# NX tokens / # Sentence

- `tok_embeds` = **Mean Token Embedding Depth** in Node Tags
    - not listed
        - \# Nodes / # Tokens
- `max_sent_embeds` = **Mean Maximum Embedding Depth per Sentence** in Node Tags
    - not listed
        - SUM of maximum embedding depths per sentence / # Sentences

- `vv_nn` = **Verb/Noun Ratio**
    - not listed
        - \# Verbs (XPOS starts with 'VV') / # nouns (XPOS is 'NN')

It was decided to leave out the other features during the evaluation.

## Load Class for Measuring Syntactic Complexity <a class="anchor" id="loadclass"></a>

The cell below loads the class `SynComplMeas` which calculates the syntactic complexity features introduced above. 

The loading time is about 30 seconds for the development data and about 2 minutes for the test data. 

For seeing all attributes and methods of the class, print the docstring.

In [6]:
# load class
sc = SynComplMeas(name="Syntactical Complexity Measures", df_dict=df_dict)

In [7]:
# show docstring
print(sc.__doc__)


    A class to represent our syntactic complexity measures.

    Attributes
    ----------
    name : str
        Name for the class
    df_dict : dict
        Dictionary that contains data frames with corpus annotation data for several connlup files
    sent_lens : Pandas.DataFrame
        Results for feature "Mean Sentence Length in Tokens"
    tok_embeds : Pandas.DataFrame
        Results for feature "Mean Token Embedding Depth"
    max_sent_embeds : Pandas.DataFrame
        Results for feature "Mean Maximum Embedding Depth per Sentence"
    simpx_s : Pandas.DataFrame
        Results for feature "Simplex Clauses per Sentence"
    subc_s : Pandas.DataFrame
        Results for feature "Dependent Clauses per Sentence"
    relc_s : Pandas.DataFrame
        Results for feature "Relative Clauses per Sentence"
    parac_s : Pandas.DataFrame
        Results for feature "Paratactic Clause Constructions per Sentence"
    clauses_s : Pandas.DataFrame
        Results for feature "Clauses per S

## Get Results and Analyze Data <a class="anchor" id="getresults"></a>

The cell below saves the results as `.csv` files in a target directory. 

You can get the result data frames by using the attribute/feature variables: `sc.attribute`.

In [8]:
# target directory
#target_dir = "results/3_syntax/dev_results/"
#target_dir = "results/3_syntax/test_results/"
target_dir = "results/3_syntax_demo/"

import os
os.makedirs(target_dir, exist_ok=True)

# save result data frames in list
result_dfs = [sc.sent_lens, sc.tok_embeds, sc.max_sent_embeds, sc.simpx_s, sc.subc_s, sc.relc_s, sc.parac_s, sc.clauses_s, sc.verbx_s, sc.vc_s, sc.nx_s, sc.simpx_c,
        sc.subc_c, sc.relc_c, sc.parac_c, sc.clause_lens, sc.simpx_lens, sc.relc_lens, sc.nx_lens, sc.px_lens, sc.vf_lens, sc.mf_lens, sc.nf_lens, sc.vv_nn]

# iterate over data frames + data frame names (str) and save results in .csv files
for df, df_name in zip(result_dfs, sc.feature_names):
    df.to_csv(target_dir + df_name + ".csv")

This cell can be used for displaying a particular result data frame (`sc.attribute`):

In [9]:
# show result df in this notebook
sc.vv_nn

Unnamed: 0,YEAR,YEAR_VAL,STUDENT_VALS,STUDENT_STD,YEARS_MEAN,YEARS_STD
0,1963,0.583824,"[0.6617647058823529, 0.5058823529411764]",0.110225,0.553828,0.04242
1,2013,0.523832,"[0.48375451263537905, 0.5639097744360902]",0.056678,0.553828,0.04242


This cell gives an overview of what years most often had the **maximum** or **minimum** values for each feature:

In [10]:
# import function mode from module statistics
from statistics import mode

# only calculate for features where higher value implies higher complexity
dfs = [sc.sent_lens, sc.tok_embeds, sc.max_sent_embeds, sc.subc_s, sc.relc_s, sc.clauses_s, sc.verbx_s, sc.vc_s, sc.nx_s, sc.subc_c, sc.relc_c, sc.clause_lens, sc.simpx_lens, 
sc.relc_lens, sc.nx_lens, sc.px_lens, sc.vf_lens, sc.mf_lens, sc.nf_lens, sc.vv_nn]

# make list of years with the highest values for each feature
highest = [int(df.YEAR[df.YEAR_VAL == df.YEAR_VAL.max()]) for df in dfs]
# make list of years with the lowest values for each features
lowest = [int(df.YEAR[df.YEAR_VAL == df.YEAR_VAL.min()]) for df in dfs]

# print years that most often had highest/lowest value
print(mode(highest))
print(mode(lowest))

2013
1963


This cell prints the *mean student standard deviation* for each feature:

In [11]:
for df, name in zip(result_dfs, sc.feature_names):
    print(df.STUDENT_STD.mean(), name)

1.4794747333435074 sent_lens
0.1956032342631167 tok_embeds
0.19117579955012226 max_sent_embeds
0.4807085573855916 simpx_s
0.20158745954879642 subc_s
0.11085095028375574 relc_s
0.0449695102070392 parac_s
0.46001815219673714 clauses_s
0.4497837119427211 verbx_s
0.28860235417601315 vc_s
0.6539408577890626 nx_s
0.07152098826130787 simpx_c
0.02358680711941884 subc_c
0.050698651298429875 relc_c
0.020822336962878024 parac_c
0.569159569183477 clause_lens
0.4563037538765561 simpx_lens
1.8528816581091945 relc_lens
0.17795069938646124 nx_lens
0.22932140130699966 px_lens
0.486404509501187 vf_lens
0.4915980588724505 mf_lens
0.9733152736626002 nf_lens
0.08345189899954328 vv_nn


The next two cells are for checking (for a result data frame) which rows have values that are higher than the mean for the columns `YEAR_VAL` and `STUDENT_STD`.

In [12]:
# check which rows have a value in YEAR_VAL that is higher then the YEAR_MEAN
sc.vv_nn.YEAR_VAL.apply(lambda x : x > sc.vv_nn.YEAR_VAL.mean())

0     True
1    False
Name: YEAR_VAL, dtype: bool

In [13]:
# check which rows have a value in STUDENT_STD that is higher then the STUDENT_STD mean
sc.vv_nn.STUDENT_STD.apply(lambda x : x > sc.vv_nn.STUDENT_STD.mean())

0     True
1    False
Name: STUDENT_STD, dtype: bool

## Syntactic Complexity of Express and Zeit <a class="anchor" id="expresszeit"></a>

In this section, the syntactic complexity features will be calculated on texts from our *Express* and *Zeit* corpora.

First, get the filenames of the `.conllup` files with the annoation for the reference corpora (136 files from *Express* and 137 files from *Zeit*) and save them in a list.

In [15]:
# import function listdir from module os
from os import listdir

# define path to the files
#ez_path = "data/corpus/random_BIO/"
ez_path = "data/"

# save filenames in list
#ez_filenames = listdir(ez_path)

# demo data
ez_filenames = ['express_1.conllup', 'express_2.conllup', 'express_3.conllup', 
                'zeit_1.conllup', 'zeit_2.conllup', 'zeit_3.conllup'] 

Then, define a function that takes the list of filenames and makes a dictionary with key=`(corpus ID, text number)` and value=`DataFrame`. 

It is similar to the function `make_df_dict` from the module `functions`, but specific to the corpora's texts' annotation that slightly differs from that of the Abitur texts.

In [16]:
def make_corp_dict(path: str, filenames: list):
    """
    Function that takes a path and filenames and returns a dictionary with key = (corpus ID, text number) and value = data frame.
    The corpus ID for Express is 1 and the corpus ID for Zeit is 2. """

    df_dict = dict()

    for i, filename in enumerate(filenames):

        # open connlup file and use first line for saving the column names for data frame
        with open(path+filename, "r", encoding="UTF-8") as file:
            column_names = file.readline().replace("# global.columns =", "").strip().split()
        
        # load file into data frame
        df = pd.read_csv(path+filename, comment="#", sep="\t", quoting=3, header=None, names=column_names)

        # convert FORM column values into strings
        df["FORM"] = df["FORM"].astype(str)
        # delete rows with no words
        df = df[~df.FORM.str.contains("EMPTY")]
        # delete superfluous rows for only one word
        df = df[~df.FORM.str.contains(r"<[IE]->")]
        # reset index
        df.reset_index(drop=True, inplace=True)

        df["TREE"] = df["TREE"].astype(str)
        # make new column SYNTAX that is like the column TREE but without the PSEUDO tags
        df["SYNTAX"] = df.TREE.str.replace(r"[BIE]-PSEUDO\|?", '', regex=True)
        # ignore rows that have no (meaningful) syntax annotation
        df = df[df.SYNTAX.apply(len) > 1]
        # reset index
        df.reset_index(drop=True, inplace=True)

        # make sentence ID
        df.loc[[0],["ID"]] = 1
        df["SENT_ID"] = df.ID.eq(1).cumsum() - 1

        # only keep columns that are needed
        keep = ["FORM", "SENT_ID", "SYNTAX", "XPOS"]
        df = df[keep]
        
        # instead of year numbers, use corpora IDs for the first part of the key tuple
        if filename.startswith("express"):
            df_dict[(1, i+1)] = df
        if filename.startswith("zeit"):
            df_dict[(2, i+1)] = df
    
    return df_dict

In [17]:
# run function
corp_dict = make_corp_dict(ez_path, ez_filenames)

In [18]:
corp_dict.keys()

dict_keys([(1, 1), (1, 2), (1, 3), (2, 4), (2, 5), (2, 6)])

In [19]:
# uncomment to look at example dataframe from the dict
#corp_dict[(1, 1)]

Use the class `SynComplMeas` to calculate each syntactic complexity measure on the Zeit and Express corpora's texts. 

The result data frames will only have two rows: One for the Express corpus and one for the Zeit corpus. 

Here, `YEAR` always stands for the corpus ID (Express=`1`, Zeit=`2`).

In [20]:
# load class
sc2 = SynComplMeas(name="Express and Zeit Syntactic Complexity", df_dict=corp_dict)

In [21]:
# show dataframe for one attribute
sc2.vv_nn

Unnamed: 0,YEAR,YEAR_VAL,STUDENT_VALS,STUDENT_STD,YEARS_MEAN,YEARS_STD
0,1,0.548979,"[0.47706422018348627, 0.6282051282051282, 0.54...",0.075835,0.478682,0.099414
1,2,0.408385,"[0.36607142857142855, 0.3879310344827586, 0.47...",0.055447,0.478682,0.099414


As one result data frame is very short, it makes sense to concatenate all data frames into one big dataframe and add a column that displays the respective feature name.

In [23]:
# initialize empty data frame
ez_df = pd.DataFrame()

# save result dfs in list
ez_results = [sc2.sent_lens, sc2.tok_embeds, sc2.max_sent_embeds, sc2.simpx_s, sc2.subc_s, sc2.relc_s, sc2.parac_s, sc2.clauses_s, sc2.verbx_s, sc2.vc_s, sc2.nx_s, sc2.simpx_c,
            sc2.subc_c, sc2.relc_c, sc2.parac_c, sc2.clause_lens, sc2.simpx_lens, sc2.relc_lens, sc2.nx_lens, sc2.px_lens, sc2.vf_lens, sc2.mf_lens, sc2.nf_lens, sc2.vv_nn]

# iterate over data frames and feature names
for df, df_name in zip(ez_results, sc2.feature_names):
    # create column for feature name
    df["FEAT"] = df_name
    # concatenate dfs
    ez_df = pd.concat([ez_df, df])

# change column order
ez_df = ez_df[["FEAT", "YEAR", "YEAR_VAL", "YEARS_MEAN", "YEARS_STD"]]
ez_df["YEAR"].replace({1: "E", 2: "Z"}, inplace=True)
ez_df.rename({"YEAR": "COR", "YEAR_VAL": "VAL", "YEARS_MEAN": "MEAN"}, axis='columns', inplace=True)

# save data frame as .csv file
#ez_df.to_csv("results/3_syntax/expr_zeit/express_zeit.csv")
ez_df.to_csv("results/3_syntax_demo/express_zeit.csv")

In [24]:
# display data frame with relevant features in notebook (only rows with YEARS_STD > 0.1)
ez_df[ez_df.YEARS_STD > 0.1]

Unnamed: 0,FEAT,COR,VAL,MEAN,YEARS_STD
0,sent_lens,E,14.950202,19.099738,5.86833
1,sent_lens,Z,23.249274,19.099738,5.86833
0,tok_embeds,E,3.020896,3.249037,0.322641
1,tok_embeds,Z,3.477179,3.249037,0.322641
0,max_sent_embeds,E,4.324755,4.849601,0.742245
1,max_sent_embeds,Z,5.374447,4.849601,0.742245
0,simpx_s,E,1.614616,1.731427,0.165196
1,simpx_s,Z,1.848238,1.731427,0.165196
0,subc_s,E,0.44175,0.547987,0.150243
1,subc_s,Z,0.654225,0.547987,0.150243


## Extra: Two Example Sentences

(Note: this does not work with the demo data)

In [30]:
ex_df = df_dict[(2003, 18)]
print(ex_df)

KeyError: (2003, 18)

In [31]:
short_sent = ex_df[ex_df.SENT_ID==83]
short_sent[["SENT_ID", "FORM", "XPOS", "SYNTAX", "WEBANNO"]]

KeyError: "['WEBANNO', 'FORM'] not in index"

In [32]:
long_sent = ex_df[ex_df.SENT_ID==5]
long_sent[["SENT_ID", "FORM", "XPOS", "SYNTAX", "WEBANNO"]]

KeyError: "['WEBANNO', 'FORM'] not in index"

In [33]:
top_fields_RE = r"""(?x)    # flag verbose
                        \|          # beginning hyphen
                        (
                        ([BIE]-)?   # optional: B- or I- or E-
                         (
                         [VMN]FE?   # VF or MF or NF or MFE
                         |
                         L[KV]      # LK or LV
                         |
                         F?KOORD    # FKOORD or KOORD
                         |
                         PARORD
                         |
                         V?CE?      # VC or VCE or C
                         |
                         FKONJ
                         )
                        )
                        (?= $|\|)   # lookahead: should be there but won't be replaced
                        """

ex_sents = [short_sent, long_sent]

for sent in ex_sents:
    #print("Sentence:", " ".join(sent.FORM))
    print()
    print("Sentence Length:", len(sent))
    print("Clauses in Sentence:", sc.count_pattern(sent, r"(^|\|)(B-)?[PR]?-?SIMPX"))
    print("Subordinate Clauses in Sentence:", sc.count_pattern(sent, r"(^|\|)(B-)?C($|\|)"))
    print("Mean Clause Length:", sc.get_phrase_lens(sent, r"[PR]?-?SIMPX"))
    print("Mean Simplex Clause Length:", sc.get_phrase_lens(sent, r"SIMPX"))
    print("Mean Relative Clause Length:", sc.get_phrase_lens(sent, r"R-?SIMPX"))
    print("Simplex Clauses in Sentence:", sc.count_pattern(sent, r"(^|\|)(B-)?SIMPX"))
    print("Relative Clauses in Sentence:", sc.count_pattern(sent, r"(^|\|)(B-)?R-?SIMPX"))
    print("Paratactic Clauses in Sentence:", sc.count_pattern(sent, r"(^|\|)(B-)?P-?SIMPX"))
    print("Mean Prefield Length:", sc.get_phrase_lens(sent, r"VF"))
    print("Mean Middle Field Length:", sc.get_phrase_lens(sent, r"MF"))
    print("Mean Postfield Length:", sc.get_phrase_lens(sent, r"NF"))
    print("Mean NP Length:", sc.get_phrase_lens(sent, r"NX"))
    print("Mean PP Length:", sc.get_phrase_lens(sent, r"PX"))
    print("Verbs in Sentence:", sc.count_pattern(sent, r"(^|\|)(B-)?VXF?INF?"))
    print("NPs in Sentence:", sc.count_pattern(sent, r"(^|\|)(B-)?NX"))
    print("Verb/Noun Ratio:", sc.count_pattern(sent, r"VV.*")/sc.count_pattern(sent, r"NN"))
    print("Mean Token Embedding Depth:", sc.get_tok_embeds(sent, top_fields_RE))
    sent["TEMP"] = sent.SYNTAX.str.replace(top_fields_RE, '', regex=True).str.split("|").apply(len)
    print("Maximum Embedding Depth:", sent.TEMP.max())
    print()


Sentence Length: 0
Clauses in Sentence: 0
Subordinate Clauses in Sentence: 0
Mean Clause Length: nan
Mean Simplex Clause Length: nan
Mean Relative Clause Length: nan
Simplex Clauses in Sentence: 0
Relative Clauses in Sentence: 0
Paratactic Clauses in Sentence: 0
Mean Prefield Length: nan
Mean Middle Field Length: nan
Mean Postfield Length: nan
Mean NP Length: nan
Mean PP Length: nan
Verbs in Sentence: 0
NPs in Sentence: 0
Verb/Noun Ratio: nan
Mean Token Embedding Depth: nan
Maximum Embedding Depth: nan


Sentence Length: 10
Clauses in Sentence: 1
Subordinate Clauses in Sentence: 0
Mean Clause Length: 9.0
Mean Simplex Clause Length: 9.0
Mean Relative Clause Length: nan
Simplex Clauses in Sentence: 1
Relative Clauses in Sentence: 0
Paratactic Clauses in Sentence: 0
Mean Prefield Length: 3.0
Mean Middle Field Length: 5.0
Mean Postfield Length: nan
Mean NP Length: 2.6666666666666665
Mean PP Length: nan
Verbs in Sentence: 1
NPs in Sentence: 3
Verb/Noun Ratio: 0.5
Mean Token Embedding Depth

  phrase_len = phrase_tag_count/phrase_count
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


## References <a class="anchor" id="ref"></a>

Kristian Berg, Jonas Romstadt, and Cedrek Neitzert. 2021. GraphVar – Korpusaufbau und Annotation. Version 1.0. Friedrich-Wilhelms-Universität Bonn, https://graphvar.uni-bonn.de/dokumentation.html.

Miao Chen and Klaus Zechner. 2011. Computing and evaluating syntactic complexity features for auto- mated scoring of spontaneous non-native speech. In *Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies*, pages 722–731, Portland, Oregon, USA. Association for Computational Linguistics

Jennifer Meyer, Torben Jansen, Johanna Fleckenstein, Stefan Keller, and Olaf Köller. 2020. Machine Learning im Bildungskonstext: Evidenz für die Genauigkeit der automatisierten Beurteilung von Essays im Fach Englisch. *Zeitschrift für Pädagogische Psychologie*, 0:1–12.