# Assessing the Linguistic Complexity of German Abitur Texts from 1963–2013
## 1: Lexical Diversity

Author: Isabell Siem

## Structure

1. Importing Libraries and Functions
2. Functions
3. Computing All Measure Values
4. Computing Standard Deviations and Means
5. Creating and Displaying All Restults with Pandas
6. Computing the Lexical Diversity of the EXPRESS and ZEIT Corpora
7. References

# 1. Importing Libraries and Functions

- Cell \#1: Imported libraries: pandas, scipy.stats, numpy
- Cell \#2: Imported function: functions.py
    - Function that splits data into test- and devset
    - Author: Matilda Schauf

In [2]:
# Cell 0
#splits = "src/dataSplits.csv"
splits = "src/demo_dataSplits.csv"

In [3]:
# Cell 1
import pandas as pd 
import scipy.stats as sc 
import numpy as np
import sys 

In [4]:
# Cell 2

# Insert the path of modules folder 
sys.path.insert(0, "src")

# Import module function "get_filenames" from "functions.py"
import functions
from functions import get_filenames

# Get devset data
# dev_filenames = get_filenames("dataSplits.csv", test=False)
# dev_filenames = sorted(test_filenames)

# Get testset data
test_filenames = get_filenames(splits, test=True)
test_filenames = sorted(test_filenames)
print(test_filenames)

['1963_DE_K1_11_M_08P.conllup', '1963_DE_K3_02_M_05P.conllup', '2013_DE_GK1_02_M_08P.conllup', '2013_DE_LK1_02_W_10P.conllup']


# 2. Functions
MTLD measure
- Compute MTLD values according to McCarthy and Jarvis (2010)
- The higher MTLD value the more complex the text (calculates mean segment length)


- Cell \#3: Function for computing the forward MTLD value
- Cell \#4: Function for computing the partial MTLD
- Cell \#5: Function for computing the reverse MTLD value

In [5]:
# Cell 3
# Compute forward MTLD value
def compute_MTLD(lemma_list, year):
    """Function that computes MTLD value of a text
    Input: Lemmas (list)
    Output: factor count & remaining TTR value for computing the partial MTLD"""
    
    factor_count = 0
    tokens = 0
    wordtypes = {}
        
    # Compute factor count for forward run
    for lemma in lemma_list:
        temp_TTR = []
        tokens += 1
        
        # Count word types with dict (= len(dict))
        if lemma not in wordtypes:
            wordtypes[lemma] = 1
        else:
            wordtypes[lemma] = +1
        
        types_sum = len(wordtypes)
        
        # Compute TTR value
        TTR_value = (types_sum / tokens)
        
        temp_TTR.append(TTR_value)
        
        # End of a segment at first TTR < 0.72, reset all counters
        if TTR_value < 0.72:
            factor_count += 1
            
            tokens = 0
            TTR_value = 0
            wordtypes = {}
            temp_TTR =[]

    return factor_count, temp_TTR

In [6]:
# Cell 4
def partial_MTLD(temp_TTR, MTLD_value):
    """Function for computing partial MTLD factors
    Input: TTR value > 0.72 (list),
    Output: """

    # Partial factor berechnen
    if temp_TTR != []:
        part = 1 - float(temp_TTR[0])
        partial_factor = part / 0.28

        final_MTLD = float(MTLD_value) + partial_factor
        
    # If there is no partial factor to be calculated (empty list)
    else:
        final_MTLD = float(MTLD_value)

    return final_MTLD

In [7]:
# Cell 5
def compute_MTLD_reverse(lemma_list):
    """Function that computes MTLD reverse values
    Input: Lemmas of a text (list)
    Output: factor count, TTR value of tokens at the en of a text (no whole segment)"""
    
    factor_count_reverse = 0
    tokens = 0
    wordtypes = {}
        
    # Compute factor count for reverse run
    for lemma in reversed(lemma_list):
        temp_TTR_reverse = []
        tokens += 1
        
        # Count word types with dict (= len(dict))
        if lemma not in wordtypes:
            wordtypes[lemma] = 1
        else:
            wordtypes[lemma] = +1
        
        types_sum = len(wordtypes)
        
         # Compute TTR value
        TTR_value = (types_sum / tokens)

        temp_TTR_reverse.append(TTR_value)
        
        # End of a segment at first TTR < 0.72, reset all counters
        if TTR_value < 0.72:
            factor_count_reverse += 1

            tokens = 0
            TTR_value = 0
            wordtypes = {}
            temp_TTR_reverse = []

    return factor_count_reverse, temp_TTR_reverse

MSTTR

The MSTTR value we first considered using as a measure of lexical diversity, but when it comes to shorter texts the MSTTR has some disadvantages:
- The MSTTR has to discard the tokens at the end of the text that do not form a whole segment.
- Choice of segment size is troublesome, since the smaller the segment, the lower the sensitivity of the index. The TTR value in small segments will be high, regardless of its actual complexity, due to small segments allowing for many different word types. Larger segments require larger texts and when forming large segments, it is likely that more data will get discarded (McCarthy and Jarvis, 2010).


For the reasons meantioned above, we decided to refrain from using the MSTTR approach as a measure of lexical diversity.


- Cell \#6: Function for computing the MSTTR value

In [8]:
# Cell 6
def compute_MSTTR(lemma_list):
    """Function that computes MSTTR value of a text, segment size = 100 tokens
    Input: lemmas of a text (list)
    Output: MSTTR value"""

    tokens = 0
    wordtypes = {}
    TTRs = []
        
    # Compute MSTTR value
    for lemma in lemma_list:
        tokens += 1
        
        # Count word types with dict (= len(dict))
        if lemma not in wordtypes:
            wordtypes[lemma] = 1
        else:
            wordtypes[lemma] = +1
        
        types_sum = len(wordtypes)
        
        # Compute TTR value
        TTR_value = (types_sum / tokens)
        
        # End of segment, reset all counters
        if tokens == 100:
            TTRs.append(TTR_value)
            TTR_value = 0
            tokens = 0
            wordtypes = {}
    
    # Compute final TTR value
    TTR = (sum(TTRs)/len(TTRs))
        
    return TTR

HD-D

- Compute HD-D values according to McCarthy (2007).
- The higher the HD-D value, the more complex the text.


- Cell \#7: Function for computing the HD-D value

In [9]:
# Cell 7
def compute_HDD(lemma_list):
    """ Function that computes HD-D value a text, sample size = 42
    Input: lemmas of a text (list)
    Output: HD-D value """
    
    freqs = {}
    all_ps = []
    
    # Get word types and freqs
    for lemma in lemma_list:
        if lemma not in freqs:
            freqs[lemma] = 1
        else:
            freqs[lemma] += 1
    

    for lemma_type, lemma_freqs in freqs.items():
        
        # Compute probability for non-occurance
        hpd = sc.hypergeom(len(lemma_list), lemma_freqs, 42)
        p_non_occurance = hpd.pmf(0)

        # Compute probability for lemma occurance in a sample of 42 
        p = 1-p_non_occurance
        final_p = (1/42) * p
        all_ps.append(final_p)
    
    # Compute final HD-D value 
    HDD_val = sum(all_ps)
         
    return HDD_val
        

MATTR

- Compute MATTR values according to Covington and McFall (2010)
- The higher the MATTR value, the more complex the text
- The computation includes a sliding window and a window size of 500


- Cell \#8: Function for computing the MATTR value

In [10]:
# Cell 8
def compute_MATTR(lemma_list):
    """Function that computes MATTR value of a text, window size = 500
    Input: Lemma of a text (list)
    Output: MATTR value"""
    
    wordtypes = {}
    tokens = 0
    all_TTR = []
    
    # Get MATTR value (Abitur texts: -500, Express/Zeit: -499)
    for index in range(len(lemma_list[:-499])):

        TTR_val = 0
        wordtypes = {}
        
        lemmas = lemma_list[index:index+500]
        
        # Count word types with dict (= len(dict))
        for tok in lemmas:
            if tok not in wordtypes:
                wordtypes[tok] = 1
            else:
                wordtypes[tok] = +1
        
        TTR_val = len(wordtypes) / 500
       
        all_TTR.append(TTR_val)
    
    # Comute final MATTR value
    MATTR_val = (sum(all_TTR) / len(all_TTR))
    
    return MATTR_val   

# 3. Computing All Measure Values

- Cell \#9: Reading all files with pandas
    - Then computing all measure values
    - Saving all data in dictionary *all_data*
        - keys: year, measure, filename
        - value: measure values

In [11]:
# Cell 9
# Dictionary for saving the values of all measures 
all_data = {}

for filename in test_filenames:
    path = "data/" + filename
    #path = "d:/Hausarbeit/graphvar_1963-2018_DE_conll/" + filename
    year = filename[0:4]
            
    #  Dict "all_data" for saving all measure values  
    if year not in all_data:
        all_data[year] = {}
        
        all_data[year]["MTLD"] = {}
        all_data[year]["MSTTR"] = {}
        all_data[year]["HD-D"] = {}
        all_data[year]["MATTR"] = {}
        
    # Read all files with pandas
    with open(path, "r", encoding="UTF-8") as file:
        column_names = file.readline().replace("# global.columns =", "").strip().split()

        df = pd.read_csv(path, comment="#", sep="\t", quoting=3, header=None, names=column_names)
        df = df.astype({"UEBERSCHRIFT": str}, errors='raise')
        df = df[df.UEBERSCHRIFT == "0"]
        
        # Extract lemmas
        lemma_series = df.LEMMA
        lemma_list = list(lemma_series)
                       
        # Only count <B-> tags
        for index, lemma in enumerate(lemma_list):
            if "<I->" in lemma:
                lemma_list.pop(index)
            elif "<E->" in lemma:
                lemma_list.pop(index)
        
        # Call functions for computing MTLD value (forward & reverse run)
        MTLD_value, temp_TTR = compute_MTLD(lemma_list, year)
        final_MTLD1 = partial_MTLD(temp_TTR, MTLD_value)
        final_MTLD1 = len(lemma_list)/final_MTLD1
        
        MTLD_value_reverse, temp_TTR_reverse = compute_MTLD_reverse(lemma_list)
        final_MTLD = partial_MTLD(temp_TTR_reverse, MTLD_value_reverse)
        final_MTLD = len(lemma_list)/final_MTLD
        
        # Call function for computing final MTLD value (per text)
        MTLD_val = final_MTLD1 + final_MTLD /2
        
        # Call function for computing MSTTR value (per text)
        MSTTR_val = compute_MSTTR(lemma_list)
        
        # Call function for computing HD-D value (per text)
        HDD_val = compute_HDD(lemma_list)
        
        # Call function for computing MATTR value (per text)
        MATTR_val = compute_MATTR(lemma_list)
        
        # key: year, key: measure, key: filename, value: measure value
        all_data[year]["MTLD"][filename] = MTLD_val
        all_data[year]["MATTR"][filename] = MATTR_val
        all_data[year]["MSTTR"][filename] = MSTTR_val
        all_data[year]["HD-D"][filename] = HDD_val
        

# 4. Computing Standard Deviations and Means

- Cell \#10: Creating a dataframe for all data and computing the standard deviation and mean for each year

In [12]:
# Cell 10

# Use for test set with all data
"""
df_alldata=pd.DataFrame({'MTLD_63': all_data["1963"]["MTLD"],
                        'MSTTR_63': all_data["1963"]["MSTTR"],
                        'HD-D_63': all_data["1963"]["HD-D"],
                         'MATTR_63': all_data["1963"]["MATTR"],
                        'MTLD_68': all_data["1968"]["MTLD"],
                        'MSTTR_68': all_data["1968"]["MSTTR"],
                        'HD-D_68': all_data["1968"]["HD-D"],
                         'MATTR_68': all_data["1968"]["MATTR"],
                        'MTLD_74': all_data["1974"]["MTLD"],
                        'MSTTR_74': all_data["1974"]["MSTTR"],
                        'HD-D_74': all_data["1974"]["HD-D"],
                         'MATTR_74': all_data["1974"]["MATTR"],
                        'MTLD_78': all_data["1978"]["MTLD"],
                        'MSTTR_78': all_data["1978"]["MSTTR"],
                        'HD-D_78': all_data["1978"]["HD-D"],
                         'MATTR_78': all_data["1978"]["MATTR"],
                        'MTLD_83': all_data["1983"]["MTLD"],
                        'MSTTR_83': all_data["1983"]["MSTTR"],
                        'HD-D_83': all_data["1983"]["HD-D"],
                         'MATTR_83': all_data["1983"]["MATTR"],
                        'MTLD_88': all_data["1988"]["MTLD"],
                        'MSTTR_88': all_data["1988"]["MSTTR"],
                        'HD-D_88': all_data["1988"]["HD-D"],
                         'MATTR_88': all_data["1988"]["MATTR"],
                        'MTLD_93': all_data["1993"]["MTLD"],
                        'MSTTR_93': all_data["1993"]["MSTTR"],
                        'HD-D_93': all_data["1993"]["HD-D"],
                         'MATTR_93': all_data["1993"]["MATTR"],
                        'MTLD_98': all_data["1998"]["MTLD"],
                        'MSTTR_98': all_data["1998"]["MSTTR"],
                        'HD-D_98': all_data["1998"]["HD-D"],
                         'MATTR_98': all_data["1998"]["MATTR"],
                        'MTLD_03': all_data["2003"]["MTLD"],
                        'MSTTR_03': all_data["2003"]["MSTTR"],
                        'HD-D_03': all_data["2003"]["HD-D"],
                         'MATTR_03': all_data["2003"]["MATTR"],
                        'MTLD_08': all_data["2008"]["MTLD"],
                        'MSTTR_08': all_data["2008"]["MSTTR"],
                        'HD-D_08': all_data["2008"]["HD-D"],
                         'MATTR_08': all_data["2008"]["MATTR"],
                        'MTLD_13': all_data["2013"]["MTLD"],
                        'MSTTR_13': all_data["2013"]["MSTTR"],
                        'HD-D_13': all_data["2013"]["HD-D"],
                         'MATTR_13': all_data["2013"]["MATTR"]})"""

# Use for recreating study with data in git repository
df_alldata=pd.DataFrame({'MTLD_63': all_data["1963"]["MTLD"],
                        'MSTTR_63': all_data["1963"]["MSTTR"],
                        'HD-D_63': all_data["1963"]["HD-D"],
                         'MATTR_63': all_data["1963"]["MATTR"],
                        'MTLD_13': all_data["2013"]["MTLD"],
                        'MSTTR_13': all_data["2013"]["MSTTR"],
                        'HD-D_13': all_data["2013"]["HD-D"],
                         'MATTR_13': all_data["2013"]["MATTR"]})

# Compute standard deviation for each year
df_std = df_alldata.std(axis= 0)
std_list = df_std.tolist()

# Compute mean for each year
df_mean = df_alldata.mean(axis= 0)
mean_list = df_mean.tolist()

# 5. Creating and Displaying Results with Pandas

- Cell \#11: df_allstd
    - standard deviation for each year
- Cell \#12: df_allmeans
    - mean for each year
- Cell \#13: df_all
    - standard deviation for each year
    - mean for each year
    - all student values for each year (list)
- Cell \#14: shows means, all student values, and standard deviations for every year in different dataframes (per measure)

In [13]:
# Cell 11
# Lists for all values of standard deviation of all measures: MTLD/MSTTR/HD-D/MATTR_std
MTLD_std = std_list[0::4]
MSTTR_std = std_list[1::4]
HDD_std = std_list[2::4]
MATTR_std = std_list[3::4]

# Dataframe for all standard deviations
df_allstd = pd.DataFrame({'YEAR': all_data.keys(),
                        'MTLD_std': MTLD_std,
                        'HDD_std': HDD_std,
                         'MATTR_std': MATTR_std,
                         # results not used in test phase
                        'MSTTR_std': MSTTR_std,})
display(df_allstd)

Unnamed: 0,YEAR,MTLD_std,HDD_std,MATTR_std,MSTTR_std
0,1963,29.608442,0.050915,0.033456,0.075334
1,2013,6.005676,0.000964,0.01252,0.020873


In [14]:
# Cell 12
# Lists for all values of yearly means of all measures: MTLD/MSTTR/HD-D/MATTR
MTLD_mean = mean_list[0::4]
MSTTR_mean = mean_list[1::4]
HDD_mean = mean_list[2::4]
MATTR_mean = mean_list[3::4]

# Dataframe for all yearly means
df_allmeans = pd.DataFrame({'YEAR': all_data.keys(),
                        'MTLD_mean': MTLD_mean,
                        'HDD_mean': HDD_mean,
                        'MATTR_mean':MATTR_mean,
                         # results not used in test phase
                        'MSTTR_mean': MSTTR_mean,})
display(df_allmeans)

Unnamed: 0,YEAR,MTLD_mean,HDD_mean,MATTR_mean,MSTTR_mean
0,1963,61.330251,0.73659,0.376154,0.581731
1,2013,75.069186,0.774007,0.388931,0.61399


In [15]:
# Cell 13
# Empty list for every measure
all_MTLD =[]
all_MSTTR = []
all_HDD = []
all_MATTR = []

# Sort all measure values in corresponding list
for year in all_data:
    mtld = []
    msttr = []
    hdd = []
    mattr = []
    for mtld_data in all_data[year]["MTLD"].values():
        mtld.append(mtld_data)
    for msttr_data in all_data[year]["MSTTR"].values():
        msttr.append(msttr_data)
    for hdd_data in all_data[year]["HD-D"].values():
        hdd.append(hdd_data)
    for mattr_data in all_data[year]["MATTR"].values():
        mattr.append(mattr_data)
    
    # Append to list
    all_MTLD.append(mtld)
    all_MSTTR.append(msttr)
    all_HDD.append(hdd) 
    all_MATTR.append(mattr)
    
# Dataframe for all means/standard deviations/student values 
df_all = pd.DataFrame({'YEAR': all_data.keys(),
                      'MTLD_mean': MTLD_mean,
                      'HDD_mean': HDD_mean,
                      'MATTR_mean': MATTR_mean,
                      'MSTTR_mean': MSTTR_mean,
                       'MTLD_std': MTLD_std,
                       'HDD_std': HDD_std,
                       'MATTR_std': MATTR_std,
                       'MSTTR_std': MSTTR_std,
                      'all_MTLD': all_MTLD,
                      'all_HDD': all_HDD,
                      'all_MATTR': all_MATTR,
                      'all_MSTTR': all_MSTTR})

display(df_all)

Unnamed: 0,YEAR,MTLD_mean,HDD_mean,MATTR_mean,MSTTR_mean,MTLD_std,HDD_std,MATTR_std,MSTTR_std,all_MTLD,all_HDD,all_MATTR,all_MSTTR
0,1963,61.330251,0.73659,0.376154,0.581731,29.608442,0.050915,0.033456,0.075334,"[82.26658070115361, 40.39392054805006]","[0.7725926326423203, 0.7005882428590878]","[0.3998107344632748, 0.35249648711943804]","[0.6350000000000001, 0.5284615384615385]"
1,2013,75.069186,0.774007,0.388931,0.61399,6.005676,0.000964,0.01252,0.020873,"[79.31584066645584, 70.82253179890975]","[0.77332480200875, 0.7746883306356062]","[0.39778349600709917, 0.3800776882993835]","[0.6287500000000001, 0.5992307692307691]"


In [16]:
# Cell 14

# MTLD
df_MTLD = pd.DataFrame({'YEAR': all_data.keys(),
                      'YEAR_VAL': MTLD_mean,
                      'STUDENT_VALS': all_MTLD, 
                      'STUDENT_STD': MTLD_std})
# HD-D
df_HDD =pd.DataFrame({'YEAR': all_data.keys(),
                      'YEAR_VAL': HDD_mean,
                      'STUDENT_VALS': all_HDD, 
                      'STUDENT_STD': HDD_std})
# MATTR
df_MATTR = pd.DataFrame({'YEAR': all_data.keys(),
                      'YEAR_VAL': MATTR_mean,
                      'STUDENT_VALS': all_MATTR, 
                      'STUDENT_STD': MATTR_std})

# MSTTR
df_MSTTR = pd.DataFrame({'YEAR': all_data.keys(),
                      'YEAR_VAL': MSTTR_mean,
                      'STUDENT_VALS': all_MSTTR, 
                      'STUDENT_STD': MSTTR_std})

# Display all dataframes
print("MTLD:")
display(df_MTLD)
print("HD-D:")
display(df_HDD)
print("MATTR:")
display(df_MATTR)
print("MSTTR:")
display(df_MSTTR)

# Create .csv for each measure
#outpath = "results/1_lex/dev_results"
#outpath = "results/1_lex/test_results"
outpath = "results/1_lex_demo/"

import os
os.makedirs(outpath, exist_ok=True)  
    
# Create .csv for each measure 
df_MTLD.to_csv(outpath + "MTLD.csv", sep=",")
df_HDD.to_csv(outpath + "HDD.csv", sep=",")
df_MATTR.to_csv(outpath + "MATTR.csv", sep=",")

MTLD:


Unnamed: 0,YEAR,YEAR_VAL,STUDENT_VALS,STUDENT_STD
0,1963,61.330251,"[82.26658070115361, 40.39392054805006]",29.608442
1,2013,75.069186,"[79.31584066645584, 70.82253179890975]",6.005676


HD-D:


Unnamed: 0,YEAR,YEAR_VAL,STUDENT_VALS,STUDENT_STD
0,1963,0.73659,"[0.7725926326423203, 0.7005882428590878]",0.050915
1,2013,0.774007,"[0.77332480200875, 0.7746883306356062]",0.000964


MATTR:


Unnamed: 0,YEAR,YEAR_VAL,STUDENT_VALS,STUDENT_STD
0,1963,0.376154,"[0.3998107344632748, 0.35249648711943804]",0.033456
1,2013,0.388931,"[0.39778349600709917, 0.3800776882993835]",0.01252


MSTTR:


Unnamed: 0,YEAR,YEAR_VAL,STUDENT_VALS,STUDENT_STD
0,1963,0.581731,"[0.6350000000000001, 0.5284615384615385]",0.075334
1,2013,0.61399,"[0.6287500000000001, 0.5992307692307691]",0.020873


# 6. Computing the lexical diversity of the Express and Zeit corpora

- Cell \#15: Function for getting lemmas from the EXPRESS and ZEIT corpora
- Cell \#16: Computing the lexical diversity for the EXPRESS and ZEIT corpora
- Cell \#17: Sorting results
- Cell \#18: displaying results in pandas dataframe (per measure)
- Cell \#19: converting results to .csv file)

In [17]:
# Cell 15

def get_lemmas(filename):
    """Function for extracting Lemmas from the Express and Zeit articles
    Input: filename
    Output: Lemmas (list)"""
    
    article_lemmas = []
    lemma_list = []

    # Read and tag files
    with open(filename, "r", encoding="UTF-8") as file:
        infile = file.readlines()
        
        for t in infile:
            
            # differentiate between different articles
            if t != "\n":
                article_lemmas.append(t)
                
            # clear article_lemmas when new article starts    
            else:
                lemma_list.append(article_lemmas)
                article_lemmas = []

    # return value: A list of all lemmas of a file
    return lemma_list

In [18]:
# Cell 16

# Filenames
expr_zeit = ["zeit_all.wpl", "express_all.wpl"]
#expr_zeit = ["express1tokens.txt", "zeit1tokens_edited_corrected.txt"]

all_data = {}

ex_MTLD_std = []
ex_HDD_std = []
ex_MSTTR_std =[]
ex_MATTR_std = []

z_MTLD_std = []
z_HDD_std = []
z_MSTTR_std = []
z_MATTR_std = []

for filename in expr_zeit:
    # path for reading EXPRESS and ZEIT corpora
    #path = "d:/Hausarbeit/random_corpora_1208_replacement/" + filename
    
    path = "data/" + filename
    journal = filename[0:4]
    
    #  Dict "all_data" for saving all measure values  
    if journal not in all_data:
        all_data[journal] = {}
        
        all_data[journal]["MTLD"] = {}
        all_data[journal]["MSTTR"] = {}
        all_data[journal]["HD-D"] = {}
        all_data[journal]["MATTR"] = {}
    
    # Call function for extracting lemmas from articles
    lemma_list = get_lemmas(path)
    
    # Empty lists for storing measure values
    results_MTLD = []
    results_MSTTR = []
    results_HDD = []
    results_MATTR = []

    for article in lemma_list:
        # Call functions for computing MTLD value (forward & reverse run)
        MTLD_value, temp_TTR = compute_MTLD(article, journal)
        final_MTLD1 = partial_MTLD(temp_TTR, MTLD_value)
        final_MTLD1 = len(article)/final_MTLD1
        
        MTLD_value_reverse, temp_TTR_reverse = compute_MTLD_reverse(article)
        final_MTLD = partial_MTLD(temp_TTR_reverse, MTLD_value_reverse)
        final_MTLD = len(article)/final_MTLD

        # Computing final MTLD value
        MTLD_val = final_MTLD1 + final_MTLD /2
        
        # Call functions for computing MSTTR, HD-D, and MATTR value 
        MSTTR_val = compute_MSTTR(article)
        HDD_val = compute_HDD(article)
        MATTR_val = compute_MATTR(article)
        
        # Append values to list
        results_MTLD.append(MTLD_val)
        results_MSTTR.append(MSTTR_val)
        results_HDD.append(HDD_val)
        results_MATTR.append(MATTR_val)
        
        if "zeit" in filename:
            z_MTLD_std.append(MTLD_val)
            z_HDD_std.append(HDD_val)
            z_MSTTR_std.append(MSTTR_val)
            z_MATTR_std.append(MATTR_val)
        else:
            
            ex_MTLD_std.append(MTLD_val)
            ex_HDD_std.append(HDD_val)
            ex_MSTTR_std.append(MSTTR_val)
            ex_MATTR_std.append(MATTR_val)
        
    # Compute final values
    end_MTLD_val = sum(results_MTLD) / len(results_MTLD)
    end_MSTTR_val = sum(results_MSTTR) / len(results_MSTTR)
    end_HDD_val = sum(results_HDD) / len(results_HDD)
    end_MATTR_val = sum(results_MATTR) / len(results_MATTR)
    
    # Insert values in dictionary
    all_data[journal]["MTLD"][filename] = end_MTLD_val
    all_data[journal]["MSTTR"][filename] = end_MSTTR_val
    all_data[journal]["HD-D"][filename] = end_HDD_val
    all_data[journal]["MATTR"][filename] = end_MATTR_val

In [19]:
# Cell 17

keys_e = all_data['expr']['MTLD'].keys()
keys_t = all_data['zeit']['MTLD'].keys()
keys_e = list(keys_e)
keys_t = list(keys_t)
MTLD_filenames = keys_e+keys_t

# HD-D
keys_e = all_data['expr']['HD-D'].keys()
keys_t = all_data['zeit']['HD-D'].keys()
keys_e = list(keys_e)
keys_t = list(keys_t)
HDD_filenames = keys_e+keys_t

# MSTTR
keys_e = all_data['expr']['MSTTR'].keys()
keys_t = all_data['zeit']['MSTTR'].keys()
keys_e = list(keys_e)
keys_t = list(keys_t)
MSTTR_filenames = keys_e+keys_t

# MATTR
keys_e = all_data['expr']['MATTR'].keys()
keys_t = all_data['zeit']['MATTR'].keys()
keys_e = list(keys_e)
keys_t = list(keys_t)
MATTR_filenames = keys_e+keys_t

### VALUES ###
# MTLD
vals_e = all_data['expr']['MTLD'].values()
vals_t = all_data['zeit']['MTLD'].values()
vals_e = list(vals_e)
vals_t = list(vals_t)
MTLD_vals = vals_e+vals_t

# HD-D
vals_e = all_data['expr']['HD-D'].values()
vals_t = all_data['zeit']['HD-D'].values()
vals_e = list(vals_e)
vals_t = list(vals_t)
HDD_vals = vals_e+vals_t

# MSTTR
vals_e = all_data['expr']['MSTTR'].values()
vals_t = all_data['zeit']['MSTTR'].values()
vals_e = list(vals_e)
vals_t = list(vals_t)
MSTTR_vals = vals_e+vals_t

# MATTR
vals_e = all_data['expr']['MATTR'].values()
vals_t = all_data['zeit']['MATTR'].values()
vals_e = list(vals_e)
vals_t = list(vals_t)
MATTR_vals = vals_e+vals_t


In [20]:
# Cell 18
# MTLD
df_all_MTLD = pd.DataFrame({'MEASURE': ['MTLD', 'MTLD'],
                        'FILENAME': MTLD_filenames,
                        'ALL_VALS' : [ex_MTLD_std, z_MTLD_std],
                        'MEAN': MTLD_vals,
                         'STD': [np.std(ex_MTLD_std), np.std(z_MTLD_std)]})
print("MTLD:")
display(df_all_MTLD)

# HD-D
df_all_HDD = pd.DataFrame({'MEASURE': ['HD-D', 'HD-D'],
                        'FILENAME': HDD_filenames,
                           'ALL_VALS' : [ex_HDD_std, z_HDD_std],
                        'MEAN': HDD_vals,
                          'STD': [np.std(ex_HDD_std), np.std(z_HDD_std)]})
print("HD-D:")
display(df_all_HDD)

# MSTTR
df_all_MSTTR = pd.DataFrame({'MEASURE': ['MSTTR', 'MSTTR'],
                             'FILENAME': MSTTR_filenames,
                             'ALL_VALS' : [ex_MSTTR_std, z_MSTTR_std],
                            'MEAN': MSTTR_vals,
                            'STD': [np.std(ex_MSTTR_std), np.std(z_MSTTR_std)]})
print("MSTTR:")
display(df_all_MSTTR)

# MATTR
df_all_MATTR = pd.DataFrame({'MEASURE': ['MATTR', 'MATTR'],
                             'FILENAME': MATTR_filenames,
                             'ALL_VALS' : [ex_MATTR_std, z_MATTR_std],
                            'MEAN': MATTR_vals,
                            'STD': [np.std(ex_MATTR_std), np.std(z_MATTR_std)]})
print("MATTR:")
display(df_all_MATTR)

# concatinate all dataframes 
print("All:")
frames = [df_all_MTLD, df_all_HDD, df_all_MSTTR, df_all_MATTR]
res = pd.concat(frames)
display(res)


MTLD:


Unnamed: 0,MEASURE,FILENAME,ALL_VALS,MEAN,STD
0,MTLD,express_all.wpl,"[240.37760325423096, 154.98494623655915, 213.6...",203.019433,35.669448
1,MTLD,zeit_all.wpl,"[235.02889825406385, 225.39974327852536, 232.0...",230.835705,4.028111


HD-D:


Unnamed: 0,MEASURE,FILENAME,ALL_VALS,MEAN,STD
0,HD-D,express_all.wpl,"[0.8830422455991954, 0.8628436007334597, 0.872...",0.872871,0.008247
1,HD-D,zeit_all.wpl,"[0.8663459490127235, 0.8600092374671684, 0.856...",0.860912,0.004118


MSTTR:


Unnamed: 0,MEASURE,FILENAME,ALL_VALS,MEAN,STD
0,MSTTR,express_all.wpl,"[0.792, 0.7459999999999999, 0.772]",0.77,0.018833
1,MSTTR,zeit_all.wpl,"[0.776, 0.7779999999999999, 0.772]",0.775333,0.002494


MATTR:


Unnamed: 0,MEASURE,FILENAME,ALL_VALS,MEAN,STD
0,MATTR,express_all.wpl,"[0.608, 0.5258823529411765, 0.5853846153846154]",0.573089,0.034633
1,MATTR,zeit_all.wpl,"[0.5995999999999999, 0.6144444444444443, 0.602...",0.605348,0.006506


All:


Unnamed: 0,MEASURE,FILENAME,ALL_VALS,MEAN,STD
0,MTLD,express_all.wpl,"[240.37760325423096, 154.98494623655915, 213.6...",203.019433,35.669448
1,MTLD,zeit_all.wpl,"[235.02889825406385, 225.39974327852536, 232.0...",230.835705,4.028111
0,HD-D,express_all.wpl,"[0.8830422455991954, 0.8628436007334597, 0.872...",0.872871,0.008247
1,HD-D,zeit_all.wpl,"[0.8663459490127235, 0.8600092374671684, 0.856...",0.860912,0.004118
0,MSTTR,express_all.wpl,"[0.792, 0.7459999999999999, 0.772]",0.77,0.018833
1,MSTTR,zeit_all.wpl,"[0.776, 0.7779999999999999, 0.772]",0.775333,0.002494
0,MATTR,express_all.wpl,"[0.608, 0.5258823529411765, 0.5853846153846154]",0.573089,0.034633
1,MATTR,zeit_all.wpl,"[0.5995999999999999, 0.6144444444444443, 0.602...",0.605348,0.006506


In [21]:
# Cell 19

#outpath = "results/1_lex/expr_zeit"
outpath = "results/1_lex_demo/"

import os
os.makedirs(outpath, exist_ok=True)  
    
# Create .csv for each measure
df_all_MTLD.to_csv(outpath + "MTLD_expr_zeit.csv", sep=",")
df_all_HDD.to_csv(outpath + "HDD_expr_zeit.csv", sep=",")
df_all_MATTR.to_csv(outpath + "MATTR_expr_zeit.csv", sep=",")

# Create .csv for all measures
res.to_csv(outpath + "all_expr_zeit.csv", sep=',')

# 7. References

- Michael A. Covington and Joe D. McFall. 2010. Cutting the gordian knot: The moving-average type–token ratio (MATTR). *Journal of Quantitative Linguistics*, 17(2):94–100.

- Philip M. McCarthy and Scott Jarvis. 2007. vocd: A theoretical and empirical evaluation. *Language Testing*, 24(4):459–488.

- Philip M. McCarthy and Scott Jarvis. 2010. MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. *Behaviour Research Methods*, 42(2):381–392.
