# Corpus Processing â€“ Annotation with spaCy
1. Why spaCy?
2. Which methods do we use?
   * Tokenisation
   * Lemmatisation
   * PoS-Tagging
3. How well do they work?

## 0. Load libraries for analysis

In [None]:
from pathlib import Path
from time import time
from collections import OrderedDict, Counter

from tqdm import tqdm
import pandas as pd
import numpy as np
import spacy

## 1. Read in the txt data

#### 1.1 Set path to corpus directory

In [None]:
corpus_dir = Path(r"../data/txt/")

#### 1.2 Read in the files from the directory

In [None]:
def read_corpus_linewise(corpus_dir: Path) -> OrderedDict[str,str]:
    """
    Reads txt files from a given directory. Returns a dictionary with the filename
    as key and the txt file content as value.
    :param Path corpus_dir: The directory in which the txt files are saved
    :return OrderedDict[str, str]: The file names as keys, the file content as value
    """
    corpus = OrderedDict()
    for filepath in corpus_dir.iterdir():
        if filepath.suffix == ".txt":
            text = filepath.read_text()
            corpus[filepath.name] = text
    return corpus

In [None]:
corpus = read_corpus_linewise(corpus_dir)

**Check**: How many files does the corpus include?

In [None]:
print(len(corpus))

## 2. Word frequencies with lazy tokenization

In [None]:
all_texts = " ".join(corpus.values())
words = all_texts.split()

**Check**: What do the word lists look like?

In [None]:
words[50:60]

How big is the corpus (number of words?)

In [None]:
len(words)

What words occur how often?

In [None]:
word_frequencies = Counter(words)

In [None]:
chosen_word = input("Input a word for which the frequency will be shown: ")

In [None]:
word_frequencies[chosen_word]

## 3. Annotation with spaCy 
Overview of spacy model available [here](https://spacy.io/models) \
Load language specific model (selection):
* German: 'de_core_news_sm'

### 3.1 Setting up the Pipeline

In [None]:
# ! python -m spacy download de_core_news_sm

In [None]:
# Load language specific model
nlp = spacy.load('de_core_news_sm')

In [None]:
# Exclude analysis components to improve the processing speed
disable_components = ['ner', 'morphologizer', 'attribute_ruler', 'sentencizer']

### 3.2 Annotation of the Texts: Token, Lemma, PoS

In [None]:
def annotate_corpus(corpus: OrderedDict[str, str], disable_components: list[str]) -> dict[str, pd.DataFrame]:
    """
    Annotate a corpus (filename: text) with spacy. Collect the Token, PoS and Lemma information. 
    Save the annotation information as a pandas DataFrame. 
    :param OrderedDict[str, str] corpus: The file names as keys, the file content as value
    :param list[str] disable_components: spacy components to be diasbled in the annotation process
    :return dict[str, pd.DataFrame]: The file name as keys, the annotated text as value
    """
    # list to collect how long the annotation runs take in seconds
    took_per_text = []

    # define result dict
    corpus_annotated = {}
    
    filename_list = list(corpus.keys())
    current = time()
    
    # iterate over the corpus values, annotate them with spacy
    for i, doc in tqdm(enumerate(nlp.pipe(list(corpus.values())[:1], disable=disable_components))):
        before = current
        current = time()
        took_per_text.append(current - before)

        # Save the token, PoS and Lemma information to a dictionary
        text_annotated = {}
        text_annotated['Token'] = [tok.text for tok in doc]
        text_annotated['Lemma'] = [tok.lemma_ for tok in doc]
        text_annotated['PoS'] = [tok.tag_ for tok in doc]    

        # Save the annotation as pandas DataFrame to the result dict
        # Key is the current filename
        corpus_annotated[filename_list[i]] = pd.DataFrame(text_annotated)

    # print corpus size and performance
    print(f"""Processed {len(corpus_annotated)} texts with spacy.
    Took {round(np.mean(took_per_text), 4)} seconds per text on average.
    Took {round(np.sum(took_per_text) / 60, 4)} minutes in total.""")

    return corpus_annotated

In [None]:
corpus_annotated = annotate_corpus(corpus, disable_components)

### 3.3 Annotated Text as Table

**Check**: What do the annotations look like?

In [None]:
corpus_annotated[list(corpus_annotated.keys())[0]].head()

### 3.4 Word Frequencies with Real Tokenization   

In [None]:
all_words_tokenized = [word for text in corpus_annotated.values() for word in text.Token]
len(all_words_tokenized)

In [None]:
words_tokenized_frequencies = Counter(all_words_tokenized)
words_tokenized_frequencies[chosen_word]

## 4. Save the annotated corpus as conll files

In [None]:
output_dir = Path(r"../data/conll")
for filepath, text_annotated in corpus_annotated.items():
    filepath = Path(filepath)
    output_path = output_dir / filepath.with_suffix(".conll")
    text_annotated.to_csv(output_path, index=False)