# Enrichment of the Corpus

* now that we have homogenized, we prepare the corpus i.e. we pre-process more for in-depth analyses 
* for this we:
  * tokenize
  * lemmatize
  * pos-tag
* This can be done with multiple **libraries** (*what concepts are known, do we need the notion of this, I'd say yes!) 
  * spacy
  * stanford-core-nlp
  * nltk
* Pre-Processing is language specific

## 1. Read in the txt data

In [None]:
from pathlib import Path

#### 1.1 Set path to corpus directory 
* Replace ./data/txt/ with path to the data

In [None]:
# or point to standard here?
corpus_dir = Path(r"../data/txt/")

#### 1.2 Read in the files from the directory

In [None]:
from collections import OrderedDict

In [None]:
def read_corpus_linewise(corpus_dir: Path) -> OrderedDict[str,str]:
    corpus = OrderedDict()
    for filepath in corpus_dir.iterdir():
        if filepath.is_file():
            text = filepath.read_text()
            #text = text.replace("\n", " ")
            corpus[filepath.name] = text
    return corpus

In [None]:
corpus = read_corpus_linewise(corpus_dir)

**Prüfen**: Wie viele Dateien wurden eingelesen?

In [None]:
print(len(corpus))

#### 1.3 Read in metadata 

In [None]:
import pandas as pd

In [None]:
metadata_dir = Path(r"../data/metadata/")

In [None]:
metadata_filepath =  metadata_dir / Path("MVP-Test-Korpus_Metadata.csv")

In [None]:
metadata_df = pd.read_csv(metadata_filepath, sep=";")

**Prüfen**: Wie sehen unsere Metadaten aus?

In [None]:
metadata_df.head()

## 2.Worthäufigkeit mit lazy tokenization

In [None]:
all_texts = " ".join(corpus.values())

In [None]:
words = all_texts.split(" ")

**Prüfen**: Wie sieht die Wortliste aus?

In [None]:
words[50:60]

Wie viele Wörter gibt es insgesamt?

In [None]:
len(words)

Welche Wörter kommen wie oft vor? 

In [None]:
from collections import Counter

In [None]:
word_frequencies = Counter(words)

In [None]:
chosen_word = input("Geben Sie ein Wort ein, für welches die Häufigkeit angezeigt wird: ")

In [None]:
word_frequencies[chosen_word]

## 3. Load NLP Library
* Do we need to install first? Probably not – do we want to show how we would install?

Overview of spacy model available [here](https://spacy.io/models) \

Load language specific model (selection):
* German: 'de_core_news_sm'
* English: 'en_core_news_sm'

### 3.1 Load library

In [None]:
import spacy

In [None]:
#! python -m spacy download de_core_news_sm

In [None]:
nlp = spacy.load('de_core_news_sm')

### 3.2 Setting up the pipeline

In [None]:
disable_components = ['ner', 'morphologizer', 'attribute_ruler']

### 3.3 Annotate texts and extract token, lemma, pos

In [None]:
from time import time

In [None]:
took_per_text = []

corpus_annotated = {}
filename_list = list(corpus.keys())
current = time()
for i, doc in enumerate(nlp.pipe(corpus.values(), disable=disable_components)):
    before = current
    current = time()
    took_per_text.append(current - before)
    annotated_text = {}
    annotated_text['Token'] = [tok.text for tok in doc]
    annotated_text['Lemma'] = [tok.lemma_ for tok in doc]
    annotated_text['PoS'] = [tok.tag_ for tok in doc]
    
    sentences = []
    sentence_idx = -1
    for token in doc:
        if token.is_sent_start:
            sentence_idx += 1
        sentences.append(sentence_idx)
    annotated_text['Sentence_idx'] = sentences
    
    corpus_annotated[filename_list[i]] = pd.DataFrame(annotated_text)

#### Wie lange hat das Annotieren gedauert?

In [None]:
import numpy as np

Durschnittlich pro Text in Sekunden:

In [None]:
np.mean(took_per_text)

Alle Texte zusammen in Sekunden:

In [None]:
np.sum(took_per_text)

**Prüfen**: Länge des annotierten Korpus gleich Länge des Originalkorpus?

In [None]:
len(corpus_annotated)

**Prüfen**: Wie sieht die Annotation aus?

In [None]:
corpus_annotated[filename_list[0]].head()

### 3.4 Worthäufigkeit mit echter Tokenization   

In [None]:
all_words_tokenized = [word for text in corpus_annotated.values() for word in text.Token]

In [None]:
len(all_words_tokenized)

In [None]:
words_tokenized_frequencies = Counter(all_words_tokenized)

In [None]:
words_tokenized_frequencies[chosen_word]

## 4. Metadaten ausweiten

### 4.1 Metadaten sammeln
* Anzahl Lemmata
* Anzahl unique Lemmata
* Anzahl Sätze
* Durschnittliche Satzlänge 

In [None]:
collected_metadata_extension = []
for filename, annotated_text in corpus_annotated.items():
    metadata_extension = {}
    metadata_extension['Filename'] = filename
    metadata_extension['Lemma_Count'] = len(annotated_text) - 1
    metadata_extension['Lemma_Count_Unique'] = len(set(annotated_text.Lemma))
    metadata_extension['Sentence_Count'] = annotated_text.Sentence_idx.iloc[-1]
    metadata_extension['Sentence_Length_Avg'] = annotated_text.groupby('Sentence_idx').Lemma.count().mean()
    collected_metadata_extension.append(metadata_extension)

In [None]:
metadata_to_extend = pd.DataFrame(collected_metadata_extension)

### 4.2 Metadaten hinzufügen

In [None]:
metadata_df

In [None]:
metadata_to_extend

In [None]:
metadata_df['Filename'] = metadata_df['Identifier'] + '-' + metadata_df['Date'].astype(str) + '-0-0-0-0.txt' 

In [None]:
metadata_extendend_df = pd.merge(metadata_df, metadata_to_extend, on="Filename")

## 5. Ergebnisse speichern 

### 5.1 Annotiertes Korpus speichern

In [None]:
result_dir = Path(r"../data/conll")

In [None]:
for filepath, annotated_text in corpus_annotated.items():
    filepath = Path(filepath)
    output_path = result_dir / filepath.with_suffix(".conll")
    annotated_text.to_csv(output_path, index=False)

### 5.2 Erweiterte Metadaten speichern  

In [None]:
metadata_extended_filename = Path(r"MVP-Test-Korpus_Metadata-v02.csv")
metadata_extendend_df.to_csv(metadata_dir / metadata_extended_filename, index=False)