# Enrichment of the Corpus

* now that we have homogenized, we prepare the corpus i.e. we pre-process more for in-depth analyses 
* for this we:
  * tokenize
  * lemmatize
  * pos-tag
* This can be done with multiple **libraries** (*what concepts are known, do we need the notion of this, I'd say yes!) 
  * spacy
  * stanford-core-nlp
  * nltk
* Pre-Processing is language specific

## 1. Read in the txt data

In [1]:
from pathlib import Path

#### 1.1 Set path to corpus directory 
* Replace ./data/txt/ with path to the data

In [2]:
# or point to standard here?
corpus_dir = Path(r"../data/txt/")

#### 1.2 Read in the files from the directory

In [3]:
from collections import OrderedDict

In [4]:
def read_corpus_linewise(corpus_dir: Path) -> OrderedDict[str,str]:
    corpus = OrderedDict()
    for filepath in corpus_dir.iterdir():
        if filepath.is_file():
            text = filepath.read_text()
            #text = text.replace("\n", " ")
            corpus[filepath.name] = text
    return corpus

In [5]:
corpus = read_corpus_linewise(corpus_dir)

**Prüfen**: Wie viele Dateien wurden eingelesen?

In [6]:
print(len(corpus))

103


#### 1.3 Read in metadata 

In [7]:
import pandas as pd

In [8]:
metadata_dir = Path(r"../data/metadata/")

In [9]:
metadata_filepath =  metadata_dir / Path("MVP-Test-Korpus_Metadata.csv")

In [10]:
metadata_df = pd.read_csv(metadata_filepath, sep=";")

**Prüfen**: Wie sehen unsere Metadaten aus?

In [11]:
metadata_df.head()

Unnamed: 0,Newspaper,Identifier,Date,Link,Filename
0,Vossische Zeitung,SNP27112366,19180101,https://content.staatsbibliothek-berlin.de/zef...,SNP27112366-19180101-0-0-0-0
1,Vossische Zeitung,SNP27112366,19180108,https://content.staatsbibliothek-berlin.de/zef...,SNP27112366-19180108-0-0-0-0
2,Vossische Zeitung,SNP27112366,19180115,https://content.staatsbibliothek-berlin.de/zef...,SNP27112366-19180115-0-0-0-0
3,Vossische Zeitung,SNP27112366,19180122,https://content.staatsbibliothek-berlin.de/zef...,SNP27112366-19180122-0-0-0-0
4,Vossische Zeitung,SNP27112366,19180129,https://content.staatsbibliothek-berlin.de/zef...,SNP27112366-19180129-0-0-0-0


## 2.Worthäufigkeit mit lazy tokenization

In [12]:
all_texts = " ".join(corpus.values())

In [13]:
words = all_texts.split(" ")

**Prüfen**: Wie sieht die Wortliste aus?

In [14]:
words[50:60]

['Scarpe.',
 'Veiverſeits',
 'von\n;me',
 'und',
 'nördlich',
 'der',
 'Somme',
 'hef-\ntige',
 'Kämpfe.',
 'Die']

Wie viele Wörter gibt es insgesamt?

In [15]:
len(words)

1783703

Welche Wörter kommen wie oft vor? 

In [16]:
from collections import Counter

In [17]:
word_frequencies = Counter(words)

In [18]:
chosen_word = input("Geben Sie ein Wort ein, für welches die Häufigkeit angezeigt wird: ")

In [19]:
word_frequencies[chosen_word]

1371

## 3. Load NLP Library
* Do we need to install first? Probably not – do we want to show how we would install?

Overview of spacy model available [here](https://spacy.io/models) \

Load language specific model (selection):
* German: 'de_core_news_sm'
* English: 'en_core_news_sm'

### 3.1 Load library

In [20]:
import spacy

In [21]:
! python -m spacy download de_core_news_sm

Collecting de-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.7.0/de_core_news_sm-3.7.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')


In [22]:
nlp = spacy.load('de_core_news_sm')

### 3.2 Setting up the pipeline

In [23]:
disable_components = ['ner', 'morphologizer', 'attribute_ruler']

### 3.3 Annotate texts and extract token, lemma, pos

In [24]:
from time import time

In [25]:
took_per_text = []

corpus_annotated = {}
filename_list = list(corpus.keys())
current = time()
for i, doc in enumerate(nlp.pipe(corpus.values(), disable=disable_components)):
    before = current
    current = time()
    took_per_text.append(current - before)
    annotated_text = {}
    annotated_text['Token'] = [tok.text for tok in doc]
    annotated_text['Lemma'] = [tok.lemma_ for tok in doc]
    annotated_text['PoS'] = [tok.tag_ for tok in doc]
    
    sentences = []
    sentence_idx = -1
    for token in doc:
        if token.is_sent_start:
            sentence_idx += 1
        sentences.append(sentence_idx)
    annotated_text['Sentence_idx'] = sentences
    
    corpus_annotated[filename_list[i]] = pd.DataFrame(annotated_text)

#### Wie lange hat das Annotieren gedauert?

In [26]:
import numpy as np

Durschnittlich pro Text in Sekunden:

In [27]:
np.mean(took_per_text)

5.673608814628379

Alle Texte zusammen in Sekunden:

In [28]:
np.sum(took_per_text)

584.381707906723

**Prüfen**: Länge des annotierten Korpus gleich Länge des Originalkorpus?

In [29]:
len(corpus_annotated)

103

**Prüfen**: Wie sieht die Annotation aus?

In [30]:
corpus_annotated[filename_list[0]].head()

Unnamed: 0,Token,Lemma,PoS,Sentence_idx
0,Kennmmmmnlie,Kennmmmmnlie,NN,0
1,“,--,$(,0
2,HET,HET,NE,1
3,PU,PU,NE,2
4,884,884,NE,3


### 3.4 Worthäufigkeit mit echter Tokenization   

In [31]:
all_words_tokenized = [word for text in corpus_annotated.values() for word in text.Token]

In [32]:
len(all_words_tokenized)

3008370

In [33]:
words_tokenized_frequencies = Counter(all_words_tokenized)

In [34]:
words_tokenized_frequencies[chosen_word]

0

## 4. Metadaten ausweiten

### 4.1 Metadaten sammeln
* Anzahl Lemmata
* Anzahl unique Lemmata
* Anzahl Sätze
* Durschnittliche Satzlänge 

In [35]:
collected_metadata_extension = []
for filename, annotated_text in corpus_annotated.items():
    metadata_extension = {}
    metadata_extension['Filename'] = filename
    metadata_extension['Lemma_Count'] = len(annotated_text) - 1
    metadata_extension['Lemma_Count_Unique'] = len(set(annotated_text.Lemma))
    metadata_extension['Sentence_Count'] = annotated_text.Sentence_idx.iloc[-1]
    metadata_extension['Sentence_Length_Avg'] = annotated_text.groupby('Sentence_idx').Lemma.count().mean()
    collected_metadata_extension.append(metadata_extension)

In [36]:
metadata_to_extend = pd.DataFrame(collected_metadata_extension)

### 4.2 Metadaten hinzufügen

In [37]:
metadata_df

Unnamed: 0,Newspaper,Identifier,Date,Link,Filename
0,Vossische Zeitung,SNP27112366,19180101,https://content.staatsbibliothek-berlin.de/zef...,SNP27112366-19180101-0-0-0-0
1,Vossische Zeitung,SNP27112366,19180108,https://content.staatsbibliothek-berlin.de/zef...,SNP27112366-19180108-0-0-0-0
2,Vossische Zeitung,SNP27112366,19180115,https://content.staatsbibliothek-berlin.de/zef...,SNP27112366-19180115-0-0-0-0
3,Vossische Zeitung,SNP27112366,19180122,https://content.staatsbibliothek-berlin.de/zef...,SNP27112366-19180122-0-0-0-0
4,Vossische Zeitung,SNP27112366,19180129,https://content.staatsbibliothek-berlin.de/zef...,SNP27112366-19180129-0-0-0-0
...,...,...,...,...,...
99,Berliner Morgenpost,SNP2719372X,19181203,https://content.staatsbibliothek-berlin.de/zef...,SNP2719372X-19181203-0-0-0-0
100,Berliner Morgenpost,SNP2719372X,19181210,https://content.staatsbibliothek-berlin.de/zef...,SNP2719372X-19181210-0-0-0-0
101,Berliner Morgenpost,SNP2719372X,19181217,https://content.staatsbibliothek-berlin.de/zef...,SNP2719372X-19181217-0-0-0-0
102,Berliner Morgenpost,SNP2719372X,19181224,https://content.staatsbibliothek-berlin.de/zef...,SNP2719372X-19181224-0-0-0-0


In [38]:
metadata_to_extend

Unnamed: 0,Filename,Lemma_Count,Lemma_Count_Unique,Sentence_Count,Sentence_Length_Avg
0,SNP2719372X-19180827-0-0-0-0.txt,19127,6784,1407,13.585227
1,SNP27112366-19180604-0-0-0-0.txt,48260,14664,3535,13.648473
2,SNP27112366-19180716-0-0-0-0.txt,42060,12277,3320,12.665161
3,SNP27112366-19180618-0-0-0-0.txt,45297,14035,3068,14.759857
4,SNP2719372X-19180521-0-0-0-0.txt,7103,2745,443,16.000000
...,...,...,...,...,...
98,SNP2719372X-19180820-0-0-0-0.txt,20086,7203,1598,12.562226
99,SNP2719372X-19180611-0-0-0-0.txt,20521,7518,1782,11.509815
100,SNP27112366-19180528-0-0-0-0.txt,45784,14624,3203,14.289950
101,SNP2719372X-19180108-0-0-0-0.txt,10940,4338,954,11.456545


In [39]:
metadata_df['Filename'] = metadata_df['Identifier'] + '-' + metadata_df['Date'].astype(str) + '-0-0-0-0.txt' 

In [40]:
metadata_extendend_df = pd.merge(metadata_df, metadata_to_extend, on="Filename")

## 5. Ergebnisse speichern 

### 5.1 Annotiertes Korpus speichern

In [41]:
result_dir = Path(r"../data/conll")

In [42]:
for filepath, annotated_text in corpus_annotated.items():
    filepath = Path(filepath)
    output_path = result_dir / filepath.with_suffix(".conll")
    annotated_text.to_csv(output_path, index=False)

### 5.2 Erweiterte Metadaten speichern  

In [43]:
metadata_extended_filename = Path(r"MVP-Test-Korpus_Metadata-v02.csv")
metadata_extendend_df.to_csv(metadata_dir / metadata_extended_filename, index=False)