### Install Packages
To install and manage packages in Python pip is used. The packages needed for the notebook are installed.

In [98]:
!pip install wikipedia-api
!pip install wikipedia
!pip install treetaggerwrapper
!pip install spacy
!pip install HanTa
!pip install nltk
!pip install keybert

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice

### Import Packages
The installed packages must then be imported in order to use them.

In [99]:
import os
import sys
import string
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
from keybert import KeyBERT
import pandas as pd

import spacy
from spacy.lang.de import German

import nltk
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

from HanTa import HanoverTagger as ht
from pprint import pprint

import re

import csv
import wikipedia
import ast

import requests

from threading import Thread
import yake

import time
import random

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sophiabuehl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Static Variables / Config
In this section the static variables are defined in order to be able to access them in all the following functions. This is for the overview and understanding of the code.

In [100]:
# TEXT_FOLDER = './exzellent/'
TEXT_FOLDER = './exzellent/'
CLEANED_FOLDER = './cleaned_exzellent/'
KEYBERT_CSV = "keybert.csv"
YAKE_CSV = 'yake.csv'

### Save Articles with Title
in this function the file name ist changed to the title of the wikipedia article it depends to. It only needs the path with the excellent articles. They were extracted in the other Notebook.

In [120]:
def change_file_name(TEXT_FOLDER):
    # for loop to generate folder if it doesn't exist
    for file_name in os.listdir(TEXT_FOLDER):
        file_path = os.path.join(TEXT_FOLDER, file_name)

        # open file and read lines
        with open(file_path, "r", encoding="utf-8") as file:
            row = file.readlines()
                
            # if the line is not empty
            if row:
                    
                # extrect article title from first line
                titel = row[0].strip()

                # clean the title from special chars
                chars = r'[<>:"/\\|*]'
                cleaned_titel = re.sub(chars, '', titel)

                # get new file name
                new_file_name = cleaned_titel + ".txt"
                new_file_path = os.path.join(TEXT_FOLDER, new_file_name)

                # rename the file
                os.rename(file_path, new_file_path)


In [121]:
# call function to change file name 
change_file_name(TEXT_FOLDER)

### Pre-Process the Excellent Articles
In order to pre-process the excellent articles, the content of the respective text files is entered. Then the tokens are formed to divide the text into smaller units. The HannoverTagger is used to filter all unneeded word types from the text. Only nouns, proper nouns and adjectives remain. After that the text is filtered and rewritten into the text file and saved in a separate folder.

The package of the HanoverTagger is used from https://github.com/wartaal/HanTa and originally from https://serwiss.bib.hs-hannover.de/frontdoor/index/index/docId/1527. In this context it is used to do the POS tagging. This means the text is categirized into nouns, verbs, adjectives and so on.

In [102]:
tagger = ht.HanoverTagger('morphmodel_ger.pgz')

In [103]:
def clean_text(txt_file):
    # path to text file
    input_path = os.path.join(TEXT_FOLDER, txt_file)
   
    # read the txt file
    with open(input_path, 'r') as file:
        text = file.read()
    
    # tokenize the text of each file and define the language
    words = nltk.word_tokenize(text, language='german')

    # POS-Tagging with the Hanover-Tagger
    tagged_words = tagger.tag_sent(words)
    pos_tags = [(word, tag) for word, _, tag in tagged_words]

    # filter words with NN (nouns), NE (proper nouns) und ADJ (adjectives)
    filtered_words = [word for word, pos in pos_tags if pos.startswith('NN') or pos.startswith('ADJ') or pos == 'NE']

    # merge the filtered words into a cleaned text
    cleaned_text = ' '.join(filtered_words)
    
    # save the cleaned text
    output_path = os.path.join(CLEANED_FOLDER, txt_file)
    with open(output_path, 'w') as file:
        file.write(cleaned_text)

### Clean data and save
In the following function the actual cleaning is done. Therefore we define the now folder for the cleaned articles, if it is not already there. Threads are used to fasten the process.

In [104]:
def clean_data(CLEANED_FOLDER, TEXT_FOLDER):
    # path to file with cleaned erticles
    if not os.path.exists(CLEANED_FOLDER):
        os.makedirs(CLEANED_FOLDER)

    # list with all the files
    txt_files = [file for file in os.listdir(TEXT_FOLDER) if file.endswith('.txt')]
    thread_list = []

    # iterate over files
    for txt_file in enumerate(txt_files):
        
        # clean the articles, use threads to make it faster
        thread = Thread(target=clean_text, args=(txt_file,))
        thread.start()
        thread_list.append(thread)

    # use threads to make it faster
    for t in thread_list:
        t.join()
    
    

877 bearbeitete Artikel

IOStream.flush timed out


1558 bearbeitete Artikel

IOStream.flush timed out


1919 bearbeitete Artikel

IOStream.flush timed out


2793 bearbeitete ArtikelDone


In [None]:
clean_data(CLEANED_FOLDER, TEXT_FOLDER)

### Get keywords with KeyBert
In this function the keywords are generated. The KeyBert model is used for this purpose. Five keywords are created for each article, each with two words. For this, 15 candidates are formed and as many different ones as possible are taken in order to achieve a high diversity in the keywords.

The KeyBert model was taken from https://github.com/MaartenGr/KeyBERT and works as follows. KeyBert generates keywords or keyphreses from a text based on similarity. It uses the Bert-Embeddings to get the most similar words to the document. The word-embiddings extract N-grams and define the keywords that desribes the entire text best. KeyBert is a simple and easy to use model in NLP for a keyword extraction.

In [105]:
def keybert_keywords(TEXT_FOLDER, CLEANED_FOLDER):
    # empty list for dataset
    data = []

    # get thre files in folder
    for filename in os.listdir(TEXT_FOLDER):

        # read text files
        txt_path = os.path.join(CLEANED_FOLDER, filename)
        with open(txt_path, 'r', encoding='utf-8') as f:
            content = f.read()

        # extract keywords for title with KeyBert model
        kw_model = KeyBERT()
        keywords = kw_model.extract_keywords(content, keyphrase_ngram_range=(1, 2), use_maxsum=True,  top_n=5,  nr_candidates=15)

        # extract title out of filename
        title = os.path.splitext(filename)[0]

        # add dataset to list
        data.append({'Titel': title, 'Schlüsselwörter': keywords})

    # create dataframe with keywords and title
    df_keybert = pd.DataFrame(data)

    return df_keybert

                                  Titel  \
0                        Tetraethylblei   
1                   Division 1 Féminine   
2           Carlton-Club-Treffen (1922)   
3                          Morbus Fabry   
4                               Girlitz   
...                                 ...   
2788                            Expo 67   
2789  Spectravideo SV-318, SVI-318 MKII   
2790                     Rip Van Winkle   
2791                         Tibetfuchs   
2792                      Höllengebirge   

                                        Schlüsselwörter  
0     [(vermarktung tetraethylblei, 0.482), (corpora...  
1     [(mannschaften uefa, 0.4869), (serienmeister l...  
2     [(ort carlton, 0.5168), (forschungsgeschichte ...  
3     [(heterozygoten patienten, 0.4847), (fabry pro...  
4     [(siedlungsdichte girlitzes, 0.4835), (girlitz...  
...                                                 ...  
2788  [(futuristisch expo, 0.5466), (montreal expo, ...  
2789  [(margin left

In [None]:
keybert_keywords(TEXT_FOLDER, CLEANED_FOLDER)

### Get keywords with Yake and print them
To have a comparison between the keywords and to evaluate them better, another model is used. Yake is also a often used model for keyword extraction. The structure of the function is very similar to the KeyBert model. Five keywords are generated based on text features to get the most accurent keywords.

The Yake model was taken from https://github.com/LIAAD/yake.

In [106]:
def yake_keywords(TEXT_FOLDER, CLEANED_FOLDER):
    # empty list for dataset
    data = []

    # get thre files in folder
    for filename in os.listdir(TEXT_FOLDER):
        
        # read text files
        txt_path = os.path.join(CLEANED_FOLDER, filename)
        with open(txt_path, 'r', encoding='utf-8') as f:
            content = f.read()

        # extract keywords for title with Yake model
        kw_extractor = yake.KeywordExtractor()
        keywords = kw_extractor.extract_keywords(content)

        # extract title out of filename
        title = os.path.splitext(filename)[0]

        # add dataset to list
        data.append({'Titel': title, 'Schlüsselwörter': keywords})

    # create dataframe with keywords and title
    df_yake = pd.DataFrame(data)


    # only with the five best keywords
    df_yake['Schlüsselwörter'] = df_yake['Schlüsselwörter'].apply(lambda x: [kw[0] for kw in x[:5]])

    return df_yake

                                  Titel  \
0                        Tetraethylblei   
1                   Division 1 Féminine   
2           Carlton-Club-Treffen (1922)   
3                          Morbus Fabry   
4                               Girlitz   
...                                 ...   
2788                            Expo 67   
2789  Spectravideo SV-318, SVI-318 MKII   
2790                     Rip Van Winkle   
2791                         Tibetfuchs   
2792                      Höllengebirge   

                                        Schlüsselwörter  
0     [Blei Blei Tetraethylblei, Blei Herstellung Te...  
1     [Lyon Paris Lyon, Paris Lyon Paris, Lyon Lyon ...  
2     [Andrew Spottiswoode London, Minister Andrew S...  
3     [Journal Band Nummer, Band Nummer, Band Nummer...  
4     [Familie Finken Fringillidae, Girlitz Nordafri...  
...                                                 ...  
2788  [Pavillon Expo Montreal, Millionen Dollar Expo...  
2789  [Heimcomputer

In [None]:
yake_keywords(TEXT_FOLDER, CLEANED_FOLDER)

### Save into CSV
Save the dataframes from the two models into a csv-file, because it is easier to work with them.

In [107]:
# save the dataframe back into a csv file
df_keybert.to_csv('./keybert.csv') 
df_yake.to_csv('./yake.csv')

The KeyBert model also gives a number to each keyword, how similar it is to the text. It is good to have this number as a first indicator for the performance of the Keybert model. But the number is not needed for the further code and as to be extracted from the csv file. Therefore the next two functions are defined.

In [108]:
def clean_keywords(keyword_str):

    # keyword_str into a python list 
    keyword_list = ast.literal_eval(keyword_str)

    # iterate over list and get first element
    cleaned_keywords = [keyword[0] for keyword in keyword_list]
    
    return list(cleaned_keywords)

In [109]:
def process_keywords_csv(input_file, output_file): 

    # opens csv-file   
    with open(input_file, 'r', encoding='utf-8') as file:

        # read csv-file as dictionary
        csv_reader = csv.DictReader(file)

        # save the rows separately
        rows = list(csv_reader)

        # clean the rows with the function above
        keywords = [(row['Titel'], clean_keywords(row['Schlüsselwörter'])) for row in rows]

    # open output file
    with open(output_file, 'w', newline='', encoding='utf-8') as file:

        # create csv writer
        writer = csv.writer(file)

        # define the column names and write it in output file
        writer.writerow(['Titel', 'Schlüsselwörter'])

        # write keywords in output file
        for keyword in keywords:
            writer.writerow(keyword)

In [110]:
process_keywords_csv('keybert.csv', 'key_output.csv')

### Evaluate the Keywords - Wikipedia API
Two models with different approaches were tried to obtain the keywords from the excellent German Wikipedia articles. Since there is no dataset with the 'right' keywords, a method for evaluation is considered. For this, the keywords are inserted into the search function using the Wikipedia API. Afterwards, it is checked whether the first article found matches the title of the inserted keywords.

This is a good way to see, if the keywords are similar to the text and contain all the relevant information from the article. If the right article can be found with the identified keywords, it is a good indicator for the right keywords. For each of the keyphrases are two words used, to also do not loose the grammar and to describe the text as well as possible. This is also the reason why basic forms of words (lemmatation) are not used, otherwise the meaning behind the two words would be lost and they would no longer be related. An attempt was made to perform lematization (with Spacy). However, due to a small selection of German lemmatizers, it was difficult to find a suitable one. In addition, a lot of words were truncated incorrectly, which meant that the grammar and the meaning behind the words was no longer correct.

In [111]:
def get_search_result(search_query):

    # language code to german
    language = 'de'

    # get only first result of the wikipedia search
    number_of_results = 1

    # empty dictionary for http header 
    headers = {}

    # url for API to search and create the full url
    api_url = 'https://api.wikimedia.org/core/v1/wikipedia/'
    endpoint = '/search/page'
    url = api_url + language + endpoint

    # set the parameters for the search
    parameters = {'q': search_query, 'limit': number_of_results}

    # send get request with url, header and parameters. Parse API rsponse as JSON
    response = requests.get(url, headers=headers, params=parameters)
    response = response.json()

    # print respoinse
    print(response)

    # if there is one result, return title
    if len(response['pages']) > 0:
        return response['pages'][0]['title']
    # if there is no result, return no title found
    else:
        print('No article found')
        return None
        

In [118]:
def validate_keywords(csv_file):

    # count for the articles
    count_found = 0

    # open csv-file
    with open(csv_file, 'r', encoding='utf-8') as file:

        # create dictionary reader
        csv_reader = csv.DictReader(file)

        # iterate over the rows and count them in idx
        for idx, row in enumerate(csv_reader):

            # exrat title and keywords
            title = row['Titel']
            keywords = eval(row['Schlüsselwörter'])
            
            # search article with keywords
            search_results = get_search_result(','.join(keywords))

            # if there is a search result 
            if search_results:
                found_title = search_results

                if found_title.lower() == title.lower():

                    # and the title is the same then the title in the csv-file
                    print(f"The article '{found_title}' was found for the title '{title}'.")
                    count_found += 1

                else:
                    # and the title is not the same then the title in the csv-file
                    print(f"The found article '{found_title}' is not matching the title '{title}'.")
            else:
                # and no article was found
                print(f"No article found for title '{title}'.")
                
            # break between 7 and 10 seconds, to have not too many requests for the API
            time.sleep(random.randint(7,10))
    
    return count_found, idx

['vermarktung tetraethylblei', 'corporation tetraethylblei', 'produktionsmenge tetraethylblei', 'tetraethylblei motor', 'tetraethyllead gasoline']
{'pages': [{'id': 259144, 'key': 'Tetraethylblei', 'title': 'Tetraethylblei', 'excerpt': 'besaßen, gründeten daraufhin 1924 die Ethyl <span class="searchmatch">Corporation</span>, die <span class="searchmatch">Tetraethylblei</span> herstellte und vermarktete. <span class="searchmatch">Tetraethylblei</span> wirkt als Antiklopfmittel, da es die bei', 'matched_title': None, 'description': 'giftige metallorganische Verbindung, historischer Kraftstoffzusatz', 'thumbnail': {'mimetype': 'image/svg+xml', 'size': None, 'width': 60, 'height': 27, 'duration': None, 'url': '//upload.wikimedia.org/wikipedia/commons/thumb/8/85/Tetraethylplumbane_200.svg/60px-Tetraethylplumbane_200.svg.png'}}]}
Tetraethylblei
Der Artikel 'Tetraethylblei' wurde für den Titel 'Tetraethylblei' gefunden.
0
['mannschaften uefa', 'serienmeister lyon', 'mannschaften division', 'l

In [None]:
# print the results for the evaluation
count_found_keybert, idx = validate_keywords('key_output.csv')
count_found_yake, idx = validate_keywords('yake.csv')
print("Keybert /// Gefunden:", count_found_keybert , 'von',idx + 1)
print("Yake /// Gefunden:", count_found_yake, 'von',idx + 1)

### Evaluation Results
We can see in the results, that both of the models have a good performance in creating keywords and finding the right article again with the wikipedia search. That shows, that the keywords are very similar to the text and that the keywords describe the articles very good. 

Two different models were used for comparison. **KeyBert** performs better, as expected. This is because the similarity to the article is used to define the keywords. In addition, several are formed and the most dissimilar ones are used. Thus, the variety of keywords is very good and they cover all relevant aspects of the text. The model has found **2409** of the total 2793 excellent German Wikipedia articles in the training. This corresponds to an accuracy of **86.25%**. To get a better accuracy, more keywords could be generated and included in the search. The preprocessing of the text seems to be sufficient due to the good keywords.

The **Yake** model still performs well, but slightly worse. It uses features in the text to create the keywords. The fact that the model performs worse may also be due to the preprocessing of the articles. In contrast to the KeyBert model, the model only correctly found **2048** articles out of 2793 based on the identified keywords. This means that 361 fewer excellent articles were found in comparison. Thus, the accuracy for the Yake model is only **73.33%**. Nevertheless, the result is good and shows that the model works. However, better keyword identification is achieved with the KeyBert model.