German Text Classification Tutorial Series

This project mainly consists of a german tutorial series on text classification with Python and german data. The tutorials are available as Jupyter notebooks in the tutorials folder. In addition, this project contains tools for extracting articles from Wikipedia by categories and a csv-post-processing tool. Already prepared datasets are provided in the data folder. A smaller and more compact dataset is stored in the dl_dataset folder.

Installation

Required: Python 3.6+

pip install -r requirements.txt

Usage

For a detailed explanation of the following two tools see tutorials/Zusatzkapitel - Wie baue ich mein eigenes Wikipediakorpus? (only available in German).

Wikiparser

$ python wikiparser.py --help
usage: path [-h] [--max_articles MAX_ARTICLES] [--save_date]

optional arguments:
-h, --help            show this help message and exit
--max_articles, -ma   sets the maximum of articles per category
--save_date, -sd      indicates if the generation date of the corpus should be saved

Data Postprocess

$ python data_postprocess.py --help
usage: path [-h] [--drop_duplicates] [--unify_articles_amount] [--tokenization] [--german_translation] [--nonlation] [--no_umlauts] [--save_methods]

optional arguments:
-h, --help                      show this help message and exit
--drop_duplicates, -dd",        indicates if the duplicates should be dropped
--unify_articles_amount, -uaa   indicates if the amount of the articles per category                                   should be unified
--tokenization, -t              indicates if the articles should be tokenized or not
--german_translation, -gt       indicates if given german translations should replace                                 the original phrases
--nonlatin, -nl                 indicates if non-latin characters should be removed
--no_umlauts, -nu               indicates if umlauts should be replaced
--save_methods, -sm             indicates if an abbreviation for the postprocessing                                   methods should be added to the output name

Project Structure

data

Corpora with german Wikipedia articles and categories as csv-files (subfolder corpora), JSON-files with german Wikipedia categories and subcategories and a german dictionary file.

dl_dataset

Dataset for the university course "Deep Learning" (Julius-Maximilians-university, SS19).

src

data_generation: Module which stores all the python files for the generation of the data by parsing wikipedia articles on the basis of categories stored in wikicategories.json. The articles will be preprocessed (tokenization, removal of unnecessary parts etc.) and extracted into a csv-file with the columns "category", "summary", "text".

tutorials

German tutorial series for Text Classification with Machine Learning, Scikit learn and Deep Learning as Jupyter Notebooks.

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
data		data
dl_dataset		dl_dataset
src/data_generation		src/data_generation
tutorials		tutorials
.gitignore		.gitignore
Pipfile		Pipfile
README.md		README.md
data_postprocess.py		data_postprocess.py
requirements.txt		requirements.txt
wikiparser.py		wikiparser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

German Text Classification Tutorial Series

Installation

Usage

Wikiparser

Data Postprocess

Project Structure

data

dl_dataset

src

tutorials

About

Releases

Packages

Languages

realjanpaulus/german_text_classification_nlp

Folders and files

Latest commit

History

Repository files navigation

German Text Classification Tutorial Series

Installation

Usage

Wikiparser

Data Postprocess

Project Structure

data

dl_dataset

src

tutorials

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages