Skip to content

Our approach for the GDI dialect classification task of VarDial 2017

License

Notifications You must be signed in to change notification settings

simon-clematide/GDI-task-2017

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLUZH submission for VarDial 2017 Workshop Task "German Dialect Identification" (GDI)

This repository contains the code and data for reproducing our approach as described in our EACL 2017 VarDial Workshop Paper and Poster.

Prerequisites for running the code under Linux/MacOS:

Steps to reproduce our results

Execute all commands from the top directory of GDI-task-2017 directory.

  • python3 lib/generate_cross_validation_data.py

    • populates the directory ./cv.d with data splits
    • creates the first global split into test.tsv and train.tsv used for the LSTM experiments (no cross-validation there)
    • creates the cross-validation data splits using train.tsv as input: test_N.tsv and train_N.tsv (0 <= N <= 9)
  • make -f run2.mk target

    • train and evaluate all folds of CRF run 2 using wapiti
  • python3 lib/dataprep_runs1and3.py

    • populates the directory preprocessed_data.d
    • create all the models and output for runs 1 (Naive Bayes) and 3 (NB, SVM, CRF ensemble) (using the results of run 2)
    • creates the data variants (augmented data, replacements) for the LSTM models
  • for var in model+charrep+augm model+charrep-augm model-charrep+augm model-charrep-augm ; do python3 lib/LSTM_models_emb.py $var ; done

    • trains 4 different LSTM models with character embeddings as described in the paper
    • takes some hours on non-GPUs
  • for var in model+charrep+augm model+charrep-augm model-charrep+augm model-charrep-augm ; do python3 lib/LSTM_models_no_emb.py $var ; done

    • trains 4 different LSTM models without character embeddings as described in the paper
    • NOTE: This script seems to have problems when run on CPU. Therefore, GPU needed for now.

About

Our approach for the GDI dialect classification task of VarDial 2017

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published