The main goal of this project is to build a system to answer Jeopardy-like questions in Russian.
This work is conducted under supervision of Simon Ostermann.
At the moment, there are two main contributions of this repository:
- Jeopardy-like QA dataset in Russian.
This is a new, independently collected edition of dataset, introduced by Mikhalkova (2021). - mT5-based neural model to solve this task.
Please, be aware that current quality of the model is low: 2/2560 answers on test set. Currently, I am researching various ways to improve the system.
For more detailed description, please, read intermediate seminar report.
- data - YoGa-QA source data and parsed dataset
- meta - project reports + helper instructions (e.g. how to connect to GPU servers)
- src/scraper - Python code for data scraping and parsing
- src/models - Python code for training neural models (AllenNLP)
‘Svoya Igra’ (Own Game) is a Russian analogue of Jeopardy. There is a TV version of this game, running from 1994. Additionally, a big community of enthusiasts exists in Russian-speaking countries. Professional authors write their own questions and run championships. An inital version of Own Game dataset from the official public database was introduced by Mikhalkova (2021).
To note, Your Game is less accurate translation for 'Svoya Igra' than Own Game, but I like the YoGa abbreviation ;)
The database of questions and answers for Own Game is freely available at https://db.chgk.info/. Data is posted as html pages, so it can be extracted by scraping and parsing. Alternatively, there are a lot of unpublished tournaments, which are distributed in form of text files (pdf, docx, txt). However, private data cannot be freely distributed due to the author rights, so I focus on the public database.
Dataset was obtained by scraping web pages and parsing them. You can find details in the intermediate project report.
The data is protected by the copyright (in Russian), with underlying licences included:
Required fields: topic_name, question_value, question_text, answer.
Optional fields: extra_positives, hard_negatives, comment, source, author, tournament, date, source_url.
topic_name | question_value | question_text | answer |
---|---|---|---|
Океаны | 4 | Океан, в отличие от своих братьев, не участвовал В НЕЙ, благодаря чему сохранил свое положение, а не был низвергнут в Тартар. | титаномахия |
Oceans | 4 | The ocean, unlike its brothers, did not participate in IT, thanks to which it retained its position, and was not thrown into Tartarus. | Titanomachy |
extra_positives | hard_negatives | comment | source |
---|---|---|---|
битва титанов и богов | гигантомахия | - | https://ru.wikipedia.org/wiki/Океан_(мифология) |
Battle of titans and gods | Gigantomachy | - | English analogue: https://en.wikipedia.org/wiki/Oceanus |
author |
tournament | date |
source_url |
---|---|---|---|
Иделия Айзятулова, Андрей Мартыненко, Александр Рождествин | XII Кубок Европы по интеллектуальным играм среди студентов (Витебск). Своя игра | 2016-10-28 | https://db.chgk.info/txt/eu16stsv.txt |
Ideliya Aizyatulova, Andrey Martynenko, Alexander Rozhdestvin | XII European Student Intellectual Games Cup (Vitebsk). Own game | 2016-10-28 | https://db.chgk.info/txt/eu16stsv.txt |
First, install required packages by this command:
pip install -r model_requirements.txt
Then, set current working directory to the src/models
:
cd src/models
bash allen_train.sh GPU_ID BATCH_SIZE LEARNING_RATE SEED DATA_FOLDER CONFIG_NAME
For example:bash allen_train.sh 1 16 0.0001 42 'question_to_answer' configs/mT5.jsonnet
bash allen_train.sh 1 8 0.0001 42 'topic_and_question_to_answer' configs/mT5.jsonnet
bash predict.sh PATH_TO_MODEL INPUT_JSONL_PATH OUTPUT_PATH
For example:bash predict.sh archive/model.tar.gz data/question_to_answer/yoga_test.jsonl predictions/question_to_answer/full_yoga_test_predictions.txt
Author: Tsimafei Prakapenka