Skip to content

Jeopardy-like QA dataset (in Russian) + neural models to solve this task

License

Notifications You must be signed in to change notification settings

tsimafeip/yoga-qa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YoGa-QA - Multilingual Question Answering Project

The main goal of this project is to build a system to answer Jeopardy-like questions in Russian.
This work is conducted under supervision of Simon Ostermann.

At the moment, there are two main contributions of this repository:

For more detailed description, please, read intermediate seminar report.

Repository Structure

- data - YoGa-QA source data and parsed dataset
- meta - project reports + helper instructions (e.g. how to connect to GPU servers)
- src/scraper - Python code for data scraping and parsing
- src/models - Python code for training neural models (AllenNLP)

Data Collection

‘Svoya Igra’ (Own Game) is a Russian analogue of Jeopardy. There is a TV version of this game, running from 1994. Additionally, a big community of enthusiasts exists in Russian-speaking countries. Professional authors write their own questions and run championships. An inital version of Own Game dataset from the official public database was introduced by Mikhalkova (2021).

To note, Your Game is less accurate translation for 'Svoya Igra' than Own Game, but I like the YoGa abbreviation ;)

Data Source

The database of questions and answers for Own Game is freely available at https://db.chgk.info/. Data is posted as html pages, so it can be extracted by scraping and parsing. Alternatively, there are a lot of unpublished tournaments, which are distributed in form of text files (pdf, docx, txt). However, private data cannot be freely distributed due to the author rights, so I focus on the public database.

Dataset was obtained by scraping web pages and parsing them. You can find details in the intermediate project report.

Licence

The data is protected by the copyright (in Russian), with underlying licences included:

Database Schema

Required fields: topic_name, question_value, question_text, answer.
Optional fields: extra_positives, hard_negatives, comment, source, author, tournament, date, source_url.

Sample Entry

topic_name question_value question_text answer
Океаны 4 Океан, в отличие от своих братьев, не участвовал В НЕЙ, благодаря чему сохранил свое положение, а не был низвергнут в Тартар. титаномахия
Oceans 4 The ocean, unlike its brothers, did not participate in IT, thanks to which it retained its position, and was not thrown into Tartarus. Titanomachy
extra_positives hard_negatives comment source
битва титанов и богов гигантомахия - https://ru.wikipedia.org/wiki/Океан_(мифология)
Battle of titans and gods Gigantomachy - English analogue: https://en.wikipedia.org/wiki/Oceanus
author
tournament
date
source_url
Иделия Айзятулова, Андрей Мартыненко, Александр Рождествин XII Кубок Европы по интеллектуальным играм среди студентов (Витебск). Своя игра 2016-10-28 https://db.chgk.info/txt/eu16stsv.txt
Ideliya Aizyatulova, Andrey Martynenko, Alexander Rozhdestvin XII European Student Intellectual Games Cup (Vitebsk). Own game 2016-10-28 https://db.chgk.info/txt/eu16stsv.txt

Models

How to start

First, install required packages by this command: pip install -r model_requirements.txt
Then, set current working directory to the src/models: cd src/models

How to train model

bash allen_train.sh GPU_ID BATCH_SIZE LEARNING_RATE SEED DATA_FOLDER CONFIG_NAME

For example:
bash allen_train.sh 1 16 0.0001 42 'question_to_answer' configs/mT5.jsonnet
bash allen_train.sh 1 8 0.0001 42 'topic_and_question_to_answer' configs/mT5.jsonnet

How to make predictions with trained model

bash predict.sh PATH_TO_MODEL INPUT_JSONL_PATH OUTPUT_PATH

For example:
bash predict.sh archive/model.tar.gz data/question_to_answer/yoga_test.jsonl predictions/question_to_answer/full_yoga_test_predictions.txt


Author: Tsimafei Prakapenka

About

Jeopardy-like QA dataset (in Russian) + neural models to solve this task

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published