YoGa-QA - Multilingual Question Answering Project

The main goal of this project is to build a system to answer Jeopardy-like questions in Russian.
This work is conducted under supervision of Simon Ostermann.

At the moment, there are two main contributions of this repository:

Jeopardy-like QA dataset in Russian.
This is a new, independently collected edition of dataset, introduced by Mikhalkova (2021).
mT5-based neural model to solve this task.
Please, be aware that current quality of the model is low: 2/2560 answers on test set. Currently, I am researching various ways to improve the system.

For more detailed description, please, read intermediate seminar report.

Repository Structure

- data - YoGa-QA source data and parsed dataset
- meta - project reports + helper instructions (e.g. how to connect to GPU servers)
- src/scraper - Python code for data scraping and parsing
- src/models - Python code for training neural models (AllenNLP)

Data Collection

‘Svoya Igra’ (Own Game) is a Russian analogue of Jeopardy. There is a TV version of this game, running from 1994. Additionally, a big community of enthusiasts exists in Russian-speaking countries. Professional authors write their own questions and run championships. An inital version of Own Game dataset from the official public database was introduced by Mikhalkova (2021).

To note, Your Game is less accurate translation for 'Svoya Igra' than Own Game, but I like the YoGa abbreviation ;)

Data Source

The database of questions and answers for Own Game is freely available at https://db.chgk.info/. Data is posted as html pages, so it can be extracted by scraping and parsing. Alternatively, there are a lot of unpublished tournaments, which are distributed in form of text files (pdf, docx, txt). However, private data cannot be freely distributed due to the author rights, so I focus on the public database.

Dataset was obtained by scraping web pages and parsing them. You can find details in the intermediate project report.

Licence

The data is protected by the copyright (in Russian), with underlying licences included:

Database Schema

Required fields: topic_name, question_value, question_text, answer.
Optional fields: extra_positives, hard_negatives, comment, source, author, tournament, date, source_url.

Sample Entry

topic_name	question_value	question_text	answer
Океаны	4	Океан, в отличие от своих братьев, не участвовал В НЕЙ, благодаря чему сохранил свое положение, а не был низвергнут в Тартар.	титаномахия
Oceans	4	The ocean, unlike its brothers, did not participate in IT, thanks to which it retained its position, and was not thrown into Tartarus.	Titanomachy

extra_positives	hard_negatives	comment	source
битва титанов и богов	гигантомахия	-	https://ru.wikipedia.org/wiki/Океан_(мифология)
Battle of titans and gods	Gigantomachy	-	English analogue: https://en.wikipedia.org/wiki/Oceanus

author	tournament	date	source_url
Иделия Айзятулова, Андрей Мартыненко, Александр Рождествин	XII Кубок Европы по интеллектуальным играм среди студентов (Витебск). Своя игра	2016-10-28	https://db.chgk.info/txt/eu16stsv.txt
Ideliya Aizyatulova, Andrey Martynenko, Alexander Rozhdestvin	XII European Student Intellectual Games Cup (Vitebsk). Own game	2016-10-28	https://db.chgk.info/txt/eu16stsv.txt

Models

How to start

First, install required packages by this command: pip install -r model_requirements.txt
Then, set current working directory to the src/models: cd src/models

How to train model

bash allen_train.sh GPU_ID BATCH_SIZE LEARNING_RATE SEED DATA_FOLDER CONFIG_NAME

For example:
bash allen_train.sh 1 16 0.0001 42 'question_to_answer' configs/mT5.jsonnet
bash allen_train.sh 1 8 0.0001 42 'topic_and_question_to_answer' configs/mT5.jsonnet

How to make predictions with trained model

bash predict.sh PATH_TO_MODEL INPUT_JSONL_PATH OUTPUT_PATH

For example:
bash predict.sh archive/model.tar.gz data/question_to_answer/yoga_test.jsonl predictions/question_to_answer/full_yoga_test_predictions.txt

Author: Tsimafei Prakapenka

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
data		data
meta		meta
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
model_requirements.txt		model_requirements.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YoGa-QA - Multilingual Question Answering Project

Repository Structure

Data Collection

Data Source

Licence

Database Schema

Sample Entry

Models

How to start

How to train model

How to make predictions with trained model

About

Releases

Packages

Languages

License

tsimafeip/yoga-qa

Folders and files

Latest commit

History

Repository files navigation

YoGa-QA - Multilingual Question Answering Project

Repository Structure

Data Collection

Data Source

Licence

Database Schema

Sample Entry

Models

How to start

How to train model

How to make predictions with trained model

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages