Data Valuation

Data Valuation provides three main blocks as shown in the diagram below:

An inference model that takes as input the data
A feature extractor that will compute features based on the raw ones.
A CoDi (Confidence Discriminator) that will accept/reject the datapoints.

Features

A highly modular end-to-end framework:
- Easily customizable inference model : researchers can plug their own inference model or tune the one provided.
- Easily customizable confidence discriminator model : researchers can plug their own codi or tune the one provided.
- Any Inference model can be used (tested on images, NLP, speech)
- Customizable plug-n-play feature processing : can easily choose which features to keep, add or remove.
- Customizable datasets by dataset classes inheritence.

Speech Dataset

The LibriSpeech corpus is a large (1000 hour) corpus of English read speech derived from audiobooks in the LibriVox project, sampled at 16kHz. The accents are various and not marked, but the majority are US English. It is available for download for free at http://www.openslr.org/12/. It was prepared as a speech recognition corpus by Vassil Panayotov.

Directory Layout

|-- codi
   |-- train_codi.yaml
   |-- mlp_codi.yaml
   |-- mlp_codi.py
   |-- ...
|-- dataset
|-- env
|-- features
   |-- features_classes
   |-- features_tests
|-- inference_models
|-- main.py
|-- main_speech.py
|-- retraining_experiment.py
|-- preprocessing
|-- README.md
|-- speech_inference
   |-- local
   |-- run.sh
   |-- RESULTS
   |-- ...

Steps to reproduce for speech

Run image locally using following command and replace the files in egs/librispeech/s5 folder by the ones in speech_inference:

nvidia-docker run -it kaldiasr/kaldi:gpu-latest bash

Then run main_speech.py script.

Docker Installation

To build and run the image locally:

cd env
docker build -t dataval .
docker run -it dataval /bin/bash

How to run

To run the main pipeline, call the main function in main.py (or main_speech.py for the speech setup)
To run retraining experiments, call the main function in retraining_experiment.py

Requirements

Listed in 'env/requirements.txt'.

Team

Code written by:

Jean-Baptiste Membrado

Joëlle Hanna

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
codi		codi
dataset		dataset
env		env
features		features
inference_models		inference_models
speech_inference		speech_inference
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
main_speech.py		main_speech.py
n_step_experiment.py		n_step_experiment.py
pipeline_diagram.png		pipeline_diagram.png
readme.md		readme.md
retraining_experiment.py		retraining_experiment.py

License

swisscom/ai-research-data-valuation-repository

Folders and files

Latest commit

History

Repository files navigation

Data Valuation

Features

Speech Dataset

Directory Layout

Steps to reproduce for speech

Docker Installation

How to run

Requirements

Team

About

Resources

License

Stars

Watchers

Forks

Languages