This repository is created to host code and scripts related to my Applied Data Science masters thesis.
A table is provided as overview and guidance when using the various files.
| File | Purpose |
|---|---|
| data_exploration.ipynb | Exploratory analysis of the dataset |
| dev_sampler.sh | Create a development subset |
| hypothesis_cleaner.py | Clean Whisper transcripts for use as JiWER hypothesis |
| test_sampler.sh | Create a test subset |
| transcript_converter.py | Clean ort transcripts and convert to txt files |