# Data Introduction

Within this notebook, I'll outline the data that we will be working with for this project.

## SOMOS Dataset

The SOMOS dataset is comprised of a variety of data types, including 20,000 synthetic utterances in WAV file format, 100 natural utterances, and 374,955 human-assigned scores ranging from 1 to 5 to evaluate the naturalness of the utterances.

The synthetic utterances are generated by training Tacotron-like acoustic models and an LPCNet vocoder on a publicly available speech dataset called LJ Speech, which consists of recordings of a single speaker. To generate the synthetic utterances, 2,000 text sentences were selected from a variety of sources, including Blizzard Challenge texts from 2007-2016, the LJ Speech corpus, and general domain data from the Internet.

The dataset contains over 250,000 utterances, each consisting of a unique combination of a:
- system ID
- an utterance ID
- a listener's choice
- and a listener ID. 

The listener's choice represents the listener's response to the system's utterance, and can range from 1 to 5. In general, a score of 1 may indicate that the listener found the system's utterance less accurate, less natural, or less human-like.

The dataset was created by the Defense Advanced Research Projects Agency (DARPA) as part of its Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program. The dataset has been used by researchers in the field of natural language processing to develop and evaluate machine learning models for speech recognition, machine translation, and other related applications.

### Additonal Features

The SOMOS dataset includes several additional features beyond the four main columns that were outlined above.

- isNative: indicates if the listener is a native speaker (1) or not (0)
wrongValidation: indicates if the listener provided a wrong response during validation (1) or not (0)
- lowNatural: indicates if the system's utterance had a low naturalness score (1) or not (0)
- sameScores: indicates if the system's utterance received the same score from multiple listeners (1) or not (0)
- highSynthetic: indicates if the system's utterance had a high synthetic score (1) or not (0)
- clean: indicates if the interaction was clean (1) or not (0)
- listenerReliability: a continuous variable from 0 to 1 indicating the listener's overall performance on a set of validation questions.

### Data Dictionary (SOMOS Dataset)

| Column Name   | Description                                                                                                                                                                      | Data Type   |
|:---------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------|
| utteranceId   | Id of the utterance, composed from sentenceId and systemId.                                                                                                                      | string      |
| choice        | 1-5: choice of listener in a Likert-type scale from very unnatural (1) to completely natural (5).                                                                                | integer     |
| sentenceId    | Original Id of sentence text. Includes info on text domain.                                                                                                                      | string      |
| systemId      | 000-200: 000 is natural speech, 001-200 are TTS systems.                                                                                                                          | string      |
| modelId       | m0-m5: m0 is natural speech, m1-m5 are TTS models.                                                                                                                                | string      |
| testpageId    | 000-999: corresponds to HIT Id on Amazon Mechanical Turk.                                                                                                                        | string      |
| locale        | us (United States), gb (United Kingdom), ca (Canada): registered locale of listener on Amazon Mechanical Turk.                                                                   | string      |
| listenerId    | Anonymized AMT worker Id.                                                                                                                                                         | string      |
| isNative      | 0 (no), 1 (yes): Although only residents of the respective English locale, according to AMT’s qualification, were recruited in each test, and only native English speakers were asked to participate, a native/non-native annotation checkbox was included in each test page. | integer     |
| wrongValidation | 0 (fails to pass quality check), 1 (passes quality check): Wrong score has been assigned to the validation sample on test page. Validation utterances and respective choices have been removed from the dataset, but the page-level validation annotation has been propagated for every choice in the page. | integer |
| lowNatural    | 0 (fails to pass quality check), 1 (passes quality check): The score assigned to the natural sample on test page is extremely low (1 or 2).                                      | integer     |
| sameScores    | 0 (fails to pass quality check), 1 (passes quality check): All scores on test page are identical ignoring the score of the validation utterance.                                     | integer     |
| highSynthetic | 0 (fails to pass quality check), 1 (passes quality check): The average score of synthetic samples on page is higher or close (down to smaller by 0.1) to the natural sample's score.                                                               | integer     |
| clean         | 0 (no), 1 (yes): Clean is the logical AND of the 4 quality checks (wrongValidation, lowNatural, sameScores, highSynthetic) that have been used in the dataset per test page in Flag 0/1 form. Thus, all test pages that have passed all 4 quality checks are considered clean. | integer     |
| listenerReliability | The percentage of test pages that the listener has submitted and are considered clean, expressed in the range [0, 1].                                                             | float       |

### Resources

1. https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
2. https://librosa.org/doc/main/index.html
3. https://www.sciencedirect.com/topics/engineering/zero-crossing-rate
4. https://medium.com/@LeonFedden/comparative-audio-analysis-with-wavenet-mfccs-umap-t-sne-and-pca-cb8237bfce2f
5. https://machinelearning.apple.com/research/mel-spectrogram
6. https://www.kaggle.com/code/shivamburnwal/speech-emotion-recognition
7. https://medium.com/heuristics/audio-signal-feature-extraction-and-clustering-935319d2225
8. https://www.soundjay.com/ambient-sounds.html


## To Do

1. Remove Duplicates before normalizing
2. add in mozilla data and then do #1 and run librosa