# Data Introductions

In this notebook, I will introduce the data we will be using for this project as well as our goal for this analysis. We have compiled resources from multiple data sources to create a diverse set of audio data. By including several different datasets, we aim to ensure that our model can generalize well to new data. Each dataset was carefully selected to provide a balanced input of audio files, including different speakers, accents, genders, recording equipment, and environments.

## Problem Statement

Speech synthesis has become increasingly sophisticated, with computer-generated voices that can sound remarkably human-like. However, distinguishing between synthetic speech and human speech can still be challenging, especially in scenarios where the quality of the audio signal may be poor or the characteristics of the speakers are highly variable. In this project, we aim to develop a machine learning model to classify speech signals as either human or synthetic, based on their acoustic features.

To accomplish this, we will use a dataset of voice recordings that includes samples of both human and synthetic speech. We will preprocess the audio signals to extract a set of relevant features, such as spectral density, pitch, and formant frequencies. We will then train a machine learning model, such as a random forest classifier or a support vector machine (SVM), to classify speech signals as either human or synthetic, based on these features.

The performance of the model will be evaluated using metrics such as accuracy, precision, recall, and F1 score. We will also conduct a thorough analysis of the model's performance, including a confusion matrix, to determine which classes are being misclassified and whether any particular features are driving the classification decision.

Overall, the goal of this project is to develop a machine learning model that can accurately classify speech signals as either human or synthetic, and to gain insights into the underlying features that contribute to this classification. The resulting model could have applications in fields such as speech recognition, natural language processing, and voice authentication, among others.

## Dataset Introductions

#### SOMOS Dataset

The SOMOS dataset is comprised of a variety of data types, including 20,000 synthetic utterances in WAV file format, 100 natural utterances, and 374,955 human-assigned scores ranging from 1 to 5 to evaluate the naturalness of the utterances.

The synthetic utterances are generated by training Tacotron-like acoustic models and an LPCNet vocoder on a publicly available speech dataset called LJ Speech, which consists of recordings of a single speaker. To generate the synthetic utterances, 2,000 text sentences were selected from a variety of sources, including Blizzard Challenge texts from 2007-2016, the LJ Speech corpus, and general domain data from the Internet.

The dataset contains over 250,000 utterances, each consisting of a unique combination of a:
- system ID
- an utterance ID
- a listener's choice
- and a listener ID. 

The listener's choice represents the listener's response to the system's utterance, and can range from 1 to 5. In general, a score of 1 may indicate that the listener found the system's utterance less accurate, less natural, or less human-like.

The dataset was created by the Defense Advanced Research Projects Agency (DARPA) as part of its Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program. The dataset has been used by researchers in the field of natural language processing to develop and evaluate machine learning models for speech recognition, machine translation, and other related applications.

#### Additonal Features

The SOMOS dataset includes several additional features beyond the four main columns that were outlined above.

- isNative: indicates if the listener is a native speaker (1) or not (0)
wrongValidation: indicates if the listener provided a wrong response during validation (1) or not (0)
- lowNatural: indicates if the system's utterance had a low naturalness score (1) or not (0)
- sameScores: indicates if the system's utterance received the same score from multiple listeners (1) or not (0)
- highSynthetic: indicates if the system's utterance had a high synthetic score (1) or not (0)
- clean: indicates if the interaction was clean (1) or not (0)
- listenerReliability: a continuous variable from 0 to 1 indicating the listener's overall performance on a set of validation questions.

#### Data Dictionary (SOMOS Dataset)

| Column Name   | Description                                                                                                                                                                      | Data Type   |
|:---------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------|
| utteranceId   | Id of the utterance, composed from sentenceId and systemId.                                                                                                                      | string      |
| choice        | 1-5: choice of listener in a Likert-type scale from very unnatural (1) to completely natural (5).                                                                                | integer     |
| sentenceId    | Original Id of sentence text. Includes info on text domain.                                                                                                                      | string      |
| systemId      | 000-200: 000 is natural speech, 001-200 are TTS systems.                                                                                                                          | string      |
| modelId       | m0-m5: m0 is natural speech, m1-m5 are TTS models.                                                                                                                                | string      |
| testpageId    | 000-999: corresponds to HIT Id on Amazon Mechanical Turk.                                                                                                                        | string      |
| locale        | us (United States), gb (United Kingdom), ca (Canada): registered locale of listener on Amazon Mechanical Turk.                                                                   | string      |
| listenerId    | Anonymized AMT worker Id.                                                                                                                                                         | string      |
| isNative      | 0 (no), 1 (yes): Although only residents of the respective English locale, according to AMT’s qualification, were recruited in each test, and only native English speakers were asked to participate, a native/non-native annotation checkbox was included in each test page. | integer     |
| wrongValidation | 0 (fails to pass quality check), 1 (passes quality check): Wrong score has been assigned to the validation sample on test page. Validation utterances and respective choices have been removed from the dataset, but the page-level validation annotation has been propagated for every choice in the page. | integer |
| lowNatural    | 0 (fails to pass quality check), 1 (passes quality check): The score assigned to the natural sample on test page is extremely low (1 or 2).                                      | integer     |
| sameScores    | 0 (fails to pass quality check), 1 (passes quality check): All scores on test page are identical ignoring the score of the validation utterance.                                     | integer     |
| highSynthetic | 0 (fails to pass quality check), 1 (passes quality check): The average score of synthetic samples on page is higher or close (down to smaller by 0.1) to the natural sample's score.                                                               | integer     |
| clean         | 0 (no), 1 (yes): Clean is the logical AND of the 4 quality checks (wrongValidation, lowNatural, sameScores, highSynthetic) that have been used in the dataset per test page in Flag 0/1 form. Thus, all test pages that have passed all 4 quality checks are considered clean. | integer     |
| listenerReliability | The percentage of test pages that the listener has submitted and are considered clean, expressed in the range [0, 1].                                                             | float       |

#### Mozilla CommonVoice Dataset

The Mozilla CommonVoice dataset is a collection of recordings of human speech, with the goal of creating a high-quality, open-source dataset for use in training speech recognition systems. The dataset consists of recordings contributed by volunteers, and is available in a variety of languages.

As of September 2021, the dataset contains over 9,000 hours of speech from more than 9,000 unique speakers, across 60 languages. We'll be accessing a subset of this dataset that uses approximately 1000 hours of speech.

Each recording in the dataset is accompanied by metadata including the speaker's age and gender, as well as additional information such as the recording location and the text of the spoken utterance. The dataset also includes pre-computed features such as MFCCs and Mel-spectrograms, as well as a variety of other metadata such as language and accent labels.

The CommonVoice dataset has been widely used in the research community to train and evaluate speech recognition models, and has helped to advance the state-of-the-art in the field.

#### Recordings Data Dictionary (Mozilla
| Column Name | Data Type | Description |
| --- | --- | --- |
| `path` | string | Relative path to the audio file in the dataset |
| `sentence_id` | integer | ID of the corresponding sentence |
| `up_votes` | integer | Number of up votes for the recording |
| `down_votes` | integer | Number of down votes for the recording |
| `age` | integer | Age of the speaker who recorded the sentence |
| `gender` | string | Gender of the speaker who recorded the sentence |
| `accent` | string | Accent of the speaker who recorded the sentence |
| `locale` | string | Locale of the speaker who recorded the sentence |
| `segment` | string | Type of segment for the recording (e.g. validated, other) |
| `duration` | float | Duration of the audio file in seconds |
| `split` | string | Split of the data for training, validation, or testing |

#### Speakers Table Data Dictionary (Mozilla)
| Column Name | Data Type | Description |
| --- | --- | --- |
| `client_id` | integer | ID of the speaker |
| `age` | integer | Age of the speaker |
| `gender` | string | Gender of the speaker |
| `accent` | string | Accent of the speaker |
| `locale` | string | Locale of the speaker |

#### Sentences Table Data Dictionary (Mozilla)
| Column Name | Data Type | Description |
| --- | --- | --- |
| `sentence_id` | integer | ID of the sentence |
| `sentence` | string | Text of the sentence |
| `up_votes` | integer | Number of up votes for the sentence |
| `down_votes` | integer | Number of down votes for the sentence |
| `split` | string | Split of the data for training, validation, or testing |

#### AptLy Labs Real-or-Fake Dataset

The Real or Fake dataset from APTLY labs is a collection of speech utterances where each utterance is labeled as real or fake. The dataset consists of approximately 2,000 unique utterances, with roughly half being real and half being fake. The fake utterances were generated using a voice conversion algorithm that can transform the voice of one person to sound like another person. The dataset includes both male and female speakers, and the real utterances are drawn from a variety of sources, including the LibriSpeech and Common Voice datasets. The dataset is primarily intended for research purposes and can be used to train and evaluate machine learning models for detecting fake speech.

| Field name     | Description                                                                                                                     |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| audio_file     | The name of the audio file containing the speech utterance.                                                                      |
| transcribed    | A Boolean indicating whether or not the speech has been transcribed.                                                             |
| transcript     | The text transcription of the speech utterance.                                                                                  |
| label          | A Boolean indicating whether the speech is real or fake.                                                                         |
| speaker_id     | A unique identifier for the speaker.                                                                                            |
| recording_date | The date the speech was recorded.                                                                                                |
| language       | The language spoken in the speech utterance.                                                                                     |
| accent         | The accent of the speaker in the speech utterance.                                                                                |
| duration       | The duration of the speech utterance in seconds.                                                                                  |
| sample_rate    | The sampling rate of the audio file.                                                                                             |
| channels       | The number of audio channels in the audio file.                                                                                   |
| bit_depth      | The bit depth of the audio file.                                                                                                 |
| codec          | The audio codec used in the audio file.                                                                                           |
| notes          | Any additional notes about the speech utterance or dataset.                                                                      |

#### LibriTTS Dataset

The LibriTTS dataset is a large-scale corpus of approximately 585 hours of English read speech derived from the audio books of the LibriVox project. The dataset contains over 5,000 hours of speech from more than 2,500 human speakers, recorded in a professional studio environment, and covers a diverse set of topics and genres. The audio is annotated with speaker and book information, and each utterance is aligned with its corresponding text transcription. The dataset is intended for use in training and evaluating text-to-speech systems and other related applications in the field of speech synthesis.

For this model, we'll use roughly 1/3 of the total dataset.

#### Data Dictionary

| Column Name   | Description                                                |
|---------------|------------------------------------------------------------|
| id            | Unique ID of the utterance in the dataset                  |
| text          | Text transcription of the spoken utterance                  |
| speaker_id    | Unique ID of the speaker who recorded the utterance         |
| chapter_id    | ID of the book chapter from which the utterance was derived |
| book_id       | ID of the book from which the utterance was derived         |
| dataset_id    | ID of the dataset from which the utterance was derived      |
| duration      | Duration of the audio file in seconds                       |
| audio_path    | File path to the audio file in WAV format                    |
| normalized    | Binary indicator of whether the audio has been normalized   |
| split         | Training, validation, or test split of the data              |

### Resources

1. https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
2. https://librosa.org/doc/main/index.html
3. https://www.sciencedirect.com/topics/engineering/zero-crossing-rate
4. https://medium.com/@LeonFedden/comparative-audio-analysis-with-wavenet-mfccs-umap-t-sne-and-pca-cb8237bfce2f
5. https://machinelearning.apple.com/research/mel-spectrogram
6. https://www.kaggle.com/code/shivamburnwal/speech-emotion-recognition
7. https://medium.com/heuristics/audio-signal-feature-extraction-and-clustering-935319d2225
8. https://www.soundjay.com/ambient-sounds.html
7. https://www.openslr.org/60/
8. https://bil.eecs.yorku.ca/datasets/
9. https://github.com/jeffprosise/Deep-Learning/blob/master/Audio%20Classification%20(CNN).ipynb
10. https://learn.flucoma.org/reference/mfcc/
