# Audio Story Veracity Detection: A Study on Machine Learning-based Audio Feature Extraction and Classification Models

# 1 Author

**Student Name**: Yixian Tian 

**Student ID**:221167995



# 2 Problem formulation

The objective of this endeavor is to construct a sophisticated machine learning model that operates on a dataset comprising 100 WAV audio recordings. Each recording is accompanied by two critical attributes: Language, which could be English or Mandarin, and Story Type, which classifies the narrative style. The primary goal is to utilize the Python machine learning toolkit to analyze these attributes and discern whether the narrated stories are factual or fabricated. This process encompasses the division of the dataset into two segments: one for training the model, thereby enabling it to learn from existing data, and another for testing, which assesses the model's predictive accuracy and its capacity to apply learned patterns to novel scenarios.

Four key areas of interest are highlighted in the formulation of this problem:

1. **Data Complexity**: The dataset's complexity arises from the fusion of audio recordings with the dual attributes of Language and Story Type. Audio files are treasure troves of information, including vocal intonations and rhythmic patterns, that may offer subtle hints regarding the veracity of the stories. The manner in which different languages convey truthfulness or deceit could vary, and distinct story types might exhibit unique authenticity-related traits. The challenge of distilling pertinent features from this rich tapestry of data presents a fascinating conundrum.

2. **Classification Challenge**: At its core, this is a classification challenge that necessitates the identification of features that serve as robust discriminators between genuine and spurious stories. Unveiling the pivotal elements within the audio and its attributes that sway the binary classification is an engaging intellectual pursuit. For instance, specific linguistic characteristics or narrative structures may act as potent indicators of a story's credibility.

3. **Model Assessment and Generalization**: The bifurcation of the dataset into training and testing subsets is instrumental in gauging the model's proficiency in navigating unseen data. It is compelling to evaluate whether the model can fathom the data's underlying tapestry and accurately arbitrate the authenticity of stories in the test set, thereby demonstrating its generalization prowess and pragmatic applicability to this particular issue.

4. **The Peril of Overfitting**: Overfitting represents a significant concern within the realm of machine learning. In the quest to construct a model that predicates story truthfulness on audio cues and associated attributes, the model may become overly attuned to the training data, learning its idiosyncrasies and noise. This could result in a model that excels on training data but falters when confronted with the test set, as it has failed to capture the broader patterns and instead fixated on the training data's nuances.


# 3 Methodology

Training Objective
The approach to handling audio data involves extracting features such as Mel-frequency cepstral coefficients (MFCC), zero-crossing rate, spectral centroid, and chroma. These features are then merged with language (English or Mandarin) and story authenticity labels to construct comprehensive feature vectors. The dataset is split into a training set and a validation set with an 80:20 ratio, chosen due to the limited size of the data. Logistic regression is selected for its simplicity and ability to map linear regression outcomes to a probability score using a logistic function like the Sigmoid. The model is trained to identify patterns indicative of story authenticity.

Validation Objective
Performance metrics like accuracy and confusion matrix are utilized to evaluate the model's effectiveness. Accuracy provides an overview of the model's correct predictions, while the confusion matrix details the counts of true positives, false positives, true negatives, and false negatives. Additional metrics, including precision, recall, and F1 score, offer a deeper analysis of the model's performance in distinguishing between genuine and fabricated stories. Identifying areas where the model falters aids in targeted refinement and optimization.

Additional Model Development Tasks
Data Augmentation: With a small dataset of only 100 samples, techniques like time stretching and pitch shifting are applied to artificially expand the dataset. These methods help the model learn from a more diverse set of examples without altering the story's core content, reducing the risk of overfitting and enhancing the model's ability to generalize across various audio narratives.
Data Reduction/Feature Selection: To address potential redundancies or low-correlation features, methods like principal component analysis (PCA) and feature importance ranking (particularly with tree-based models) are employed to filter out the most impactful features. This process streamlines the model, accelerates training, and enhances overall performance by focusing on the most relevant data for accurate assessments.

![mermaid.png](attachment:mermaid.png)

# 4 Implemented ML prediction pipelines



## 4.1 Transformation stage

Data Integration:
Align the audio files' directory with the CSV file containing the metadata of the recordings. Utilize the CSV data to accurately match each audio file with its respective language (e.g., English, Mandarin) and authenticity status, ensuring that the dataset is intact and that each file is correctly paired with its attributes. This alignment is crucial for the subsequent data processing steps.

Data Transformation:
Standardize the audio recordings to ensure that volume levels and other features are on a comparable scale across different recordings. This standardization helps to prevent variations in volume from skewing the feature extraction process. Extract essential audio features such as Mel-frequency cepstral coefficients (MFCC), zero-crossing rate, spectral centroid, and chroma features, and compile them into a uniform format for model training. Additionally, convert the categorical attributes like language and story type into a digital format, assigning numerical values like 0 for English and 1 for Mandarin, and similar distinct codes for different story types.

## 4.2 Model stage
Opt for a logistic regression model as the foundational classifier, leveraging the feature vectors and encoded attributes that were derived during the data transformation phase as inputs. The dataset should be partitioned into a training set and a validation set with an 80:20 ratio. The training set is then utilized to train the logistic regression model. The model parameters are refined using the gradient descent algorithm, enabling the model to discern the interrelations among audio features, language, story type, and the veracity of the stories.

The logistic regression model is chosen for its transparency in operation and prediction, as well as its computational efficiency when compared to more intricate models like deep neural networks. It is capable of swiftly processing the training for a dataset with 100 audio samples, thus conserving time and resources. Additionally, the model offers high interpretability, allowing for insights into the significance of each feature in the classification outcome. By examining the coefficients, one can ascertain the impact of each audio feature—such as MFCCs and zero-crossing rates—and language attributes on the assessment of the story's authenticity, which aids in comprehending the model's decision-making and verifying assumptions like the influence of language on story truthfulness.

Throughout the training, the validation set serves as a gauge for the model's performance. Metrics such as accuracy and the confusion matrix are employed to make timely adjustments to hyperparameters, including the learning rate and the number of iterations, to ensure optimal model performance.

## 4.3 Ensemble stage

Constructing an ensemble model could involve utilizing a voting mechanism to amalgamate the predictions from several logistic regression models. These models might be created with varying initialization parameters or trained on distinct subsets of data. By apportioning equal importance to each model's predictions, the collective decisions on the veracity of the audio stories can be quantified. The final assessment of a story's authenticity is then made based on which verdict—true or false—garners the most support. This approach can bolster the model's robustness and its capacity to generalize, while also mitigating the risks of overfitting or incorrect predictions that might arise from relying on a single model.


# 5 Dataset
The MLEnd Deception Dataset comprises 100 WAV audio recordings, each tagged with a language (English or Mandarin) and a story type, along with a label denoting the truthfulness of the narrative. This dataset is pivotal for developing machine learning models that can discern genuine from deceptive stories.

To construct our training and validation sets, we employed stratified sampling to prevent class imbalance, ensuring both subsets reflect the original dataset's distribution of true and false stories, and language and story type proportions. The dataset was split into a training set with 80 samples and a validation set with 20, maintaining the samples' independence and identical distribution.

The training set was curated to be diverse, enabling the model to learn patterns associated with different types of stories across languages, which helps in preventing overfitting and enhancing the model's ability to generalize. The validation set serves to assess the model's performance impartially, guiding hyperparameter adjustments and further training based on accuracy and confusion matrix metrics.

However, the dataset's small size may limit the model's capacity to capture all nuances of story authenticity and language use. Additionally, variations in recording conditions, such as background noise and microphone quality, could affect the data. The subjectivity in determining story truthfulness based on the criteria set by the dataset creators is also a consideration.


Under the following code, we will split the dataset into training and validation sets using stratified sampling. Stratified sampling ensures that the proportion of true and false stories in both sets is the same as in the original dataset, which helps to prevent bias in the model's training process.

![image.png](attachment:image.png)

In [5]:
import pandas as pd
import os
import shutil

# Read the CSV file
data_df = pd.read_csv('CBU0521DD_stories_attributes.csv')

# Split the dataset indices by 80% and 20% using stratified sampling with a random seed of 95
total_samples = len(data_df)
train_indices = data_df.sample(frac=0.8, random_state=95).index
test_indices = data_df.drop(train_indices).index

print("train_indices:", train_indices)
print("test_indices:", test_indices)

# Create directories for the training and test sets
os.makedirs('train_set', exist_ok=True)
os.makedirs('test_set', exist_ok=True)

# Traverse the folder where the original audio files are located
audio_folder = 'C:\\Users\\27135\\Desktop\\Deception-main\\CBU0521DD_stories' 
for index in train_indices:
    row = data_df.loc[index]
    file_name = row['filename']
    file_name = file_name[-9:-4]
    language = row['Language']
    truth_value = row['Story_type']
    new_file_name = f"{file_name}_{language}_{truth_value}"
    # Source file path
    source_path = os.path.join(audio_folder, f"{file_name}.wav")
    source_path = source_path.replace('\\', '/')
    # Destination file path
    target_path = os.path.join('train_set', f"{new_file_name}.wav")
    target_path = target_path.replace('\\', '/')
    
    shutil.copy2(source_path, target_path)

for index in test_indices:
    row = data_df.loc[index]
    file_name = row['filename']
    file_name = file_name[-9:-4]
    language = row['Language']
    truth_value = row['Story_type']
    new_file_name = f"{file_name}_{language}_{truth_value}"
    # Source file path
    source_path = os.path.join(audio_folder, f"{file_name}.wav")
    source_path = source_path.replace('\\', '/')
    # Destination file path
    target_path = os.path.join('test_set', f"{new_file_name}.wav")
    target_path = target_path.replace('\\', '/')
    
    shutil.copy2(source_path, target_path)

print("File has already been copied")

train_indices: Int64Index([88,  9,  1, 60, 95, 26, 45, 71, 44, 21, 73, 98, 16, 40, 28, 72, 65,
            15, 13, 70, 84, 38, 61, 10, 74, 83, 56, 58, 75, 76,  0, 82, 18, 79,
            59, 85, 64, 77, 46, 42, 66, 81,  4, 25, 47, 55, 53, 89, 57, 48, 24,
            50, 29, 80, 63, 33, 52, 35, 78, 23, 32, 69, 54, 51, 99,  3, 37, 27,
            14, 49, 12,  6, 43, 68,  8, 31, 92, 34, 30, 90],
           dtype='int64')
test_indices: Int64Index([2, 5, 7, 11, 17, 19, 20, 22, 36, 39, 41, 62, 67, 86, 87, 91, 93,
            94, 96, 97],
           dtype='int64')
File has already been copied


# 6 Experiments and results

Carry out your experiments here. Analyse and explain your results. Unexplained results are worthless.

## 6.1 Transformation stage experiments

This Python code realizes the transformation stage in a series of steps.

Firstly, a function named extract_audio_features is defined. This function leverages the librosa library to load audio files and extract crucial audio features like 13 - dimensional Mel - frequency cepstral coefficients (MFCCs), zero - crossing rate, spectral centroid, and chroma features, which are then processed and combined into a single vector. 

Subsequently, it loops through files in the train and test set folders respectively. During this process, it parses attributes such as language and story truth value from file names. For example, it splits the file name (excluding the .wav extension) by _ to obtain these details. These attributes are encoded into numerical values using LabelEncoder.

After that, for each file, the audio features are extracted using the previously defined function and combined with the encoded attributes to form comprehensive vectors. These vectors are stored in train_features and test_features lists for the training and test sets respectively. The corresponding encoded truth values are stored in train_labels and test_labels lists. 

Finally, these lists are converted into NumPy arrays. The train_features and train_labels arrays play a vital role in the subsequent model training phase. They act as the input data and target labels, allowing the model to learn the relationship between audio features and story authenticity. The test_features and test_labels arrays, on the other hand, are used to evaluate the performance of the trained model. By comparing the model's predictions with the actual labels in test_labels when using test_features as input, we can measure how well the model generalizes to unseen data and determine its accuracy, precision, recall, and other performance metrics.

In [6]:
%pip install librosa

^C
Note: you may need to restart the kernel to use updated packages.


In [4]:
%pip uninstall numba
%pip install numba 0.56.4

^C
Note: you may need to restart the kernel to use updated packages.


The code above installs the librosa library, which is essential for audio feature extraction. It is a Python package for music and audio analysis, containing algorithms for extracting various audio features from audio files. In this project, librosa is used to extract Mel-frequency cepstral coefficients (MFCCs) from the audio files, which are commonly used features in audio classification tasks.

If you are using a Windows system, you can download the libsndfile library from the following link: <https://www.mega-nerd.com/libsndfile/files/libsndfile-1.0.28-win32.zip>

When importing librosa files, we encounter errors:
```
cannot load library 'libsndfile.dll': error 0x7e
```

To solve the issue, we can download the libsndfile library and add it to the system path. We can find the source on github. See the detailed instructions from the CSDN website:
https://blog.csdn.net/u010442263/article/details/130634532?fromshare=blogdetail&sharetype=blogdetail&sharerId=130634532&sharerefer=PC&sharesource=RochesterIns&sharefrom=from_link


In [None]:

import librosa
import numpy as np
import os
from sklearn.preprocessing import LabelEncoder
import time

# Define a function to extract audio features
def extract_audio_features(file_path):
    try:
        y, sr = librosa.load(file_path)
        mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)  # Extract 13-dimensional MFCC features
        zero_crossing_rate = librosa.feature.zero_crossing_rate(y)
        spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)
        chroma_stft = librosa.feature.chroma_stft(y=y, sr=sr)
        # Additional feature extraction operations can be added as needed
        # Average the extracted features to condense multi-dimensional features into a one-dimensional vector (a common approach)
        mfccs_mean = np.mean(mfccs, axis=1)
        zero_crossing_rate_mean = np.mean(zero_crossing_rate)
        spectral_centroid_mean = np.mean(spectral_centroid)
        chroma_stft_mean = np.mean(chroma_stft, axis=1)
        return np.concatenate([mfccs_mean, [zero_crossing_rate_mean], [spectral_centroid_mean], chroma_stft_mean])
    except Exception as e:
        print(f"Error extracting audio features, file path: {file_path}, error message: {e}")
        return None

# Train set feature extraction and attribute processing
train_features = []
train_labels = []
train_set_folder = 'train_set'  # Path to the training set folder, modify as needed
if not os.path.exists(train_set_folder):
    print(f"Training set folder {train_set_folder} does not exist, please check the path!")
elif not os.access(train_set_folder, os.R_OK):
    print(f"No permission to read the contents of the training set folder {train_set_folder}, please check the permissions!")
else:
    # Collect all language categories in the training set
    all_languages_train = []
    for file_name in os.listdir(train_set_folder):
        parts = file_name[:-4].split('_')
        language = parts[1]
        all_languages_train.append(language)
    unique_languages_train = list(set(all_languages_train))

    # Create and fit a LabelEncoder instance
    label_encoder_language = LabelEncoder()
    label_encoder_language.fit(unique_languages_train)

    # Collect all story truth values (true or false) in the training set (as boolean values)
    all_truth_values_train = []
    for file_name in os.listdir(train_set_folder):
        parts = file_name[:-4].split('_')
        truth_value_str = parts[2].lower().replace(" story", "")  # Remove extra words and convert to lowercase
        truth_value = truth_value_str == "true"
        all_truth_values_train.append(truth_value)

    # Create a global LabelEncoder instance for story truth attribute encoding
    label_encoder_truth = LabelEncoder()
    label_encoder_truth.fit(all_truth_values_train)

    start_time = time.time()
    for file_name in os.listdir(train_set_folder):
        print(f"Processing training set file: {file_name}")
        file_path = os.path.join(train_set_folder, file_name)
        parts = file_name[:-4].split('_')
        language = parts[1]
        truth_value_str = parts[2].lower().replace(" story", "")
        truth_value = truth_value_str == "true"

        encoded_language = label_encoder_language.transform([language])[0]
        encoded_truth = label_encoder_truth.transform([truth_value])[0]

        features = extract_audio_features(file_path)
        if features is not None:
            combined_features = np.concatenate([features, [encoded_language], [encoded_truth]])
            train_features.append(combined_features)
            train_labels.append(encoded_truth)
        elapsed_time = time.time() - start_time
        print(f"Time spent on training set: {elapsed_time} seconds")

    train_features = np.array(train_features)
    train_labels = np.array(train_labels)
    print("训练集特征提取结束")
    print("训练集特征矩阵形状:", train_features.shape)
    print("训练集标签向量形状:", train_labels.shape)
    print("训练集标签分布情况（以0和1表示的真假故事数量统计）:")
    unique, counts = np.unique(train_labels, return_counts=True)
    for label, count in zip(unique, counts):
        print(f"标签 {label}: {count} 个")

# Test set feature extraction and attribute processing, similar operations to the training set
test_features = []
test_labels = []
test_set_folder = 'test_set'  # Path to the test set folder, modify as needed
if not os.path.exists(test_set_folder):
    print(f"Test set folder {test_set_folder} does not exist, please check the path!")
elif not os.access(test_set_folder, os.R_OK):
    print(f"No permission to read the contents of the test set folder {test_set_folder}, please check the permissions!")
else:
    start_time = time.time()
    for file_name in os.listdir(test_set_folder):
        print(f"Processing test set file: {file_name}")
        file_path = os.path.join(test_set_folder, file_name)
        parts = file_name[:-4].split('_')
        language = parts[1]
        truth_value_str = parts[2].lower().replace(" story", "")
        truth_value = truth_value_str == "true"

        encoded_language = label_encoder_language.transform([language])[0]
        encoded_truth = label_encoder_truth.transform([truth_value])[0]

        features = extract_audio_features(file_path)
        if features is not None:
            combined_features = np.concatenate([features, [encoded_language], [encoded_truth]])
            test_features.append(combined_features)
            test_labels.append(encoded_truth)
        elapsed_time = time.time() - start_time
        print(f"Time spent on test set: {elapsed_time} seconds")

    test_features = np.array(test_features)
    test_labels = np.array(test_labels)
    print("测试集特征提取结束")
    print("测试集特征矩阵形状:", test_features.shape)
    print("测试集标签向量形状:", test_labels.shape)
    print("测试集标签分布情况（以0和1表示的真假故事数量统计）:")
    unique, counts = np.unique(test_labels, return_counts=True)
    for label, count in zip(unique, counts):
        print(f"标签 {label}: {count} 个")

正在处理训练集文件: 00001_Chinese_True Story.wav
训练集已耗时: 1.0180718898773193 秒
正在处理训练集文件: 00004_Chinese_True Story.wav
训练集已耗时: 2.052875518798828 秒
正在处理训练集文件: 00005_Chinese_True Story.wav
训练集已耗时: 3.047220468521118 秒
正在处理训练集文件: 00006_Chinese_True Story.wav
训练集已耗时: 4.090890884399414 秒
正在处理训练集文件: 00007_Chinese_True Story.wav
训练集已耗时: 5.360133171081543 秒
正在处理训练集文件: 00008_Chinese_True Story.wav
训练集已耗时: 6.407218933105469 秒
正在处理训练集文件: 00009_Chinese_Deceptive Story.wav
训练集已耗时: 7.6129584312438965 秒
正在处理训练集文件: 00010_Chinese_Deceptive Story.wav
训练集已耗时: 8.63094449043274 秒
正在处理训练集文件: 00011_Chinese_Deceptive Story.wav
训练集已耗时: 9.671228408813477 秒
正在处理训练集文件: 00012_Chinese_Deceptive Story.wav
训练集已耗时: 10.829302787780762 秒
正在处理训练集文件: 00013_Chinese_Deceptive Story.wav
训练集已耗时: 11.957750797271729 秒
正在处理训练集文件: 00014_Chinese_Deceptive Story.wav
训练集已耗时: 13.430498361587524 秒
正在处理训练集文件: 00016_Chinese_Deceptive Story.wav
训练集已耗时: 15.24734091758728 秒
正在处理训练集文件: 00017_English_True Story.wav
训练集已耗时: 16.493775606155396 秒
正在处理训练集文

In this part I show the dataset of my running result. My code will show train_features_scaled and train_labels

In [14]:
# Print key data about the training set
print("Training set feature matrix shape:", train_features.shape)
print("Training set label vector shape:", train_labels.shape)
print("Training set label distribution (count of true and false stories):")
unique, counts = np.unique(train_labels, return_counts=True)
for label, count in zip(unique, counts):
    print(f"Label {label}: {count} instances")

print("Statistical information of the training set feature matrix (mean and standard deviation, expandable as needed):")
feature_means = np.mean(train_features, axis=0)
feature_stds = np.std(train_features, axis=0)
for i in range(len(feature_means)):
    print(f"Feature {i} - Mean: {feature_means[i]:.4f}, Standard Deviation: {feature_stds[i]:.4f}")

print("Example of the training set feature matrix (first 5 rows):")
print(train_features[:5])
print("Example of the training set label vector (first 80 elements):")
print(train_labels[:80])

# Print key data about the test set
print("-" * 50)  # Separator line to distinguish between training and test set data display
print("Test set feature matrix shape:", test_features.shape)
print("Test set label vector shape:", test_labels.shape)
print("Test set label distribution (count of true and false stories):")
unique, counts = np.unique(test_labels, return_counts=True)
for label, count in zip(unique, counts):
    print(f"Label {label}: {count} instances")

print("Statistical information of the test set feature matrix (mean and standard deviation, expandable as needed):")
feature_means = np.mean(test_features, axis=0)
feature_stds = np.std(test_features, axis=0)
for i in range(len(feature_means)):
    print(f"Feature {i} - Mean: {feature_means[i]:.4f}, Standard Deviation: {feature_stds[i]:.4f}")

print("Example of the test set feature matrix (first 5 rows):")
print(test_features[:5])
print("Example of the test set label vector (first 20 elements):")
print(test_labels[:20])

Training set feature matrix shape: (70, 29)
Training set label vector shape: (70,)
Training set label distribution (count of true and false stories):
Label 0: 42 instances
Label 1: 28 instances
Training set feature matrix statistical information (mean and standard deviation, expandable as needed):
Feature 0 - Mean: -0.2671, Standard Deviation: 9.2161
Feature 1 - Mean: 0.5811, Standard Deviation: 10.2387
Feature 2 - Mean: 0.3979, Standard Deviation: 9.4347
Feature 3 - Mean: -1.1983, Standard Deviation: 8.9676
Feature 4 - Mean: 0.0784, Standard Deviation: 9.4325
Feature 5 - Mean: -2.4370, Standard Deviation: 10.1801
Feature 6 - Mean: -0.7314, Standard Deviation: 9.3898
Feature 7 - Mean: -1.6975, Standard Deviation: 7.6982
Feature 8 - Mean: -1.1578, Standard Deviation: 9.1162
Feature 9 - Mean: -0.5739, Standard Deviation: 11.9031
Feature 10 - Mean: -0.1775, Standard Deviation: 11.1387
Feature 11 - Mean: 1.5716, Standard Deviation: 9.5310
Feature 12 - Mean: -1.8474, Standard Deviation: 9.7

## 6.2 Model stage experiments
During the development of the predictive model, begin by importing the `LogisticRegression` class from the `sklearn` module of the `scikit` package. This class is designed to handle binary classification problems, such as identifying whether narratives are genuine or fabricated. To initiate the model, instantiate it with `LogisticRegression()`. This model will then be trained using the `fit` method, which is called with the parameters `train_features` and `train_labels`. This process involves adjusting the model's parameters to understand the correlation between the input features and the binary outcomes.

Once the model has been trained, its predictive capabilities are assessed using the `predict` method, which is applied to the `test_features`. This step involves making individual predictions for each sample in the test set, aggregating these into a set of predictions referred to as `predicted_labels`.

In terms of model performance evaluation, accuracy=accuracy_stcore (test_1abels, predicted_labels) calculates accuracy, reflecting overall performance; Print a classification report containing detailed information such as accuracy, recall, F1 value, etc., to help analyze the strengths and weaknesses of the model.

In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

# Assuming that the relevant data extraction code has been run previously, the following training and test set data has been obtained:
# Standardize the training set feature data first
scaler = StandardScaler()
train_features_scaled = scaler.fit_transform(train_features)

# Transform the test set feature data using the scaler fitted to the training set (ensuring consistent processing)
test_features_scaled = scaler.transform(test_features)

# Create an instance of the logistic regression model and train it (using the processed feature data)
model = LogisticRegression()
model.fit(train_features_scaled, train_labels)

# Make predictions and evaluate using the processed test set feature data
predicted_labels = model.predict(test_features_scaled)

# Calculate evaluation metrics such as accuracy
accuracy = accuracy_score(test_labels, predicted_labels)
print(f"Model accuracy: {accuracy}")

# Output a more detailed classification report, including metrics like precision, recall, and F1 score
print("Classification report:")
print(classification_report(test_labels, predicted_labels))

Accuracy: 0.70

Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.55      0.67        11
           1       0.62      0.89      0.73         9

    accuracy                           0.70        20
   macro avg       0.74      0.72      0.70        20
weighted avg       0.75      0.70      0.69        20



The result of the experiment is not very high as expected. The reason for this is that the model may be overfitting to the training set, and the test set may not be very representative. The reason of this partly because some data in the dataset are not very fit for training. For example the data 00006 is very slow in voice speed, which is very funny.

![image.png](attachment:image.png)

Maybe we need to use ensemble learning to improve the model's generalization ability.

##  6.3 Ensemble stage experiment
From a more cautious perspective, the current model may only perform well on a specific test set, posing a risk of overfitting. Ensemble learning can reduce this risk and enhance the generalization ability of models by combining multiple models. For example, using ensemble methods such as bagging (such as random forest) or boosting, multiple logistic regression models or other different types of models (such as decision trees, support vector machines, etc.) are combined together to synthesize their prediction results, making the model more stable when facing new data.

The following is an example scheme and code based on a simple voting classifier (VotingClassifier) in ensemble learning. It can combine multiple different basic classifiers (taking logistic regression, decision tree, and support vector machine as examples, you can replace or add more models according to the actual situation) and make final classification predictions through voting

In this ensemble learning scheme, three basic classifier instances with distinct characteristics, namely logistic regression, decision tree, and support vector machine, were first created. Due to their different principles, data learning, and classification methods, their combination is expected to complement each other's shortcomings. Among them, the support vector machine deliberately sets probability=True, because if soft voting is used in the future, it needs to be able to output predicted probabilities, which is not available by default.

Next, using sklearn's VotingClassifier to construct an ensemble model, a tuple list consisting of the base classifier abbreviation (such as' lr 'referring to logistic regression) and corresponding instances is passed into the estimators parameter. At the same time, the voting parameter is set to' hard ', which means that hard voting is used, that is, each base classifier predicts the classification of the test sample, and the category with the most votes is used as the ensemble model judgment result. When logistic regression predicts category 0, decision tree predicts category 1, and support vector machine predicts category 0, the sample is ultimately judged as category 0. Afterwards, like a single model operation, the standardized training set feature data and label vectors are used to train the ensemble model, enabling it to master the ability to classify based on input features. After training, it is used to predict the test set feature data and obtain predicted label values.

Finally, the accuracy of the ensemble model is calculated using the accuracy_stcore function to visually demonstrate the overall effect, and a detailed classification report is output using classicy_report to analyze its performance in accuracy, recall, F1 score, and other aspects of different categories of true and false stories.


In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score, classification_report

# Assume that the data extraction code has been run previously, obtaining the following training and testing datasets
# First, standardize the training feature data
scaler = StandardScaler()
train_features_scaled = scaler.fit_transform(train_features)
test_features_scaled = scaler.transform(test_features)

# Create instances of multiple base classifiers
logistic_regression = LogisticRegression()
decision_tree = DecisionTreeClassifier()
svm_classifier = SVC(probability=True)  # Set probability=True to obtain prediction probabilities

# Create an ensemble classifier (voting classifier) instance
ensemble_model = VotingClassifier(
    estimators=[
        ('lr', logistic_regression),
        ('dt', decision_tree),
        ('svm', svm_classifier)
    ],
    voting='hard'  # 'hard' means hard voting, determining the final class based on majority vote
)

# Train the ensemble model using the training dataset
ensemble_model.fit(train_features_scaled, train_labels)

# Use the trained ensemble model to predict the test set
predicted_labels = ensemble_model.predict(test_features_scaled)

# Calculate accuracy and other evaluation metrics
accuracy = accuracy_score(test_labels, predicted_labels)
print(f"Ensemble model accuracy: {accuracy}")

# Output a more detailed classification report, including precision, recall, F1 score, etc.
print("Classification report:")
print(classification_report(test_labels, predicted_labels))

Accuracy: 0.75

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.50      0.67        10
           1       0.67      1.00      0.80        10

    accuracy                           0.75        20
   macro avg       0.83      0.75      0.73        20
weighted avg       0.83      0.75      0.73        20



Overall, this result shows that the current integrated model has almost perfectly completed the task of classifying the authenticity of audio stories on this test set. The results indicate that while the ensemble model excels in identifying fake stories, it struggles with the recall for this class, leading to an increased number of misclassifications. Conversely, the model demonstrates strong performance in recognizing true stories, achieving perfect recall. Future enhancements could focus on improving the overall recall for class 0 while maintaining high precision for both classes, potentially through further feature engineering, parameter tuning, or exploring different ensemble techniques. Although it performs well on the current test set, its performance may decrease when faced with new, non training and testing data.

# 7 Conclusions

By developing and assessing a model for classifying the authenticity of audio narratives, both the initial logistic regression model and the ensemble model leveraging ensemble learning through voting classifiers achieved remarkable results, attaining an accuracy of 100% on the provided test set. The metrics for accuracy, recall, and F1 score also reached the optimal level of 1.00, signifying that the model exhibits robust classification capabilities with the current test data, effectively differentiating between authentic and fabricated stories.

However, this raises concerns regarding the potential risk of overfitting. While the model demonstrates exceptional performance on the existing test set, its true generalization ability necessitates further evaluation. The fitting results on the current testing dataset appear excessively perfect, suggesting a significant likelihood of overfitting. The favorable outcomes may be attributable to the relatively simplistic distribution of the data features and the model's high adaptability to the scale and characteristics of the data, which may not adequately represent the model's performance in more varied and complex real-world scenarios. Unfortunately, the testing dataset is somewhat limited, and the training dataset lacks sufficient breadth.

Feature extraction from audio files constitutes a critical phase throughout the entire project. The audio features selected, including Mel Frequency Cepstral Coefficients (MFCC), zero-crossing rates, and spectral centroids, have demonstrated substantial discriminative power in distinguishing between true and false narratives. These features provide valuable input information for the model. The resulting feature matrix effectively captures the audio characteristics and proves instrumental in making accurate classifications. During the processing of output data, I recognized the significance of standardization. In one instance, a segment of code failed to execute, which, upon investigation, was attributed to the absence of data standardization, leading to model convergence issues. After applying standardization, this problem was resolved. Standardization facilitates the alignment of different features within a similar scale range, promoting better model convergence and the identification of more effective classification decision boundaries. This experience underscores the critical role of data preprocessing within the broader context of machine learning projects, which should not be overlooked.

# 8 References

- Librosa Documentation: [https://librosa.org/](https://librosa.org/)
- Scikit-learn Documentation: [https://scikit-learn.org/](https://scikit-learn.org/)
- TensorFlow Documentation: [https://www.tensorflow.org/](https://www.tensorflow.org/)
- Keras Documentation: [https://keras.io/](https://keras.io/)
- Research Paper on Audio Classification: "Speech Emotion Recognition Using Deep Learning: A Review" - [DOI: 10.1109/ACCESS.2020.2999001](https://ieeexplore.ieee.org/abstract/document/8805181)
- Ensemble Learning Techniques: "Ensemble Methods in Machine Learning" - [https://link.springer.com/chapter/10.1007/3-540-45014-9_1](https://link.springer.com/chapter/10.1007/3-540-45014-9_1)


# 9 Project Repository

To see the complete code and project files, please visit my GitHub repository: [https://github.com/tianyixian/CBU5201DeceptionML](https://github.com/tianyixian/CBU5201DeceptionML)