# Predicing Deceptive Stories: A Machine Learning Mini Project

# Table of Contents

- **1 Author**

- **2 Probelm formulation**

- **3 Methodology**

- **4 Implemented ML prediction pipelines**

- **5 Dataset**

- **6 Experiments and Results**

- **7 Conclusions**

- **8 References**

# 1 Author

**Student Name**:  Tabitha Carnell

**Student ID**: 240945420


# 2 Problem formulation

The objective of this mini-project is to address the problem: **Whether a narrative story is 'true' or 'deceptive'**. Specifically, the machine learning model will be built to take a 30-second narrative audio recording (input) and predict whether the story is true or not (output).

Research on detecting deception in audio demonstrates how understanding speech patterns will improve insights into human communication (Hirschberg et al., 2005; Fernandes & Ullah, 2021; DePaulo et al., 1982). This is useful in fields such as forensics and security, for example in suspect interrogations, as it helps to identify inconsistencies or signs of deception.

This project will focus on building and testing predictive models to address this problem, while acknowledging the constraints of data quality and size, which will limit the extent to which the problem can be explored.










# 3 Methodology


**Overview**

In this project, 30-second audio recordings will be used to build machine leaning models that will classify narrative stories as either true or deceptive. The outline of this project includes preprocessing the raw audio data, extracting significant features and working with two distinct datasets that extract different features. Each dataset will be used to train and verify a number of prediction models and their performance will be evaluated using classification metrics. Lastly, the best-performing dataset and model will be tested on a separate split of data (test) to determine its genuine accuracy.

**Training Task**

To build a reliable model, the extracted and standardised features will be trained on 70% of the data. This split was chosen because the size of the dataset (100) is relatively small and a larger training set proportion ensures the model can capture meaningful patterns and minimises the risk of underfitting.

**Validation Task and Evaluation Metrics**

The validation task will assess the model's performance on unseen data and encourage improvements during training. Validation will follow the validation set approach whereby the data is split randomly, in this case will be carried out on 15% of the data to balance the training and testing data for final evaluation.

The models will be assessed using the chosen evaluation metrics including: accuracy, the confusion matrix, and the F1-score. Accuracy provides a measure of overall performance, yet can fail to capture, for example, how well the model differentiates between classes. The confusion matrix and F1-score help by highlighting specific errors and balancing precision and recall, making them essential for assessing how reliably the model can distinguish between true and deceptive stories (Zhou et al., 2021).

**Test Task**

The test task is the most important task of the three for evaluating the model's deployment quality in real-world scenarios. This project allocates 15% of the dataset for testing, against the data for training (70%) and validation (15%). The test dataset remains separate to prevent bias, creating an unbiased estimate of how reliably the model can distinguish between true and deceptive stories when deployed.

# 4 Implemented ML prediction pipelines

This section outlines the machine learning prediction pipelines designed to classify narrative stories as true or deceptive. The prediction pipeline includes a sequence of operations beginning with raw audio files as input, which undergo transformations in the Transformation stage (4.1), such as feature extraction and scaling, to produce feature matrices. These are used in Model stage (4.2), where multiple classification models take the feature matricies and generate predicted labels (true or deceptive) as output.

While not implemented in this project, the benefits of the Ensemble stage (4.3) will be discussed, such as how combining model predictions can enhance reliability and overall performance.

Each stage in the pipeline contributes to the overall prediction quality of this project and more detailed descriptions of the stages are provided in the following subsections.

## 4.1 Transformation stage

The Transformation stage prepares the raw audio data for input by converting it into numerical features and ensuring it is suitable for training.

**Feature Extraction**

Feature extraction will transform the 30-second raw audio recordings into numerical feature matrices. Each row of the matrix represents an individual recording and each column will correspond to a selected feature. Two feature sets were created:

Dataset A will focus on spectral and temporal features, capturing the overall shape, energy and frequency distribution of the audio signal to analyse speech characteristics linked to deception.

- [Mean of MFCCs (Mel-Frequency Cepstral Coefficients)](https://librosa.org/doc/main/generated/librosa.feature.mfcc.html) -
analyses the overall sound of the audio, focusing on how the voice is expressed including coefficients such as: pitch, tone, and energy (Davis and Mermelstein, 1980; Gupta et al. 2013). MFCCs can help identify specific speech patterns such as changes in vocal emphasis, which might suggest whether a story is true or deceptive.

- [Zero-Crossing Rate](https://librosa.org/doc/main/generated/librosa.feature.zero_crossing_rate.html) - Tracks how often the audio signal crosses from positive to negative or vice versa (Badhe et al., 2019). A higher rate can suggest uneven speech, which might reflect nervousness often seen in deceptive stories.

- [Root Mean Square Energy](https://librosa.org/doc/0.10.2/generated/librosa.feature.rms.html) - represents the overall loudness or intensity of the audio (Badhe et al., 2019). Variations in energy might reflect changes in emotional emphasis, which could differ between true and deceptive stories.

- [Spectral Bandwidth](https://librosa.org/doc/main/generated/librosa.feature.spectral_bandwidth.html) - looks at the range of frequencies in the audio (Librosa). A wider range can show genuine speech seen in true stories, while a narrower range could indicate less emotional engagement, which may appear in deceptive speech.

Dataset B will focus on tonal and rhythmic features, capturing pitch, rhythm and voice clarity to analyse emotional patterns and variations linked to deception.

- [Pitch](https://librosa.org/doc/main/generated/librosa.piptrack.html) - measures how high or low a voice sounds, changes in pitch can reflect emotions or stress. For example, an inconsistent pitch might highlight nervousness seen in deceptive speech, while the opposite might indicate a truthful story.

- [Tempo](https://librosa.org/doc/main/generated/librosa.feature.tempo.html) - refers to how fast someone is speaking,  a faster speech rate might suggest nervousness which could be linked to deception. In contrast, a steady tempo often reflects calmness seen in truthful speech.

- [Spectral Contrast](https://librosa.org/doc/main/generated/librosa.feature.spectral_contrast.html) - captures the difference between loud and soft parts of the audio (Librosa). Clear and sharp contrasts may suggest a truthful story, while flatter contrasts might show a less engaging delivery seen in deceptive speech.

- Power - measures how strong or energetic the voice is. A more powerful voice can reflect confidence seen in truthful storytelling. In contrast, lower power might suggest lack of confidence emphasising deception.

These features were selected based on their relevance to distinguishing between true and deceptive stories. The transformation was chosen to split features into distinct sets, making it easier to compare model performance, reducing complexiting and streamlining testing.

**Dimensionality Reduction**

No dimensionality reduction techniques (such as PCA) were applied in this project due to the time frame. Yet, the dimensionality of the data was managed through manual feature selection. By focusing only on features likely to provide meaningful insights discovered through research, this will ensure that the datasets avoid redundancy and remain efficient.

**Scaling**

The extracted features will be transformed using `StandardScaler` which will allow the data to have a mean of 0 and a standard deviation of 1. This step ensures that all features are on the same scale which will prevent those with larger ranges from dominating the training process. This will allow for fair contributions from all features and improve the stability and performance of the models during training.

## 4.2 Model stage

The Model stage involves applying machine learning models to the transformed feature matrices (input) to classify narrative stories as either true or deceptive. The models chosen for this project are as follows:

**Support Vector Machines (SVM)**

SVMs are useful for binary classification, making this model suitable for distinguishing between true and deceptive narratives. They handle small datasets well by relying on support vectors which reduces overfitting. However, the interpretability of this model makes it difficult to understand which features contribute most to classifying stories as true or deceptive (Cervantes et al. 2020).

**Logistic Regression**

Logistic regression is another model which will be used in this project as it's a widely used model for binary classification. Its simplicity makes it a good baseline model for evaluating the likelihood of a story being true or deceptive. However, it is limited to exploring linear relationships which may not capture more meaniful patterns within the data.

**Decision Tree**

Decision trees are interpretable models that split data into regions based on feature thresholds, which makes them suitable for identifying patterns in deceptive and true stories in this project. They also handle complex, non-linear relationships well (unlike regression). However, the model can overfit without tuning, to address this, hyperparameters like maximum depth and minimum samples per leaf will be adjusted to improve generalisation and performance on unseen data (Gomes Mantovani et al., 2024).

These models were selected to provide a balance between complexity, interpretability and performance. By comparing their results, the most effective approach will be identified using the chosen evaluation metrics and the chosen model will be used for testing.

## 4.3 Ensemble stage

The Ensemble stage was not implemented in this project due to time constraints, yet it could have improved classification performance by combining the strengths of the chosen models. Random forests, for example, could have been used where the process of building multiple decision trees and aggregating their predictions would enhance robustness by reducing overfitting. Alternatively, techniques like voting ensembles (majority voting across SVM, Logistic Regression and Decision Tree) or stacking ensembles (using a meta-classifier to combine predictions) could have further improved accuracy.

These methods mitigate individual model weaknesses and leverage their complementary strengths, which could offer improved reliability of predictions for distinguishing true and deceptive stories.

# 5 Dataset

**Dataset Description**

The dataset that will be used to assess the machine learning problem defined above is the `MLEnDD_stories_small` dataset, which was downloaded from GitHub:

https://github.com/MLEndDatasets/Deception/tree/main/MLEndDD_stories_small

The original dataset was created intially with 300 samples and, for the purpose of this project, was condensed to 100. Each audio recording is 2 minutes or longer. The first 30-clips will be used in place of the full recording to directly address the machine learning problem.

These audio recordings link to a unique identifier and contain narrative stories (either in English or other lanaguages) whereby 50 are true and 50 are deceptive. This metadata is held in the corresponding `MLEndDD_story_attributes_small.csv` file, also downloaded from GitHub:

https://github.com/MLEndDatasets/Deception/blob/main/MLEndDD_story_attributes_small.csv

This dataset is quite relevant to the project as it directly addresses the task of classifying narrative stories as true or deceptive, offering a clear foundation for training and evaluating machine learning models.

**Sub-Datasets**

Two sub-datasets (A and B) will be created which will each focus on different feature extractions to explore the effectiveness of various audio features in predicting whether a narrative story is true or deceptive.

The first sub-dataset (A) will include features such as the mean of the MFCCs (Mel-Frequency Cepstral Coefficients), Zero-Crossing Rate, RMS Energy and Spectral Bandwidth. These  are commonly used in audio analysis to capture the spectral and temporal characteristics of the audio recordings.

The second sub-dataset (B) will focus on features like Pitch (mean), Tempo, Spectral Contrast and Power, which provide insights into the tonal and rhythmic properties of the audio recordings.

Both sub-datasets will be used for training and validation to compare how these different sets of features contribute to model performance. They will be both be standardised and split into training, validation and testing sets post feature extraction.

**Limitations**

While the dataset is seen to provide a foundation for exploring the classification of true and deceptive stories, several limitations can be considered which could impact the generalisability of the models developed in this project:

- Small dataset: The dataset contains only 100 samples which limits the amount of data available for splits and increases the risk of overfitting.

- Limited data scope: Only the beginning 30-second clip will be used from each audio recording, potentially missing key contexts found later in the recordings.

- Language differences: The dataset includes multiple languages, but differences in mannerisms and vocal patterns across languages won't be considered when training due to the absense of the language label during modelling which may affect accuracy.

- Background noise: Audio files will not be preprocessed to remove background noise, which may affect feature quality.

- Training and validation data: The same dataset will be used for training and validation splits rather than using entirely separate datasets, which could affect generalisability.

Addressing these limitations in future work could significantly improve the accuracy and applicability of the models before deployment.

## 5.1 MLEnd Deception Dataset

This section focuses on downloading and preparing the audio files and their associated metadata CSV for modeling, ensuring the data is ready for subsequent analysis and feature extraction.

### 5.1.1 Download Dataset

The releveant libraries are installed and the MLEnd deception dataset and csv file are downloaded and stored in Google Drive for further processing.

In [None]:
#install library
!pip install mlend==1.0.0.4

In [None]:
#import libraries
from google.colab import drive

import numpy as np
import pandas as pd

import os, sys, re, pickle, glob
import urllib.request

import IPython.display as ipd
from tqdm import tqdm
import librosa

drive.mount('/content/drive')

The dataset is downloaded and stored in the `'MLEnd'` directory with the following structure:

- **MLEnd/deception/**:
   - Contains 100 narrative audio files named `00001.wav` to `00100.wav`.

- **MLEnd/deception/MLEndDD_story_attributes_small.csv**:
  - Conatains a CSV file containing metadata about the audio files

In [None]:
#import library and functions
import mlend
from mlend import download_deception_small, deception_small_load

In [None]:
#lists files in the dataset folder
path = '/content/drive/MyDrive/Data/MLEnd'
print(os.listdir(path))

In [None]:
datadir = download_deception_small(save_to=path, subset={}, verbose=1, overwrite=False) #download small data

### 5.1.2 Audio Files (`MLEndDD_stories_small`)

The `MLEndDD_stories_small` dataset contains 100 audio files, each representing a narrated story and are:

- Named numerically (e.g., 00001.wav, 00002.wav)
- Roughly 2 minutes long (will be cut into 30-second audios)
- Associated with metadata extracted from the CSV file

These audio files (after cut) serve as the primary input for analysis and model training as they hold the raw data. The file paths are stored in **X_paths** within the TrainSet, allowing easy access for preprocessing and model development. The Labels are also provided in raw string format **(Y)** and numeric format **(Y_encoded)**  and are essential for model training and performance evaluation.

In [None]:
TrainSet, TestSet, MAPs = deception_small_load(datadir_main=datadir, train_test_split=None, verbose=1, encode_labels=True)
#read file paths

In [None]:
#X_paths are the paths to the audio files stored locally where each entry points to a .wav file in the dataset
print("Sample File Paths (X_paths):")
print(TrainSet['X_paths'][:5])

#Y are the original string labels for the stories indicating whether a story is a 'true_story' or 'deceptive_story'
print("\nSample Raw Labels (Y):")
print(TrainSet['Y'][:5])

#Y_encoded are the encoded numeric labels for the stories: 0 represents a 'true_story' and 1 represents a 'deceptive_story'
print("\nSample Numeric Labels (Y_encoded):")
print(TrainSet['Y_encoded'][:5])

The five random audio files can be heard below, helping to understand their format and content for analysis and modeling.

In [None]:
#assign X_paths to a variable 'files'
files = TrainSet['X_paths']

#display 5 random audio files
print('\n 5 Audio Files at Random')
for _ in range(5):
  n = np.random.randint(100)
  display(ipd.Audio(files[n]))

The audio files are cut into 30-second audios (by using the first 30-seconds of the recoding) which will be the input into the machine learning model, this directly addresses the problem stated of having 30-second audios as the input.



In [None]:
pip install pydub

In [None]:
from pydub import AudioSegment

#folder paths
input_folder = '/content/drive/MyDrive/Data/MLEnd/deception/MLEndDD_stories_small/'
output_folder = '/content/drive/MyDrive/Data/MLEnd/deception/cut_audio/'

#create the output directory if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

#iterate over all audio files in the input folder
for filename in os.listdir(input_folder):
  if filename.endswith('.wav'):
    filepath = os.path.join(input_folder, filename)

    #load the audio file
    audio = AudioSegment.from_wav(filepath)

    #extract the first 30 seconds
    clip = audio[:30000]

    #save clip
    output_filename = f"{os.path.splitext(filename)[0]}.wav"
    output_path = os.path.join(output_folder, output_filename)
    clip.export(output_path, format='wav')

#update files as the new 30-second audio clips
cut_audio_folder = '/content/drive/MyDrive/Data/MLEnd/deception/cut_audio/'

#update files as the new 30-second audios and sort them numerically
files = sorted(
    [os.path.join(cut_audio_folder, f) for f in os.listdir(cut_audio_folder) if f.endswith('.wav')],
    key=lambda x: int(x.split('/')[-1].split('.')[0]))

#update 'X_paths' (all audio files are stored here)
TrainSet['X_paths'] = files

#display 5 random audio files to confirm 30-second clips
print('\n5 Audio Files at Random:')
for _ in range(5):
    n = np.random.randint(len(files))
    display(ipd.Audio(files[n]))

This ensures the input audio files, that will be later split, are the cut audios and not the original files.

In [None]:
#print first 5 file paths (to ensure the audios correlate to the 30-second not the full clip)
print(TrainSet['X_paths'][:5])

00001.wav is displayed for both the original and cut audio to confirm they match up to the same speaker and content.

In [None]:
from IPython.display import Audio

#display the original audio
print("Original Audio (Full):")
display(Audio('/content/drive/MyDrive/Data/MLEnd/deception/MLEndDD_stories_small/00001.wav'))

#display the 30-second audio
print("\n30-Second Audio:")
display(Audio('/content/drive/MyDrive/Data/MLEnd/deception/cut_audio/00001.wav'))

### 5.1.3 CSV file (`MLEndDD_story_attributes_small.csv`)

The csv file is transformed into a **MLEndD_df** dataset which provides metadata about the 100 audio files, each with 3 attributes:
- an audio file name
- the language it is spoken in
- a binary label (true/deceptive).

This transformation makes it easier to link filenames with their metadata (via index), allowing for simpler analysis and use alongside the separately stored audio files.

In [None]:
#load the CSV file into a dataframe and set filename as index
MLEnd_df = pd.read_csv('/content/drive/MyDrive/Data/MLEnd/deception/MLEndDD_story_attributes_small.csv').set_index('filename')

#display the dataframe
print('\nMLEnd Story Attributes:')
display(MLEnd_df)

The audio WAV file names (after being cut into 30-second audios) are printed below and are seen to correspond to the file name entries in the `MLEndD_df` DataFrame which confirms accurate mapping for the subsquent analysis.

In [None]:
#display WAV file names from cut_audio
for file in files:
  print(file.split('/')[-1])

There are a total number of 100 stories in both the `MLEND_df` and `cut_audios` and there are an equal amount of both deceptive and true stories.

In [None]:
print("\nTotal number of stories (attributes):", len(MLEnd_df))
print("\nTotal number of stories (.wav files):", len(files))

#check distribution (count) of true (0) and deceptive (1) stories
print("\nStory Type Distribution:")
print(MLEnd_df['Story_type'].value_counts());

## 5.2 Dataset A

This dataset was created with a focus on core **spectral** and **temporal** features and inlcudes: MFCC (Mean),  Zero-Crossing Rate, Root Mean Square Energy and Spectral Bandwidth. These features will be extracted, visualised and split in this section.

### 5.2.1 Feature Extraction

In [None]:
def dataset_a(files, labels_file, scale_audio=False):

    X, y = [], []

    for file in tqdm(files, desc="Extracting features for Dataset A"):

        filename = file.split('/')[-1]

        yi = labels_file.loc[filename]['Story_type'] == 'true_story'

        y_audio, sr = librosa.load(file, sr=None)

        if scale_audio:
            y_audio = y_audio / np.max(np.abs(y_audio))

        mfcc = np.mean(librosa.feature.mfcc(y=y_audio, sr=sr, n_mfcc=13)) #mfcc
        zcr = np.mean(librosa.feature.zero_crossing_rate(y=y_audio))  # Zero-Crossing Rate
        rms = np.mean(librosa.feature.rms(y=y_audio))  # RMS Energy
        spectral_bandwidth = np.mean(librosa.feature.spectral_bandwidth(y=y_audio, sr=sr))  # Spectral Bandwidth

        xi = [zcr, rms, spectral_bandwidth, mfcc]

        X.append(xi)
        y.append(yi)

    return np.array(X), np.array(y)

# Generate the feature matrix (X) and labels (y)
X, y = dataset_a(files, labels_file=MLEnd_df, scale_audio=True)

### 5.2.2 Feature Visualisations

The distributions and relationships between features of dataset A are visualised using histograms and heatmaps.


In [None]:
import matplotlib.pyplot as plt

#create dataframe with features and labels
features = ['MFCC Mean', 'Zero-Crossing Rate', 'RMS Energy', 'Spectral Bandwidth']
df = pd.DataFrame(X, columns=features)
df['Label'] = y

#plot histograms for each feature and the label
df.hist(column=features + ['Label'], bins=10, figsize=(12, 10), alpha=0.7)
plt.suptitle('Dataset A Feature Distributions', fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
import seaborn as sns

#plot the correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df[features].corr(), annot=True, fmt=".2f", cmap="coolwarm", cbar=True, square=True)
plt.title('Dataset A Feature Correlation Heatmap', fontsize=16)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

The strongest relationship from the heatmap is between Zero-Crossing Rate and Spectral Bandwidth (0.63), suggesting that more frequent signal changes tend to occur in audio with a wider range of frequencies, potentially reflecting dynamic or varied speech patterns.

### 5.2.3 Split and Scaling

Dataset A is split into train (70%), validation (15%) and test (15%) sections and the data is standardised for modelling.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#split into train (70%), validation (15%) and test (15%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.30, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.50, random_state=42, stratify=y_temp)

#scale the x's
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

#show dataset split shapes and balance

#train set
print("Train Set:", X_train.shape, y_train.shape)
print('  The number of true stories in Train Set:', np.count_nonzero(y_train))
print('  The number of deceptive stories in Train Set:', y_train.size - np.count_nonzero(y_train))

#validation set
print("Validation Set:", X_val.shape, y_val.shape)
print('  The number of true stories in Validation Set:', np.count_nonzero(y_val))
print('  The number of deceptive stories in Validation Set:', y_val.size - np.count_nonzero(y_val))

#test set
print("Test Set:", X_test.shape, y_test.shape)
print('  The number of true stories in Test Set:', np.count_nonzero(y_test))
print('  The number of deceptive stories in Test Set:', y_test.size - np.count_nonzero(y_test))

## 5.3 Dataset B

This dataset was created with a focus on **speech dynamics** and **vocal quality** with features inlcuding: Pitch (Mean),  Tempo, Power and Spectral Contrast. These features will be extracted, visualised and split in this section.

### 5.3.1 Feature Extraction

In [None]:
def dataset_b(files, labels_file, scale_audio=False):

    X_b, y_b = [], []

    for file in tqdm(files, desc="Extracting features for Dataset B"):

        filename = file.split('/')[-1]

        yi = labels_file.loc[filename]['Story_type'] == 'true_story'

        y_audio, sr = librosa.load(file, sr=None)

        if scale_audio:
            y_audio = y_audio / np.max(np.abs(y_audio))

        pitches, magnitudes = librosa.piptrack(y=y_audio, sr=sr)
        pitch_mean = np.mean(pitches[pitches > 0]) if np.any(pitches > 0) else 0
        tempo = librosa.feature.tempo(y=y_audio, sr=sr)[0]
        spectral_contrast = np.mean(librosa.feature.spectral_contrast(y=y_audio, sr=sr))
        power = np.sum(y_audio**2)/len(y_audio)

        xi_b = [pitch_mean, tempo, spectral_contrast, power]

        X_b.append(xi_b)
        y_b.append(yi)

    return np.array(X_b), np.array(y_b)

# Generate the feature matrix (X) and labels (y) for Dataset B
X_b, y_b = dataset_b(files, labels_file=MLEnd_df, scale_audio=True)

### 5.3.2 Feature Visualisations

The distributions and relationships between features of dataset B are visualised using histograms and heatmaps.

In [None]:
#create dataframe with features and labels
features_b = ['Pitch Mean', 'Tempo', 'Spectral Contrast', 'Power']
df_b = pd.DataFrame(X_b, columns=features_b)
df_b['Label'] = y_b

#plot histograms for each feature and the label
df_b.hist(column=features_b + ['Label'], bins=10, figsize=(12, 10), alpha=0.7)
plt.suptitle('Dataset B Feature Distributions', fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
#plot the correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df_b[features_b].corr(), annot=True, fmt=".2f", cmap="coolwarm", cbar=True, square=True)
plt.title('Dataset B Feature Correlation Heatmap', fontsize=16)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

The heatmap for Dataset B shows minimal correlation between the  features, with the strongest relationship being a small positive correlation (0.33) between Pitch Mean and Spectral Contrast. This suggests some shared information between these features, with the others appearing largely independent.

### 5.3.3 Split and Scaling

Dataset B is split into train (70%), validation (15%) and test (15%) sections and the data is standardised for modelling.

In [None]:
#split into train (70%), validation (15%) and test (15%)
Xb_train, Xb_temp, yb_train, yb_temp = train_test_split(X_b, y_b, test_size=0.30, random_state=42, stratify=y_b)
Xb_val, Xb_test, yb_val, yb_test = train_test_split(Xb_temp, yb_temp, test_size=0.50, random_state=42, stratify=yb_temp)

#scale the x's
scaler = StandardScaler()
Xb_train = scaler.fit_transform(Xb_train)
Xb_val = scaler.transform(Xb_val)
Xb_test = scaler.transform(Xb_test)

#show dataset split shapes and balance

#train set
print("Train Set:", Xb_train.shape, yb_train.shape)
print('  The number of true stories in Train Set:', np.count_nonzero(y_train))
print('  The number of deceptive stories in Train Set:', yb_train.size - np.count_nonzero(yb_train))

#validation set
print("Validation Set:", Xb_val.shape, yb_val.shape)
print('  The number of true stories in Validation Set:', np.count_nonzero(y_val))
print('  The number of deceptive stories in Validation Set:', yb_val.size - np.count_nonzero(yb_val))

#test set
print("Test Set:", Xb_test.shape, yb_test.shape)
print('  The number of true stories in Test Set:', np.count_nonzero(y_test))
print('  The number of deceptive stories in Test Set:', yb_test.size - np.count_nonzero(yb_test))


# 6 Experiments and Results

## 6.1 Model 1: Support Vector Machines

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn import svm
from sklearn.metrics import accuracy_score

#svm model for dataset A
svm_a = svm.SVC(C=1)

#fit the model on the training set
svm_a.fit(X_train, y_train)

#predictions on training and validation
yt_svm_a = svm_a.predict(X_train)
yv_svm_a = svm_a.predict(X_val)

#svm model for dataset B
svm_b = svm.SVC(C=1)

#fit the model on the training set
svm_b.fit(Xb_train, yb_train)

#predictions on training and validation
yt_svm_b = svm_b.predict(Xb_train)
yv_svm_b = svm_b.predict(Xb_val)

#confusion matrix for dataset A (validation)
cm_a = confusion_matrix(y_val, yv_svm_a)
class_labels = ['True Story', 'Deceptive Story']
disp_a = ConfusionMatrixDisplay(confusion_matrix=cm_a, display_labels=class_labels)
disp_a.plot(cmap="Blues")
plt.title("Confusion Matrix for Dataset A")
plt.show()

#f1-score extraction
report_a = classification_report(y_val, yv_svm_a, output_dict=True)
print('Dataset A F1-Scores:')
for key, values in report_a.items():
    if isinstance(values, dict):
        print(f"{key}: {values['f1-score']:.2f}")

print('\nDataset A Accuracy')
print('Training Accuracy:', accuracy_score(y_train, yt_svm_a))
print('Validation Accuracy:', accuracy_score(y_val, yv_svm_a))

#confusion matrix for dataset B (validation)
cm_b = confusion_matrix(yb_val, yv_svm_b)
disp_b = ConfusionMatrixDisplay(confusion_matrix=cm_b, display_labels=class_labels)
disp_b.plot(cmap="Blues")
plt.title("Confusion Matrix for Dataset B")
plt.show()

#f1-score extraction
report_b = classification_report(yb_val, yv_svm_b, output_dict=True)
print('Dataset B F1-Scores:')
for key, values in report_b.items():
    if isinstance(values, dict):
        print(f"{key}: {values['f1-score']:.2f}")

print('\nDataset B Accuracy')
print('Training Accuracy:', accuracy_score(yb_train, yt_svm_b))
print('Validation Accuracy:', accuracy_score(yb_val, yv_svm_b));

**Dataset A SVM Results**

The SVM model for Dataset A achieved 74% training and 67% validation accuracy, showing good generalisation with minimal overfitting. The confusion matrix shows better performance on deceptive stories, with 6 correctly classified deceptive  and 4 true stories. The F1-scores of 0.62 (true) and 0.71 (deceptive) highlight balanced performance across both classes.

**Dataset B SVM Results**

Dataset B's SVM shows 77% training accuracy which drops to 40% validation accuracy which indicates significant overfitting. The confusion matrix shows poor differentiation with only 4 deceptive and 2 true stories correctly classified. F1-scores of 0.31 (true) and 0.47 (deceptive) confirm its struggle to handle this dataset effectively.

**Overall**

Dataset A demonstrates better feature representation under the svm model and performs reasonably well based on accuracy and F1-scores but there is room for improvement.

## 6.2 Model 2: Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

#logistic regression model for dataset A
model_a = LogisticRegression()
model_a.fit(X_train, y_train)

#predictions on training and validation sets
yt_reg_a = model_a.predict(X_train)
yv_reg_a = model_a.predict(X_val)

#accuracy
train_accuracy_a = accuracy_score(y_train, yt_reg_a)
val_accuracy_a = accuracy_score(y_val, yv_reg_a)

#logistic regression model for dataset B
model_b = LogisticRegression()
model_b.fit(Xb_train, yb_train)

#predictions on training and validation sets
yt_reg_b = model_b.predict(Xb_train)
yv_reg_b = model_b.predict(Xb_val)

#accuracy
train_accuracy_b = accuracy_score(yb_train, yt_reg_b)
val_accuracy_b = accuracy_score(yb_val, yv_reg_b)

#confusion matrix for dataset A (validation)
cm_a = confusion_matrix(y_val, yv_reg_a)
disp_a = ConfusionMatrixDisplay(confusion_matrix=cm_a, display_labels=class_labels)
disp_a.plot(cmap="Blues")
plt.title("Confusion Matrix for Dataset A")
plt.show()

#f1-score extraction
report_a = classification_report(y_val, yv_reg_a, output_dict=True)
print('Dataset A F1-Scores:')
for key, values in report_a.items():
    if isinstance(values, dict):
        print(f"{key}: {values['f1-score']:.2f}")

print('\nDataset A Accuracy')
print('Training Accuracy:', accuracy_score(y_train, yt_reg_a))
print('Validation Accuracy:', accuracy_score(y_val, yv_reg_a))

#confusion matrix for dataset B (validation)
cm_b = confusion_matrix(yb_val, yv_reg_b)
disp_b = ConfusionMatrixDisplay(confusion_matrix=cm_b, display_labels=class_labels)
disp_b.plot(cmap="Blues")
plt.title("Confusion Matrix for Dataset B")
plt.show()

#f1-score extraction
report_b = classification_report(yb_val, yv_reg_b, output_dict=True)
print('Dataset B F1-Scores:')
for key, values in report_b.items():
    if isinstance(values, dict):
        print(f"{key}: {values['f1-score']:.2f}")

print('\nDataset B Accuracy')
print('Training Accuracy:', accuracy_score(yb_train, yt_reg_b))
print('Validation Accuracy:', accuracy_score(yb_val, yv_reg_b));

**Dataset A Logistic Regression Results**

The Logistic Regression model for Dataset A achieved a training accuracy of 60% and a validation accuracy of 53%, showing moderate generalisation. The confusion matrix emphasises better performance at predicting true stories (F1-score 0.63) compared to false stories (F1-score 0.36), suggesting it struggles to identify false stories accurately.

**Dataset B Logistic Regression Results**

On Dataset B, Logistic Regression achieves a training accuracy of 58.6% and a validation accuracy of 46.7%, which is lower than its performance on Dataset A. The confusion matrix further confirms frequent misclassifications in false stories with the F1-scores (0.56 for true, 0.33 for false).

**Overall**

Dataset A also demonstrates better feature representation under the logistic regression model however comapred to the SVM model, it performs worse on both accuracy and F1-scores. This could be due to its reliance on linear decision boundaries not fully capturing relationships needed for this classification problem.

## 6.3 Model 3: Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

#decision tree model for dataset A
model_a = DecisionTreeClassifier(random_state=42)
model_a.fit(X_train, y_train)

yt_dt_a = model_a.predict(X_train)
yv_dt_a = model_a.predict(X_val)

train_accuracy_a = accuracy_score(y_train, yt_dt_a)
val_accuracy_a = accuracy_score(y_val, yv_dt_a)

print('Dataset A')
print('  Training Accuracy:', train_accuracy_a)
print('  Validation Accuracy:', val_accuracy_a)

#decision tree model for dataset B
model_b = DecisionTreeClassifier(random_state=42)
model_b.fit(Xb_train, yb_train)

yt_dt_b = model_b.predict(Xb_train)
yv_dt_b = model_b.predict(Xb_val)

train_accuracy_b = accuracy_score(yb_train, yt_dt_b)
val_accuracy_b = accuracy_score(yb_val, yv_dt_b)

print('\nDataset B')
print('  Training Accuracy:', train_accuracy_b)
print('  Validation Accuracy:', val_accuracy_b);

The decision tree shows overfitting through achieving perfect training accuracy (100%) but poor validation accuracy (53%). To address this, hyperparameters like maximum depth and minimum samples per split will be tuned to reduce complexity, improve generalisation and focus on meaningful patterns.

In [None]:
#decision tree model for dataset A with hyperparameter tuning
model_a = DecisionTreeClassifier(max_depth=3, min_samples_split=10, random_state=42)
model_a.fit(X_train, y_train)

yt_dt_a = model_a.predict(X_train)
yv_dt_a = model_a.predict(X_val)

train_accuracy_a = accuracy_score(y_train, yt_dt_a)
val_accuracy_a = accuracy_score(y_val, yv_dt_a)

#decision tree model for dataset B with hyperparameter tuning
model_b = DecisionTreeClassifier(max_depth=3, min_samples_split=10, random_state=42)
model_b.fit(Xb_train, yb_train)

yt_dt_b = model_b.predict(Xb_train)
yv_dt_b = model_b.predict(Xb_val)

train_accuracy_b = accuracy_score(yb_train, yt_dt_b)
val_accuracy_b = accuracy_score(yb_val, yv_dt_b)

#confusion matrix for dataset A (validation)
cm_a = confusion_matrix(y_val, yv_dt_a)
disp_a = ConfusionMatrixDisplay(confusion_matrix=cm_a, display_labels=class_labels)
disp_a.plot(cmap="Blues")
plt.title("Confusion Matrix for Dataset A")
plt.show()

#f1-score extraction
report_a = classification_report(y_val, yv_dt_a, output_dict=True)
print('Dataset A F1-Scores:')
for key, values in report_a.items():
    if isinstance(values, dict):
        print(f"{key}: {values['f1-score']:.2f}")

print('\nDataset A Accuracy')
print('Training Accuracy:', accuracy_score(y_train, yt_dt_a))
print('Validation Accuracy:', accuracy_score(y_val, yv_dt_a))

#confusion matrix for dataset B (validation)
cm_b = confusion_matrix(yb_val, yv_dt_b)
disp_b = ConfusionMatrixDisplay(confusion_matrix=cm_b, display_labels=class_labels)
disp_b.plot(cmap="Blues")
plt.title("Confusion Matrix for Dataset B")
plt.show()

#f1-score extraction
report_b = classification_report(yb_val, yv_dt_b, output_dict=True)
print('Dataset B F1-Scores:')
for key, values in report_b.items():
    if isinstance(values, dict):
        print(f"{key}: {values['f1-score']:.2f}")

print('\nDataset B Accuracy')
print('Training Accuracy:', accuracy_score(yb_train, yt_dt_b))
print('Validation Accuracy:', accuracy_score(yb_val, yv_dt_b));

**Dataset A Decstion Tree (with Tuning) Results**

For Dataset A, the decision tree achieves a training accuracy of 71% and a validation accuracy of 53%, showing that it can generalize moderately well. The confusion matrix indicates the model performs better at identifying true stories. The classification is balanced with an F1-score of 0.50 but is not that strong, leaving room for improvement even after tuning.

**Dataset B Decstion Tree (with Tuning) Results**

For Dataset B, the model reaches a higher training accuracy of 77% but the validation accuracy remains low at 46%, suggesting continued overfitting despite the tuning. The confusion matrix highlights challenges in predicting both true and deceptive stories. The F1-score of 0.46 further empahsises the model's struggle to balance precision and recall.

**Overall**

The decision tree, even with hyperparameter tuning, performs moderately on Dataset A but struggles with Dataset B due to overfitting and feature noise. Compared to the SVM and logistic regression models, the decision tree shows weaker generalisation and less reliability overall.

## 6.4 Best Model and Dataset: Re-train and Final Test

Based on the analysis, Dataset A is the best choice for the testing stage due to its more representative features and consistent generalisation seen across all models compared to Dataset B. As well as this, SVM is the most reliable model for the testing stage, with it achieving the highest validation accuracy (67%) and balanced F1-scores on Dataset A.

Together, Dataset A and the SVM model provide the best combination for evaluating deployment performance and the best suited for distinguishing between true and deceptive stories. They will both be used for re-training and the final test to show the true accuracy of this model for addressing the problem.


### 6.4.1 Re-train Data

The selected SVM model is retrained on 85% of Dataset A, a combination of the training (70%) and validation (15%) sets. This maximises the use of labeled data to improve pattern learning while still leaving the test set untouched for unbiased evaluation.

In [None]:
#combine dataset a's training and validation sets to create new train (85%)
X_combined = np.concatenate((X_train, X_val), axis=0)
y_combined = np.concatenate((y_train, y_val), axis=0)

#check new shapes
print("Combined Training Set:", X_combined.shape, y_combined.shape)
print("Test Set:", X_test.shape, y_test.shape)

In [None]:
#retrain the model using the combined dataset
svm_a_retrained = svm.SVC(C=1)  # Adjust hyperparameters if needed
svm_a_retrained.fit(X_combined, y_combined)

#predictions on the training set (combined dataset)
y_train_pred = svm_a_retrained.predict(X_combined)

#evaluate the model on the training set
print("Combined training Set Accuracy:", accuracy_score(y_combined, y_train_pred))

The combined training accuracy of 73% is slightly lower than the original accuarcy for SVM training on Dataset A (74%). This small drop is expected as adding the validation data introduces more diversity, which helps the model generalise better. By using a larger 85% training set, the model is now better prepared for testing and may perform more reliably on unseen data.

### 6.4.2 The Test

This section evaluates the model on unseen test data, providing a reliable measure of its real-world performance. It ensures the model's generalisation and readiness for deployment.

In [None]:
#evaluate the retrained model on the test set
y_test_pred = svm_a_retrained.predict(X_test)

#plot confusion matrix
ConfusionMatrixDisplay.from_predictions(y_test, y_test_pred, display_labels=class_labels, cmap="Blues")
plt.title("Confusion Matrix for Test Set")
plt.show()

#f1-score extraction
report_a = classification_report(y_test, y_test_pred, output_dict=True)
print('F1-Scores:')
for key, values in report_a.items():
    if isinstance(values, dict):
        print(f"{key}: {values['f1-score']:.2f}")

#accuracy
print("\nTest Set Accuracy:", accuracy_score(y_test, y_test_pred))

The test results show that the retrained SVM model achieved an accuracy of 53.33%, indicating some ability to generalise but also highlighting limitations in predicting unseen data. The F1-scores suggest the model performs better at identifying deceptive stories (0.59) compared to true ones (0.46), with an overall balance reflected in a macro-average of 0.52.

The confusion matrix reveals that true stories are more often misclassified (3 correct vs. 5 incorrect), whereas deceptive stories are identified more accurately (5 correct vs. 2 incorrect). These results suggest the model could benefit from further feature refinement or tuning to improve its generalisation capabilities.

# 7 Conclusions


**Datasets**

Between the two datasets, Dataset A outperformed Dataset B as it showed better generalisation and consistency throughout. Ultimately, the features in Dataset A proved more suitable for addressing the problem of determining whether a story is true or deceptive.

**Models**

Among the models tested, the SVM achieved the most reliable results. The logistic regression model struggled with capturing the complexity of non-linear relationships in the data and the decision tree also achieved lowe accuracy compared to SVM. Overall, the SVM was the best model for this project, balancing both accuracy and performance.

**Final Test Performance**

The final test accuracy (53%) was not perfect, however, shows the model was able to predict whether a story was true or deceptive correctly around half of the time, struggling more with distinguishing a true story. It's important to note the size of the dataset meant the amount of data the model was trained and tested on may not have been sufficient. Moroever, the limited timeframe and scope meant that only a certain amount of features were able to be explored and were perhaps not the best in assessing this type of problem. This problem would benefit from expanding the dataset, incorporating more meaningful features and experimenting on more models to achieve the best accuracy for deployment.


# 8 References

* Badhe, S.S., Gulhane, S.R., and Shirbahadurkar, S.D. (2019) 'Analysis of Spectral Features for Speaker Clustering', *International Journal of Innovative Technology and Exploring Engineering (IJITEE)*, 8(9S3), pp. 1234-1238.

* Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L. and Lopez, A. (2020). 'A comprehensive survey on support vector machine classification: Applications, challenges and trends'. *Neurocomputing*, 408, pp.189-215.

* Davis, S. and Mermelstein, P. (1980). 'Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences'. *IEEE transactions on acoustics, speech, and signal processing*, 28(4), pp.357-366.

*   DePaulo, B.M., Rosenthal, R., Rosenkrantz, J. and Green, C.R. (1982). 'Actual and perceived cues to deception: A closer look at speech'. *Basic and Applied Social Psychology*, 3(4), pp.291-312.

* Fernandes, S.V. and Ullah, M.S. (2021). 'Use of machine learning for deception detection from spectral and cepstral features of speech signals'. *IEEE Access*, 9, pp.78925-78935

* Gomes Mantovani, R., Horváth, T., Rossi, A.L., Cerri, R., Barbon Junior, S., Vanschoren, J. and Carvalho, A.C.D. (2024). 'Better trees: an empirical study on hyperparameter tuning of classification decision tree induction algorithms'. *Data Mining and Knowledge Discovery*, pp.1-53.

* Gupta, S., Jaafar, J., Ahmad, W.W. and Bansal, A. (2013). 'Feature extraction using MFCC'. *Signal & Image Processing: An International Journal*, 4(4), pp.101-108.

* Harris, C.R., Millman, K.J., van der Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M.H., Brett, M., Haldane, A., del Río, J.F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T.E. (2020). Array programming with NumPy. *Nature*, 585(7825), pp.357-362.

*   Hirschberg, J.B., Benus, S., Brenier, J.M., Enos, F., Friedman, S., Gilman, S., Girand, C., Graciarena, M., Kathol, A., Michaelis, L. and Pellom, B. (2005). 'Distinguishing deceptive from non-deceptive speech'. *Interspeech*, Lisbon, Portugal.

* Hunter, J.D. (2007). Matplotlib: A 2D graphics environment. *Computing in Science & Engineering*, 9(3), pp.90-95.

* McFee, B., Raffel, C., Liang, D., Ellis, D.P.W., McVicar, M., Battenberg, E. and Nieto, O. (2015). *librosa: Audio and music signal analysis in Python*. Proceedings of the 14th python in science conference.

* McKinney, W. (2010). Data structures for statistical computing in python. *In Proceedings of the 9th Python in Science Conference *, 445, pp. 51-56.

* Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. *Journal of Machine Learning Research*, 12, pp.2825-2830.

* Waskom, M. L. (2021). Seaborn: Statistical data visualization. *Journal of Open Source Software*, 6(60), 3021.

* Zhou, J., Gandomi, A.H., Chen, F. and Holzinger, A. (2021). 'Evaluating the quality of machine learning explanations: A survey on methods and metrics'. *Electronics*, 10(5), p.593.
