# Project 3

**Objective**

Given two folders, normal folder and Asthma folder. The first folder consists of recorded sound waves of normal breathing. The second folder consists of recorded sound waves of breathing with a wheezing sound. We want to perform a binary classification. 

$ \color{red}{\text{Grade: 5/5}} $

$ \color{red}{\text{Good}} $

## Useful packages for this project

In [1]:
# data wrangling
import numpy as np
import pandas as pd
from pathlib import Path

# hepml
from hepml.core import make_gravitational_waves, download_dataset

# tda magic
from gtda.homology import VietorisRipsPersistence, CubicalPersistence
from gtda.diagrams import PersistenceEntropy, Scaler
from gtda.plotting import plot_heatmap, plot_point_cloud, plot_diagram
from gtda.pipeline import Pipeline
from gtda.time_series import TakensEmbedding
from gtda.time_series import SingleTakensEmbedding
from sklearn.preprocessing import StandardScaler

# ml tools
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score

import os
import wave
from persim import plot_diagrams
from ripser import ripser
from sklearn.pipeline import Pipeline
from gtda.diagrams import Amplitude
from gtda.diagrams import NumberOfPoints
from sklearn.pipeline import make_pipeline, make_union
from teaspoon.ML import feature_functions as Ff

## Import dataset combined and labelled 

**Creating label for the two folders, with files from Normal labeled 0 and files from Asthma labeled 1.**

In [3]:
# Define folder paths
# base_path = 'drive'  # Change to your actual folder path
base_path = 'C:/Users/aime/Documents/AIMS_Docs/TDA/Project_3/Project_3'
normal_path = os.path.join(base_path, 'Normal')
asthma_path = os.path.join(base_path, 'Asthma')

# Function to get file names and assign labels
def get_files_with_labels(folder_path, label):
    file_list = []
    for file_name in os.listdir(folder_path):
        if file_name.endswith('.wav'):  # Consider only .wav files
            file_list.append((file_name, label))
    return file_list

# Get files from both folders with respective labels
normal_files = get_files_with_labels(normal_path, 0)
asthma_files = get_files_with_labels(asthma_path, 1)

# Combine data
all_files = normal_files + asthma_files

# Create a DataFrame
df = pd.DataFrame(all_files, columns=['filename', 'label'])

# Save DataFrame to CSV
df.to_csv('labeled_data.csv', index=False)

**Read the csv file**

In [4]:
dataset = pd.read_csv('labeled_data.csv')

We have already tried to see for one file, the persistent diagram, plot of the wave file. Now let us generalize for all of the files by combining in one functions all of the steps

## Useful functions

1- Function to apply a Takens embedding transformation to a given time series using a specified embedder. It transforms the input time series into an embedded time series and optionally prints the shape of the embedded time series along with the optimal embedding dimension and time delay.

In [5]:
def fit_embedder(embedder, y, verbose=True):
    y_embedded = embedder.fit_transform(y)

    if verbose:
        print(f"Shape of embedded time series: {y_embedded.shape}")
        print(f"Optimal embedding dimension is {embedder.dimension_} and time delay is {embedder.time_delay_}")

    return y_embedded

2- Function designed to transform a list of persistence diagram into a single array where each point in the diagram is labeled with its corresponding homology dimension.

In [6]:
def convert_dgm(dgm):
    Arr = dgm.copy()
    Arr[0] = Arr[0][:-1]
    col_a  = np.zeros(Arr[0].shape[0])
    Arr[0] = np.column_stack((Arr[0], col_a))

    col_b  = np.ones(Arr[1].shape[0], dtype=int)
    Arr[1] = np.column_stack((Arr[1], col_b))
    temp_1 = list(Arr[0])
    temp_2 = list(Arr[1])
    temp_1.extend(temp_2)
    return np.asarray(temp_1)

3- Functions to generate features using persistence entropy, Carlson coordinate, Wasserstein distance.

This function takes as input an audio file and return the features concatenated. Features from the Carlsson coordinate, and the different metrics.

In [7]:
def generate_features(spf):
      # Extract Raw Audio from Wav File
    signal = spf.readframes(-1)
    signal = np.frombuffer(signal, np.int16)
    fs = spf.getframerate()

      # If Stereo
    if spf.getnchannels() == 2:
        print("Just mono files")
        sys.exit(0)

      # Takens Embedding
    embedding_dimension = 30
    embedding_time_delay = 300
    stride = 10

    embedder = SingleTakensEmbedding(
          parameters_type="search", n_jobs=2, time_delay=embedding_time_delay, dimension=embedding_dimension, stride=stride
      )

    y_noise_embedded = fit_embedder(embedder, signal)


    res = ripser(y_noise_embedded, n_perm=700)
    dgms_sub = res['dgms']

    res = convert_dgm(dgms_sub)

    test = dgms_sub[0][:-1]

    test_1 = dgms_sub[1]
      # compute feature matrix
    FN = 5
    FeatureMatrix, TotalNumComb, CombList = Ff.F_CCoordinates(test[None,:,:], FN)
    X_cc_0 = FeatureMatrix[-8]

    FeatureMatrix, TotalNumComb, CombList = Ff.F_CCoordinates(test_1[None,:,:], FN)
    X_cc_1 = FeatureMatrix[-9]

      # Listing all metrics we want to use to extract diagram amplitudes
    metrics = [
        {"metric": "bottleneck", "metric_params": {}},
        {"metric": "wasserstein", "metric_params": {"p": 2}},
        {"metric": "betti", "metric_params": {"p": 2, "n_bins": 100}},
        {"metric": "heat", "metric_params": {"p": 2, "sigma": 1.6, "n_bins": 100}},
        {"metric": "heat", "metric_params": {"p": 2, "sigma": 3.2, "n_bins": 100}},
      ]

    feature_union = make_union(
          PersistenceEntropy(normalize=True),
          NumberOfPoints(n_jobs=-1),
          *[Amplitude(**metric, n_jobs=-1) for metric in metrics]
      )

    data_metrics = feature_union.fit_transform(res[None, :, :])

    return np.hstack((X_cc_0, X_cc_1, data_metrics))

4- Calling the function for each file in the folder 

In [8]:
X = []
y = []

for index, row in dataset.iterrows():
    filename = row['filename'].strip()  # Clean filename
    label = row['label']

    # Determine the subfolder based on label
    subfolder = 'Normal' if label == 0 else 'Asthma'

    # Construct the full path to the WAV file
    wav_file_path = os.path.join(base_path, subfolder, filename)
    spf = wave.open(wav_file_path, "r")
    X.append(generate_features(spf))
    y.append(label)

X = np.array(X)
y = np.array(y)


Shape of embedded time series: (5371, 9)
Optimal embedding dimension is 9 and time delay is 63
Shape of embedded time series: (7170, 17)
Optimal embedding dimension is 17 and time delay is 283
Shape of embedded time series: (4955, 11)
Optimal embedding dimension is 11 and time delay is 185
Shape of embedded time series: (5027, 13)
Optimal embedding dimension is 13 and time delay is 222
Shape of embedded time series: (4841, 10)
Optimal embedding dimension is 10 and time delay is 246
Shape of embedded time series: (4815, 8)
Optimal embedding dimension is 8 and time delay is 62
Shape of embedded time series: (7796, 12)
Optimal embedding dimension is 12 and time delay is 215
Shape of embedded time series: (7234, 11)
Optimal embedding dimension is 11 and time delay is 107
Shape of embedded time series: (6035, 8)
Optimal embedding dimension is 8 and time delay is 221
Shape of embedded time series: (5116, 12)
Optimal embedding dimension is 12 and time delay is 208
Shape of embedded time serie

Shape of embedded time series: (7764, 14)
Optimal embedding dimension is 14 and time delay is 128
Shape of embedded time series: (4508, 16)
Optimal embedding dimension is 16 and time delay is 182
Shape of embedded time series: (6986, 9)
Optimal embedding dimension is 9 and time delay is 284
Shape of embedded time series: (3463, 8)
Optimal embedding dimension is 8 and time delay is 274
Shape of embedded time series: (5219, 12)
Optimal embedding dimension is 12 and time delay is 184
Shape of embedded time series: (5232, 11)
Optimal embedding dimension is 11 and time delay is 189
Shape of embedded time series: (7788, 12)
Optimal embedding dimension is 12 and time delay is 129
Shape of embedded time series: (4885, 8)
Optimal embedding dimension is 8 and time delay is 181
Shape of embedded time series: (3246, 9)
Optimal embedding dimension is 9 and time delay is 287
Shape of embedded time series: (5248, 8)
Optimal embedding dimension is 8 and time delay is 248
Shape of embedded time series:

# Split the dataset into train and test

In [9]:
X.shape = (201, 20)

In [10]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

## Train model and check accuracy

We use the random Forest Classifier

In [11]:
# Functions that print the accuracy score

def print_scores(fitted_model):
    res = {
        "Accuracy on train:": accuracy_score(fitted_model.predict(X_train), y_train),
        
        "Accuracy on valid:": accuracy_score(fitted_model.predict(X_valid), y_valid),
        
    }
    if hasattr(fitted_model, "oob_score_"):
        res["OOB accuracy:"] = fitted_model.oob_score_

    for k, v in res.items():
        print(k, round(v, 3))

In [13]:
rf = RandomForestClassifier(random_state=29)
rf.fit(X_train, y_train)
print_scores(rf)

Accuracy on train: 0.988
Accuracy on valid: 0.659
