## Importations

In [None]:
#basic imports
import pandas as pd
import numpy as np 
import itertools
from typing import List
import matplotlib.pyplot as plt
import os
from random import shuffle, sample
import warnings
from tqdm import tqdm

#smoothing and gapfilling
from scipy.signal import convolve, hamming

#fourier features
from scipy import signal as sg
#wavelets features
import pywt
import scipy as sc

#pca
from sklearn import decomposition , metrics

#classifiers
from sklearn.ensemble import RandomForestClassifier

## Cleaning the data

The first step of our project is to turn surgery records into patients records with usable data. During surgery, the priority isn't the recording, some sensors are therefore often not recording or giving false values. The main goal here is to summarize all pieces of data into understandable signals.

### Mapping

If your data isn't recorded similarly as ours, you can change the columns names in the following cell (lists below): the first sensor is the most reliable and the last the less reliable.

FC stands for heart frequency, Pouls for heart frequency calculated with the pulse oxymetry sensor for example



In [None]:
blood_pressure_sensors = ["PAm", "PNIm", "PBm"]
temperature_sensors = ["Temp", "Toeso", "T1"]
heart_freq_sensors = ['FC', 'Pouls']
already_used = ['HEURE','DATE','LIT','NOM']  + blood_pressure_sensors + temperature_sensors + heart_freq_sensors

A lot of sensors are used during surgery, with some of them recording the same thing. The following functions gather all columns and give priority to most reliable sensors for each feature(temperature, heart frequency, blood oxygenation...)


In [None]:
def get_pulses(df):
    """Returns a pandas series filled with blood pressure from df.FC or df.Pouls if the latter is unavailable.
    Args:
        - df : pandas.DataFrame  -> df["FC"]
    Returns:
        - pandas.Series
    """
    serie_pulses  = df[heart_freq_sensors[0]].copy()
    for sensor in heart_freq_sensors[1:]:
        serie_pulses.loc[serie_pulses.isna()] = df[sensor].loc[serie_pulses.isna()]

    return serie_pulses
    
def get_blood_pressure(df):
    """Returns a pandas series filled with blood pressure from df['PAm'], df['PNIm'] if the latter is unavailable or df['PBm'] as a last resort
    """
    
    serie_blood_pressure = df[blood_pressure_sensors[0]].copy()
    for sensor in blood_pressure_sensors[1:]:
        serie_blood_pressure.loc[serie_blood_pressure.isna()] = df[sensor].loc[serie_blood_pressure.isna()]

    return serie_blood_pressure


def get_temperature(df):
    """Returns a pandas series filled with temperature from df['Temp'], df['Toeso'] if the latter is unavailable or df['T1'] as a last resort
    """

    serie_temperature = df[temperature_sensors[0]].copy()
    for sensor in temperature_sensors[1:]:
        serie_temperature.loc[serie_temperature.isna()] = df[sensor].loc[serie_temperature.isna()]

    return serie_temperature

The mapping function creates the *clean* Dataframe of a patient. In our case, the operating room number 4 is dedicated to heart surgery. For the other operating rooms, we chose to only keep heart frequency, body temperature and blood oxygenation, because other vital signs were not recorded for most patients. Surgeries occuring in the operating room number 4 are much different from the other ones, and provide a lot more vital signs, which led us to handle this room differently when it comes to mapping, and later to phase it out of the prediction process.

In [None]:
def mapping(df, bloc_4 = False):
    """Returns a Dataframe containing the heart frequency, blood pressure, SpO2, respiratory rate and temperature except for the operating room number 4.
    """
    df.replace('AP', np.NaN, inplace = True)
    mapped_df = pd.DataFrame()
    mapped_df["seconde"] = df["seconde"].copy()
    mapped_df["Pouls"] = get_pulses(df)
    mapped_df["Pression"] = get_blood_pressure(df)
    if not bloc_4:
        mapped_df["Temperature"] = get_temperature(df)
        mapped_df["SpO2"] = df["SpO2"].copy()
        mapped_df["FR"] = df["FR"].copy()
    else :
    #adding other sensors specific to bloc 4 
        mapped_df = pd.concat([mapped_df, pd.concat([df[col] for col in df.columns.drop(already_used)], axis=1)], axis=1)
    
        
    mapped_df.dropna(axis = 1, how = 'all')
    
    return mapped_df.astype(float)

### Splitting patients

The main idea here is to identify and separate each patient in every operation file, which often consists in 2 or 3 days of continuous recording. Of course, sensors might stop recording for several minutes in the middle of an operation, or on the contrary record for a few seconds during a break.

<figure ><figcaption >Identifying different patients in the data:</figcaption><video src="split_video.mp4" width=80% autoplay loop>  </video> </figure>

The first step is to create a pandaSeries that summarizes the heart rate, blood oxygenation, blood pressure and respiration rate, considering that the recording of one of those vital signs is equivalent to having a patient in the operating room. This Serie hasn't any medical meaning but is a timeline of the global recorded data.

In [None]:
def column_test_patient(df):
    '''
    arg: df, DataFrame
    return: DataSerie of meaningful Series
    '''
    serie_patient = df['Pouls'].copy()
    serie_patient[serie_patient.isna()] = df['SpO2'].loc[serie_patient.isna()]
    serie_patient[serie_patient.isna()] = df['Pression'].loc[serie_patient.isna()]
    serie_patient[serie_patient.isna()] = df['FR'].loc[serie_patient.isna()]

    return serie_patient

We then turn this timeline into a binary information (is the data recorded or not?)

In [None]:
def binar_list(df):
    '''
    arg: df, DataFrame
    return: binary mask (list) indicating the presence of data with 1, absence by 0.
    '''
    patient_serie=column_test_patient(df)
    return patient_serie.notna().astype(int)


The following function summarizes the information of the previous timeline, by giving the length of each sequence, which will enable us to decide whether a sequence of missing datas can be interpreted as a change of patient.

In [None]:
def list_index_patients(binar_list):
    '''
    arg: binar_list
    return: list of tuple (key,length of sequence)
    '''
    return [
        (key, len(list(group))) 
        for key, group in itertools.groupby(binar_list)
    ]


We had some difficulties to seperate patients because of the noise (sensors recording during a break). The following handles this issue by deleting the noise (in fact gathering the two break sequences that were artificially splitted because of this wrong value).

In [None]:
def noises_treatment(list_index_patient, threshold):
    '''
    arg: list of tuples (key,length of sequence), threshold (duration of bloc cleaning)
    return: list of tuples (key,length of sequence) without noise
    '''
    new_index = [list_index_patient[0]]
    i = 1
    while i <(len(list_index_patient)-1):
        if list_index_patient[i][1]<threshold:
            last_el = list(new_index[-1])
            last_el[1] += list_index_patient[i][1]
            last_el[1] += list_index_patient[i+1][1]
            new_index[-1] = tuple(last_el)
            i += 1
        else :
            new_index.append(list_index_patient[i])
        i += 1
    if list_index_patient[-1][1]>=threshold :
        new_index.append(list_index_patient[-1])
    else : 
        last_el = list(new_index[-1])
        last_el[1] += list_index_patient[-1][1]
        new_index[-1] = tuple(last_el)
    return new_index

We then have to identify the beginning and the end of every surgery. 

In [None]:
def accumulate_list(list_len):
    return [0]+list(itertools.accumulate(list_len))

def break_finder(list_index_patients, threshold):
    '''
    arg: list of tuples (key,length of sequence) without noise
    return: list of lists [begin,end] which frame the sequences of cleaning
    '''
    list_len = [len1[1] for len1 in list_index_patients]
    accumulate_list1 = accumulate_list(list_len)
    breaks = []
    for index,len1 in enumerate(list_index_patients):
        value, length = len1
        begin, end = accumulate_list1[index] , accumulate_list1[index+1]
        if (value == 0) and (length > threshold):
            breaks.append([begin,end])
    return breaks

Finally we can slice the data into patients.

In [None]:
def list_to_patients(list_break,df):
    """takes a list of lists [begin,end] and the corresponding dataframe and returns a list containing one Dataframe for each patient.
    Args:
        - list_break: list
        - df: pd.DataFrame
    Return
        patients: list of pd.Dataframes"""
    patients=[df[list_break[i][1]:list_break[i+1][0]] for i in range (-1,len(list_break)-1)] #On commence à -1 pour réserver une case pour le potentiel premier patient
    if list_break:
        patients[0]=df[0:list_break[0][0]]
        patients.append(df[list_break[-1][1]:-1])
    return patients


def separation_patients(df, threshold):
    '''
    Args: 
        - df: Dataframe ; 
        - threshold: int (duration of bloc cleaning)
    returns: list of patients' DataFrames
    '''
    list_index_patients1 = list_index_patients(binar_list(df))
    list_index_pat_treated = noises_treatment(list_index_patients1, threshold)
    breaks = break_finder(list_index_pat_treated, threshold)

    return list_to_patients(breaks, df)

### Cleaning the data
We then have to delete empty csv files, or patients without a regular time Serie. The following cell deals with these cases.

In [None]:
def second(df):
    """returns a pandas series containing the time in seconds.
    Args: 
        - df: pandas.DataFrame 
    Return: 
        - pandas.Series """
    delta_serie = pd.to_datetime(df.HEURE) - pd.to_datetime(df.HEURE[0])
    delta_serie = delta_serie.apply(lambda delta : delta.total_seconds())
    
    while (delta_serie<0).any() :
        delta_serie.loc[delta_serie<0]+= 60*24*60
    
    return delta_serie

def df_is_empty(df):
    '''
    Returns True if the Dataframe "df" is empty, False if not.
    '''
    return len(df) < 5

def time_is_missing(df):
    '''
    Returns True if the Dataframe "df" has no column containing the time, False if not
    '''
    return 'HEURE' not in df.columns

def patient_is_empty(patient):
    '''
    Returns True if the surgery lasts less that 20 minutes, False if not.

    '''
    return len(patient)<4*60

### Sorting the recording files by time

We used this function to sort our csv file by time. This can be useful when the recording of a single patient is splited in two csv files following each other. It is then necessary to detect it and merge the two differents parts of the surgery. This case is not handled in this notebook and can be added if needed.

In [None]:
def sort_folder(folder):
    """takes the path to the data as argument and returns a dictionary containing as keys the name of the operating rooms 
    and as values a list of csv files sorted by date.
    Args :
        - folder : string (path the the folder containing the data)
    Returns :
        - dic : dictionary 
    """
    sorted_list = [
        (filename[0:7] + filename[-8:-4] + filename[-12:-10], filename) 
        for filename in os.listdir(folder)
    ]
    sorted_list.sort()
    
    dic = {}
    for string, filename in sorted_list : 
        key = filename[0:6]
        if key not in dic:
            dic[key] = []
            
        dic[filename[0:6]].append(filename)
    return dic

### Operating room cleaning duration


This dictionnary represents the shortest duration for switching patients for each operating room. If the data is missing for less than this time span, than we are sure that there was no patient switch during this elapsed time. This duration is different for each operating room because some are dealing with emergencies (operating room number 1), others not.

This dictionary can be adapted to all time constraints.

In [None]:
thresholds = {
        'bloc_1':15*60//5,
        'bloc_2':20*60//5,
        'bloc_3':20*60//5,
        'bloc_4':20*60//5,
        'bloc_5':20*60//5,
        'bloc_6':20*60//5,
        'bloc_7':20*60//5,
        'bloc_8':20*60//5,
        'Bloc_1':15*60//5,
        'Bloc_2':20*60//5,
        'Bloc_3':20*60//5,
        'Bloc_4':20*60//5,
        'Bloc_5':20*60//5,
        'Bloc_6':20*60//5,
        'Bloc_7':20*60//5,
        'Bloc_8':20*60//5
    }

### Splitting function

It's now time to gather everything. This function writes a csv file for every detected patient, in a new folder named "patients". Of course, we keep only patients with usable data (where time is not missing for example). To use this function,  give the path to the folder containing the data (csv files with encoding latin-1).

In [None]:
def processing(folder):
    if not os.path.exists("patients"):
        os.makedirs("patients")
    dic = sort_folder(folder)
    for bloc, filelist in dic.items():
        for filename in tqdm(filelist, desc = bloc):
            df = pd.read_csv(f"{folder}/{filename}", encoding = 'latin-1', low_memory=False)# les 'AP' dans des colonnes de floats créent un warning, ce cas est géré dans mapping(on remplace les 'AP' par des NaN)
            if df_is_empty(df) or time_is_missing(df):
                continue
            df['seconde'] = second(df)
            df = mapping(df, bloc_4 = (bloc == "bloc_4"))
            for ind, patient in enumerate(separation_patients(df,thresholds[bloc])):
                patient_name = filename.replace(".csv",f"_{ind}.csv")
                patient_filename = "patients/" + patient_name
                if patient_is_empty(patient):
                    continue
                patient.to_csv(patient_filename)

Here is an example of one of the dataframes written by this function. 
<img src="data_frame_patient.png"> </img>

## Labelling


This step can't be done in this notebook, labelling has to be done on the user's end. For the next step, a new training folder containing folders named by labels is needed.

<img src="folder_train_example.png" width=80%>

The main idea is to identify whether the surgery went without problems, and if not to only keep the part before the heart attack, in order to train the algorithm not to recognize an attack but to anticipate it, as shown in the video below.

<video src="labelling_video.mp4" width=80% autoplay loop>  </video>

### Examples

To be clearer, here are some patients with the labels we gave them, based only on heart rate and blood oxygenation

<figure> <img src="clean.png" width=80%> </img> <figcaption> Clean Patient</figcaption></figure>

<figure> <img src="anomaly.png" width=80%> </img> <figcaption> Anomaly Patient</figcaption></figure>

<figure> <img src="attack.png" width=80%> </img> <figcaption> Attack Patient</figcaption></figure>

## Patient class

The aim of this class is to simplify the use of the patients' data. It allows for manipulations on the patient's information
that are needed before sending it to a classifier with one unique structure. 

### Cleaning the data
Those functions are used in the Patient Class, in order to fill the small gaps, and remove the outliers (sometimes due to wrong mesures or to the surgery directly).

In [None]:
def clear_col(file):
    #Retire les colonnes vides
    res = []
    df = pd.read_csv(file, encoding= 'unicode_escape')
    for col in df.columns:
        if df[col].notna().sum() != 0:
            res.append(col)
    return res, df

def g(x):
    '''Auxiliary function used for sorting'''
    return 40/x if x > 30 else 2

def get_rolling_features(serie, length):
    return (
        serie.rolling(length).std(),
        serie.rolling(length).mean()
    )

def moy(serie, length):
    '''Running average'''
    win = hamming(length)
    res = serie.copy()
    res[2:len(res)-1] = pd.Series(convolve(serie, win/np.sum(win), mode = 'valid'))
    return res

def intermed(serie, length, index, rl_std, rl_mean, duree_min = 6, window = 13):
    '''Auxiliary function used for sorting'''
    a = 0 #Compte le nombre de points outlier d'affilés
    for i in range(2 * window):
        while abs(serie[min(len(serie) -1, index-window+i)] - rl_mean[i]) < g(rl_std[i])*rl_std[i]:
            a+=1
            i+=1
            if a >= duree_min: #C'est la durée à partir duquel on considère que le point n'est pas un outlier
                return(serie[min(len(serie) -1, index-window+i)])
        a=0
    return np.nan

def tri(serie, length = 20, j = 2, max_std = 2): #Est censée retirer les valeurs bizarres dues aux erreurs de mesure. On ne l'utilise pas non plus
    """Gets rid of outliers of the series "series" given as an argument."""
    res = serie.copy()
    rl_std, rl_mean = get_rolling_features(serie, length)
    for i in range(13, len(serie)-13):
        if rl_std[i] <= max_std or abs(serie[i] - rl_mean[i]) < g(rl_std[i])*rl_std[i]:
            res[i] = serie[i]
        else:     
            res[i] = intermed(serie, length, i, rl_std, rl_mean)
    return res

def lissage(serie):
    '''Smooths the serie given as an argument.'''
    return moy(serie, 4)

def new_series(serie, outlier):
    '''A combination of the 2 previous functions. Gets rid of outliers only if "outlier" is set to True.'''
    if outlier:
        return lissage(tri(serie)).interpolate(methode = "slinear")
    return lissage(serie).interpolate(methode = "slinear")

def new_series_gap_filling(serie):
    return serie.interpolate(methode="slinear")


### PCA Feature
The pca can be considered as a classifier in itself, but we decided to use it as a feature. The following function returns the difference between the real data and the one recreated by the pca. More details on the PCA in our report.

In [None]:
def compute_transform_error1(pca, dataset):
    dataset_transformed = pca.inverse_transform(
        pca.transform(dataset)
    )

    return np.linalg.norm(
        dataset - dataset_transformed,
        axis = -1
    )

Here is a description of the properties and methods that compose the Patient class.

### Classmethod from_file :  
This class method reads a csv file and turns it into a patient object, in which the gap in the patient's data have been filled. 

### Properties
The following properties facilitate the acces to the data of a patient: bloc, Pouls, SpO2, temp, Pression, fr, seconds, duree.

### Preparing the data for the features:
- gap_filling method: fills all the gap in a patient's data.
- smoothing method : smooths a patient's data using a linear approximation.
- slicing method:  selects the time span in which a patient's data will be analysed. It is essential because some features as the Fourier transformation require data with similar lengths, but not all operations have the same duration.It also enables to choose wether we include the gap_filling in the patient's informations.
- method standard : standardizes a patient's data (sets their mean to zero and their standard deviation to 1)
- method center : centers patient's data (sets their mean to zero)

### Features
All the following methods return features:
- coefsPouls1_Spo2_0
- coefsPouls0_Spo2_0
- mcr_spo2_0
- mcr_pouls_1
- ondelettes_SpO2_Pouls
- energy_dwt
- moment
- corr_P_S_dwt
- mean
- std
- min
- max
- first quartile
- median
- third quartile
- fourier
- pca_error

### Getting features
 This method helps us gather all the feature that we need on a patient. It applies the preceding functions, that calculate features on a patient's data, with the chosen parameters. The results are stored in a pandas DataFrame, where each column is a feature.

In [None]:
samplerate =1/5


class Patient:
    
    def __init__(self,name,df, label = None):
        self.df = df
        self.name = name
        self.label = label

        
    @classmethod
    def from_file(cls, filename, path="", label=None):
        '''from a file name and a path, returns a patient to whom gap filling has been applied
        Args : 
            - filename : str -> name of the csv file
            - path : str -> path to go to thr file/
            - label : str -> patient's label
        Returns : 
            - Patient object
        '''
        df = pd.read_csv(f"{path}{filename}", encoding="ISO-8859-1")
        return cls(filename, df.drop("Unnamed: 0",axis=1), label).gap_filling()
    
    @property
    def bloc(self):
        listname = self.name.split("_") 
        return int(listname[1])
        
    @property
    def pouls(self):
        return (self.df["Pouls"]).to_numpy(copy=True)

    @property
    def spo2(self):
        return (self.df["SpO2"]).to_numpy(copy=True)

    @property
    def temp(self):
        return (self.df["Temperature"]).to_numpy(copy=True)

    @property
    def fr(self):
        return (self.df["FR"]).to_numpy(copy=True)

    @property
    def seconde(self):
        return (self.df["seconde"]).to_numpy(copy=True)

    @property
    def pression(self):
        return (self.df["Pression"]).to_numpy(copy=True)

    @property
    def duree(self):
        return self.seconde[-1]-self.seconde[0]
    
    def slice(self, begin, leng):
        '''returns the Dataframe of a patient within the chosen time span.
        Args : 
            - begin : int -> begining of the window
            - leng : int -> length of the window
        Return :
            - Patient object
        '''
        df_sliced = self.df.iloc[begin:begin + leng] #check si ca fait une copie
        df_sliced.reset_index(inplace=True, drop=True)
        return Patient(self.name, df_sliced, self.label)
    
    def standard(self):
        '''returns the normalized Dataframe of a patient.
        Args : 

        Returns : 
            - Patient object -> Patient whom DataFrame is reduced and centered
         '''
        df = self.df.copy()
        df = (df - df.mean()) / df.std()
        return Patient(self.name, df, self.label)

    def center(self):
        '''returns the centered Dataframe of a patient.
        Args : 

        Return : 
            - Patient object -> Patient whom DataFrame is centered. 
        '''
        df = self.df.copy()
        df -= df.mean()
        return Patient(self.name, df, self.label)
    
    def  smoothing(self, outlier):
        '''smooths the Dataframe of a patient.
        Args : 
            -outlier: bool -> remove the outliers in the dataframe if True

        Return : 
            - Patient object
        '''
        df_smoothed  = pd.concat(
            [
                new_series(self.df[col], outlier and self.bloc==2) 
                for col in self.df.columns
            ], 
            axis = 1
        )
        return type(self)(self.name, df_smoothed, self.label)

    def  gap_filling(self):
        ''' interpolates missing values, except the NaN values at the beginning and at the end of the surgery.
        Args : 
            
        Return : 
            - Patient object -> whom DataFrame does not include any NaNs
        '''
        df_filled  = pd.concat(
            [
                new_series_gap_filling(self.df[col])
                for col in self.df.columns
            ], 
            axis = 1
        )
        
        df_filled = df_filled[df_filled["Pouls"].notna() & df_filled["SpO2"].notna()].copy() #The linear interpolation let NaNs at the beginning 
        #and at the end of the DataFrame, which are removed thanks to this line
        df_filled.reset_index(inplace = True) #reset the index of the DataFrame, otherwise the first idice could be different from zero
        return type(self)(self.name, df_filled, self.label)
    
    #Beginning of the features:

    def coefsPouls1_Spo2_0(self,begin,end):
        """
        Returns the correlation coefficient between the Very Low Frequency Spo2 wavelet [0] and the Low Pulse Frequency wavelet [1]
        Args:
            - Patient object
        Return:
            - int -> coefficient
        """
        df = self.df.iloc[int(begin*60//5): int(end*60//5)]

        coefs_pouls = pywt.wavedec(df['Pouls'], level = 3, wavelet = 'db4')
        coefs_Spo2 = pywt.wavedec(df['SpO2'], level = 3, wavelet = 'db4')

        warnings.filterwarnings("ignore")
        r, _ = sc.stats.pearsonr(coefs_pouls[1],coefs_Spo2[0]) # si la spo2 est constante on obtient un warning et un NaN, ce cas est traité dans clean
        warnings.resetwarnings()
        return r
    
    def coefsPouls0_Spo2_0(self,begin,end):
        """
        Returns the correlation coefficient between the Very Low Frequency Spo2 wavelet [0] and the Low Frequency Pulse wavelet [0]
        Args:
            - Patient object
        Return:
            - int -> coefficient
        """
        df = self.df.iloc[int(begin*60//5): int(end*60//5)]

        coefs_pouls = pywt.wavedec(df['Pouls'], level = 3, wavelet = 'db4')
        coefs_Spo2 = pywt.wavedec(df['SpO2'], level = 3, wavelet = 'db4')
        
        warnings.filterwarnings("ignore")
        r, _ = sc.stats.pearsonr(coefs_pouls[0],coefs_Spo2[0]) # si la spo2 est constante on obtient un warning et un NaN, ce cas est traité dans clean
        warnings.resetwarnings()
        return r
    
    def mcr_spo2_0(self,begin,end):
        """Returns the mean crossing rate of the Pulse1 function of the patient, ie the second function resulting from his wavelet
        Args:
        
        Returns:
        int -> mean crossing rate"""
        spo2 = self.df['SpO2']
        spo2.iloc[int(begin*60//5): int(end*60//5)]
        coefs = pywt.wavedec(spo2, level = 3, wavelet= 'db4')
        spo2_0 = coefs[0]

        #calcul du mcr - mean crossing rate
        spo2_0 -= np.nanmean(spo2_0)
        m = np.sum(np.diff(spo2_0) < 0)
        m /= len(spo2_0)

        return m

    def mcr_pouls_1(self,begin,end):
        """Returns the mean crossing rate of the Pulse1 function of the patient, ie the first function resulting from his wavelet
        Args:
        
        Returns:
        int -> mean crossing rate"""
        pouls = self.df['Pouls']
        pouls.iloc[int(begin*60//5): int(end*60//5)]
        coefs = pywt.wavedec(pouls, level = 3, wavelet = 'db4')
        pouls1 = coefs[1]

        #calcul du mcr - mean crossing rate
        pouls1 -= np.nanmean(pouls1)
        m = np.sum(np.diff(pouls1) < 0)
        m /= len(pouls1)

        return m

    def ondelettes_SpO2_Pouls(self,begin,end):
        """Returns the number of times i for which there is a reaction of the Pulse to an action of the Spo2
        Args: 
        - self
        Returns:
        - L : int"""
        df = self.df
        df = df.iloc[int(begin*60//5) : int(end*60//5)]
        Pouls= df['Pouls'].interpolate(method = 'slinear')
        Spo2 = df['SpO2'].interpolate(method = 'slinear')

        mask1 = Pouls.notna()
        mask2 = Spo2.notna()
        mask3 = mask1 * mask2
        
        Pouls = Pouls[mask3]
        Spo2 = Spo2[mask3]

        if not len(Pouls):
            return 0
        
        coefs_pouls = pywt.wavedec(Pouls, level = 2, wavelet= 'db4')
        coefs_Spo2 = pywt.wavedec(Spo2, level = 2, wavelet= 'db4')

        Pouls_1 = coefs_pouls[2]
        SpO2_1 = coefs_Spo2[2]

        L = 0
        Pouls_1 = pd.Series(Pouls_1)
        SpO2_1 = pd.Series(SpO2_1)
        for i in range(3,len(Pouls_1)-3):
            if np.abs(SpO2_1[i]) > SpO2_1.std()*2 and (np.abs(Pouls_1[i:i+5]) > Pouls_1.std()*2).any():
                L+=1
        return L

        
    def energy_dwt(self, features_list, start=0, length=10, level=4, outlier = True):
        """
       Calculates the energy on the detail coefficients of the
         multilevel Discrete Wavelet Transform.
         The size and position of the window are in parameters.
        Args:
        features_list -> str list, represents the patient's caracteristics features on which we decide to calculate this features.
        start : int -> beginning of the chosen window in minutes
        length : int -> length of the chosen window in minutes
        level : int -> number of transformations

        Returns:
        -int -> value of the energy
        
        """
        patient = self.smoothing(outlier)
        data_pouls = patient.pouls[~np.isnan(patient.pouls) | ~np.isnan(patient.spo2)]
        
        start, length = int(start*60/5), int(length*60/5)
        waves = pywt.wavedec(data_pouls[start: start+length], 'db4', 'symmetric', level=level)
        
        energy = [
            np.nansum(np.abs(waves[lev]))/len(waves[lev])
            for lev in range(1,level+1)
        ]
        
        dict_wav = {
                "energy_1" : energy[3], 
                "energy_2" : energy[2], 
                "energy_3" : energy[1], 
                "energy_4" : energy[0]
        }
        
        return { 
            key: value 
            for key,value in dict_wav.items() 
            if key in features_list
        }


    def moment(patient, recordstype, order=1, scaled=False, reduit=False):
        """
        Computes the 1st, 2nd, 3rd or 4th moment of a series, which can also be scaled and/or centered.
        
        Most relevant moments :
        - Coefficient d’asymétrie : order=3, center=True, scale=True
        - Non - normalized kurtosis : order=4, center=True, scale=True
        
        Args:
        -patient : Patient
        -ecordstype : string
        -order : int, optional
        -center : bool, optional
        -scale : bool, optional

        Returns:
        -int -> value of the moment
        
        """
        tab=patient.df[recordstype].to_numpy()
        tab = tab[~np.isnan(tab)] #delete the  NaNs

        if center:
            tab -= np.mean(tab)
        
        if scale:
            warnings.filterwarnings("ignore")
            tab /= np.std(tab) # if the spo2 is constant we get a warning and a NaN, this case is treated in clean
            warnings.resetwarnings()
        
        return np.sum(tab ** order) / len(tab)
    
    def corr_P_S_dwt(self, start=0, length=20, level=3):
        """
        Function that tests whether the coefficients of the details of a DWT level
         for pulse rate and SpO2 are higher than their respective std.        
        Args :

        -start : int -> beginning of the window in minutes
        -length : int -> length of the window in minutes
        -level : int -> number of trabsformed

        Returns:
        -float
        
        """
        kP, kS = .5, .5  # coeff std Pouls, SpO2
        
        start, length = int(start*60/5), int(length*60/5)
        
        wavesP = pywt.wavedec(self.df['Pouls'][start:start+length], 'db4', 'symmetric', level = level)
        wavesS = pywt.wavedec(self.df['SpO2'][start:start+length], 'db4', 'symmetric', level = level)
        
        stdP, stdS = kP * np.nanstd(wavesP[1]), kS * np.nanstd(wavesS[1])

        return np.nansum((np.abs(wavesP[1]) < stdP) & (np.abs(wavesS[1]) > stdS))/len(wavesP[1])

    def mean(patient, recordstype):
        return patient.df[recordstype].mean()
    
    def std(patient, recordstype):
        return patient.df[recordstype].std()
    
    def min(patient, recordstype):
        return patient.df[recordstype].min()
    
    def max(patient, recordstype):
        return patient.df[recordstype].max()
    
    def first_quartile(patient, recordstype):
        return patient.df[recordstype].quantile(0.25)
    
    def median(patient, recordstype):
        return patient.df[recordstype].quantile(0.5)
    
    def third_quartile(patient, recordstype):
        return patient.df[recordstype].quantile(0.75)
        
    def fourier(self, features_list, nperseg, begin, end, outlier=True) :
        
        """
        Calculate 7 features:
             - integrations freq + abs time (X (f)) for the pulse then the SpO2,
                 - correlation coefficient between integration / freq of abs (X (f)) between Pulse and SpO2
                 - list of corr coefficients between certain X (f) of Pulse and SpO2
        ---Parameters---
        self = patient smoothed 
        nperseg = int, window of stft in number of points (50 in general) 
        begin, end = int, chosen temporal window in number of points
        outlier = bool, if True, remove the outliers
        
        ---Returns---
        pw_pouls_abs = float
        pw_SpO2_abs = float
        coeff_pw_abs = float
        coeff_freq = list, dtype = float, len(list) = 4 
        """
        patient = self.smoothing(outlier).standard()
        data_pouls = patient.pouls[~np.isnan(patient.pouls) | ~np.isnan(patient.spo2)]
        data_SpO2 = patient.spo2[~np.isnan(patient.spo2) | ~np.isnan(patient.pouls)]

        fp, tp, spec_p = sg.stft(
            data_pouls[begin : end],
            samplerate,
            'blackman',
            nperseg = nperseg
        )

        fs, ts, spec_s = sg.stft(
            data_SpO2[begin : end],
            samplerate,
            'blackman',
            nperseg = nperseg
        )
        
        df_fft = fp[1] - fp[0]    
        pw_t_pouls_abs = [
            np.sum(np.abs(fft_t)) * df_fft 
            for fft_t in spec_p.T
        ]

        pw_t_SpO2_abs = [
            np.sum(np.abs(fft_t)) * df_fft 
            for fft_t in spec_s.T
        ]

        dt = tp[1] - tp[0]
        pw_pouls_abs = sum(pw_t_pouls_abs) * dt
        pw_SpO2_abs = sum(pw_t_SpO2_abs) * dt
        
        coeff_pw_abs = np.corrcoef(pw_t_pouls_abs, pw_t_SpO2_abs)[0,1]

        coeff_freq = [
            np.corrcoef(np.abs(spec_p[freq]), np.abs(spec_s[freq]))[0,1]
            for freq in ([6,13,15])
        ]
        
        dict_fourier= {
            "pw_pouls_abs" : pw_pouls_abs, 
            "pw_SpO2_abs" : pw_SpO2_abs, 
            "coeff_pw_abs" : coeff_pw_abs, 
            "coeff_freq1" : coeff_freq[0], 
            "coeff_freq2" : coeff_freq[1], 
            "coeff_freq3" : coeff_freq[2] 
        }
        return { 
            key: value 
            for key,value in dict_fourier.items() 
            if key in features_list
        }
    
    def pca_error(self,pca, begin=120, length=180):
        return float(
            compute_transform_error1(
                pca,
                [self.standard().slice(begin,length).pouls]
            )
        )

    


#method which gather the features :
def get_features(patient, parameters):
    ''' allows access to the dataframe of a patient's features
    Args : 
        - parameters : dict -> clé : features'name ; valeur :dictionary containing the name of the parameters of the associated method in keys and their values in value 
    Return : 
        - df_features : pd.DataFrame
    '''
    function_mapping = {
        "mean_pouls": patient.mean,
        "mean_SpO2": patient.mean,
        "mean_FR": patient.mean, 
        "mean_pression": patient.mean, 
        "mean_temperature": patient.mean,
        "std_pouls": patient.std,
        "std_SpO2": patient.std,
        "std_FR": patient.std, 
        "std_pression": patient.std, 
        "std_temperature": patient.std,
        "min_pouls": patient.min,
        "min_SpO2": patient.min,
        "min_FR": patient.min, 
        "min_pression": patient.min, 
        "min_temperature": patient.min,
        "max_pouls": patient.max,
        "max_SpO2": patient.max,
        "max_FR": patient.max, 
        "max_pression": patient.max, 
        "max_temperature": patient.max,
        "median_pouls": patient.median,
        "median_SpO2": patient.median,
        "median_FR": patient.median, 
        "median_pression": patient.median, 
        "median_temperature": patient.median,
        "first_quartile_pouls": patient.first_quartile,
        "first_quartile_SpO2": patient.first_quartile,
        "first_quartile_FR": patient.first_quartile, 
        "first_quartile_pression": patient.first_quartile, 
        "first_quartile_temperature": patient.first_quartile,
        "third_quartile_pouls": patient.third_quartile,
        "third_quartile_SpO2": patient.third_quartile,
        "third_quartile_FR": patient.third_quartile, 
        "third_quartile_pression": patient.third_quartile, 
        "third_quartile_temperature": patient.third_quartile,
        "moment3_SpO2":patient.moment,
        "moment3_Pouls":patient.moment,
        "moment4_SpO2":patient.moment,
        "moment4_Pouls":patient.moment,
        "fourier":patient.fourier,
        "pca": patient.pca_error, 
        "wavelet_corr_spo2_0_pouls_1": patient.coefsPouls1_Spo2_0, 
        "wavelet_corr_spo2_0_pouls_0": patient.coefsPouls0_Spo2_0, 
        "energy_dwt" : patient.energy_dwt, 
        "corr_P_S_dwt" : patient.corr_P_S_dwt, 
        "mcr_pouls_1" : patient.mcr_pouls_1, 
        "mcr_spo2_0" :patient.mcr_spo2_0, 
        "ondelettes_SpO2_Pouls" : patient.ondelettes_SpO2_Pouls
    }
    
    df_features = pd.DataFrame([patient.name], columns = ['name'] )
    df_features['label'] = patient.label
    
    for feature in parameters.keys():
        if feature == "fourier":
            dict_fourier = function_mapping[feature](**parameters[feature])
            for key, value in dict_fourier.items():
                df_features[key] = value
        elif feature == "energy_dwt":
            dict_wav = function_mapping[feature](**parameters[feature])
            for key, value in dict_wav.items():
                df_features[key] = value
        else: 
            df_features[feature] = [function_mapping[feature](**parameters[feature])]
    
    return df_features


## PatientsList Class

This class groups together patients and simplifies the massive generation of features.
A Patientlist object only has one attribute, which is a list of Patients objects.
The idea of this class is to apply the method of the Patient class directly to a group of patient.

### Getting a list of patients
The class method from_folder transforms the csv files of patients (each file must correspond to only one operation) contained in a folder in a Patientlist object. The method called clean (described below) is applied on this list, so we don't have any issue with indexses getting out of range or NaNs in the columns after we made the Patientlist object.

### Making the class easier tu use
- magic method len : enables to access the len of a Patientlist object as if it was a classic list
- magic method iter : makes the Patientlist objects iterables.

### Cleaning the list
method clean : makes a new Patientlist object by selecting only patients where saturation and pulse columns are complete and  the duration is long enough to be cut on a certain window.

### Apply Patient class methods
The four following methods have been implemented to apply the methods relative to the Patient class to a Patientlist object.
- method slice : slices the patient's data for each patient in the list.
- method standard : standardizes the patient's data for each patient in the list. 
- method center: centers the patient's data for each patient in the list.
- method smoothing : smooths the patient's data for each patient in the list.

this is made possible by the following method: 
- method apply_to_list_patient : this method allows for applying a function relative to a patient to all of the 
patients composing  a Patientlist object

### Preparing a train list
The 2 following methods select a certain part of the patients:
- method bloc_selection : enables to select only the patients that have been operated in a chosen bloc in a new Patientlist object.
- method label_selection :  enables to select only the patients that have received a certain label in a new Patientlist object.
The next two split a list into two sets of patients: the first one could be the training set and the second one the testing set. This comes in useful to test a new classifier.
- method split_training : split a Patientlist object in two : one Patientlist corresponding to the training set, with which we will train our classifier, and one Patientlist corresponding to the testing set, with wich classifiers' performances will be evaluated. It is essential
to do so because we can't test an algorithm on a set with which he has already been trained; this would introduce a huge bias.
The proportion of labels in the two lists are the same as in the data.
- method split_homogene: similar to split_training, but we make sure that the proportion of labels in the two lists are the same as in the data.


- method label : returns a DataFrame with the name of the patients composing a Patientlist and their label. It is necessary because the classifier needs to have access to the label of patients when they train, and also when we test them.



### Getting the features: 
The get_features_list_patient method writes a DataFrame in which each column corresponds to a feature, and each line corresponds to a patient
present in the Patientlist object. Calculating the features is pretty long, so it spares a lot of time to store them in a DataFrame for the rest 
of the process. Moreover, the access to the features is made easier by using pandas DataFrames.


### PCA
The train_pca method is apart because, as said before, the pca is a classifier in itself and has to be trained with a training set. We used it as a feature because we thought it was a good point to see if the data we had was easy to summarize or not. This method trains the pca algorithm. 

In [None]:
class PatientsList:
    """Represents a list of patients
    """
    def __init__(self, patient_list: List[Patient]):
        self.patientslist = patient_list

    @classmethod
    def from_folder(cls, path_to_folder, label=True):
        """Returns a Patientslist instance built from a folder contaning data.
        Args:
            -path_to_folder : string
        Returns :
            -PatientsList object
        """

        patient_list = []
        if label:
            for foldername in os.listdir(path_to_folder):
                for filename in os.listdir(f"{path_to_folder}{foldername}"):
                    patient = Patient.from_file(
                        filename, f"{path_to_folder}{foldername}/", foldername
                    )
                    patient_list.append(patient)
        else: 
            for filename in os.listdir(path_to_folder):
                    patient = Patient.from_file(
                        filename, f"{path_to_folder}/"
                    )
                    patient_list.append(patient)
        return cls(patient_list).clean()

    def __len__(self):
        return len(self.patientslist)

    def __iter__(self):
        return iter(self.patientslist)

    def clean(self, begin=120, length=180):
        """Slices all of the dataframes of the list of patients from begin to begin + length.
        Args:
            -begin: int 
            -length : int 
        Returns :
            -Patientslist object
        """

        cleaned_list = [ 
            patient 
            for patient in self.patientslist 
            if (
                patient.df["Pouls"].notna().all() 
                and patient.df["SpO2"].notna().all() 
                and len(patient.pouls)>begin+length
            )
        ]
        return PatientsList(cleaned_list)

    def slice(self, begin, length, to_the_end=False):
        """Smooths every Dataframe of the list of patients.
        Args:

        Return:
            -PatientsList object
        """
        if to_the_end:
            slicedlist = [
            patient.slice(begin, len(patient.df)-begin)
            for patient in self.patientslist
            if (len(patient.df) >= begin)
        ]
        else:
            slicedlist = [
                patient.slice(begin, length)
                for patient in self.patientslist
                if (len(patient.df.iloc[begin:]) >= length)
            ]
        return PatientsList(slicedlist)
    
    def standard(self):
        """Applies a function to every patient of the list of patients.
        Args:
            -function :callable
            -list -> list of the arguments of the function
        Returns:
            -PatientList object
        """

        standardedlist = [
            patient.standard()
            for patient in self.patientslist
        ]
        return PatientsList(standardedlist)
    
    def center(self):
        """Returns the centered data of each patient of the list.
        Args :
            -k : int -> bloc choisi
        Returns:
            -PatientList
        """

        centeredlist = [
            patient.center()
            for patient in self.patientslist
        ]
        return PatientsList(centeredlist)

    def smoothing(self, outlier=True):
        smoothedlist = [patient.smoothing(outlier) for patient in self.patientslist]
        return PatientsList(smoothedlist)

    def apply_to_list_patient(self, fonction, args):
        appliedlist = [fonction(patient, *args) for patient in self.patientslist]
        return PatientsList(appliedlist)

    def bloc_selection(self, k):
        """Selects all the patient operated in the operating room number "k".
        Args :
            -k : int -> chosen operating room
        Returns:
            -PatientList
        """
        patientlist = [
            self.patientslist[i]
            for i in range(len(self.patientslist))
            if self.patientslist[i].bloc == k
        ]
        return PatientsList(patientlist)

    def label_selection(self, label):
        """Selects all the patients labeled as "label".
        Args :
            -label : int -> bloc choisi
        Returns:
            -PatientList
        """

        patientlist = [
            self.patientslist[i]
            for i in range(len(self.patientslist))
            if self.patientslist[i].label == label
        ]
        return PatientsList(patientlist)

    
    def split_training(self, ratio_kept):
        """Splits a list of patients into a train list and a test list.
        Args : 
            -ratio_kept : int -> ratio between the length of the train list and the test list.
        Returns :
            -Patientlist object tuple
        """

        list_to_split = sample(self.patientslist, len(self))
        n = int(ratio_kept * len(self.patientslist))
        return PatientsList(list_to_split[:n]), PatientsList(list_to_split[n:])
    
    def split_homogene(self, ratio_kept, labels_choisis):
        """Splits a list of patients into a train list and a test list where the proportion of labels are consistent with that of the cohort.

        Args : 
            -ratio_kept : int -> ratio between the length of the train list and the test list.
            -labels_choisis : list -> list of the labels that will appear in the test list
        Returns:
            - tuple of Patientlist objects
        """ 

        n = int(ratio_kept * len(self.patientslist))
        
        list_by_labels = [
            self.label_selection(label).split_training(ratio_kept)
            for label in labels_choisis
        ]
        list_by_labels_train = [el[0] for el in list_by_labels]
        list_by_labels_test = [el[1] for el in list_by_labels]
        list_train = list(itertools.chain(*list_by_labels_train))
        list_test = list(itertools.chain(*list_by_labels_test))
        
        shuffle(list_train)
        shuffle(list_test)
        
        return PatientsList(list_train), PatientsList(list_test)
        
        

    def label(self):
        """Returns a pandas series containing the label of each patient.
        Args : 
        
        Returns:
            -Pandas.DataFrame
        """ 

        df = pd.DataFrame(columns=["name","label"])
        for patient in self.patientslist:
            df = df.append({"name" : patient.name, "label" : patient.label}, ignore_index=True)
        df = df.set_index("name")
        return df

    def get_features_list_patient(self, arguments):
        """Returns a Dataframe containing all of the features of each patient.
        Args :
            -arguments : dict -> dictionary containing as keys the names of the chosen features and as values dictionaries, contaning as keys 
            the names of the parameters of the features and as values the chosen values.
        Returns :
            -Pandas.DataFrame
        """

        df_features = pd.DataFrame()
        for patient in  (self.patientslist):
            df_features = df_features.append(get_features(patient, arguments ), ignore_index = True  )
        df_features.set_index("name", inplace = True)
        for col in df_features.columns[2:]:
            df_features.loc[df_features[col].isna(),col] = df_features[col].mean() # on traite les features utilisant la correlation et la division par l'écart type qui sont nan si la spo2 est constante
        return df_features

    def train_pca(self, n_components=20,  label="clean", begin=120, length=180):
        '''Computes the PCA from a test group containing patients labeled as "clean".
        Args :
            -n_components : int -> number of components.
            -label : string -> label of the test group.
            -begin : int -> beginning of the window on which the PCA will work.
            -length : int -> length of the window.
        Returns:
            -new_pca : numpyarray

        '''
        clean_train = [
            patient.pouls
            for patient in self.label_selection(label).standard().slice(begin, length).patientslist 
        ]
        new_pca = decomposition.PCA(n_components)
        new_pca.fit(clean_train)
        return new_pca





### Getting Features

The get_feature method takes a dictionnary as argument, with as keys the names of the features and as values dictionnaries containing as keys the names of the arguments used by the corresponding feature method, and as values their values.

A feature can easily be added to the proccess (in this dictionnary, in the get_feature method and in the Patient Class).

<a id="all_features"></a>


In [None]:
feature_dict_all = {"fourier" : 
              {"features_list" : ["pw_pouls_abs","pw_SpO2_abs","coeff_pw_abs","coeff_freq1","coeff_freq2","coeff_freq3"], 
                "nperseg" : 50,
                "begin" : 0,
                "end" : 270},
                "energy_dwt" : 
              {"features_list" : ["energy_1","energy_2","energy_3","energy_4"], "start" : 0, "length" : 32.5},
              "ondelettes_SpO2_Pouls": {"begin" : 0, "end": 20},
              "moment3_Pouls" :
              {'recordstype' : 'Pouls', 'order' : 3 , 'centre' : True, 'reduit' : True },
              "moment3_SpO2" :
              {'recordstype' : 'SpO2', 'order' : 3 , 'centre' : True, 'reduit' : True },
              "moment4_Pouls" :
              {'recordstype' : 'Pouls', 'order' : 4 , 'centre' : True, 'reduit' : True },
              "moment4_SpO2" : 
              {'recordstype' : 'SpO2', 'order' : 4 , 'centre' : True, 'reduit' : True },
              "mean_pouls" : 
              {'recordstype' :'Pouls'},
              "mean_SpO2" : 
              {'recordstype' :'SpO2'},
              "std_pouls" :
              {'recordstype' : 'Pouls'},
              "std_SpO2" :
              {'recordstype' : 'SpO2'},
              "min_pouls" :
              {'recordstype' : 'Pouls'},
              "min_SpO2":
              {'recordstype' : 'SpO2'},
              "max_pouls":
              {'recordstype' : 'Pouls'},
              "max_SpO2":
              {'recordstype' : 'SpO2'},
              "first_quartile_pouls":
              {'recordstype' : 'Pouls'},
              "first_quartile_SpO2":
              {'recordstype' : 'SpO2'},
              "median_pouls":
              {'recordstype' : 'Pouls'},
              "median_SpO2":
              {'recordstype' : 'SpO2'},
              "third_quartile_pouls":
              {'recordstype' : 'Pouls'},
              "third_quartile_SpO2":
              {'recordstype' : 'SpO2'},
              "wavelet_corr_spo2_0_pouls_1" :
              {"begin" : 0, "end": 32.5},
              "wavelet_corr_spo2_0_pouls_0": 
              {"begin" : 0, "end": 32.5}, 
              "corr_P_S_dwt" : 
              {"start" : 0, "length" : 32.5}, 
              "mcr_pouls_1" :
              {"begin" : 0, "end": 32.5}, 
              "mcr_spo2_0" :
              {"begin" : 0, "end": 32.5}

}


Example of result of the get_feature_list_patients method:
<img src="features.png" width=80%></img>

It is time to implement our final algorithm, in order to predict whether our patient will suffer from a heart attack. The idea is to make a first statement after 20 or 30 minutes of surgery, considering that something in this part of the data could warn us about an attack.  

During the first part of the surgery (beginning of anaesthesia), the patient is very unstable and some features are not compatible with those variations. So we chose to cut (slicing method) this part for every patient before calculating our features, and because the goal here is not to predict any heart attack during this part of the surgery.

## Prediction


#### Features

Chosen features are defined below. [(see dictionnary with all features here)](#all_features)

In [None]:
feature_dict_final = {"fourier" : 
              {"features_list" : ["pw_pouls_abs"], 
                "nperseg" : 50,
                "begin" : 0,
                "end" : 270},
              "mean_pouls" : 
              {'recordstype' :'Pouls'},
              "mean_SpO2" : 
              {'recordstype' :'SpO2'},
              "std_pouls" :
              {'recordstype' : 'Pouls'},
              "std_SpO2" :
              {'recordstype' : 'SpO2'}

}
used_features = ["pw_pouls_abs", "mean_SpO2", "mean_pouls", "std_SpO2", "std_pouls"]


### Classifier and training

To use this cell, a folder containing folders named by labels: clean, attack (and anomaly if you want) has to be created.
Any classifier can be used.

In [None]:

classifier = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    class_weight='balanced'
)

patients_train = PatientsList.from_folder("train/")
df_train = patients_train.slice(150, -1, to_the_end=True).get_features_list_patient(feature_dict_final).loc[:, used_features]
label_train = patients_train.label()


classifier.fit(df_train, label_train['label'])


#this cell might run for several minutes

FileNotFoundError: [Errno 2] No such file or directory: 'train/'

### Features and Classifier choice

We tested a lot of combinations of features with several classifiers (with different parameters) like kNNClassifier or MLPClassifier. We here use the RandomForestClassifier, which gave us the best results (accuracy and recall) combined with the following set of features (see used_features and features_dict_final). It is also a classifier that gives a lot of informations on the way he makes his decisions, which is essential for anesthetists.

### Prediction function
After training the RandomForestClassifier, we just have to give the path to the csv files with the raw data (the function automatically splits the patients if needed). The function returns a dataframe with the predicted labels and the name of the patient, such as the dataframe shown below.


<img src="dataframe_prediction.png">

The process takes 7 minutes for 400 patients (spliting + getting the features + predicting).

In [None]:
def patientsprediction(foldername):
    processing(foldername)
    patientstopredict = PatientsList.from_folder("patients/",label=False)
    df_features = patientstopredict.get_features_list_patient(feature_dict_final)
    df_test = df_features.loc[:, used_features]
    
    prediction = classifier.predict(df_test)
    
    return pd.concat(
        [
            pd.Series(df_features.index, name="name"),
            pd.Series(prediction, name="prediciton")
        ],
        axis = 1
    )

This last function allows for evaluation of the prediction of the previous classifier and returns the confusion matrix. It takes as argument the path to the folder containing your labelled data (containing folders named by label).

In [None]:
def comparaison_label_prediction(foldername):
    patientstopredict = PatientsList.from_folder(foldername)
    df_features = patientstopredict.get_features_list_patient(feature_dict_final)
    df_test = df_features.loc[:, used_features]
    prediction = classifier.predict(df_test)
    labels = patientstopredict.label()
    
    return metrics.confusion_matrix(
        labels.to_numpy(),
        prediction,
        labels = ["clean", "attack"]
    )