# Preprocessing data for generic HMM generation
    
In the following notebook, our aim will be to layout the idea of preprocessing the initial data, in order to
enable a **stable Machine-Learning performance** later on!


### The red box shows the current position in the pipeline

![PipelineStep1](images/Pipeline_step1.png "View of the current state of the Pipeline")

### Inside the red box, the following chart models the workflow

![PreprocessingPipelineImage](images/PreprocessingPipeline.png "View of the Preprocessing Pipeline")

We start by receiving the data in the csv format, together with a .ini file conforming to the abtract syntax defined in
markerconfig.esl.

First, we delete all unnecessary data, which means 'ignoring' certain columns which are specified in the config file.
Then, we group the data by person (or to be more generic, by every experiement run). Thus, we obtain timeseries for each run.
These timeseries might not be consistent in frequency, which is why we have to install this consistency within the data.
This means inserting empty Datapoints inbetween the data, or dropping some of the data (This part is to be discussed in detail).
Finally, we generate a label encoder for our data, transform our data per Marker and thus make the data easier and faster to work with.

Subsequently, we obtain a csv file with time consistent sequences as well as a map from Marker to LabelEncoder objects used to transform the encoded values back to their original names.

In [1]:
import pandas as pd
import numpy as np
import configparser

In [2]:
path_to_config = "config.ini"
path_to_data = "../../data/csv_klinischeDaten/Konstante_u_fortlaufende_Daten_identifier-nur-ID/test.csv"

# read in the config file
cparser = configparser.ConfigParser()
cparser.read(path_to_config)

# read in the data and store it into a dataframe
df = pd.read_csv(path_to_data, delimiter='\t')

In [3]:
cparser.sections()

['markerconfig_metainfo', 'usubjid', 'visdat', 'siteid', 'visnam', 'subjstat']

In [4]:
df

Unnamed: 0,usubjid,siteid,subjstat,visnam,visdat
0,237-034-721,13,good,Logs,2021-09-01
1,237-034-721,13,good,Logs,2021-09-03
2,237-034-721,13,good,Logs,2021-08-30
3,237-034-721,13,good,Logs,2021-08-21
4,237-034-721,13,med,Logs,2021-08-11
5,237-034-721,13,med,Logs,2021-08-01
6,237-034-722,13,good,Logs,2021-09-01
7,237-034-722,13,good,Logs,2021-09-01
8,237-034-722,13,good,Logs,2021-08-31
9,237-034-722,13,med,Logs,2021-08-21


## Delete unnecessary data

- delete any unwanted columns
- delete any excluded rows (?)

In [5]:
# drop any columns/sections that are not specified in the config.ini file.
valid_sections = cparser.sections()

if 'markerconfig_metainfo' in valid_sections:
    valid_sections.remove('markerconfig_metainfo')

df = df[valid_sections]
df

Unnamed: 0,usubjid,visdat,siteid,visnam,subjstat
0,237-034-721,2021-09-01,13,Logs,good
1,237-034-721,2021-09-03,13,Logs,good
2,237-034-721,2021-08-30,13,Logs,good
3,237-034-721,2021-08-21,13,Logs,good
4,237-034-721,2021-08-11,13,Logs,med
5,237-034-721,2021-08-01,13,Logs,med
6,237-034-722,2021-09-01,13,Logs,good
7,237-034-722,2021-09-01,13,Logs,good
8,237-034-722,2021-08-31,13,Logs,good
9,237-034-722,2021-08-21,13,Logs,med


## Group data by experiment

Either we have a csv file in which more than one experiment is stored, or we have
one big time-sequence. We need to query the 'markerconfig_metainfo' section of the configfile
to find out, if we need to split the data into respective groups.

In [7]:
def group(df : pd.DataFrame, cparser : dict) -> [pd.DataFrame]:
    '''
    df :        The dataframe consisting of ungrouped, unsorted datapoints
    cparser :   The configparser dictionary (maybe) containing metainfo about by what column
                to group the data
    returns :   A list of DataFrames grouped by the given attribute, if no attribute was supplied inside the
                cparser, then the received DataFrame is returned without grouping, as a single element list.
    '''
    # check if we have metainfo, if we dont,
    # we have to treat the data as one big experiment
    if 'markerconfig_metainfo' not in cparser or 'groupby' not in cparser['markerconfig_metainfo']:
        return [df]

    # load our key, witch which we are grouping the data.
    key = cparser['markerconfig_metainfo']['groupby']
        
    if key not in df.columns:
        raise KeyError(f"The groupby-key '{key}' specified inside the config file isn't found in the data.")
    
    tmpdf = df.groupby(key)
    return [tmpdf.get_group(group) for group in tmpdf.groups]
    
    
  

## Enforce measurement interval consistency (if wanted)

In most cases, the time between measurements matters. In our case, state transitions are being observed with
every experiment sequence. These transitions may not be observed within fixed time intervals. Severe inconsitencies in observation intervals between experiments may lead to a poor model performance. Thus, we query the metainfo about the time interval requested for the data and insert empty datapoints or delete duplicate datapoints in case we have to.



<span style="color:red">*This idea should be discussed, since this basically is a tradeoff between model accuracy and training data. Maybe we should just repeat the last state in the given time interval, until the new state is reached* </span>.

In [8]:

def enforce_timeinterval_consistency(dfs : [pd.DataFrame], cparser: dict) -> [pd.DataFrame]:
    
    if 'markerconfig_metainfo' not in cparser or 'dateinfo' not in cparser['markerconfig_metainfo']:
        return dfs
    
    dateid, time1, time2 = eval(cparser['markerconfig_metainfo']['dateinfo'])
    groupby = cparser['markerconfig_metainfo']['groupby'] if 'groupby' in cparser['markerconfig_metainfo'] else None
    
    
    timedelta = abs(pd.Timestamp(time1) - pd.Timestamp(time2))
    
    for i in range(len(dfs)):
        dfs[i] = make_df_timeinterval_consistent(dfs[i], timedelta, dateid, groupby)
        
    
    return dfs
        
def make_df_timeinterval_consistent(df : pd.DataFrame, timedelta : pd.Timedelta, dateid : str, groupby : str) -> pd.DataFrame:
    '''
    df :        The dataframe we want to enforce time interval consistency on
    timedelta : Timedelta object, the maximum time delta, that two observations are allowed away from each other.
    returns :   A time inteval consistent DataFrame. The new DataFrame may contain empty rows (filler rows are added
                when two observations are too far away from each other), or fewer rows than before (when observations are
                to close to each other, say all on the same day, we have to drop all but one).
    '''
    if dateid not in df.columns:
        raise KeyError(f"dateid-key {dateid}, which was specified inside the config.ini was not in the data.")
    
    if groupby is not None:
        group_handle = df[groupby].iloc[0]
    
    
    # 1) remove duplicate values
    df = df.drop_duplicates(subset=[dateid])
    
    
    # 2) map the dates to pd.Timestamp objects
    try:
        df[dateid] = df[dateid].map(lambda x: pd.Timestamp(x))
    except Exception as e:
        print('There has been a problem parsing the dates in your data!'\
               ++ e\
              ++'Please let your dates conform to the ISO 8601 standart YYYY-MM-DD hh:mm:ss')
        return pd.DataFrame()
    
    
    # sort the dataframe by dates
    df = df.sort_values(by=[dateid]).reset_index(drop=True)
    
    # 3) Allocate new rows inside a list. Fill in empty rows, in case we have a temporal gap
    new_rows = []
    dates = df[dateid]
    
    # always add the first date. In the following loop, we will add subsequent dates,
    # aswell as filler rows.
    new_rows.append(df.iloc[0].to_dict())
    
    for i in range(1, len(dates)):
        current_date = dates.iloc[i-1]
        next_date = dates.iloc[i]

        if timedelta < next_date - current_date:

            cur_dict = df.iloc[i-1].to_dict()

            # add some filler rows
            periods = 2 + int(abs(next_date - current_date) / timedelta)

            times = pd.date_range(start=current_date, end=next_date, periods=periods)

            rows = [cur_dict.copy() for _ in times[1:-1]]

            for j, date in enumerate(times[1:-1]):
                rows[j][dateid] = date

            new_rows = new_rows + rows

            
        # append next row to list
        new_rows.append(df.iloc[i].to_dict())
        
            
    return pd.DataFrame(new_rows, columns=df.columns)


In [9]:
grouped_dfs = group(df, cparser)
consistant_dfs = enforce_timeinterval_consistency(grouped_dfs, cparser)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[dateid] = df[dateid].map(lambda x: pd.Timestamp(x))


## Encoding states

Since we have a list of DataFrames after the time interval enforcement, we have to concatenate all those single grouped Frames into a singular big DataFrame.

To encode the states, we read in the information that every marker has attached to it, in some special cases, we actually need to perform some more operations, in order to make the dataset ready for usage.

These special cases include for example the dtype linspace(x,y,bins), which is there to help us with continuous values.
In this case, we have to extract the metainfo from the configfile, contruct a np.linspace object, then perform a transformation
of the data via np.digitize. The same goes for the date_range dtype.

In the end, we should be presented with multiple DataFrames, one per Layer.



In [10]:
df = pd.concat(consistant_dfs).reset_index(drop=True)
tmpdf = df.groupby('usubjid')
groups = [tmpdf.get_group(g) for g in tmpdf.groups]
groups[3]


Unnamed: 0,usubjid,visdat,siteid,visnam,subjstat
111,237-034-724,2022-08-01 00:00:00.000000000,13,Logs,med
112,237-034-724,2022-08-01 21:49:05.454545454,13,Logs,med
113,237-034-724,2022-08-02 19:38:10.909090909,13,Logs,med
114,237-034-724,2022-08-03 17:27:16.363636363,13,Logs,med
115,237-034-724,2022-08-04 15:16:21.818181818,13,Logs,med
116,237-034-724,2022-08-05 13:05:27.272727272,13,Logs,med
117,237-034-724,2022-08-06 10:54:32.727272727,13,Logs,med
118,237-034-724,2022-08-07 08:43:38.181818181,13,Logs,med
119,237-034-724,2022-08-08 06:32:43.636363636,13,Logs,med
120,237-034-724,2022-08-09 04:21:49.090909090,13,Logs,med


In [11]:
from sklearn.preprocessing import LabelEncoder

#TODO: generate a LabelEncoder object for each marker, encode each column!
label_encoders = {}

for section in valid_sections:
    dtype = cparser[section]['dtype']

    if 'linspace' in dtype:
        try:
            start, end, number_bins = eval(dtype.replace('linspace', ''))
            start = float(start)
            end = float(end)
            number_bins = int(number_bins)

            bins = np.linspace(start, end, number_bins)

            # transform section of df with the help of bins!
            df[section] = df[section].map(lambda x: np.digitize(x, bins))

        except:
            print(f"dtype of section {section} was parsed as {dtype}, but we encountered an error.")
            raise
            
    elif 'date_range' in dtype:
        
        try:
            # parse start date, end date, bins
            start, end, periods = eval(dtype.replace('date_range', ''))
            start = pd.Timestamp(start)
            end = pd.Timestamp(end)
            periods = int(periods)
            
            
            bins = np.linspace(start, end, periods)
            
            df[section] = df[section].map(lambda x: np.digitize(x, bins))
            
        except:
            print(f"dtype of section {section} was parsed as {dtype}, but we encountered an error.")
            raise
    elif dtype == 'str':
        df[section] = df[section].map(lambda x: x.lower().strip().replace(' ', '_'))
        
    else:
        pass
    
    # construct label encoder, fit label encoder, transform column
    tmp_encoder = LabelEncoder()
    tmp_encoder.fit(df[section])
    df[section] = tmp_encoder.transform(df[section])
    
    # save label encoder
    label_encoders[section] = tmp_encoder

## Serialize the DataFrame and LabelEncoder Objects

In [12]:
import pickle

# save dataFrame
df.to_pickle('preprocessed.pkl')

# save LabelEncoders
with open('label_encoders.pkl', 'wb') as file:
    pickle.dump(label_encoders, file)
    
df, label_encoders


(     usubjid  visdat  siteid  visnam  subjstat
 0          0       0       0       0         2
 1          0       1       0       0         2
 2          0       2       0       0         2
 3          0       3       0       0         2
 4          0       4       0       0         2
 ..       ...     ...     ...     ...       ...
 145        3     120       0       0         2
 146        3     121       0       0         1
 147        3     122       0       0         1
 148        3     123       0       0         1
 149        3     124       0       0         2
 
 [150 rows x 5 columns],
 {'usubjid': LabelEncoder(),
  'visdat': LabelEncoder(),
  'siteid': LabelEncoder(),
  'visnam': LabelEncoder(),
  'subjstat': LabelEncoder()})

### lets try the inverse transformation

In [13]:

for section in df.columns:
    df[section] = label_encoders[section].inverse_transform(df[section])
    
df

Unnamed: 0,usubjid,visdat,siteid,visnam,subjstat
0,237-034-721,2021-08-01 00:00:00.000000000,6,logs,med
1,237-034-721,2021-08-01 21:49:05.454545454,6,logs,med
2,237-034-721,2021-08-02 19:38:10.909090909,6,logs,med
3,237-034-721,2021-08-03 17:27:16.363636363,6,logs,med
4,237-034-721,2021-08-04 15:16:21.818181818,6,logs,med
...,...,...,...,...,...
145,237-034-724,2022-08-31 08:00:00.000000000,6,logs,med
146,237-034-724,2022-09-01 00:00:00.000000000,6,logs,good
147,237-034-724,2022-09-01 16:00:00.000000000,6,logs,good
148,237-034-724,2022-09-02 08:00:00.000000000,6,logs,good


# This concludes the most fundamental preprocessing of the data.

The next step is to use the generated data to construct the transition matrix, as well as the emission matrix and starting state probability distribution for each marker.