wtte.pipelines.data_pipeline returns wrong seq_ids #61

michigann · 2019-06-09T13:10:01Z

Hi, I found there is some problem with data preprocessing functions.

The problem is when we want to get result from our model for sequences and its id, when we use lib data_pipeline function for preprocessing our data. Ok, so to the point. data_pipeline function in wtte.pipelines module seems to return seq_ids in wrong order. So it causes problem with seq_index-to-seq_id mapping. The bug is in df_to_array function in its second instruction line: unique_ids = list(grouped.groups.keys()). Grouped seqneces aren't ordered by its ids so padded feature vector based on it can have different order than seq_ids returned from data_pipeline function. Its because data_pipeline returns sequences ordered by id_col in passed padnas dataframe, but df_to_array creates features sequences based on pandas groupby order which may be different, like in my case. My suggestion to fix this bug (the simplest one) is just to change unique_ids = list(grouped.groups.keys()) to unique_ids = df[id_col].unique() in df_to_array.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wtte.pipelines.data_pipeline returns wrong seq_ids #61

wtte.pipelines.data_pipeline returns wrong seq_ids #61

michigann commented Jun 9, 2019

wtte.pipelines.data_pipeline returns wrong seq_ids #61

wtte.pipelines.data_pipeline returns wrong seq_ids #61

Comments

michigann commented Jun 9, 2019