Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wtte.pipelines.data_pipeline returns wrong seq_ids #61

Open
michigann opened this issue Jun 9, 2019 · 0 comments
Open

wtte.pipelines.data_pipeline returns wrong seq_ids #61

michigann opened this issue Jun 9, 2019 · 0 comments

Comments

@michigann
Copy link

Hi, I found there is some problem with data preprocessing functions.

The problem is when we want to get result from our model for sequences and its id, when we use lib data_pipeline function for preprocessing our data. Ok, so to the point. data_pipeline function in wtte.pipelines module seems to return seq_ids in wrong order. So it causes problem with seq_index-to-seq_id mapping. The bug is in df_to_array function in its second instruction line: unique_ids = list(grouped.groups.keys()). Grouped seqneces aren't ordered by its ids so padded feature vector based on it can have different order than seq_ids returned from data_pipeline function. Its because data_pipeline returns sequences ordered by id_col in passed padnas dataframe, but df_to_array creates features sequences based on pandas groupby order which may be different, like in my case. My suggestion to fix this bug (the simplest one) is just to change unique_ids = list(grouped.groups.keys()) to unique_ids = df[id_col].unique() in df_to_array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant