Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to prepare model input from my own data? #2

Closed
Jwenyi opened this issue Oct 13, 2021 · 13 comments
Closed

How to prepare model input from my own data? #2

Jwenyi opened this issue Oct 13, 2021 · 13 comments

Comments

@Jwenyi
Copy link

Jwenyi commented Oct 13, 2021

Hi Dr. Wang,
I'm a surgeon in China. I'm really interested in your SurvTrace and i'd like to apply it on my research to predict the prognosis of cancer patients. However, I do just learned python not long ago. Could you show me how to prepare the model input from local files? E.g. A matrix (mxn), the row is patients ID, the col containing overall survival time, events, and features for modeling.

@RyanWangZf
Copy link
Owner

RyanWangZf commented Oct 13, 2021

Hi there,
You can refer to
https://github.com/RyanWangZf/SurvTRACE/blob/main/data/seer_processed.csv
for the standard input format of the data.

After your data is formatted as that, you can refer to
https://github.com/RyanWangZf/SurvTRACE/blob/main/survtrace/dataset.py

especially the condition under

elif data == "seer":

to set the PATH_DATA, event_list, cols_categorical , cols_standardize, config['num_event'] to be fit to your data.

Moreover, since this survtrace method is built based on transformers, we need a GPU device with like RTX 3060 or sth to train it efficiently.

Feel free to reach out if there is a further question.😀

@Jwenyi
Copy link
Author

Jwenyi commented Oct 14, 2021

Thanks for your response!! I'll try it :)

@Jwenyi
Copy link
Author

Jwenyi commented Oct 14, 2021

Hi Zifeng,
I've done my own survtrace model as your suggestion, thanks!! However, I have a new question that "how to predict the prognosis of a patients/sample?"
I used to train Cox regression model (a simple statistical model) or XGBoost that could provide a predicted score (or a survival function value?) for each patients, so we could use these scores to stratify patients. Thus, I wonder if any way to provide a prediction for each sample and output a dataframe or matrix that includes these prediction? Or, how we use survtrace to assign a predicted score for each patient?
P.S. I modeled survtrace without competing risk, patients only has one event "death or alive".

@RyanWangZf
Copy link
Owner

RyanWangZf commented Oct 14, 2021

Hi, you can use these four functions to get the predicted hazard/risk/survival rate for patients.

On

def predict_hazard(self, input_ids, batch_size=None):

and below. It outputs hazard/risk/survival rate on each discrete time point corresponding to the time horizons we set

'horizons': [.25, .5, .75], # the discrete intervals are cut at 0%, 25%, 50%, 75%, 100%

It can be used like

surv = model.predict_surv(df_test, batch_size=val_batch_size)
risk = 1 - surv

for more details please refer to the evaluation function

class Evaluator:

@Jwenyi
Copy link
Author

Jwenyi commented Oct 14, 2021

Thanks a lot!! It's really helpful for me 😀

@RyanWangZf
Copy link
Owner

Thanks a lot!! It's really helpful for me 😀

It's my pleasure~ welcome to star our projects if it's helpful 😇

@Jwenyi
Copy link
Author

Jwenyi commented Oct 15, 2021

Thanks a lot!! It's really helpful for me 😀

It's my pleasure~ welcome to star our projects if it's helpful 😇

Surely!!

@Jwenyi
Copy link
Author

Jwenyi commented Oct 15, 2021

Hi Zifeng,
Sorry for disturbing you again but I encountered a new question during traning survtrace.😂
When I run a function "load_data" which from "dataset.py", it repoted that
_"UserWarning: Got event/censoring at start time. Should be removed! It is set s.t. it has no contribution to loss. warnings.warn("""Got event/censoring at start time. Should be removed! It is set s.t. it has no contribution to loss.""_
it from a code
y = labtrans.transform(*get_target(df)) # y = (discrete duration, event indicator)
However, I had checked my input data and found no censor or event existed in the begining time. And, I'm sure that all the patients did not meets "duration 0, event 1". May be this question is attributed to:

times = np.quantile(df["duration"][df["event"] == 1.0], horizons).tolist()
times
[389.500000125, 601.9999998, 1120.75]
In this code I see that the time interval has been set, however, I do had some patients whom "duration" are less than 389.5 and "event" are 1 (Death). Does it cause that question? If the answer is yes, I noted that even if I deleted these patients, the "times" will also change, and there will be new patients who do not meet the conditions.

How should I solve this problem? Or this problem does not affect the performance of the model and can therefore be ignored? I am eagerly looking forward to your reply.

P.S. ,part of my data are listed below, in which I show the patients who has the shortest duration in my data:

<style> </style>
duration event AURKA.FGD6 AURKA.GABRP CLDN9.IL27RA DPYD.FANCI
90 1 0 1 1 0
92 0 0 0 0 0
100 0 0 0 1 1
100 0 0 1 1 1
103 1 1 1 0 0
108 0 1 1 0 1
112 0 1 1 1 1
120 0 0 1 1 0
126 1 1 1 0 0

@Jwenyi
Copy link
Author

Jwenyi commented Oct 15, 2021

P.S. I figure that maybe a patients with shortest duration shoul not be "Death"? So I also deleted this patients and unfortunately I encountered this warnings again..😂

@RyanWangZf
Copy link
Owner

P.S. I figure that maybe a patients with shortest duration shoul not be "Death"? So I also deleted this patients and unfortunately I encountered this warnings again..😂

I check the code where this warning raises on

if idx_durations.min() == 0:
warnings.warn("""Got event/censoring at start time. Should be removed! It is set s.t. it has no contribution to loss.""")
t_frac[idx_durations == 0] = 0
events[idx_durations == 0] = 0
idx_durations = idx_durations - 1
# get rid of -1
idx_durations[idx_durations < 0] = 0
return idx_durations.astype('int64'), events.astype('float32'), t_frac.astype('float32')

Before the operation on line 81, the patient has duration < 389.500000125 is actually assigned index 1 instead of 0. So, this warning raises because durations[idx_durations == 0] == 0.

Could you add a break point there and print(durations[idx_durations == 0]) to show me what's the output?

@Jwenyi
Copy link
Author

Jwenyi commented Oct 15, 2021

P.S. I figure that maybe a patients with shortest duration shoul not be "Death"? So I also deleted this patients and unfortunately I encountered this warnings again..😂

I check the code where this warning raises on

if idx_durations.min() == 0:
warnings.warn("""Got event/censoring at start time. Should be removed! It is set s.t. it has no contribution to loss.""")
t_frac[idx_durations == 0] = 0
events[idx_durations == 0] = 0
idx_durations = idx_durations - 1
# get rid of -1
idx_durations[idx_durations < 0] = 0
return idx_durations.astype('int64'), events.astype('float32'), t_frac.astype('float32')

Before the operation on line 81, the patient has duration < 389.500000125 is actually assigned index 1 instead of 0. So, this warning raises because durations[idx_durations == 0] == 0.

Could you add a break point there and print(durations[idx_durations == 0]) to show me what's the output?

Thanks Zifeng! I checked there and found the output is "92.00000018".

@RyanWangZf
Copy link
Owner

P.S. I figure that maybe a patients with shortest duration shoul not be "Death"? So I also deleted this patients and unfortunately I encountered this warnings again..😂

I check the code where this warning raises on

if idx_durations.min() == 0:
warnings.warn("""Got event/censoring at start time. Should be removed! It is set s.t. it has no contribution to loss.""")
t_frac[idx_durations == 0] = 0
events[idx_durations == 0] = 0
idx_durations = idx_durations - 1
# get rid of -1
idx_durations[idx_durations < 0] = 0
return idx_durations.astype('int64'), events.astype('float32'), t_frac.astype('float32')

Before the operation on line 81, the patient has duration < 389.500000125 is actually assigned index 1 instead of 0. So, this warning raises because durations[idx_durations == 0] == 0.
Could you add a break point there and print(durations[idx_durations == 0]) to show me what's the output?

Thanks Zifeng! I checked there and found the output is "92.00000018".

Do you mean there is only one output and it's not zero? It's weird 😂
I copied these transform code from pycox
https://github.com/havakv/pycox/blob/d384d4f0ac89ddd8458daabfd3fe271ff26542e3/pycox/preprocessing/label_transforms.py#L150

don't know what happened.

But if there is only one output, only this single data will be deleted and I guess it will not influence the results much 😇

@Jwenyi
Copy link
Author

Jwenyi commented Oct 17, 2021

Thanks! I checked my inputed data and processed data and found that the samples size seemed to change very little. 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants