How to prepare model input from my own data? #2

Jwenyi · 2021-10-13T16:13:05Z

Hi Dr. Wang,
I'm a surgeon in China. I'm really interested in your SurvTrace and i'd like to apply it on my research to predict the prognosis of cancer patients. However, I do just learned python not long ago. Could you show me how to prepare the model input from local files? E.g. A matrix (mxn), the row is patients ID, the col containing overall survival time, events, and features for modeling.

RyanWangZf · 2021-10-13T19:09:44Z

Hi there,
You can refer to
https://github.com/RyanWangZf/SurvTRACE/blob/main/data/seer_processed.csv
for the standard input format of the data.

After your data is formatted as that, you can refer to
https://github.com/RyanWangZf/SurvTRACE/blob/main/survtrace/dataset.py

especially the condition under

elif data == "seer":

to set the PATH_DATA, event_list, cols_categorical , cols_standardize, config['num_event'] to be fit to your data.

Moreover, since this survtrace method is built based on transformers, we need a GPU device with like RTX 3060 or sth to train it efficiently.

Feel free to reach out if there is a further question.😀

Jwenyi · 2021-10-14T07:35:04Z

Thanks for your response!! I'll try it :)

Jwenyi · 2021-10-14T16:01:15Z

Hi Zifeng,
I've done my own survtrace model as your suggestion, thanks!! However, I have a new question that "how to predict the prognosis of a patients/sample?"
I used to train Cox regression model (a simple statistical model) or XGBoost that could provide a predicted score (or a survival function value?) for each patients, so we could use these scores to stratify patients. Thus, I wonder if any way to provide a prediction for each sample and output a dataframe or matrix that includes these prediction? Or, how we use survtrace to assign a predicted score for each patient?
P.S. I modeled survtrace without competing risk, patients only has one event "death or alive".

RyanWangZf · 2021-10-14T16:29:44Z

Hi, you can use these four functions to get the predicted hazard/risk/survival rate for patients.

On

SurvTRACE/survtrace/model.py

Line 277 in 0d40f37

def predict_hazard(self, input_ids, batch_size=None):

and below. It outputs hazard/risk/survival rate on each discrete time point corresponding to the time horizons we set

SurvTRACE/survtrace/config.py

Line 7 in 0d40f37

    
           'horizons': [.25, .5, .75], # the discrete intervals are cut at 0%, 25%, 50%, 75%, 100%

It can be used like

surv = model.predict_surv(df_test, batch_size=val_batch_size)
risk = 1 - surv

for more details please refer to the evaluation function

SurvTRACE/survtrace/evaluate_utils.py

Line 6 in 0d40f37

class Evaluator:

Jwenyi · 2021-10-14T17:24:41Z

Thanks a lot!! It's really helpful for me 😀

RyanWangZf · 2021-10-14T21:39:05Z

Thanks a lot!! It's really helpful for me 😀

It's my pleasure~ welcome to star our projects if it's helpful 😇

Jwenyi · 2021-10-15T08:27:38Z

Thanks a lot!! It's really helpful for me 😀

It's my pleasure~ welcome to star our projects if it's helpful 😇

Surely!!

Jwenyi · 2021-10-15T13:40:42Z

Hi Zifeng,
Sorry for disturbing you again but I encountered a new question during traning survtrace.😂
When I run a function "load_data" which from "dataset.py", it repoted that
_"UserWarning: Got event/censoring at start time. Should be removed! It is set s.t. it has no contribution to loss. warnings.warn("""Got event/censoring at start time. Should be removed! It is set s.t. it has no contribution to loss.""_
it from a code
y = labtrans.transform(*get_target(df)) # y = (discrete duration, event indicator)
However, I had checked my input data and found no censor or event existed in the begining time. And, I'm sure that all the patients did not meets "duration 0, event 1". May be this question is attributed to:

times = np.quantile(df["duration"][df["event"] == 1.0], horizons).tolist()
times
[389.500000125, 601.9999998, 1120.75]
In this code I see that the time interval has been set, however, I do had some patients whom "duration" are less than 389.5 and "event" are 1 (Death). Does it cause that question? If the answer is yes, I noted that even if I deleted these patients, the "times" will also change, and there will be new patients who do not meet the conditions.

How should I solve this problem? Or this problem does not affect the performance of the model and can therefore be ignored? I am eagerly looking forward to your reply.

P.S. ,part of my data are listed below, in which I show the patients who has the shortest duration in my data:

duration	event	AURKA.FGD6	AURKA.GABRP	CLDN9.IL27RA	DPYD.FANCI
90	1	0	1	1	0
92	0	0	0	0	0
100	0	0	0	1	1
100	0	0	1	1	1
103	1	1	1	0	0
108	0	1	1	0	1
112	0	1	1	1	1
120	0	0	1	1	0
126	1	1	1	0	0

Jwenyi · 2021-10-15T14:02:47Z

P.S. I figure that maybe a patients with shortest duration shoul not be "Death"? So I also deleted this patients and unfortunately I encountered this warnings again..😂

RyanWangZf · 2021-10-15T15:00:56Z

P.S. I figure that maybe a patients with shortest duration shoul not be "Death"? So I also deleted this patients and unfortunately I encountered this warnings again..😂

I check the code where this warning raises on

SurvTRACE/survtrace/utils.py

Lines 77 to 84 in 0d40f37

    
           if idx_durations.min() == 0: 
        
               warnings.warn("""Got event/censoring at start time. Should be removed! It is set s.t. it has no contribution to loss.""") 
        
               t_frac[idx_durations == 0] = 0 
        
               events[idx_durations == 0] = 0 
        
           idx_durations = idx_durations - 1 
        
           # get rid of -1 
        
           idx_durations[idx_durations < 0] = 0 
        
           return idx_durations.astype('int64'), events.astype('float32'), t_frac.astype('float32')

Before the operation on line 81, the patient has duration < 389.500000125 is actually assigned index 1 instead of 0. So, this warning raises because durations[idx_durations == 0] == 0.

Could you add a break point there and print(durations[idx_durations == 0]) to show me what's the output?

Jwenyi · 2021-10-15T15:14:54Z

P.S. I figure that maybe a patients with shortest duration shoul not be "Death"? So I also deleted this patients and unfortunately I encountered this warnings again..😂

I check the code where this warning raises on

SurvTRACE/survtrace/utils.py

Lines 77 to 84 in 0d40f37

if idx_durations.min() == 0:

warnings.warn("""Got event/censoring at start time. Should be removed! It is set s.t. it has no contribution to loss.""")

t_frac[idx_durations == 0] = 0

events[idx_durations == 0] = 0

idx_durations = idx_durations - 1

# get rid of -1

idx_durations[idx_durations < 0] = 0

return idx_durations.astype('int64'), events.astype('float32'), t_frac.astype('float32')

Before the operation on line 81, the patient has duration < 389.500000125 is actually assigned index 1 instead of 0. So, this warning raises because durations[idx_durations == 0] == 0.

Could you add a break point there and print(durations[idx_durations == 0]) to show me what's the output?

Thanks Zifeng! I checked there and found the output is "92.00000018".

RyanWangZf · 2021-10-15T15:25:25Z

P.S. I figure that maybe a patients with shortest duration shoul not be "Death"? So I also deleted this patients and unfortunately I encountered this warnings again..😂

I check the code where this warning raises on

SurvTRACE/survtrace/utils.py

Lines 77 to 84 in 0d40f37

if idx_durations.min() == 0:

warnings.warn("""Got event/censoring at start time. Should be removed! It is set s.t. it has no contribution to loss.""")

t_frac[idx_durations == 0] = 0

events[idx_durations == 0] = 0

idx_durations = idx_durations - 1

# get rid of -1

idx_durations[idx_durations < 0] = 0

return idx_durations.astype('int64'), events.astype('float32'), t_frac.astype('float32')

Before the operation on line 81, the patient has duration < 389.500000125 is actually assigned index 1 instead of 0. So, this warning raises because durations[idx_durations == 0] == 0.
Could you add a break point there and print(durations[idx_durations == 0]) to show me what's the output?

Thanks Zifeng! I checked there and found the output is "92.00000018".

Do you mean there is only one output and it's not zero? It's weird 😂
I copied these transform code from pycox
https://github.com/havakv/pycox/blob/d384d4f0ac89ddd8458daabfd3fe271ff26542e3/pycox/preprocessing/label_transforms.py#L150

don't know what happened.

But if there is only one output, only this single data will be deleted and I guess it will not influence the results much 😇

Jwenyi · 2021-10-17T09:55:54Z

Thanks! I checked my inputed data and processed data and found that the samples size seemed to change very little. 😀

RyanWangZf closed this as completed Mar 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to prepare model input from my own data? #2

How to prepare model input from my own data? #2

Jwenyi commented Oct 13, 2021

RyanWangZf commented Oct 13, 2021 •

edited

Loading

Jwenyi commented Oct 14, 2021

Jwenyi commented Oct 14, 2021 •

edited

Loading

RyanWangZf commented Oct 14, 2021 •

edited

Loading

Jwenyi commented Oct 14, 2021

RyanWangZf commented Oct 14, 2021

Jwenyi commented Oct 15, 2021

Jwenyi commented Oct 15, 2021

Jwenyi commented Oct 15, 2021

RyanWangZf commented Oct 15, 2021

Jwenyi commented Oct 15, 2021 •

edited

Loading

RyanWangZf commented Oct 15, 2021

Jwenyi commented Oct 17, 2021

How to prepare model input from my own data? #2

How to prepare model input from my own data? #2

Comments

Jwenyi commented Oct 13, 2021

RyanWangZf commented Oct 13, 2021 • edited Loading

Jwenyi commented Oct 14, 2021

Jwenyi commented Oct 14, 2021 • edited Loading

RyanWangZf commented Oct 14, 2021 • edited Loading

Jwenyi commented Oct 14, 2021

RyanWangZf commented Oct 14, 2021

Jwenyi commented Oct 15, 2021

Jwenyi commented Oct 15, 2021

Jwenyi commented Oct 15, 2021

RyanWangZf commented Oct 15, 2021

Jwenyi commented Oct 15, 2021 • edited Loading

RyanWangZf commented Oct 15, 2021

Jwenyi commented Oct 17, 2021

RyanWangZf commented Oct 13, 2021 •

edited

Loading

Jwenyi commented Oct 14, 2021 •

edited

Loading

RyanWangZf commented Oct 14, 2021 •

edited

Loading

Jwenyi commented Oct 15, 2021 •

edited

Loading