Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Immune_All_Low - model training question #104

Open
NikicaJEa opened this issue Jan 10, 2024 · 4 comments
Open

Immune_All_Low - model training question #104

NikicaJEa opened this issue Jan 10, 2024 · 4 comments

Comments

@NikicaJEa
Copy link

Hi I was trying to replicate your Immune_All_Low model by using the same training dataset that you kindly provided (CellTypist_Immune_Reference_v2_count). I tested the two models (original and the replicate) on a independent dataset. The final annotations differ quite a bit, especially I notice that the prediction scores from the original model are significantly higher (0-1) than of the replicate one (mostly around 0). Here is the model training code:

train_data = sc.read("./CellTypist_Immune_Reference_v2_count.h5ad")
gene_list = original_immune_all_low_classifier.features

# include only genes that are in train_data.var_names
valid_genes = [gene for gene in gene_list if gene in train_data.var_names]

sc.pp.normalize_total(train_data, target_sum=1e4)
sc.pp.log1p(train_data)

train_data = train_data[:, valid_genes] #subset to markers used by the original model

Classifier = celltypist.train(train_data, labels = 'label', n_jobs = 16, feature_selection = False, use_SGD = True, mini_batch = False, check_expression = False, balance_cell_type =False)

I also tried to see if it would make a difference if I would normalize after the gene subsetting but nothing changed much.
Thanks for the help!

@KatarinaLalatovic
Copy link

I am also interested in this

@ChuanXu1
Copy link
Collaborator

@NikicaJEa, if you only select a subset of genes for training, SGD is not necessary - you can safely turn it off to enable a canonical logistic regression.

@NikicaJEa
Copy link
Author

Thanks for your reply @ChuanXu1. I experimented with all possible combinations I could think of: with/without SGD, mini batching, balance_cell_type, without subseting genes, with feature selection step. Unfortunately, none of these combinations yielded results comparable to the original immune_low model. Now I understand there is always some degree of randomness to be expected, but this is more than I would expect. It would be beneficial to understand the specific parameters under which the original model was trained.

@ChuanXu1
Copy link
Collaborator

@NikicaJEa, to produce a model with comparable performance versus the built-in models, you can use the same set of genes (which you had already done plus check_expression = False) and increase the number of iterations (for example, max_iter = 1000), with all other parameters being the defaults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants