-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Project proposal of team_Methylation-Badassays #3
Comments
Hi @STAT540-UBC/team-badassays Thank you for writing up the final proposal. There is a great progress from your initial proposal to the final proposal. However, there are still many parts that need more clarification which I believe will be Ok once you study and learn more about your project and methodology and progress. Also, I believe today's lecture was beneficial for your team since it was addressing key points in DNA methylation analysis :) @rbalshaw and I reviewed your proposal. What you need to think more about are..
More guidelines from Rob:
We would be happy to meet with the team and discuss further :) Good luck with your interesting project! |
Thanks for the comments and helpful suggestions @farnushfarhadi and @rbalshaw ! We have some thoughts and responses to your comments, which I will respond to inline:
That's a good point that we don't know the ethnicities in our second dataset so they might have other genetic ancestries than our first dataset. However, maybe we can adjust our goal from "Predict the ethnicities based on DNAm data" to asking the question "Are these samples from this other dataset more epigenetically 'Asian' or 'Caucasian'?". That is, maybe these samples aren't 'Asian' or 'Caucasian' but can we use our identified CpGs to say which samples are more Asian or Caucasian (based on our CpGs) and vice versa? The goal then, is to build a tool that let's researchers estimate the ethnic heterogeneity in their dataset (not necessarily predict exactly the ethnicity).
I'm not sure exactly what you mean by this. Could you please clarify why adding the second dataset to the first and then identifying the differentially methylated sites would be useful? Considering that the ethnicities are unknown in the second dataset, wouldn't they be useless in building a classifier that predicts ethnicity? Currently, we are still doing the preprocessing of the data. By the project proposal deadline we plan to accomplish the following things:
In general I'm unsure about what the workflow looks like after this. How should we build the 'classifier'? I'm thinking this is what we could possibly do: Thanks, |
Associated with point 1: I think you're on track here. Your first data set allows you to assess if CpG sites can differentiate between self-reported Asian vs. Caucasian. There are several steps here. Then, cross-validation in this data set would allow you to assess how effectively your CpG's can do that (as measured by Sens and Spec, AUC, etc.). This is a largely a supervised learning problem (supervised vs. unsupervised used as the machine learning folks use them...) where you are trying to predict which group each sample belongs to. Next, you take the CpG and the classification rule you have developed in dataset 1 and check to see if the CpGs you have decided are relevant still appear relevant in this second dataset. This is trickier and will require a bit more imagination -- but I think that your goal here could be to demonstrate that data set 1 taught you which CpGs to look at (and actually proposed a way to combine them into a Asian vs. Caucasian "score") and this should permit you to adjust for at least this type of heterogeneity in a data set where self-reported ethnicity is not collected. Associated with Point 2: I didn't intend that point 2 would involve data set 2 at all. You're right - you can't really confirm anything using this dataset, as you have no "known labels". Rather, as I suggested above, data set 2 could perhaps be used to demonstrate the potential usefulness of what you were able to learn from data set 1. But, as you've heard, one of the challenges in many of these studies is the risk of over-fitting -- too many possible parameters to estimate and not enough data. Cross-validation is one statistical technique that can be used to control for and assess the degree of over-fitting. Try googling "correcting for optimism in statistical modeling". (Statsgeek had a nice little example...) Your "future workflow" looks heavily reliant on linear regression and then clustering.
Hope this helps. |
@rbalshaw @farnushfarhadi Would you guys have a look at our draft analysis plan, maybe we can discuss it tomorrow during seminar time: One way we can think of, in order to identify CpG sites that are helpful in predicting ethnicity, is using classification methods like logistic regression:
Or, we can also experiment with unsupervised clustering methods like PCA: Merge both training and test data set, use PCA to visualize which of the PCs are good classifiers of ethnicity given the labels we have, use identified PC(s) as classifiers for samples without labels (i.e. samples in the test data set). Would you consider this to be a rigorous method? If we identify some PCs as classifiers, should we expect one of them to be similar to the logistic regression classifier? |
Hi @MingWan10 Would just like to add one thing. With regards to (2.), Rob mentioned, and I think it is a good idea, at the beginning, to randomly (we would probably need to keep the proportions of Asians vs Caucasians the same though) separate our first dataset (with known ethnicity) into 'training' and 'testing' subsets. We can use the training subset to build the classifier and then test on our testing subset to get an idea of the accuracy of the classifier. Victor |
Sorry I cannot attend the seminar this afternoon. Your plans are sounding sensible. @wvictor14 describes what amounts to one "fold" in the cross-validation strategy that @MingWan10 is describing in his step 1. As for including all 400k predictors in one run of a penalized regression, I'll have to leave that to you. But -- just be careful that if you do some form of "preselection" that uses the ethnicity, you must include that step in the cross-validation process. For example, say someone had done 100 single-predictor regressions and then used only the 15 that had p-values<0.20 in a stepwise regression analysis. If they only cross-validate the stepwise regression part of their process, they will vastly overestimate how good their final model really is. They would need to cross-validate both the single-predictor regression analyses and the subsequent stepwise regression analysis to be sure that their estimates of performance are not overly contaminated by the optimism that you get by "testing on the training set". (Of course, I'm not recommending one-predictor regression followed by stepwise regression... just the cross-validation approach.) |
FYI - I just replied to the GitHub issue. I have lost confidence that everyone will get that update.
Rob
On Mar 15, 2017, at 9:04 AM, Victor Yuan <notifications@github.com<mailto:notifications@github.com>> wrote:
Hi @MingWan10<https://github.com/MingWan10>
Would just like to add one thing.
With regards to (2.), Rob mentioned, and I think it is a good idea, at the beginning, to randomly (we would probably need to keep the proportions of Asians vs Caucasians the same though) separate our first dataset (with known ethnicity) into 'training' and 'testing' subsets. We can use the training subset to build the classifier and then test on our testing subset to get an idea of the accuracy of the classifier.
Victor
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#3 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AS8Q__R8P3rSw-hU6-YjR8s3MMtzFrjiks5rmAv-gaJpZM4MCdZ4>.
|
Hi team, I am excited to know more about your project! you are the first group I will be talking to today! See ya |
@rbalshaw THANK YOU very much for your great and helpful comments. |
Thanks @rbalshaw for your suggestions! After today's lecture on CV, we also realized initial screening of predictors isn't really a great idea if we go with logistic regression + regularization, so we will put in all predictors at once. There are other classification models though, as @farnushfarhadi pointed out during our discussions today, we could also try KNN, SVM or linear discriminant analysis etc. Do you guys have insights on which method we should try out? (+ @singha53 ) |
@MingWan10 Good point re: which method should we try out. I will incorporate that for the next lecture after I teach regularization. I will compare the methods we learned to date (e.g. KNN, penalized logistic regression and SVM using the caret package). Note: you can include the screening of predictors (ie. feature selection) to build your classifier as long as you also include it in the cross-validation folds. For now I recommend trying out the code I put in today's lecture on your dataset. |
@rbalshaw @farnushfarhadi
the last commit: 7ce221a
the link to the proposal: https://github.com/STAT540-UBC/team_Methylation-Badassays/blob/master/project_proposal.md
The text was updated successfully, but these errors were encountered: