|
564 | 564 | "source": [
|
565 | 565 | "Your data will often deviate from a normal distribution (sometimes drastically, like Cadmium Chloride shown above).\n",
|
566 | 566 | "However, one of the assumptions of the model that we use in GWAS is that the residuals are normally distrbuted.\n",
|
567 |
| - "Violations of this assumption can result in model misspecification and thus biased parameter estimates." |
| 567 | + "Violations of this assumption can result in model misspecification and biased parameter estimates." |
568 | 568 | ]
|
569 | 569 | },
|
570 | 570 | {
|
|
578 | 578 | "cell_type": "markdown",
|
579 | 579 | "metadata": {},
|
580 | 580 | "source": [
|
581 |
| - "There are a wide variety of methods to stabilize variance and make data normally distributed. Here, we explore the usefulness of the Box-Cox transformation as well as a (non-parametric) rank-based transformation." |
| 581 | + "There are a wide variety of methods to stabilize variance and make data normally distributed. Here, we explore the Box-Cox transformation as well as a (non-parametric) rank-based transformation." |
582 | 582 | ]
|
583 | 583 | },
|
584 | 584 | {
|
|
615 | 615 | "cell_type": "markdown",
|
616 | 616 | "metadata": {},
|
617 | 617 | "source": [
|
618 |
| - "The rank transformation normalizes the data by converting the data to ranks and then transforming these ranks to the corresponding quantiles of a normal distribution. Because this transformation does not rely on a parameter (or actually one parameter per sample, namely the normal quantile), it is called non-parametric.\n", |
| 618 | + "The rank transformation normalizes the data by converting the data to ranks and then transforming these ranks to the corresponding quantiles of a normal distribution. Because this transformation does not rely on or specify a parameter it is considered non-parametric.\n", |
619 | 619 | "\n",
|
620 | 620 | "Before using a rank-based transformation, you should consider whether other models (e.g. the binomial model) are more appropriate for your data."
|
621 | 621 | ]
|
|
831 | 831 | "cell_type": "markdown",
|
832 | 832 | "metadata": {},
|
833 | 833 | "source": [
|
834 |
| - "Next, we convert the P-values to a pandas DataFrame:" |
| 834 | + "Next, we convert the P-values into a pandas DataFrame:" |
835 | 835 | ]
|
836 | 836 | },
|
837 | 837 | {
|
|
1532 | 1532 | "cell_type": "markdown",
|
1533 | 1533 | "metadata": {},
|
1534 | 1534 | "source": [
|
1535 |
| - "False discovery rates (FDR) give an idea of the expected type-1 error rate at a given P-value threshold. If we are testing millions of hypotheses, then we might be willing to accept type-1 errors at a given rate, if in return we get more discoveries.\n", |
| 1535 | + "False discovery rates (FDR) give an idea of the expected type-1 error rate at a given *P*-value threshold. This measure gives a useful alternative to traditional Bonferroni correction, which bounds the so-called family-wise error rate (FWER), namely the probability of having at least a single type-1 error.\n", |
1536 | 1536 | "\n",
|
1537 |
| - "This measure gives a useful alternative to traditional Bonferroni correction, which bounds the so-called family-wise error rate (FWER), namely the probability of having at least a single type 1 error." |
| 1537 | + "That is, a *P* value is the rate at which truly null hypotheses are called significant.\n", |
| 1538 | + "\n", |
| 1539 | + "The FDR is the rate that which significant results are truly null. So an FDR rate of 5% means that - among all of the features that are called significant - 5% of these will be false positives." |
1538 | 1540 | ]
|
1539 | 1541 | },
|
1540 | 1542 | {
|
|
1548 | 1550 | "cell_type": "markdown",
|
1549 | 1551 | "metadata": {},
|
1550 | 1552 | "source": [
|
1551 |
| - "Definition: minimum false discovery rate threshold that would allow the variable to be significant." |
| 1553 | + "These are, like _P_ values, a measure of significance for a given test. \n", |
| 1554 | + "\n", |
| 1555 | + "In practice, if one is willing to accept a result at an FDR of the given $q$ value (e.g. 0.03) then among the results that have a lower $q$ value, $q$ percent of those will be false positives (3% in this example)." |
1552 | 1556 | ]
|
1553 | 1557 | },
|
1554 | 1558 | {
|
|
1638 | 1642 | {
|
1639 | 1643 | "data": {
|
1640 | 1644 | "text/plain": [
|
1641 |
| - "<matplotlib.legend.Legend at 0x1a24e668d0>" |
| 1645 | + "<matplotlib.legend.Legend at 0x1a23b30290>" |
1642 | 1646 | ]
|
1643 | 1647 | },
|
1644 | 1648 | "execution_count": 25,
|
|
0 commit comments