small things/text

mahort · mahort · commit 5595f89a5642 · 2018-11-13T20:44:50.000+01:00
diff --git a/Lecture-1-Linear-Regression.ipynb b/Lecture-1-Linear-Regression.ipynb
@@ -22,7 +22,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Set up environment"
+    "### Set up the environment"
    ]
   },
   {
@@ -50,15 +50,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Download data"
+    "### Load the data"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We here load the yeast cross dataset.\n",
-    "The data used in this study have been preconverted into an hdf5 file. \n",
+    "Here, we load the yeast cross dataset.\n",
+    "The data used in this study have already been converted into an hdf5 file. \n",
     "To process your own data, please use the limix command line binary (see [here](http://nbviewer.jupyter.org/github/limix/limix-tutorials/blob/master/preprocessing_QC/loading_files.ipynb))."
    ]
   },
@@ -81,14 +81,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Set up data object"
+    "### Set up the data object"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Phenotypes and genotypes are stored inside the HDF5 file. Load them into a dataframe and select the first 3 phenotypes. "
+    "Both the phenotypes and the genotypes are stored inside an HDF5 file. Load the data into a dataframe; here, we focus on the first 3 phenotypes. "
    ]
   },
   {
@@ -462,14 +462,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Normal distributed phenotypes and phenotype transformations"
+    "## Check the model assumptions (are the data normal?)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To explore the phenotypic data, we create a histogram of the phenotype values."
+    "Here, we use histograms to look at the distributions of the phenotypes."
    ]
   },
   {
@@ -562,25 +562,23 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Some of the phenotypes deviate from a normal distribution.\n",
-    "One of the assumptions of the linear regression model we use for association testing is that the model residuals are normal distrbuted.\n",
-    "Violation of this assumption leads to biases in the analysis.\n",
-    "We only have access to the residuals after fitting the model.\n",
-    "Under the assumption that the model eplains only a small portion of phenotypic variation we can assess the phenotype values instead."
+    "Your data will often deviate from a normal distribution (sometimes drastically, like Cadmium Chloride shown above).\n",
+    "Unfortunately, one of the assumptions of the model that we use in GWAS is that the residuals are normally distrbuted.\n",
+    "Violations of this assumption can result in model misspecification and thus biased parameter estimates."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Transforming phenotypes"
+    "### Variance stabilizing transformations; standardizing the phenotypes"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To make the data look more normal distrbuted, we apply two different phenotype transformations, the Box-Cox transformation and a non-parametric rank-based transformation."
+    "There are a wide variety of methods to stabilize variance and make data normally distributed. Here, we explore the usefulness of the Box-Cox transformation as well as a (non-parametric) rank-based transformation."
    ]
   },
   {
@@ -594,7 +592,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The Box-Cox transformation makes the data \"more normal\" by fitting a power transformation with one parameter to the observed phenotypic data."
+    "The Box-Cox transformation makes the data \"more normal\" by fitting a power transformation ($y^{\\lambda}$, where $\\lambda$ is found using maximum likelihood) to the observed phenotypic data."
    ]
   },
   {
@@ -926,14 +924,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Manhattan plot"
+    "### Plotting the results"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "A common way to visualize the results of a GWAS is a so-called Manhattan plot, where the $-log_{10}$ P-values are plotted against the genomic position.\n",
+    "A common way to visualize the results from GWAS is by using a so-called Manhattan plot, where the $-log_{10}$ P-values are plotted against the genomic position.\n",
     "\n",
     "The LIMIX function for producing Manhattan plots is ``limix.plot.plot_manhattan`` (see [here][1]).\n",
     "\n",
@@ -1026,14 +1024,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "##### GWAS using linear regression on the transformed phenotypes:"
+    "### Conducting GWAS with the transformed phenotypes:"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "First we analyze the Box-Cox transformed phenotypes."
+    "First we perform GWAS with the Box-Cox transformed phenotypes."
    ]
   },
   {
@@ -1059,7 +1057,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Next, we analyze the phenotypes transformed by the rank-based transformation."
+    "Next, we investigate the rank-transformed phenotypes."
    ]
   },
   {
@@ -1085,7 +1083,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To compare the results of the various transformations, we plot the p-values against one another:"
+    "To compare the results of the transformations, we can plot the p-values against one another:"
    ]
   },
   {
@@ -1641,16 +1639,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 29,
+   "execution_count": 25,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "<matplotlib.legend.Legend at 0x1a5bb55d10>"
+       "<matplotlib.legend.Legend at 0x1a24e668d0>"
       ]
      },
-     "execution_count": 29,
+     "execution_count": 25,
      "metadata": {},
      "output_type": "execute_result"
     },
@@ -1716,17 +1714,17 @@
     "covars_conditional= sp.concatenate((geno_df.loc[sample_idx].values[:,imax:imax+1], sp.ones((phenotype_vals.values.shape[0],1))),1)\n",
     "                                  \n",
     "\n",
-    "#run linear regression on each SNP\n",
+    "#run linear regression on each SNP, while conditioning on the top SNP as a covariate.\n",
     "lm_conditional = qtl_test_lm(snps=geno_df.loc[sample_idx].values,pheno=phenotype_vals.values,covs=covars_conditional)\n",
     "\n",
-    "#convert P-values to a DataFrame for nice output writing:\n",
+    "#convert P-values to a pandas DataFrame:\n",
     "pvalues_lm_conditional = pd.DataFrame(data=lm_conditional.pvalues.T,index=positions,\n",
     "                       columns=phenotype_ID)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 30,
+   "execution_count": 27,
    "metadata": {},
    "outputs": [
     {
diff --git a/Lecture-2-Linear-Mixed-Model.ipynb b/Lecture-2-Linear-Mixed-Model.ipynb
@@ -60,9 +60,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We here load the arabidopsis dataset.\n",
-    "The data used in this study have been pre-converted into an hdf5 file. \n",
-    "To process your own data, please use the limix command line binary (see [here](http://nbviewer.jupyter.org/github/limix/limix-tutorials/blob/master/preprocessing_QC/loading_files.ipynb))."
+    "Here, we load the arabidopsis data, which have already been converted into an hdf5 file. \n",
+    "\n",
+    "To process your own data, use the limix command line binary (see [here](http://nbviewer.jupyter.org/github/limix/limix-tutorials/blob/master/preprocessing_QC/loading_files.ipynb) for an example)."
    ]
   },
   {
@@ -87,7 +87,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The data object allows to query specific genotype or phenotype data"
+    "The HDF5 file holds both the genotype and phenotype data."
    ]
   },
   {
@@ -295,7 +295,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 35,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -307,6 +307,26 @@
     "                      dtype='float64')"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(1179, 21456)"
+      ]
+     },
+     "execution_count": 36,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "geno_df.shape"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 7,
@@ -520,7 +540,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Explore flowering time phenotypes"
+    "### Let's work with flowering time data"
    ]
   },
   {
@@ -675,7 +695,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Note that when cov are not set (None), LIMIX fits an intercept (i.e., ``covs=sp.ones((N,1))``)."
+    "When you do not include covariates (None), LIMIX still fits an intercept (i.e., ``covs=sp.ones((N,1))``)."
    ]
   },
   {
@@ -817,7 +837,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Manhattan plot"
+    "### To get a quick idea of the results from GWAS, visualize the P-values with a Manhattan plot"
    ]
   },
   {
@@ -1052,7 +1072,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Here, we perform GWAS while using an increasing number of principal components (after PCA of the genotype matrix) in an attempt to correct for confounding due to population structure. Then, we use QQ-plots to check for (possible) P-value inflation."
+    "Here, we perform GWAS using principal components (after PCA of the genotype matrix) in an attempt to correct for confounding due to population structure. Like earlier, we use QQ-plots to check for (possible) P-value inflation."
    ]
   },
   {

Original file line number	Diff line number	Diff line change
`@@ -22,7 +22,7 @@`
`22`	`22`	`"cell_type": "markdown",`
`23`	`23`	`"metadata": {},`
`24`	`24`	`"source": [`
`25`		`- "#### Set up environment"`
	`25`	`+ "### Set up the environment"`
`26`	`26`	`]`
`27`	`27`	`},`
`28`	`28`	`{`
`@@ -50,15 +50,15 @@`
`50`	`50`	`"cell_type": "markdown",`
`51`	`51`	`"metadata": {},`
`52`	`52`	`"source": [`
`53`		`- "#### Download data"`
	`53`	`+ "### Load the data"`
`54`	`54`	`]`
`55`	`55`	`},`
`56`	`56`	`{`
`57`	`57`	`"cell_type": "markdown",`
`58`	`58`	`"metadata": {},`
`59`	`59`	`"source": [`
`60`		`- "We here load the yeast cross dataset.\n",`
`61`		`- "The data used in this study have been preconverted into an hdf5 file. \n",`
	`60`	`+ "Here, we load the yeast cross dataset.\n",`
	`61`	`+ "The data used in this study have already been converted into an hdf5 file. \n",`
`62`	`62`	`"To process your own data, please use the limix command line binary (see [here](http://nbviewer.jupyter.org/github/limix/limix-tutorials/blob/master/preprocessing_QC/loading_files.ipynb))."`
`63`	`63`	`]`
`64`	`64`	`},`
`@@ -81,14 +81,14 @@`
`81`	`81`	`"cell_type": "markdown",`
`82`	`82`	`"metadata": {},`
`83`	`83`	`"source": [`
`84`		`- "#### Set up data object"`
	`84`	`+ "### Set up the data object"`
`85`	`85`	`]`
`86`	`86`	`},`
`87`	`87`	`{`
`88`	`88`	`"cell_type": "markdown",`
`89`	`89`	`"metadata": {},`
`90`	`90`	`"source": [`
`91`		`- "Phenotypes and genotypes are stored inside the HDF5 file. Load them into a dataframe and select the first 3 phenotypes. "`
	`91`	`+ "Both the phenotypes and the genotypes are stored inside an HDF5 file. Load the data into a dataframe; here, we focus on the first 3 phenotypes. "`
`92`	`92`	`]`
`93`	`93`	`},`
`94`	`94`	`{`
`@@ -462,14 +462,14 @@`
`462`	`462`	`"cell_type": "markdown",`
`463`	`463`	`"metadata": {},`
`464`	`464`	`"source": [`
`465`		`- "## Normal distributed phenotypes and phenotype transformations"`
	`465`	`+ "## Check the model assumptions (are the data normal?)"`
`466`	`466`	`]`
`467`	`467`	`},`
`468`	`468`	`{`
`469`	`469`	`"cell_type": "markdown",`
`470`	`470`	`"metadata": {},`
`471`	`471`	`"source": [`
`472`		`- "To explore the phenotypic data, we create a histogram of the phenotype values."`
	`472`	`+ "Here, we use histograms to look at the distributions of the phenotypes."`
`473`	`473`	`]`
`474`	`474`	`},`
`475`	`475`	`{`
`@@ -562,25 +562,23 @@`
`562`	`562`	`"cell_type": "markdown",`
`563`	`563`	`"metadata": {},`
`564`	`564`	`"source": [`
`565`		`- "Some of the phenotypes deviate from a normal distribution.\n",`
`566`		`- "One of the assumptions of the linear regression model we use for association testing is that the model residuals are normal distrbuted.\n",`
`567`		`- "Violation of this assumption leads to biases in the analysis.\n",`
`568`		`- "We only have access to the residuals after fitting the model.\n",`
`569`		`- "Under the assumption that the model eplains only a small portion of phenotypic variation we can assess the phenotype values instead."`
	`565`	`+ "Your data will often deviate from a normal distribution (sometimes drastically, like Cadmium Chloride shown above).\n",`
	`566`	`+ "Unfortunately, one of the assumptions of the model that we use in GWAS is that the residuals are normally distrbuted.\n",`
	`567`	`+ "Violations of this assumption can result in model misspecification and thus biased parameter estimates."`
`570`	`568`	`]`
`571`	`569`	`},`
`572`	`570`	`{`
`573`	`571`	`"cell_type": "markdown",`
`574`	`572`	`"metadata": {},`
`575`	`573`	`"source": [`
`576`		`- "### Transforming phenotypes"`
	`574`	`+ "### Variance stabilizing transformations; standardizing the phenotypes"`
`577`	`575`	`]`
`578`	`576`	`},`
`579`	`577`	`{`
`580`	`578`	`"cell_type": "markdown",`
`581`	`579`	`"metadata": {},`
`582`	`580`	`"source": [`
`583`		`- "To make the data look more normal distrbuted, we apply two different phenotype transformations, the Box-Cox transformation and a non-parametric rank-based transformation."`
	`581`	`+ "There are a wide variety of methods to stabilize variance and make data normally distributed. Here, we explore the usefulness of the Box-Cox transformation as well as a (non-parametric) rank-based transformation."`
`584`	`582`	`]`
`585`	`583`	`},`
`586`	`584`	`{`
`@@ -594,7 +592,7 @@`
`594`	`592`	`"cell_type": "markdown",`
`595`	`593`	`"metadata": {},`
`596`	`594`	`"source": [`
`597`		`- "The Box-Cox transformation makes the data \"more normal\" by fitting a power transformation with one parameter to the observed phenotypic data."`
	`595`	`+ "The Box-Cox transformation makes the data \"more normal\" by fitting a power transformation ($y^{\\lambda}$, where $\\lambda$ is found using maximum likelihood) to the observed phenotypic data."`
`598`	`596`	`]`
`599`	`597`	`},`
`600`	`598`	`{`
`@@ -926,14 +924,14 @@`
`926`	`924`	`"cell_type": "markdown",`
`927`	`925`	`"metadata": {},`
`928`	`926`	`"source": [`
`929`		`- "#### Manhattan plot"`
	`927`	`+ "### Plotting the results"`
`930`	`928`	`]`
`931`	`929`	`},`
`932`	`930`	`{`
`933`	`931`	`"cell_type": "markdown",`
`934`	`932`	`"metadata": {},`
`935`	`933`	`"source": [`
`936`		`- "A common way to visualize the results of a GWAS is a so-called Manhattan plot, where the $-log_{10}$ P-values are plotted against the genomic position.\n",`
	`934`	`+ "A common way to visualize the results from GWAS is by using a so-called Manhattan plot, where the $-log_{10}$ P-values are plotted against the genomic position.\n",`
`937`	`935`	`"\n",`
`938`	`936`	"The LIMIX function for producing Manhattan plots is ``limix.plot.plot_manhattan`` (see [here][1]).\n",
`939`	`937`	`"\n",`
`@@ -1026,14 +1024,14 @@`
`1026`	`1024`	`"cell_type": "markdown",`
`1027`	`1025`	`"metadata": {},`
`1028`	`1026`	`"source": [`
`1029`		`- "##### GWAS using linear regression on the transformed phenotypes:"`
	`1027`	`+ "### Conducting GWAS with the transformed phenotypes:"`
`1030`	`1028`	`]`
`1031`	`1029`	`},`
`1032`	`1030`	`{`
`1033`	`1031`	`"cell_type": "markdown",`
`1034`	`1032`	`"metadata": {},`
`1035`	`1033`	`"source": [`
`1036`		`- "First we analyze the Box-Cox transformed phenotypes."`
	`1034`	`+ "First we perform GWAS with the Box-Cox transformed phenotypes."`
`1037`	`1035`	`]`
`1038`	`1036`	`},`
`1039`	`1037`	`{`
`@@ -1059,7 +1057,7 @@`
`1059`	`1057`	`"cell_type": "markdown",`
`1060`	`1058`	`"metadata": {},`
`1061`	`1059`	`"source": [`
`1062`		`- "Next, we analyze the phenotypes transformed by the rank-based transformation."`
	`1060`	`+ "Next, we investigate the rank-transformed phenotypes."`
`1063`	`1061`	`]`
`1064`	`1062`	`},`
`1065`	`1063`	`{`
`@@ -1085,7 +1083,7 @@`
`1085`	`1083`	`"cell_type": "markdown",`
`1086`	`1084`	`"metadata": {},`
`1087`	`1085`	`"source": [`
`1088`		`- "To compare the results of the various transformations, we plot the p-values against one another:"`
	`1086`	`+ "To compare the results of the transformations, we can plot the p-values against one another:"`
`1089`	`1087`	`]`
`1090`	`1088`	`},`
`1091`	`1089`	`{`
`@@ -1641,16 +1639,16 @@`
`1641`	`1639`	`},`
`1642`	`1640`	`{`
`1643`	`1641`	`"cell_type": "code",`
`1644`		`- "execution_count": 29,`
	`1642`	`+ "execution_count": 25,`
`1645`	`1643`	`"metadata": {},`
`1646`	`1644`	`"outputs": [`
`1647`	`1645`	`{`
`1648`	`1646`	`"data": {`
`1649`	`1647`	`"text/plain": [`
`1650`		`- "<matplotlib.legend.Legend at 0x1a5bb55d10>"`
	`1648`	`+ "<matplotlib.legend.Legend at 0x1a24e668d0>"`
`1651`	`1649`	`]`
`1652`	`1650`	`},`
`1653`		`- "execution_count": 29,`
	`1651`	`+ "execution_count": 25,`
`1654`	`1652`	`"metadata": {},`
`1655`	`1653`	`"output_type": "execute_result"`
`1656`	`1654`	`},`
`@@ -1716,17 +1714,17 @@`
`1716`	`1714`	`"covars_conditional= sp.concatenate((geno_df.loc[sample_idx].values[:,imax:imax+1], sp.ones((phenotype_vals.values.shape[0],1))),1)\n",`
`1717`	`1715`	`" \n",`
`1718`	`1716`	`"\n",`
`1719`		`- "#run linear regression on each SNP\n",`
	`1717`	`+ "#run linear regression on each SNP, while conditioning on the top SNP as a covariate.\n",`
`1720`	`1718`	`"lm_conditional = qtl_test_lm(snps=geno_df.loc[sample_idx].values,pheno=phenotype_vals.values,covs=covars_conditional)\n",`
`1721`	`1719`	`"\n",`
`1722`		`- "#convert P-values to a DataFrame for nice output writing:\n",`
	`1720`	`+ "#convert P-values to a pandas DataFrame:\n",`
`1723`	`1721`	`"pvalues_lm_conditional = pd.DataFrame(data=lm_conditional.pvalues.T,index=positions,\n",`
`1724`	`1722`	`" columns=phenotype_ID)"`
`1725`	`1723`	`]`
`1726`	`1724`	`},`
`1727`	`1725`	`{`
`1728`	`1726`	`"cell_type": "code",`
`1729`		`- "execution_count": 30,`
	`1727`	`+ "execution_count": 27,`
`1730`	`1728`	`"metadata": {},`
`1731`	`1729`	`"outputs": [`
`1732`	`1730`	`{`