Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Final edits

  • Loading branch information...
commit 1fc5cedd38d6eafa78644d45f2904b1bea205ed5 1 parent 18ed663
@saksham authored
View
19 report/LyxReportStructure/FinalReport.lyx
@@ -217,12 +217,13 @@ Morphology of mitochondria is interesting for cell biologists because it
image from the stack of images of the worms obtained from the microscope.
This process is biased because the biologists working with the worm have
prior expectation on the morphology.
- Moreover, the human based image analysis is time intensive.
+ Moreover, the human based image analysis is time consuming.
There is, hence, a need for an unbiased metric for the morphology of mitochondr
ia, and possible automation of the morphometric analysis.
In this report, the authors suggest use of two features for classification
- of mitochondria and present their findings on how different machine learning
- algorithms MLE, logistic regression, SVM and neural networks.
+ of mitochondria and present their findings on performance of the machine
+ learning algorithms MLE, logistic regression, SVM and neural networks on
+ the classification task.
\end_layout
\begin_layout Section
@@ -240,7 +241,7 @@ filename "background.lyx"
\end_layout
\begin_layout Section
-Data Acquisition, Processing and Feature Extraction (something like that)
+Data Pre-processing and Feature Extraction
\end_layout
\begin_layout Standard
@@ -250,7 +251,9 @@ name "sec:data-acquisition"
\end_inset
-
+This section describes how the images in the stack file are processed before
+ classification can be done.
+
\end_layout
\begin_layout Standard
@@ -276,9 +279,9 @@ name "sec:Classification"
\begin_layout Standard
Four machine learning algorithms: maximum likelihood estimation (MLE), logistic
- regression, support vector machine (SVM) and neural networks, were used
- for classifying the mitochondria from the sharpest image.
- Each of them is discussed in detail below.
+ regression, support vector machine (SVM) and neural networks (NN), were
+ used for classifying the mitochondria from the sharpest image.
+ Each of them is discussed in detail in this section.
\end_layout
\begin_layout Subsection
View
8 report/LyxReportStructure/background.lyx
@@ -106,17 +106,13 @@ slice
These are attributes like “circularity” (shape compared to a perfect circle),
length of one mitochondria, kind of clustering, etc.
These data are used to classify the morphology of the mitochondria into
- three classes:
+ two classes:
\shape italic
fragmented
\shape default
-,
-\shape italic
-tubular
-\shape default
and
\shape italic
-in between
+tubular
\shape default
.
As this process involves human based classification, it is prone to biases
View
165 report/LyxReportStructure/data_pre_processing.lyx
@@ -65,23 +65,25 @@ Data Pre-processing
\begin_layout Standard
To extract relevant features from the images preprocessing is necessary.
- To do so we created a preprocessing pipeline that consists of three steps.
+ To do so, a pre-processing pipeline comprising of three steps was created.
\end_layout
\begin_layout Paragraph*
-Sharpest image:
+Sharpest Image:
\end_layout
\begin_layout Standard
-Each image stack consists between around 8 to 18 photography’s of the same
+Each image stack consists between around 10 to 20 photographs of the same
part of one worm.
- The images differ only in the focal plane that is shown.
- In the most cases only one of these images shows the mitochondria in a
- sharp and clear way that can be used to classify the mitochondria.
- The manual process is to look at each of the images, select the sharpest
- and proceed with the next classification tasks only with this one.
- If a program should become able to classify automatically, it is necessary
- to find a way to select the sharpest image automatically, too.
+ The images in a stack differ are taken at different focal planes of the
+ microscope.
+ In most of the cases only one of these images shows the mitochondria in
+ a sharp and clear way that can be used for classification.
+ The usual process is to look at each of the images in the stack, select
+ the sharpest and proceed with the next classification tasks only with the
+ sharpest image.
+ If a program is to be able to classify the mitochondria automatically,
+ it should be able to select the sharpest image automatically as well.
\end_layout
@@ -113,7 +115,7 @@ For this purpose a simple gradient based estimation function is used.
\end_inset
normalized by the number of gradients gives a good estimation on how sharp
- a photography actually is.
+ a photograph actually is.
\end_layout
@@ -156,21 +158,23 @@ The higher the value
\end_inset
the sharper the image.
- A disadvantage of this simple method is that one usually can only compare
- images showing the same motive and so in real image processing software
- much better methods for estimating image sharpness are in use.
- Nevertheless it has shown to be a sufficient method for our purposes.
- Because we are only comparing images within on single stack and they all
- show the exact same microscope photography.
+ A disadvantage of this simple method is that, one usually can only compare
+ images showing the same motif.
+ Hence, in real image processing software much better methods for estimating
+ image sharpness are in use.
+ Nevertheless it has proved itself to be a sufficient method for the purpose
+ of classification presented in this report.
+ This is because only images within a single stack are compared, and they
+ all show the exact same photograph.
\end_layout
\begin_layout Paragraph
-Histogram analysis:
+Histogram Analysis:
\end_layout
\begin_layout Standard
To adjust the optimal contrast of each image it is necessary to have a look
- on the histograms and how the different levels (0 to 255) of gray are distribut
+ at the histograms and how the different levels (0 to 255) of gray are distribut
ed all over one image.
\end_layout
@@ -256,11 +260,11 @@ The task is to find suitable values for
\begin_inset Formula $b_{l}$
\end_inset
- the lower bound below which everything goes to 0 (black) and
+, the lower bound, below which everything goes to 0 (black) and
\begin_inset Formula $b_{u}$
\end_inset
- the upper bound above which everything goes to 255 (white).
+, the upper bound, above which everything goes to 255 (white).
\end_layout
@@ -321,36 +325,37 @@ The gray scale levels in between are scaled to fit into the histogram of
the output image.
The result is a high contrast image in which mitochondria are much easier
to find.
- Because of all the images have different histograms it is not possible
- to use the same lower bound and upper bound for all of them.
- So we defined the variables
-\begin_inset Formula $b_{l}$
+ Because all the images have different histograms, it is not possible to
+ use the same lower bound and upper bound for all of them.
+ So the variables
+\begin_inset Formula $\alpha$
\end_inset
and
-\begin_inset Formula $b_{u}$
+\begin_inset Formula $\beta$
\end_inset
- corresponding to how many of the pixels should become 0 (black) or 255
- (white) after the adjustment.
- Using this procedure we are able to find the optimal bounds for each individual
- image.
- The performance of the later classification process will highly depend
- on the output quality of this contrast adjustment.
+ were defined, corresponding to how many of the pixels should have the grayscale
+ value of 0 (black) or 255 (white) after the adjustment.
+ Using this procedure the optimal bounds for each individual image were
+ found.
+ The performance of the later classification process depends highly on the
+ quality of output of this contrast adjustment.
In order to find the best values for
-\begin_inset Formula $b_{l}$
+\begin_inset Formula $\alpha$
\end_inset
and
-\begin_inset Formula $b_{u}$
+\begin_inset Formula $\beta$
\end_inset
- we used the optimization tools of MatLab.
- The error to minimize is the misclassification rate on the training set.
+ we used the optimization tools of MATLAB.
+ Misclassification rate on the training set is used as the optimization
+ objective.
\end_layout
\begin_layout Paragraph
-Boundaries and blob detection:
+Boundaries and Blob Detection:
\end_layout
\begin_layout Standard
@@ -362,11 +367,11 @@ After improving the contrast of each image it can easily be converted to
bwboundaries
\shape default
\size default
-' it is possible to trace the exterior boundaries of the mitochondria in
- the binary image.
+' it was possible trace the exterior boundaries of the mitochondria in the
+ binary image.
After filtering out objects that are too big or too small to be valid mitochond
-ria we have generated a set of vectors each containing the borders of one
- of the mitochondria.
+ria a set of vectors were generated, each containing the borders of one
+ of the mitochondrion.
\end_layout
\begin_layout Standard
@@ -461,15 +466,19 @@ Circularity:
\begin_layout Standard
The circularity of mitochondria is one of the most important and reasonable
ways to compare mitochondria in tubular and fragmented cells.
- If the cell is classified tubular the mitochondria are far from being circularl
-y dirtibuted whereas in a fragmented classified cell almost all of the mitochond
-ria are indeed shaped like circles.
+ If the cell is classified as
+\shape italic
+tubular
+\shape default
+, the bounds of each mitochondrion are far from being circularly dirtibuted
+ whereas in a fragmented classified cell almost all of the mitochondria
+ are indeed shaped like circles.
\end_layout
\begin_layout Standard
-Since we have already found the boundaries in an earlier step of the preprocessi
-ng we use FuzzyCShells to calculate the error that represents how the average
- of the points are away from the circle.
+Since the boundaries have already been found in an earlier pre-processing
+ step, FuzzyCShells method was used to calculate the error that represents
+ the average of the distance between the points and the circle.
The centre of the circle is the mean
\begin_inset Formula $\mu$
\end_inset
@@ -555,6 +564,7 @@ sideways false
status open
\begin_layout Plain Layout
+\align center
\begin_inset Graphics
filename images/boundary_circle_1.eps
special width=\columnwidth
@@ -568,7 +578,7 @@ status open
\begin_inset Caption
\begin_layout Plain Layout
-Blob boundary circular distributed
+Circular distributed boundary of a blob
\end_layout
\end_inset
@@ -592,6 +602,7 @@ sideways false
status open
\begin_layout Plain Layout
+\align center
\begin_inset Graphics
filename images/boundary_line_2.eps
special width=\columnwidth
@@ -605,7 +616,7 @@ status open
\begin_inset Caption
\begin_layout Plain Layout
-Blob boundary line shaped distributed
+Tubuar shaped boundary of a blob
\end_layout
\end_inset
@@ -619,14 +630,15 @@ Blob boundary line shaped distributed
\end_layout
\begin_layout Paragraph
-Covariance analysis:
+Covariance Analysis:
\end_layout
\begin_layout Standard
The covariance analysis is yet another way to generate a feature based on
the shape of a blob.
- If data points are distributed line shaped then the covariance of the points
- has two highly different values for the covariance's eigenvalues
+ If data points are distributed in a tubular fashion, then the covariance
+ of the points has two very different values for the covariance's eigenvalues
+
\begin_inset Formula $\lambda_{b}$
\end_inset
@@ -635,7 +647,7 @@ The covariance analysis is yet another way to generate a feature based on
\end_inset
, where
-\begin_inset Formula $\lambda_{b}>=\lambda_{s}$
+\begin_inset Formula $\lambda_{b}\geqslant\lambda_{s}$
\end_inset
.
@@ -645,7 +657,11 @@ The covariance analysis is yet another way to generate a feature based on
will belong to the eigenvector pointing in the direction along which the
original blob data is distributed.
- So the ratio
+ So the ratio of the two eigen values,
+\begin_inset Formula $e_{\lambda}$
+\end_inset
+
+, is close to zero if the data is distributed as a line.
\end_layout
\begin_layout Standard
@@ -659,13 +675,13 @@ e_{\lambda}=\lambda_{b}/\lambda_{s}
\end_layout
-\begin_layout Standard
-is close to zero if the data has been line shaped distributed.
+\begin_layout Paragraph
+Thumbnail Features:
\end_layout
\begin_layout Standard
After blob detection, the ratio of the eigenvalues and the circularity measure
- were passed as input to various classification algorithms as shown in Figure
+ were passed as input to various classification algorithms as shown in figure
\begin_inset CommandInset ref
LatexCommand ref
@@ -674,13 +690,15 @@ reference "fig:classification-pipeline"
\end_inset
.
+ Note that features for neural network is slightly different from the rest
+ of classification methods.
\end_layout
\begin_layout Standard
\begin_inset Float figure
wide false
sideways false
-status collapsed
+status open
\begin_layout Plain Layout
\align center
@@ -752,23 +770,28 @@ vskip
\end_layout
\begin_layout Standard
-For the Neural Network (only) the blobs have been used directly to train
- the model and make predictions.
- Despite now none of the extraction functions described above have been
- used still some image pre-processing is necessary.
-
-\end_layout
+For neural network (only) the blobs have been used directly to train the
+ model and make predictions.
+ Instead of computing
+\begin_inset Formula $e_{c}$
+\end_inset
-\begin_layout Standard
-First of all the blobs are resized to fit into an image of size 20
+ and
+\begin_inset Formula $e_{\lambda}$
+\end_inset
+
+ for each blob and using these values as features for classification, the
+ pixel values of the blob are used directly to make predictions.
+ After blobs are detected, each blob is resized to fit into a 20
\begin_inset Formula $\times$
\end_inset
-20.
- In this image, they are centered and the space between the boundaries is
- filled.
- Vectorizing these single blob images produces an 400 dimensional feature
- space that can be used to train and predict on an Neural Network.
+20px image.
+ The blobs are then centered in the image.
+ A binary image is created by filling the blob with grayscale value of 255
+ (white) and that for rest of the image is set to 0 (black).
+ Vectorizing these single blob images produces a 400 dimensional feature
+ space that is then used to train a neural network and make predictions.
\end_layout
\end_body
View
BIN  report/LyxReportStructure/images/Pipeline.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN  report/LyxReportStructure/images/Pipeline.pptx
Binary file not shown
View
13 report/LyxReportStructure/logisticRegression.lyx
@@ -60,25 +60,25 @@
\begin_body
\begin_layout Standard
-Unegularized logistic regression with the objective function
+Unregularized logistic regression with the objective function
\begin_inset Formula $J(\theta)$
\end_inset
- and the gradient as shown in Equations
+ and the gradient as shown in equations
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:log-reg-cost-function"
\end_inset
-and
+ and
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:log-reg-gradient-function"
\end_inset
- were used for classifying data, where
+ were used for classifying the data, where
\begin_inset Formula $m$
\end_inset
@@ -132,7 +132,7 @@ represents the label for
.
The input for the logistic regression learning algorithm was the mean of
- the ratio of eigenvalues and the circularity measure.
+ the ratio of eigenvalues and the circularity measures.
Standard MATLAB optimization function
\shape italic
\size footnotesize
@@ -176,7 +176,8 @@ reference "fig:Classification-using-logistic"
\end_inset
shows decision boundary for the classification.
-
+ For the entire data set, logistic regression seems to perform classification
+ quite well.
\end_layout
\begin_layout Standard
View
68 report/LyxReportStructure/mle.lyx
@@ -60,7 +60,7 @@
\begin_body
\begin_layout Standard
-For the MLE classification it is assumed that the data is gaussian distributed
+For the MLE classification, it is assumed that the data is gaussian distributed
for each of the classes.
\end_layout
@@ -76,14 +76,14 @@ For the MLE classification it is assumed that the data is gaussian distributed
\end_layout
\begin_layout Standard
-To fit a multivariate gaussian (equation
+To fit a multivariate Gaussian (equation
\begin_inset CommandInset ref
LatexCommand ref
reference "eq:mle1"
\end_inset
-) on the data of one class it is necessary to compute the maximum values
+) on the data of one class, it is necessary to compute the maximum values
for the mean
\begin_inset Formula $\mu$
\end_inset
@@ -93,13 +93,27 @@ reference "eq:mle1"
\end_inset
.
- This can be done solving for the closed forms
+ This can be done by solving the closed form equations
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "eq:mu-mle"
+
+\end_inset
+
+ and
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "eq:sigma-mle"
+
+\end_inset
+
+.
\end_layout
\begin_layout Standard
\begin_inset Formula
\begin{equation}
-\mu_{MLE}=\frac{1}{n}\sum_{i=1}^{n}x^{(i)}
+\mu_{MLE}=\frac{1}{n}\sum_{i=1}^{n}x^{(i)}\label{eq:mu-mle}
\end{equation}
\end_inset
@@ -110,7 +124,7 @@ reference "eq:mle1"
\begin_layout Standard
\begin_inset Formula
\begin{equation}
-\Sigma_{MLE}=\frac{1}{n}\sum_{i=1}^{n}(x^{(i)}-\mu)(x^{(i)}-\mu)^{T}
+\Sigma_{MLE}=\frac{1}{n}\sum_{i=1}^{n}(x^{(i)}-\mu)(x^{(i)}-\mu)^{T}\label{eq:sigma-mle}
\end{equation}
\end_inset
@@ -119,8 +133,16 @@ reference "eq:mle1"
\end_layout
\begin_layout Standard
-using MLE.
- The MLE optimization is repeated for each of the classes to evaluate
+The MLE optimization is repeated for each of the classes, i.e.
+
+\shape italic
+fragmented
+\shape default
+ and
+\shape italic
+tubular
+\shape default
+, to evaluate
\begin_inset Formula $\mu_{f}$
\end_inset
@@ -136,9 +158,9 @@ using MLE.
\begin_inset Formula $\Sigma_{t}$
\end_inset
- for the classes fragmented and tubular.
+.
In 2D the Gaussians can be shown as ellipses over the data points (see
- figur
+ figure
\begin_inset CommandInset ref
LatexCommand ref
reference "fig:Multivariate-Gaussians-fitted"
@@ -146,8 +168,8 @@ reference "fig:Multivariate-Gaussians-fitted"
\end_inset
).
- To predict on this model for new samples the values for both Gaussians
-
+ To make predictions using this model for new samples, the values for both
+ Gaussians
\begin_inset Formula $p(x|\mu_{f},\Sigma_{f})$
\end_inset
@@ -155,16 +177,26 @@ reference "fig:Multivariate-Gaussians-fitted"
\begin_inset Formula $p(x|\mu_{t},\Sigma_{t})$
\end_inset
- have to be calculated.
- The new class will be that one with the highes value returned.
+ are calculated.
+ The blobs are assigned to the class whose Gaussian returns the highest
+ value.
\end_layout
\begin_layout Standard
-Despite the MLE classification ist one of the simplest methods, because
- there exists a closed form and no complicatet optimization algorithms are
- needed, it turned out to be the model with the best classification results
- for the mitochondria data.
+Figure
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "fig:Multivariate-Gaussians-fitted"
+
+\end_inset
+
+ shows the ellipses corresponding to the Gaussians belonging to the two
+ classes and the decision boundary.
+ Despite the fact that MLE classification is one of the simplest methods,
+ because there exists a closed form and no complicatet optimization algorithms
+ are needed, it turned out to be the model with the best classification
+ results for the mitochondria data.
\end_layout
\begin_layout Standard
View
44 report/LyxReportStructure/svm.lyx
@@ -61,7 +61,14 @@
\begin_layout Standard
Support vector machines (SVM) belong to a group of classifiers called sparse
- kernel machines.
+ kernel machines
+\begin_inset CommandInset citation
+LatexCommand cite
+key "Bishop_Book"
+
+\end_inset
+
+.
A big advantage of these machines is that they have sparse solutions, so
that the predictions for new inputs depend only on the kernel function
eveluated at a subset of the training data points.
@@ -145,8 +152,8 @@ L_{P}=\frac{1}{2}\|w\|^{2}-\sum_{i}\alpha^{(i)}y^{(i)}(w\cdot x^{(i)})-\sum_{i}\
\end_layout
\begin_layout Standard
-Because the mitochondria data set has shown not to be perfectly linear seperable
- (see figure
+Because the mitochondria data set has shown not to be perfectly linearly
+ seperable (see figure
\begin_inset CommandInset ref
LatexCommand ref
reference "fig:Linear-SVM"
@@ -160,19 +167,19 @@ reference "fig:Linear-SVM"
has been introduced to control the trade-off between slack variables and
the margin.
The slack variables allow missclassification but a penalty increasing with
- the distance from the boundary will contribute to the optimization problem.
+ the distance from the boundary will contribute to the optimization objective.
\end_layout
\begin_layout Standard
-The Linear SVM has been used to classify the mitochondria data set.
+The linear SVM has been used to classify the mitochondria data set.
The value used for
\begin_inset Formula $C$
\end_inset
was 100.
- Because the data is almost linear seperable it produced good classifications
- for new samples, too (see table
+ Because the data is almost linearly seperable it produced good classification
+ results for new samples, too (see table
\begin_inset CommandInset ref
LatexCommand ref
reference "table:accuracy-ml-methods"
@@ -180,6 +187,15 @@ reference "table:accuracy-ml-methods"
\end_inset
).
+ The linear decision boundary computed by SVM is almost identical to the
+ one found using logistic regression (figure
+\begin_inset CommandInset ref
+LatexCommand ref
+reference "fig:Classification-using-logistic"
+
+\end_inset
+
+).
\end_layout
\begin_layout Standard
@@ -269,17 +285,13 @@ reference "eq:constraintsvm3"
\end_inset
-like in the the primal formulation.
+ like in the the primal formulation.
The kernel
\begin_inset Formula $K(x^{(i)},x^{(j)})$
\end_inset
- is substituded to the inner product.
-\end_layout
-
-\begin_layout Standard
-The kernel puts the decision problem to a higher dimension in which the
- data will become linear seperable.
+ is substituted to the inner product and transforms the decision problem
+ to a higher dimension in which the data will become linear seperable.
For the mitochondria classification problem a Gaussian kernel (equation
\begin_inset CommandInset ref
@@ -304,7 +316,7 @@ K(a,b)=\exp\left(\frac{\|a-b\|^{2}}{2\sigma^{2}}\right)\label{eq:gaussiankernel}
\begin_layout Standard
The result for a decision boundary using the Gaussian kernel is shown in
- figur
+ figure
\begin_inset CommandInset ref
LatexCommand ref
reference "fig:Gaussian-Kernel-SVM"
@@ -323,7 +335,7 @@ reference "fig:Gaussian-Kernel-SVM"
to 0.2.
The SVM lerned the shape of the data distribution perfectly, but there
is also a danger of overfitting on the possible outliers on the bottom
- of the figur.
+ of the figure.
\end_layout
\begin_layout Standard
Please sign in to comment.
Something went wrong with that request. Please try again.