Skip to content

Commit

Permalink
adding a few missing images/posts
Browse files Browse the repository at this point in the history
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Jan 28, 2021
1 parent 0b0f34b commit b089cc1
Show file tree
Hide file tree
Showing 7 changed files with 13 additions and 18 deletions.
15 changes: 5 additions & 10 deletions _posts/2013/2013-10-26-cluster-validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,11 @@ tags:
---


Unsupervised clustering can be tough in situations when we just don't know "the right number" of clusters.  In toy examples we can sometimes use domain knowledge to push us toward the right answer, but what happens when you really haven't a clue?  Forget about the number of clusters, how about when I don't know the algorithm to use?  We've talked about using metrics like the [Gap Statistic](http://www.vbmis.com/learn/?p=574), and [sparse K-Means](http://www.vbmis.com/learn/?p=454) that looks at the within cluster sum of squares for optimization.  These are pretty good.  However, I wanted more.  That's when I found a package for R called [clValid](http://cran.r-project.org/web/packages/clValid/vignettes/clValid.pdf).
Unsupervised clustering can be tough in situations when we just don't know "the right number" of clusters.  In toy examples we can sometimes use domain knowledge to push us toward the right answer, but what happens when you really haven't a clue?  Forget about the number of clusters, how about when I don't know the algorithm to use?  We've talked about using metrics like the [Gap Statistic](https://vsoch.github.io/2013/the-gap-statistic/), and [sparse K-Means](https://vsoch.github.io/2013/sparse-k-means-clustering-sparcl/) that looks at the within cluster sum of squares for optimization.  These are pretty good.  However, I wanted more.  That's when I found a package for R called [clValid](http://cran.r-project.org/web/packages/clValid/vignettes/clValid.pdf).

Now that I'm post Quals, I can happily delve into work that is a little more applied than theoretical (thank figs!).  First let's talk generally about different kinds of evaluation metrics, and then let's try them with some data.  The kinds of evaluation metrics that we will use fall into three categories: internal, stability, and biological.  Actually, stability is a kind of internal, so I'll talk about it as a subset of internal.




## What is internal validation?

Internal validation is the introverted validation method.  We figure out how good the clustering is based on intrinsic properties that don't go outside of our particular dataset, like compactness, distance between clusters, and how well connected they are.  For example:
Expand Down Expand Up @@ -62,13 +60,13 @@ To calculate the Biological Homogenity Index, the result of which will be a valu

I've gone over unsupervised algorithms in other posts, so here I will just list what the clValid package offers:

- UPGMA: This is agglomerative [hierarchical clustering](http://www.vbmis.com/learn/?p=98 "Hierarchical Clustering"), bread and butter.
- [K-Means](http://www.vbmis.com/learn/?p=94 "K-Means Clustering")! Also bread and butter
- UPGMA: This is agglomerative [hierarchical clustering](https://vsoch.github.io/2013/hierarchical-clustering/), bread and butter.
- [K-Means](https://vsoch.github.io/2013/k-means-clustering/)! Also bread and butter
- Diana and SOTA: also hierarchical clustering, but divisive (meaning instead of starting with points and merging, we start with a massive blob and split it)
- PAM "Partitioning around medioids" is basically K-means with distance metrics other than Euclidean.  It's sister package Clara runs PAM in a bootstrappy sort of way on subsets of data.
- Fanny: Fuzzy clustering!
- SOM: [self organizing maps! ](http://www.vbmis.com/learn/?p=510 "Self Organizing Maps (SOM)")
- Model Based: fit your data to some statistical distribution using the [EM (expectation maximization) algorithm](http://www.vbmis.com/learn/?p=345 "Expectation Maximization (EM) Algorithm").
- SOM: [self organizing maps! ](https://vsoch.github.io/2013/self-organizing-maps-som/)
- Model Based: fit your data to some statistical distribution using the [EM (expectation maximization) algorithm](https://vsoch.github.io/2013/expectation-maximization-em-algorithm/).
- SOTA: self organizing trees.


Expand Down Expand Up @@ -166,9 +164,6 @@ And as you see above, we can use the optimalScores() method instead of summary()
Remember that, for each of these, the result is a value between 0 and 1, with 1 indicating more biologically homogenous (BHI) and more stable (BSI).  This says that, based on our labels, there isn't that much homogeneity, which we could have guessed from looking at the tree.  But it's not so terrible - I tried running this with a random shuffling of the labels, and got values pitifully close to zero.
 
## Summary
Firstly, learning this stuff is awesome.  I have a gut feeling that there is signal in this data, and it's on me to refine that signal.  What I haven't revealed is that this data is only a small sample of the number of morphometry metrics that I have derived, and the labels are sort of bad as well (I could go on a rant about this particular set of labels, but I won't :)).  Further, we have done absolutely no feature selection!  However, armed with the knowledge of how to evaluate these kinds of clustering, I can now dig deeper into my data - first doing feature selection across a much larger set of metrics, and then trying biological validation using many different sets of labels.  Most of my labels are more akin to expression values than binary objects, so I'll need to think about how to best represent them for this goal.  I have some other work to do today, so I'll leave this for next time.  Until then, stick a fork in me, I'm Dunn! :)
2 changes: 1 addition & 1 deletion _posts/2014/2014-1-04-star-wars-origami-and-photoshop.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,6 @@ tags:

Pew, pew pew!

[![millennium_falcon_reversed_backward2](http://www.vbmis.com/learn/wp-content/uploads/2013/12/millennium_falcon_reversed_backward21-785x371.png)](http://www.vbmis.com/learn/wp-content/uploads/2013/12/millennium_falcon_reversed_backward21.png)
[![millennium_falcon_reversed_backward2]({{ site.baseurl }}/assets/images/posts/vbmis/millennium_falcon_reversed_backward21.png)]({{ site.baseurl }}/assets/images/posts/vbmis/millennium_falcon_reversed_backward21.png)


8 changes: 4 additions & 4 deletions _posts/2014/2014-1-04-why-are-journal-articles-so-boring.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,18 +19,18 @@ This is exactly the conundrum that went through my mind late one evening last we
##


## [PubmedLib](http://www.vbmis.com/bmi/project/PUBMEDLib/)
## [PubmedLib](https://github.com/vsoch/PUBMEDLib/)

This little page queries Pubmed for an abstract based on a search term,

![search1](http://www.vbmis.com/learn/wp-content/uploads/2013/11/search1.png)
![search1](https://raw.githubusercontent.com/vsoch/PUBMEDLib/main/img/1.png)

It then parses the abstract to determine parts of speech, and randomly selects words for you to fill in:

![terms](http://www.vbmis.com/learn/wp-content/uploads/2013/11/terms.png)
![terms](https://raw.githubusercontent.com/vsoch/PUBMEDLib/main/img/2.png)

Then, you get to see your silly creation, with words in red.  That’s pretty much it :o)

[![result](http://www.vbmis.com/learn/wp-content/uploads/2013/11/result-785x442.png)](http://www.vbmis.com/learn/wp-content/uploads/2013/11/result.png)
[![result](https://raw.githubusercontent.com/vsoch/PUBMEDLib/main/img/3.png)](https://raw.githubusercontent.com/vsoch/PUBMEDLib/main/img/3.png)

As you can see, completely useless, other than maybe some practice for me using Pubmed’s API.  I should probably do some *real *work now, hehe :)
2 changes: 1 addition & 1 deletion _posts/2014/2014-7-27-machine-learning-in-r.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ tags:
r
---

<iframe class="iframe-class" frameborder="0" height="19420" scrolling="yes" src="http://vbmis.com/bmi/share/wall/BlueBeetle.html" width="100%"></iframe>
<iframe class="iframe-class" frameborder="0" height="19420" scrolling="yes" src="https://vsoch.github.io/vbmis.com/media/R/BlueBeetle.html" width="100%"></iframe>
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ NeuroSynth is a database that maps functional activation coordinates to behavior

However, when I had the .Rda all saved up, I thought it would be easy (and fun!) to throw this into a quick web interface, and unite the two websites.  An hour and a half later – done!

[![nsynth2ca](http://vsoch.com/blog/wp-content/uploads/2014/09/nsynth2ca.png)](http://www.vbmis.com/bmi/nsynth2ca/) [http://www.vbmis.com/bmi/nsynth2ca/](http://www.vbmis.com/bmi/nsynth2ca/)
[nsynth2ca](https://vsoch.github.io/nsynth2ca/) and the code at [https://github.com/vsoch/nsynth2ca/](https://github.com/vsoch/nsynth2ca/)



Expand Down
2 changes: 1 addition & 1 deletion _posts/2016/2016-1-15-the-academic-software-developer.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,4 +56,4 @@ This link between local and web-based resource continues to be a challenge that
# The Academic Software Developer
The need for reproducible science has brought with it the emerging niche of the “academic software developer,” an individual that is a combined research scientist and full stack software developer, and is well suited to develop applications for specific domains of research. This is a space that exists between research and software development, and the niche that I choose to operate in. The Academic Software Developer, in the year 2016, is thinking very hard about how to integrate large data, analysis pipelines, and structured terminology and standard into web-friendly things. He or she is using modern web technology including streaming data, server side JavaScript, Python frameworks and cloud resources, and Virtual Machines. He or she is building and using Application Programming Interfaces, Continuous Integration, and version control. Speaking personally, I am continually experimenting with my own research, continually trying new things and trying to do a little bit better each time. I want to be a programming badass, and have ridiculous skill using [Docker](https://hub.docker.com/), [Neo4j](neo4j.com), [Web Components](https://www.polymer-project.org/1.0/), and building [Desktop applications](http://electron.atom.io/) to go along with the webby ones. It is not satisfying to stop learning, or to see such amazing technology being developed [down the street](http://googledevelopers.blogspot.com/) and have painful awareness that I might never be able to absorb it all. My work can always be better, and perhaps the biggest strength and burden of having such stubbornness is this feeling that efforts are never good enough. I think this can be an OK way to be only given the understanding that it is ok for things to not work the first time. I find this process of learning, trying and failing, and trying again, to be exciting and essential for life fulfillment. There is no satisfaction at the end of the day if there are not many interesting, challenging problems to work on.

I don't have an "ending" for this story, but I can tell you briefly what I am thinking about. Every paper should be associated with some kind of "reproducible repo." This could mean one (or more) of several things, depending on the abilities of the researcher and importance of the result. It may mean that I can deploy an entire analysis with the click of a button, akin to the recently published [MyConnectome Project](http://results.myconnectome.org). It may mean that a paper comes with a small web interface linking to a database and API to access methods and data, as I attempted even for my [first tiny publication](http://www.vbmis.com/bmi/noisecloud). It could be a simple [interactive web interface](http://vsoch.github.io/image-comparison-thresholding/) hosted with analysis code on a Github repo to explore a result. We could use [continuous integration](https://github.com/vsoch/reverse-inference-ci) outside of its scope to run an analysis, or [programatically generate a visualization](https://github.com/vsoch/semantic-image-comparison-ci) using completely open source data and methods (APIs). A published result is almost useless if care is not taken to make it an actionable, implementable thing. I'm tired of static text being the output of years of work. As a researcher I want some kind of "reactive analysis" that is an assertion a researcher makes about a data input answering some hypothesis, and receiving notification about a change in results when the state of the world (data) changes. I want current "research culture" to be more open to business and industry practice of using data from unexpected places beyond Pubmed and limited self-report metrics that are somehow "more official" than someone writing about their life experience informally online. I am not convinced that the limited number of datasets that we pass around and protect, not sharing until we've squeezed out every last inference, are somehow better than the crapton of data that is sitting right in front of us in unexpected internet places. Outside of a shift in research culture, generation of tools toward this vision is by no means an easy thing to do. Such desires require intelligent methods and infrastructure that must be thought about carefully, and built. But we don't currently have these things, and we are already way fallen behind the standard in industry that probably comes by way of having more financial resources. What do we have? We have ourselves. We have our motivation, and skillset, and we can make a difference. My hope is that other graduate students have equivalent awareness to take responsibility for making things better. Work harder. Take risks, and do not be complacent. Take initiative to set the standard, even if you feel like you are just a little fish.
I don't have an "ending" for this story, but I can tell you briefly what I am thinking about. Every paper should be associated with some kind of "reproducible repo." This could mean one (or more) of several things, depending on the abilities of the researcher and importance of the result. It may mean that I can deploy an entire analysis with the click of a button, akin to the recently published [MyConnectome Project](http://results.myconnectome.org). It may mean that a paper comes with a small web interface linking to a database and API to access methods and data, as I attempted even for my [first tiny publication](https://vsoch.github.io/noisecloud-www/). It could be a simple [interactive web interface](http://vsoch.github.io/image-comparison-thresholding/) hosted with analysis code on a Github repo to explore a result. We could use [continuous integration](https://github.com/vsoch/reverse-inference-ci) outside of its scope to run an analysis, or [programatically generate a visualization](https://github.com/vsoch/semantic-image-comparison-ci) using completely open source data and methods (APIs). A published result is almost useless if care is not taken to make it an actionable, implementable thing. I'm tired of static text being the output of years of work. As a researcher I want some kind of "reactive analysis" that is an assertion a researcher makes about a data input answering some hypothesis, and receiving notification about a change in results when the state of the world (data) changes. I want current "research culture" to be more open to business and industry practice of using data from unexpected places beyond Pubmed and limited self-report metrics that are somehow "more official" than someone writing about their life experience informally online. I am not convinced that the limited number of datasets that we pass around and protect, not sharing until we've squeezed out every last inference, are somehow better than the crapton of data that is sitting right in front of us in unexpected internet places. Outside of a shift in research culture, generation of tools toward this vision is by no means an easy thing to do. Such desires require intelligent methods and infrastructure that must be thought about carefully, and built. But we don't currently have these things, and we are already way fallen behind the standard in industry that probably comes by way of having more financial resources. What do we have? We have ourselves. We have our motivation, and skillset, and we can make a difference. My hope is that other graduate students have equivalent awareness to take responsibility for making things better. Work harder. Take risks, and do not be complacent. Take initiative to set the standard, even if you feel like you are just a little fish.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit b089cc1

Please sign in to comment.