Permalink
Browse files

Document and tighten up the demo

  • Loading branch information...
tdunning committed Mar 21, 2014
1 parent d615c3f commit e359c00a4cc2de9df31fd3580ee6100314048d00
View
@@ -1,21 +1,60 @@
This project provides implementations of common sparse coding algorithms.
### Anomaly Detection using Sub-sequence Clustering
The best illustration of the code so far is an anomaly detection demo. The idea is to use sub-sequence clustering
of an EKG signal to reconstruct the EKG. The difference between the original and the reconstruction can be used
to find anomalies in the original signal.
This project provides a demonstration of a simple time-series anomaly detector.
The idea is to use sub-sequence clustering of an EKG signal to reconstruct the EKG. The difference between
the original and the reconstruction can be used as a measure of how much like the signal is like a prototypical
EKG. Poor reconstruction can thus be used to find anomalies in the original signal.
The data for this demo are taken from physionet. See http://physionet.org/physiobank/database/#ecg-databases
The particular data used for this demo is the Apnea ECG database which can be found at
http://physionet.org/physiobank/database/apnea-ecg/
To run the demo, note that there is a data file included in the resources of this software (see src/main/resources/a02.dat).
You can find original version of this file at
All necessary data for this demo is included as a resource in the source code (see src/main/resources/a02.dat).
You can find original version of the training data at
http://physionet.org/physiobank/database/apnea-ecg/a02.dat
This file is 6.1MB in size and contains several hours of recorded EKG data from a patient in a sleep apnea study.
This file is 6.1MB in size and contains several hours of recorded EKG data from a patient in a sleep apnea study. This
file contains 3.2 million samples of which we use the first 200,000 for training.
### Installing and Running the Demo
The class com.tdunning.sparse.Learn goes through the steps required to read and process this data to produce a simple
anomaly detector.
anomaly detector. The output of this program consists of the clustering itself (in dict.tsv) as well as a reconstruction
of the test signal (in trace.tsv). These outputs can be visualized using the provided R script.
To compile and run the demo,
mvn -q exec:java -Dexec.mainClass=com.tdunning.sparse.Learn
To produce the figures showing how the anomalies are detected
rm *.pdf ; Rscript figures.r
### What the Figures Show
Figure 1 shows how an ordinary, non-anomalous signal (top line) is reconstructed (middle line) with relatively small
errors. Figures 2, 3 and 4 show magnified views of the successive 5 second periods.
Looking at the distribution of the reconstruction error in Figure 5 shows that the error is distinctly not normally
distributed. Instead, the distribution of the error has longer tails than the normal distribution would have.
Figure 6 shows a histogram of the error. The standard deviation of the error magnitude is about 5, but nearly 2% of the
errors are larger than 15 (3 standard deviations). This is implausibly large for a normal distribution which would
only have less than 0.3% of the errors that large. Even more extreme, 50 samples per million are larger than 20
standard deviations.
Scanning for errors greater than 100 takes us to a point 100 seconds into the recording where the error spikes sharply.
Figure 7 shows the error and Figure 8 shows the original and reconstructed signal for this 5 second period. The
reconstruction clearly isn't capturing the negative excursion of the original signal, but it isn't clear why. Figure 9
shows a magnified view of the 1 second right around the anomaly and we can see that the problem is a double beat.
Scanning for more anomalies takes us to 240s into the trace where there is a clear signal acquisition malfunction as
shown in Figures 10 and 11.
The 64 most commonly used sub-sequence clusters are shown in figure 12. The left-most column shows how translations
of the same portion of the heartbeat show up as clusters in the signal dictionary. These patterns are scaled, shifted
and added to reconstruct the original signal.
View
@@ -0,0 +1,95 @@
figure = 1
ylim = c(-180, 180)
if (!exists('xtrace')) {
xtrace = read.delim("trace.tsv", header=F)
dict = read.delim("dict.tsv", header=F)
}
# plots three graphs in a single figure. These graphs contain
# 1) the original signal
# 2) the reconstructed signal
# 3) the error
# The offset argument indicates how many samples the plots should be offset into the entire trace
threePlot = function (offset) {
startPdf()
xm = matrix(c(0,1,2,3,0),ncol=3, nrow=5)
xm[,1] = 0
xm[,3] = 0
layout(xm, heights=c(0.2, 1, 1, 1.5, 0.02), widths = c(0.1, 1, 0.1))
par(mar=c(1,1,0,0.2))
plot(xtrace$V1[offset + 1:1000], type='l', ylim=ylim, ylab='', xaxt='n', yaxt='n')
axis(side=2, at=c(-100, 0, 100))
plot(xtrace$V2[offset + 1:1000], type='l', ylim=ylim, ylab='', xaxt='n', yaxt='n')
axis(side=2, at=c(-100, 0, 100))
par(mar=c(5,1,0,0.2))
plot((offset + 1:1000)/200, xtrace$V1[offset + 1:1000] - xtrace$V2[offset + 1:1000], type='l', ylim=ylim, ylab='', yaxt='n',
xlab="Time (s)")
axis(side=2, at=c(-100, 0, 100))
dev.off()
}
errorChart = function(offset) {
startPdf()
plot((offset + 1:1000)/200, xtrace$V1[offset + 1:1000] - xtrace$V2[offset + 1:1000], type='l', xlab="Time (s)", ylab="mV", ylim=ylim,
main=paste(offset/200, " to ", (offset+1000)/200, "seconds"), yaxt='n')
axis(side=2, at=c(-100, 0, 100))
dev.off()
}
originalChart = function(offset, size=1000) {
startPdf()
plot((offset + 1:size)/200, xtrace$V1[offset + 1:size], type='l', xlab="Time (s)", ylab="mV", ylim=ylim,
main=paste(offset/200, " to ", (offset+size)/200, "seconds"), yaxt='n')
axis(side=2, at=c(-100, 0, 100))
dev.off()
}
startPdf = function () {
pdf(sprintf("figure-%02d.pdf", figure), width=6, height=4, pointsize=11)
assign("figure", figure + 1, envir = .GlobalEnv)
}
# these show normal behavior
threePlot(0)
errorChart(0)
errorChart(1000)
errorChart(2000)
# reconstruction error is clearly not normally distributed
error = (xtrace$V1[1:20000]-xtrace$V2[1:20000])
startPdf()
qqnorm(error, cex=0.3, main="Reconstruction error is not normally distributed")
dev.off()
# but it is very tightly constrained
startPdf()
hist(error, breaks=70, main = "Reconstruction Error", xlab="Error (mV)", freq=F)
text(-55, 0.09, bquote(sigma == .(sprintf("%.1f", sd(error)))), adj=c(0,0))
text(-55, 0.08, sprintf("P(error > %.0f) = %.1f%%", 3*sd(error), 100*mean(abs(error) > 3 * sd(error))), adj=c(0,0))
text(-55, 0.07, sprintf("P(error > 100) < %.0f ppm", 1e6/20000), adj=c(0,0))
dev.off()
# at 100 seconds in, we see an anomaly
errorChart(20000)
threePlot(20000)
originalChart(20200, 200)
errorChart(48000)
threePlot(48000)
startPdf()
layout(matrix(1:64, ncol=8))
par(mar=c(0,0,0,0))
ix = order(-table(xtrace$V3))
for (i in ix[1:64]) {
ymax = max(abs(dict[i,]))
plot(t(dict[i,]), type='l', xaxt='n', yaxt='n', ylim=c(-ymax,ymax))
}
dev.off()
View
153 pom.xml
@@ -4,10 +4,49 @@
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>SparseCoding</groupId>
<artifactId>SparseCoding</artifactId>
<groupId>com.tdunning</groupId>
<artifactId>AnomalyDetector</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<name>Anomaly Detection Demo</name>
<description>A demonstration of sub-sequence clustering for anomaly detection</description>
<url>https://github.com/tdunning/anomaly-detection</url>
<licenses>
<license>
<name>The Apache Software License, Version 2.0</name>
<url>http://www.apache.org/licenses/LICENSE-2.0.txt</url>
<distribution>repo</distribution>
</license>
</licenses>
<scm>
<connection>scm:git:https://github.com/tdunning/anomaly-detection.git</connection>
<developerConnection>scm:git:https://github.com/tdunning/anomaly-detection.git</developerConnection>
<tag>HEAD</tag>
<url>https://github.com/tdunning/anomaly-detection</url>
</scm>
<developers>
<developer>
<id>tdunning</id>
<name>Ted</name>
<email>ted.dunning@gmail.com</email>
<url>https://github.com/tdunning/anomaly-detection</url>
<roles>
<role>developer</role>
</roles>
<timezone>-8</timezone>
<properties>
<twitter>@ted_dunning</twitter>
</properties>
</developer>
</developers>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.mahout</groupId>
@@ -25,4 +64,114 @@
<version>4.11</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.16</version>
<configuration>
<parallel>methods</parallel>
<perCoreThreadCount>true</perCoreThreadCount>
<threadCount>1</threadCount>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<verbose>true</verbose>
<compilerVersion>1.7</compilerVersion>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
<version>2.2.1</version>
<executions>
<execution>
<id>attach-sources</id>
<goals>
<goal>jar</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-javadoc-plugin</artifactId>
<version>2.9.1</version>
<executions>
<execution>
<id>attach-javadocs</id>
<goals>
<goal>jar</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-release-plugin</artifactId>
<version>2.4.2</version>
<configuration>
<arguments>-Dgpg.keyname=${gpg.keyname}</arguments>
<arguments>-Dgpg.passphrase=${gpg.passphrase}</arguments>
</configuration>
</plugin>
</plugins>
</build>
<distributionManagement>
<snapshotRepository>
<id>sonatype-nexus-snapshots</id>
<name>Sonatype Nexus snapshot repository</name>
<url>https://oss.sonatype.org/content/repositories/snapshots</url>
</snapshotRepository>
<repository>
<id>sonatype-nexus-staging</id>
<name>Sonatype Nexus release repository</name>
<url>https://oss.sonatype.org/service/local/staging/deploy/maven2/</url>
</repository>
</distributionManagement>
<profiles>
<profile>
<id>sign-artifacts</id>
<activation>
<property>
<name>performRelease</name>
<value>true</value>
</property>
</activation>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-gpg-plugin</artifactId>
<version>1.4</version>
<configuration>
<passphrase>${gpg.passphrase}</passphrase>
<keyname>${gpg.keyname}</keyname>
</configuration>
<executions>
<execution>
<id>sign-artifacts</id>
<phase>verify</phase>
<goals>
<goal>sign</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
</profiles>
</project>
Oops, something went wrong.

0 comments on commit e359c00

Please sign in to comment.