Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running ICA on a local machine #172

Closed
vjlbym opened this issue Apr 19, 2015 · 18 comments
Closed

Error running ICA on a local machine #172

vjlbym opened this issue Apr 19, 2015 · 18 comments

Comments

@vjlbym
Copy link
Contributor

vjlbym commented Apr 19, 2015

Hi all,

I am posting an error log that I am getting when trying to run ICA on a recording of Ca2+ traces. There are about 50 cells in the field of view. So I set the number of ICs to 75, with 150 PCs.

The images at each time point are stored as .tif files. I loaded them in as a series and then normalized them using:

normdata = data.toTimeSeries().normalize(baseline='mean') #Normalize data by the global mean. (data-mean)/mean

normdata = data.toTimeSeries()

normdata.cache()

Thanks a lot for your help! And also, thanks a lot for Thunder :)


Py4JJavaError Traceback (most recent call last)
in ()
3 start_time = time.time()
4 from thunder import ICA
----> 5 modelICA = ICA(k=150,c=75).fit(normdata) # Run ICA on normalized data. k=#of principal components, c=#of ICs
6 sns.set_style('darkgrid')
7 plt.plot(modelICA.a);

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/factorization/ica.pyc in fit(self, data)
95
96 # reduce dimensionality
---> 97 svd = SVD(k=self.k, method=self.svdMethod).calc(data)
98
99 # whiten data

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/factorization/svd.pyc in calc(self, mat)
137
138 # compute (xx')^-1 through a map reduce
--> 139 xx = mat.times(cInv).gramian()
140 xxInv = inv(xx)
141

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/matrices.pyc in times(self, other)
191 newindex = arange(0, new_d)
192 return self._constructor(self.rdd.mapValues(lambda x: dot(x, other_b.value)),
--> 193 nrows=self._nrows, ncols=new_d, index=newindex).finalize(self)
194
195 def elementwise(self, other, op):

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/matrices.pyc in init(self, rdd, index, dims, dtype, nrows, ncols, nrecords)
52 elif ncols is not None:
53 index = arange(ncols)
---> 54 super(RowMatrix, self).init(rdd, nrecords=nrecs, dtype=dtype, dims=dims, index=index)
55
56 @Property

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/series.pyc in init(self, rdd, nrecords, dtype, index, dims)
48 self._index = None
49 if index is not None:
---> 50 self.index = index
51 if dims and not isinstance(dims, Dimensions):
52 try:

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/series.pyc in index(self, value)
65 def index(self, value):
66 # touches self.index to trigger automatic calculation from first record if self.index is not set
---> 67 lenSelf = len(self.index)
68 if type(value) is str:
69 value = [value]

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/series.pyc in index(self)
59 def index(self):
60 if self._index is None:
---> 61 self.populateParamsFromFirstRecord()
62 return self._index
63

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/series.pyc in populateParamsFromFirstRecord(self)
103 Returns the result of calling self.rdd.first().
104 """
--> 105 record = super(Series, self).populateParamsFromFirstRecord()
106 if self._index is None:
107 val = record[1]

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/data.pyc in populateParamsFromFirstRecord(self)
76 from numpy import asarray
77
---> 78 record = self.rdd.first()
79 self._dtype = str(asarray(record[1]).dtype)
80 return record

/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/rdd.pyc in first(self)
1165 2
1166 """
-> 1167 return self.take(1)[0]
1168
1169 def saveAsNewAPIHadoopDataset(self, conf, keyConverter=None, valueConverter=None):

/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/rdd.pyc in take(self, num)
1151 p = range(
1152 partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1153 res = self.context.runJob(self, takeUpToNumLeft, p, True)
1154
1155 items += res

/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/context.pyc in runJob(self, rdd, partitionFunc, partitions, allowLocal)
768 # SparkContext#runJob.
769 mappedRDD = rdd.mapPartitions(partitionFunc)
--> 770 it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal)
771 return list(mappedRDD._collect_iterator_through_file(it))
772

/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in call(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:

/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 12005, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 75, in main
command = pickleSer._read_with_length(infile)
File "/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 146, in _read_with_length
length = read_int(stream)
File "/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 464, in read_int
raise EOFError
EOFError

    org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
    org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
    org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
    org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
    org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
    org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
    org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
    org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
    org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
    org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
    org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

@GrantRVD
Copy link

This section sticks out to me

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 12005, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 75, in main
command = pickleSer._read_with_length(infile)
File "/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 146, in _read_with_length
length = read_int(stream)
File "/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 464, in read_int
raise EOFError
EOFError

This indicates it encounted an EOF (end of file) marker where it didn't expect one, so perhaps the file ended inappropriately. Perhaps you could give a little more info on the format of your data and how you're reading it into Python before calling thunder?

@vjlbym
Copy link
Contributor Author

vjlbym commented Apr 19, 2015

Hey yeah,
I am loading the data directly into Thunder as

data=tsc.loadImagesAsSeries(dirpath,inputFormat='tif',startIdx=0,stopIdx=3000)

dirpath is the path to the directory containing the data. The data is a set of 8bit .tif files corresponding to individual frames.

@vjlbym
Copy link
Contributor Author

vjlbym commented Apr 19, 2015

Also, I tried the above with 16 bit tif files, which is what the original images were. That didn't work. Since all the example files seem to be 8 bit tif, I converted them into 8 bit.

@GrantRVD
Copy link

Does the call to tsc.loadImageAsSeries method happen successfully? Can you try getting thunder/tsc to just display some of the images you input to confirm they're correct, before trying to run ICA?

@vjlbym
Copy link
Contributor Author

vjlbym commented Apr 19, 2015

Yeah, the images load fine. data.first() looks fine too.

@GrantRVD
Copy link

Then I'm lost, because it looks like there's a problem arising when thunder tries to serialize the data, but I can't pinpoint where the EOF is coming from. It would help if we could pin down which file is being read when the EOF is encountered.

@vjlbym
Copy link
Contributor Author

vjlbym commented Apr 19, 2015

Weirdly, running with no start and stop index didn't give that error. But I am getting a JAVA out of memory error again. I think I set the heap size to 6gb. I did JAVA_OPTS="-Xms6g -Xmx 6g". The data I am loading is 100mb. So I don't know what else to do to fix that. I am now attempting to run it on EC2. Haven't ever done that before. So might take a bit.

@GrantRVD
Copy link

Interesting. That could mean the indices you chose were handled weirdly by the serializer, or perhaps the system just ran out of memory before it could get to the EOFError. It's something, at least.

By the way, for your Java options, it doesn't make much sense to make your initial and max heap sizes the same. For 100 MB of data, I'd say 512m is a fine initial size, and just make 6g or 4096m your max size. This way your system won't eat up way more than it needs and should have the ability to extend if it needs to.

If even that doesn't fix your heap problems, you might want to dig into your program and understand more about where the problem is coming from. Try using some numpy or scipy tools to do your ICA manually and use IPython's memory profiler, or something similar, to track down what's eating up all of your memory.

@freeman-lab
Copy link
Member

@vjlbym thanks for the detailed info on this issue! and thanks for @GrantRVD for helping out. Would it be possible to share these particular files? (even just a subset of them would be helpful) You could post to a public bucket on S3, or if you'd rather not share publicly, we could find another way to send them to me. That would let us do some tests and try to get to the bottom of it.

As I mentioned in the gitter, local usage is currently sub-optimal, especially when it comes to memory, but that's something we're looking to improve in future releases. Comparisons to memory use with other tools may, unfortunately, not be particularly informative, though profiling could be.

Trying to run on EC2 would definitely be worthwhile (and ultimately probably the way to go!). That said, we should be able to figure out the problem here.

And btw, both 16-bit and 8-bit tifs should in general work fine, I think that's probably not the particular problem here.

@vjlbym
Copy link
Contributor Author

vjlbym commented Apr 19, 2015

@GrantRVD I am pretty sure it had to do with the index since it barely ran previously, whereas with no indices mentioned, a first set of tasks were done for the 3000 images. Changed the heap sizes. I am pretty much new to JAVA, Thunder, Spark and Python. So live and learn :) That didn't solve the memory issue though.
@freeman-lab Thanks a lot, again! Since the data isn't mine, I am not comfortable posting it on a public space but I just emailed you a dropbox link to your gmail id that I found on your website. Thanks a lot for your help. I am setting up EC2 now.

@freeman-lab
Copy link
Member

@vjlbym thanks for sharing the data! I've been playing with it and -- for better or for worse! -- can't seem to reproduce these problems.

I'm running Thunder 0.6.0.dev (the current master branch), and Spark 1.3.0, on Mac OS X 10.9.5, no special Java memory settings

With the data you sent, the following all works fine:

data = tsc.loadImages('/path/to/data/series1/', inputFormat='tif', npartitions=1)
data.count()
>> 3000
ts = data.toTimeSeries()
ts.cache()
ts.count()
>> 24939
from thunder import ICA
model = ICA(k=20, c=10).fit(ts)
import matplotlib.pyplot as plt
plt.plot(model.a)
>> pretty picture =)

Some notes / caveats:

  • this was Spark 1.3.0, and I noticed you were using Spark 1.1.0, so that could be part of the issue. I didn't test with 1.1.0, but I recommend at least trying 1.3.0.
  • this was on Mac OS X. it's possible that there are Spark + Windows + Java specific memory issues, which might be hard to debug =/
  • i tried several variants of loading with and without specifying indices and all worked
  • i repeated the above successfully with both the 8-bit and 16-bit data
  • most steps were reasonably fast but the ICA still took several minutes (though it definitely completed). it was worse with ~150 components, which is why i used fewer for testing. that's much slower than a purely local implementation could/should be. as mentioned, we're working on this!
  • also note the npartitions=1, that tells Spark to chunk the data into just 1 partition. this improved the speed of all operations with these data significantly. by default Thunder uses many more partitions, which is great on a cluster, but particularly inefficient locally. it's something we're going to make happen automatically in the future (see Better auto-partitioning during image loading #111). that said, without setting this to 1, everything ran, without OOM errors, just much slower.

so with all that, i'd say there's a chance switching to 1.3.0 will help, other it's likely Windows-specific memory issues, which may be hard to debug

@vjlbym
Copy link
Contributor Author

vjlbym commented Apr 20, 2015

@freeman-lab Thanks a lot for the detailed reply. One thing is that I have been running all the above in Ubuntu. I was trying to get Spark to run on Windows since people in my lab are more comfortable with it. It's good to know about npartitions=1 I'll switch to 1.3.0 and check if that solves it. Also, how much RAM did you have on the computer running the program?

@freeman-lab
Copy link
Member

Ahh, thanks for the clarification, in that case ignore everything I said about Windows =) Fingers crossed 1.3.0 will help then. And this was on a MacBook Air with 8 GB RAM.

@vjlbym
Copy link
Contributor Author

vjlbym commented Apr 20, 2015

Oh. My desktop should be perfectly able to handle it then. Hope it's solved by 1.3.0. Thanks again! Will post if it worked first thing in the morning :)

@vjlbym
Copy link
Contributor Author

vjlbym commented Apr 20, 2015

So it worked finally! 1.3.0 still gave me OOM errors. But the issue apparently was that java runtime options in Ubuntu had to be set using _JAVA_OPTIONS and not JAVA_OPTS, even though this website says JAVA_OPTS is the way to go for Ubuntu http://askubuntu.com/questions/107665/how-do-i-change-java-runtime-parameters

Changing it to _JAVA_OPTIONS as @GrantRVD had suggested for Windows worked for my Ubuntu. I didn't use that one since it seemed to be Windows specific, but apparently not!

Thanks a lot, @GrantRVD and @freeman-lab!! I'll close this issue now.

@vjlbym vjlbym closed this as completed Apr 20, 2015
@freeman-lab
Copy link
Member

@vjlbym that's great! I didn't know that about the naming conventions for JAVA_OPTS, very curious, great job figuring it out. It would be great if you could submit a pull request adding a note about this to the FAQ, something like "I'm getting out of memory errors during local usage", and then a description of how to set these java opts in different environments.

@freeman-lab
Copy link
Member

Here's the source file you'd be adding to https://github.com/thunder-project/thunder/blob/master/python/doc/faq.rst

@vjlbym
Copy link
Contributor Author

vjlbym commented Apr 20, 2015

@freeman-lab Just did. Apparently, this isn't environment specific. It's just a JAVA thing: https://community.oracle.com/message/6440415

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants