Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError from Theano when multiprocessing: numpy.array not aligned #2640

Closed
lorentzenchr opened this issue Oct 18, 2017 · 12 comments

Comments

@lorentzenchr
Copy link

commented Oct 18, 2017

Problem Description

If I try to sample with Metropolis in parallel (njobs>1), Theano throws an error, indicating that the numpy array is not aligned.

Code that produces the error

import numpy as np
import pymc3 as pm
import theano.tensor as tt

#create data
n=1000000
X = np.concatenate((np.ones((n, 1)), np.random.rand(n, 1)), axis = 1)
y = np.random.normal(loc = X.dot(np.array([5, -2])), scale = 1)

model = pm.Model()
with model:
    beta = pm.Normal('beta', mu = 0, sd = 1e3, shape = X.shape[1])
    mu = tt.dot(X, beta)
    y_est = pm.Normal('y_est', mu = mu, observed = y)
    step = pm.Metropolis()
    trace = pm.sample(1, tune = 0, step = step, njobs = 2)

The same code runs fine, if njobs=1 or n=1000 instead of 1000000.

The Error message:

Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "../lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "../lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "../3.6/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File ".../lib/python3.6/site-packages/joblib/pool.py", line 362, in get
    return recv()
  File "../lib/python3.6/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
  File ".../lib/python3.6/site-packages/theano/compile/function_module.py", line 1049, in _constructor_Function
    f = maker.create(input_storage, trustme=True)
  File ".../lib/python3.6/site-packages/theano/compile/function_module.py", line 1661, in create
    input_storage=input_storage_lists, storage_map=storage_map)
  File ".../lib/python3.6/site-packages/theano/gof/link.py", line 699, in make_thunk
    storage_map=storage_map)[:3]
  File ".../lib/python3.6/site-packages/theano/gof/vm.py", line 1047, in make_all
    impl=impl))
  File ".../lib/python3.6/site-packages/theano/gof/op.py", line 935, in make_thunk
    no_recycling)
  File "...lib/python3.6/site-packages/theano/gof/op.py", line 830, in make_c_thunk
    e = FunctionGraph(node.inputs, node.outputs)
  File ".../lib/python3.6/site-packages/theano/gof/fg.py", line 142, in __init__
    inputs, outputs = graph.clone(inputs, outputs)
  File ".../lib/python3.6/site-packages/theano/gof/graph.py", line 818, in clone
    equiv = clone_get_equiv(i, o, copy_inputs)
  File "...lib/python3.6/site-packages/theano/gof/graph.py", line 851, in clone_get_equiv
    cpy = input.clone()
  File ".../lib/python3.6/site-packages/theano/gof/graph.py", line 577, in clone
    cp = self.__class__(self.type, self.data, self.name)
  File ".../lib/python3.6/site-packages/theano/tensor/var.py", line 927, in __init__
    Constant.__init__(self, type, data, name)
  File ".../lib/python3.6/site-packages/theano/gof/graph.py", line 549, in __init__
    self.data = type.filter(data)
  File ".../lib/python3.6/site-packages/theano/tensor/type.py", line 189, in filter
    "object dtype", data.dtype)
TypeError: ('The following error happened while compiling the node', CGemv{no_inplace}(TensorConstant{[ 2.174979...3341453 ]}, TensorConstant{-1.0}, TensorConstant{[[ 1.     ..37875534]]}, Subtensor{int64:int64:}.0, TensorConstant{1.0}), '\n', 'The numpy.ndarray object is not aligned. Theano C code does not support that.', 'object buffer<memory at 0x126d3cdc8>', 'object shape', (1000000,), 'object strides', (8,), 'object dtype', dtype('float64'))
Process ForkPoolWorker-2:
Traceback (most recent call last):
...

After that message, the process seems dead and I terminate it manually.
The error occurs on 2 different computers with different OS.

Versions and main components

  • PyMC3 Version: 3.2
  • Theano Version: 0.9.0
  • Numpy Version: 1.13.3
  • Python Version: 3.6.1 and 3.6.3
  • Operating system: MacOS X and Linux
  • How did you install PyMC3: pip
@lorentzenchr

This comment has been minimized.

Copy link
Author

commented Oct 19, 2017

The error also occurs for the nuts sampler. Maybe the error occurs because of the way joblib handles numpy arrays and feeds them to theano?

@junpenglao

This comment has been minimized.

Copy link
Member

commented Oct 19, 2017

Oh, actually it is related to the size of your data. Maybe it doesn't fit into the memory? The problem disappears setting n=1000 for example.

@aseyboldt

This comment has been minimized.

Copy link
Member

commented Oct 19, 2017

The cutoff is at n=125000, all smaller values work. At exactly this point y is 1e6 bytes, which is the value at which joblib starts to use memory mapping to transfer arrays to the other processes. I don't think this is an accident. :-)

So it looks like this problem appears when an array is larger than a million bytes and the first theano op that uses this array requires an aligned array. I guess the easiest workaround would be to just disable the memory mapping in joblib, by passing mmap_mode=None to Parallel. But this would mean that we make copies of arbitrarily large arrays.

@junpenglao

This comment has been minimized.

Copy link
Member

commented Oct 19, 2017

For me it is around n=65500... I need a better computer...
Not sure what is the best solution here, cc @aseyboldt @nouiz @ferrine

@aseyboldt

This comment has been minimized.

Copy link
Member

commented Oct 19, 2017

That's a bit strange, but I don't think this is about how good your computer is...

A short script to test the alignment:

import joblib
import numpy as np


def func(x):
    print(x.flags)
    print(type(x))

if __name__ == '__main__':
    n = 1024 ** 2 // 8 + 1
    x = np.random.randn(n)
    jobs = joblib.Parallel(2)
    print(x.nbytes)
    res = jobs([joblib.delayed(func)(x), joblib.delayed(func)(x)])

Could you use this script to figure out at which point you get a numpy.core.memmap.memmap with ALIGNED: False on your machine?

This whole thing kind of sounds like a joblib bug to me. I don't see a good reason why the memmap shouldn't be aligned...

@junpenglao

This comment has been minimized.

Copy link
Member

commented Oct 19, 2017

1048584
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : False
  UPDATEIFCOPY : False  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : False
  UPDATEIFCOPY : False
<class 'numpy.core.memmap.memmap'>

<class 'numpy.core.memmap.memmap'>

My output.

@aseyboldt

This comment has been minimized.

Copy link
Member

commented Oct 19, 2017

I reported this to joblib.

@lorentzenchr If you need this quickly, applying this patch should avoid the issue until this is fixed properly:

diff --git a/pymc3/sampling.py b/pymc3/sampling.py
index 338a054c..0fb9f678 100644
--- a/pymc3/sampling.py
+++ b/pymc3/sampling.py
@@ -546,11 +546,13 @@ def _mp_sample(**kwargs):
 
     chains = list(range(chain, chain + njobs))
     pbars = [kwargs.pop('progressbar')] + [False] * (njobs - 1)
-    traces = Parallel(n_jobs=njobs)(delayed(_sample)(chain=chains[i],
-                                                     progressbar=pbars[i],
-                                                     random_seed=rseed[i],
-                                                     start=start_vals[i],
-                                                     **kwargs) for i in range(njobs))
+    traces = Parallel(n_jobs=njobs, mmap_mode=None)(
+        delayed(_sample)(chain=chains[i],
+                         progressbar=pbars[i],
+                         random_seed=rseed[i],
+                         start=start_vals[i],
+                         **kwargs)
+        for i in range(njobs))
     return merge_traces(traces)
@lorentzenchr

This comment has been minimized.

Copy link
Author

commented Oct 19, 2017

@aseyboldt Thank you very much for the issue on joblib and your quick fix, in case I need it.

@cmichelenstrofer

This comment has been minimized.

Copy link

commented Oct 28, 2017

Link to issue by @aseyboldt in joblib: joblib/joblib#563

Hope they resolve it quickly...

@jcbozonier

This comment has been minimized.

Copy link

commented Dec 15, 2017

Also experiencing this issue. Analyzing multi-level model with a ton of production data (millions of data points?). Just letting you know someone else appreciates y'all looking at this! :)

@ColCarroll

This comment has been minimized.

Copy link
Member

commented Dec 15, 2017

Looks like the joblib devs have a heroic pull request open that got mired in compatibility issues between pickle protocols, in case anyone wants to help that out: joblib/joblib#570

@twiecki

This comment has been minimized.

Copy link
Member

commented Feb 26, 2018

@AustinRochford should we close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.