Real matrix multiplication without automatic index matching #3344

Closed
lrq3000 opened this Issue Apr 13, 2013 · 17 comments

Projects

None yet

3 participants

@lrq3000

Automatic index matching is really great, but sometimes you need a real matrix multiplication, for example to compute the Covariance matrix (in my case using weights, hence .cov() is not usable).

Current workarounds include using numpy.outer() (which is still not really a real matrix multiplication, because it will always give the outer product), or t = t[:, None], but I feel this last solution may hamper the performances on big dataset.

It would be really nice to have a method that just does a "blind" matrix multiplication, multiplying the rows of the first matrix with the columns of the second one, without caring about index matching.

Edit: I'm here talking about DataFrame and Series, which could respectively be considered as matrix and vector.

@y-p

There's an issue for implementing a covariance matrix calculation to the package,
but I think matrix algebra falls somewhat outside the scope of pandas.
It's not that isn't useful, it's just that since it's not the focus of pandas, it's better to use
a more focused package for BLAS type operations (I may be in the minority view on this).

You do know you have access to the underlying numpy array through the .values
attribute? that should free you up to take data out of pandas, process it, and then
turn it back into a dataframe, if you need to.

marking as someday for now.

@jseabold
Python for Data member

+1 for keeping stats that don't rely on an index out of pandas. That crosses a bit too much the fuzzy line I keep in my head for pandas vs. numerical/statistical packages.

@lrq3000

As I said, .cov() does not support weights, and it's easy to imagine other specific cases where computing manually the dot product of matrices would be necessary.

Also as I said, Numpy does not provide a way to manually dot product (as far as I know, Numpy also tries to match in some way), thus this would be an "innovative" feature of Pandas.

And I never said that the produced matrix couldn't have indexes in some way. For example, indexes make perfect sense in a covariance matrix (with the same keys as both indexes and columns), but with the current implementation based on index matching, it is just not possible.

Dot product is an important part of most matrix manipulations, and I feel Pandas is losing a big thing here not implementing this possibility "just by principle".

PS: of course it's always possible to make a manual implementation in Python, but the performances would be just awful.

@y-p

If the the current cov matrix isn't sophisticated enough for your needs, you really should
find a python package that does that sort thing well, with all the rest of the decompositions
and derivatives and Eigenvalue magic you need.

Also, If numpy doesn't do "dot product" as you define it, I guess I'm not sure what you
mean by the term.

@lrq3000

Thank's jseabold, this is indeed a good solution.

Anyway, as I said this is only one example. Another example is to compute the weighted median, weighted deviation to the median and weighted covariance matrix based on median. In this case, there is still a need to compute the dot product.

And I don't even mention neural nets feed forward and back propagation...

@jseabold
Python for Data member

Neural networks are very far outside the scope of pandas.

I guess my point here is that these are issues that should be filed with statsmodels (help, you don't have pandas support here yet) or PyBrain or scikit-learn (if and when they get neural nets to work within their API) to support pandas objects at all.

If you want weighted medians, etc. file an enhancement ticket with statsmodels rather than trying to redefine the scope of pandas. That's my opinion at least and like y-p, I may be in the minority of pandas devs (of which I am not one).

@lrq3000

@y-p: You're right, sorry I confused myself, I'm talking about matrix multiplication, not dot product (sorry english is not my first language and math terms are sometimes misleading...).

I will give more pratical informations below.

Link to my SO post with a pratical example:
http://stackoverflow.com/questions/15889998/pandas-force-matrix-multiplication

More example code:

import pandas
import numpy
t = pandas.Series([1, 2])
print(t.values.T * t.values)

This prints [1 4] instead of the expected matrix:

[1 2
 2 4]

I could not find any way to get this result other than using numpy.outer(), and even then, it's not really in the dimension I specify (the "transpose" just doesn't affect anything for these operations), but always the outer product.

Maybe I am missing a very simple thing here, but really I have crawled through the whole documentation and even a bit of the code of Pandas, and couldn't find a way to do so.

@lrq3000

@jseabold: I don't want to reuse existing models (even if that's a great initiative to make available such libraries), but make my own (not out of pure fun but because the existing models can't fit every use, and I don't intend to do stats but rather implement still-in-research models).

Contrarywise to what you are describing, I am not talking about making pandas implement a lot of new functionalities like weighted deviation to median, but just matrix multiplication, which I feel is generic enough to be used for a lot of other applications (after all, this is a basic mathematical operation...).

@y-p

I think what both jseabold and I are saying, is that there's a substantial scientific/data
python ecosystem out there, and not all tools do or should do all things.
If you have focused packages that do only one thing well you end up with better
overall tools.
pandas is already a much loved pastiche of (hopefully well-chosen) concepts, but that
doesn't justify making it the kitchen sink.

You will probably find what you need in numpy, scipy, sklearn and other mega projects
that have a dedicated team of developers focusing on their area of expertise.

I suggest you consider joining the pydata mailing list,
to discuss your needs, where you may get helpful suggestions from users who have perhaps solved similar problems.

That said, having better integration with other parts of the pydata ecosystem is always a
discussion worth having, so if you come across pain points in that direction. that would
be a useful topic to bring up.

@lrq3000

Guys, I understand your point of view, and I agree that indexing is a major feature of Pandas and it pushes the concept so far that it is likely to become a paradigm in its own right.

But my point of view is that without matrix multiplication, which is a basic operation, you just cannot make a lot of algorithms.

On a practical side, even if the principle of "index matching" couldn't be strictly followed during a matrix multiplication, Pandas could still try to set the right keys afterwards.

Example:

Matrix A = n x m
Matrix B = m x o
Matrix C = A x B = n x o

So of course, we lose the indexes that spanned over the m dimension, but we still do have the keys of the n dimension and o dimension. So it is perfectly possible to set in the resulting DataFrame (or Series) the indexes to be the indexes of A, and the columns keys to be the column keys of B.

@y-p : thank you for your suggestion, I will see if other solutions are currently available and post the results in here.

@jseabold
Python for Data member

Just use the underlying numpy arrays for whatever you want to do and slap your indices back on afterwards. If every dot product I had to compute had indexing overhead the cost would quickly outweigh the gains no matter how efficient the indexing code is. This would certainly be true for something like a back-propagation algorithm. Indeed, I suspect that pure python numpy arrays alone have more overhead than most people would be comfortable with when coding these algorithms.

@y-p
from pandas.util.testing import makeCustomDataframe as mkdf
a=mkdf(3,5,data_gen_f=lambda r,c: randint(1,100))
b=mkdf(5,3,data_gen_f=lambda r,c: randint(1,100))
c=DataFrame(a.values.dot(b.values),index=a.index,columns=b.columns)
print a
print b
print c
assert  (a.iloc[0,:].values*b.iloc[:,0].values.T).sum() == c.iloc[0,0]

C0       C_l0_g0  C_l0_g1  C_l0_g2  C_l0_g3  C_l0_g4
R0                                                  
R_l0_g0       39       87       88        2       65
R_l0_g1       59       14       76       10       65
R_l0_g2       93       69        4       29       58
C0       C_l0_g0  C_l0_g1  C_l0_g2
R0                                
R_l0_g0       76       88       11
R_l0_g1       66       73       47
R_l0_g2       78       69       15
R_l0_g3       47        3       40
R_l0_g4       54       31       31
C0       C_l0_g0  C_l0_g1  C_l0_g2
R0                                
R_l0_g0    19174    17876     7933
R_l0_g1    15316    13503     4862
R_l0_g2    16429    15382     7284
@lrq3000

@y-p exactly what I need! Through I wonder in this would work with DataFrame x Series and Series x Series?

Also, could you please explain me what the assert does?

@y-p

Just wanted to make it explicit that it's really a matrix multiplication, without
you having to break out a calculator.

@lrq3000

Ok that's great! Thank's a lot for your time!

But still I think that a wrapper method that would do this operation on its own would be just great! (you can't know how much time I have crawled the web and never found a so simple and efficient implementation)

Meanwhile, I'll use that in my own function and check that it's working alright with Series too.

@y-p

numpy broadcasting should serve you well, and you can explore the section about monkey patching
in the docs to roll your own dataframe methods easily.

Closing now.

@y-p y-p closed this Apr 17, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment