Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P2: Vector-matrix multiplication #44

Closed
magsol opened this issue Dec 15, 2015 · 16 comments
Closed

P2: Vector-matrix multiplication #44

magsol opened this issue Dec 15, 2015 · 16 comments
Assignees
Labels

Comments

@magsol
Copy link
Member

magsol commented Dec 15, 2015

This Spark primitive is a little trickier than #20. This is due to the fact that the matrix will be row-distributed, but in vector-matrix multiplication, the columns of the matrix are multiplied.

Still, this can be done in a fairly straightforward manner.

  1. As in P1, broadcast the array u to be multiplied, e.g. sc.broadcast(u).
  2. Run a flatMap over the RDD.
  3. Each flatMap worker multiply its row of the matrix with the corresponding element of the broadcasted vector u.
  4. Each value of the resulting vector will be outputted, keyed by its element index (hence the need for flatMap instead of map).
  5. A reduceByKey will then sum up the values for each key, which correspond to the elements of the resulting vector u.
@magsol magsol added the todo label Dec 15, 2015
@magsol magsol added this to the Milestone 3: Spark Prototype milestone Dec 15, 2015
magsol added a commit that referenced this issue Dec 22, 2015
…, vector-matrix multiplication), and P4 (#46, deflation). Not yet tested, but this is what an initial proof of concept looks like.
@MOJTABAFA
Copy link
Contributor

@magsol
Now I'm going to test the pyspark file in thunder , but during the execution the following error is appeared :

File "/home/targol/anaconda2/lib/python2.7/R1DL_Pyspark.py", line 216, in <module>
    S = S.apply(deflate, keepDType = True, keepIndex = True)
TypeError: apply() got an unexpected keyword argument 'keepDType'

the log file is as follows :
testreport.txt

@magsol
Copy link
Member Author

magsol commented Dec 24, 2015

Can you figure out what it means?

iPhone'd

On Dec 24, 2015, at 14:19, MOJTABAFA notifications@github.com wrote:

@magsol
Now I'm going to test the pyspark file in thunder , but during the execution the following error is appeared :

File "/home/targol/anaconda2/lib/python2.7/R1DL_Pyspark.py", line 216, in
S = S.apply(deflate, keepDType = True, keepIndex = True)
TypeError: apply() got an unexpected keyword argument 'keepDType'
the log file is as follows :
testreport.txt


Reply to this email directly or view it on GitHub.

@MOJTABAFA
Copy link
Contributor

@magsol
is it in deflate function ? when we're calling it in S.apply() ,shouldn't we pass the "raw" to this function ?

@magsol
Copy link
Member Author

magsol commented Dec 24, 2015

No. Read the error message specifically. It's complaining about about unrecognized parameter names. Check the thunder documentation and see if you can figure out how to fix it.

iPhone'd

On Dec 24, 2015, at 15:27, MOJTABAFA notifications@github.com wrote:

is it in deflate function ? when we're calling it shouldn't we pass the "raw" to this function ?


Reply to this email directly or view it on GitHub.

@MOJTABAFA
Copy link
Contributor

Ok, let me check it.

@MOJTABAFA
Copy link
Contributor

Actually the problem was just on spelling ! In 'KeepDType' T must be change into small 't' ! I'll correct it in main file.

@MOJTABAFA
Copy link
Contributor

Now again I've tested the code on small test1 pattern , there z is much better than our previous z file ! now the z.txt is a sparse matrix , I'm going to test the bigger data set , the results of first small data set is as follows :
Z.txt

@magsol
Copy link
Member Author

magsol commented Dec 25, 2015

Excellent work!

@MOJTABAFA
Copy link
Contributor

@magsol
After a long time execution in my lap top the z matrix of big data is now around 40MB, so it's not possible to put it here. However, the results are not similar to our small data sets, where the small data set answers were totally satisfactory, the big data result is suspicious. the parameters were : m=100, n=0.07 , e =0.01 and I didn't determine the Row and Col , some part of result is as follows :

-49.625731  -90.950085  -107.148851 -22.263390  -27.960949  -74.573206  -35.491131  -1.820312   106.864215  14.072931   171.595561  -93.018838  -3.851441   281.873055  212.104157  375.934794  -69.247916  -79.771974  -27.565432  335.760112  330.057942  255.645971  129.561707  23.689732   39.457266   338.431347  358.045253  -16.198390  211.919775  120.124855  66.542751   282.075863  378.395402  -94.307979  -2.779630   -11.584412  185.832728  279.141163  101.102970  -99.788754  -82.138987  99.249246   175.284746  101.319492  -94.943044  -29.128951  26.582609   -22.439812  16.184655   -30.774730  -42.659585  -28.481978  -76.469311  -137.889147 -69.109695  -74.959590  -93.705282  -121.603436 -149.070855 -55.650968  4.239743    -17.991413  -64.647887  -55.436329  -55.543341  -233.434969 -226.427454 -73.695304  -141.986671 -140.047461 -242.440411 -280.187721 -196.235706 89.043456   -22.907281  -11.296129  -80.976172  -138.241792 -352.324480 -125.427455 43.500121   -186.793748 -112.535951 -205.595161 -278.406738 -371.797682 -80.563537  48.026023   287.180729  178.378065  121.456420  87.679904   -109.481793 -114.439424 11.187516   282.435522  -78.271834  -78.662650  -222.487548 -393.253565 

@magsol
Copy link
Member Author

magsol commented Dec 26, 2015

After a long time execution in my lap top the z matrix of big data is now around 40MB, so it's not possible to put it there.

I'm not sure what that means.

However, the results are not similar to our small data sets, where the small data set answers were totally satisfactory, the big data results is suspicious. the parameters were : m=100, n=0.07 , e =0.01 and I didn't determine the Row and Col , some part of result is as follows :

How does it compare to what we see in the milestone 2 output?

@MOJTABAFA
Copy link
Contributor

@magsol

The answer for that 2 questions :

  • The Z file size is around 40MB , So I cannot drag and drop it here ( the maximum acceptable size for a repository ticket is around 10 MB ).
  • . in milestone 2 , the z output values for both the small and big data samples were similar in value of each element . However, the dimensions were different . Now, in milestone 3 , the answer for small test sets are totally better than milestone 2 answers and extremely close to what xiang mentioned as Ground Z answers.But in Big data, However the Z dimensions are the same as milestone 2 answers, but the element values were different, However, It maybe because of some resource problems in my laptop or other reasons , But As I told you before my laptop is not suitable for testing now and it will take alot of time here. Anyway, I try to test it again and will let you know. .

@magsol
Copy link
Member Author

magsol commented Dec 28, 2015

If the quality changes with the size (i.e. the results are better with
small data than large data) it may be a resource issue, although that still
seems odd as Spark is a deterministic framework; quality of the results
shouldn't degrade with data volume.

Still, we need more testing. I'm on the road again tomorrow, but almost
have a spark cluster ready at UGA. Hopefully in the next day or two. In the
meantime, I'll run this on my office desktop; it has 32GB memory and 8
cores, so it should scale reasonably well.

If you and Xiang could start working on unit tests, that would be great.
Small ones are fine for now--take a fraction of the input we have, get an
expected output, and then have the program run it and test if the two
outputs are equal within a certain tolerance (e.g. 6 decimal points).
On Sat, Dec 26, 2015 at 14:09 MOJTABAFA notifications@github.com wrote:

@magsol https://github.com/MAGSOL
1.The Z file size is around 40MB , So I cannot drag and drop it here ( the
maximum acceptable size for a repository ticket is around 10 MB ).

  1. in milestone 2 , the z output values for both the small and big
    data samples were similar in value of each element . However, the
    dimensions were different . Now, in milestone 3 , the answer for small test
    sets are totally better than milestone 2 answers and extremely close to
    what xiang mentioned as Ground Z answers.But in Big data, However the Z
    dimensions are the same as milestone 2 answers, but the element values were
    different, However, It maybe because of some resource problems in my laptop
    or other reasons , But As I told you before my laptop is not suitable for
    testing now and it will take alot of time here. Anyway, I try to test it
    again and will let you know. .


Reply to this email directly or view it on GitHub
#44 (comment)
.

iPhone'd

@MOJTABAFA
Copy link
Contributor

@magsol
I already talked with Milad, tomorrow I'll go to university and try to check the code in lab server.Moreover, I checked the big data file and there is one point I wanted to ask your idea about that :
The small file was tall and tiny with dimensions of (100,5), However the Big data file is short and fatty with (170,39850) dimensions( I mean the number of rows are much smaller than number of columns). could it be a reason for uncertain results?As I read in a paper that spark answers in matrix multiplications are always better in tall and tiny matrices.

@magsol
Copy link
Member Author

magsol commented Dec 29, 2015

Hmm, that's a good question. However the fact that the very nature of the
data has changed has me a little worried. I thought, in general, the number
of rows (data points) would far exceed the number of columns (features)? It
seems like, in these two datasets, they have roughly the same number of
data points (100 vs 170) but hugely differing dimensions. Is that truly the
case, or have the data been accidentally transposed?

On Mon, Dec 28, 2015 at 2:44 PM MOJTABAFA notifications@github.com wrote:

@magsol https://github.com/magsol
I already talked with Milad, tomorrow I'll go to university and try to
check the code in lab server.Moreover, I checked the big data file and
there is one point I wanted to ask your idea about that :
The small file was tall and tiny with dimensions of (100,5), However the
Big data file is short and fatty with (170,39850) dimensions. could it be a
reason for uncertain results?As I read in a paper that spark answers in
matrix multiplications are always better in tall and tiny matrices.


Reply to this email directly or view it on GitHub
#44 (comment)
.

iPhone'd

@MOJTABAFA
Copy link
Contributor

@magsol
Actually I don't know why but by using the transposed data the following error is appeared :

  File "/home/targol/spark-1.5.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py", line 2089, in 
<genexpr>
  File "/home/targol/anaconda2/lib/python2.7/R1DL_Pyspark.py", line 26, in <lambda>
    .map(lambda x: np.array(map(float, x.strip().split("\t")))) \
ValueError: could not convert string to float: 

Moreover, It's really difficult and time consuming for me to test with my laptop because of lack of resources.

@magsol
Copy link
Member Author

magsol commented Dec 30, 2015

It looks like there's a non-float character that we're trying to cast to a float, e.g. float("?") or something like that.

Nonetheless, I hear you loud and clear. I'm sorry I haven't had time to finish setting up my cluster, but that's still in progress. I should have some news for you today or tomorrow.

@magsol magsol closed this as completed Jan 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants