use SGVector instead of plain pointer in GMM #3859

MikeLing · 2017-06-23T14:17:52Z

During I'm working on the global random removing, I found we have a lot of plain pointer in GMM. I thought maybe we want this pr, please feel free to close it if it's not. :)

Thank you

vigsterkr · 2017-06-23T14:47:26Z

@MikeLing if you wanna insist on these changes then let's switch as well all the CMath methods and other SGVector/SGMatrix methods to linalg plz! :) lemme know if you need help there which one and how!

MikeLing · 2017-06-23T14:54:50Z

Hi @vigsterkr , sure, please tell me more about that! But I'm not sure I could start to work on it right away, let me finish these issues on my hand first :)

BTW, I just update this pr due to I found I forget to remove some debug statement in the pr :P

vigsterkr · 2017-06-23T15:24:02Z

@MikeLing i dont see the point of doing halfway things in this case. if you started this then plz do it properly if not then close this PR. the change of Malloc to SGVector is really not enough here... that can be done with a regex script automatically...

vigsterkr

here are some examples where one could use linalg... but there's plenty more...

vigsterkr · 2017-06-23T15:36:22Z

src/shogun/clustering/GMM.cpp

-			SGVector<float64_t>::add(mean_sum, alpha.matrix[j*alpha.num_cols+i], v.vector, 1, mean_sum, v.vlen);
+			SGVector<float64_t>::add(
+			    mean_sum.vector, alpha.matrix[j * alpha.num_cols + i], v.vector,
+			    1, mean_sum.vector, v.vlen);
 		}

 		for (int32_t j=0; j<num_dim; j++)
 			mean_sum[j]/=alpha_sum;


this is bascially linalg::scale(mean, mean, 1/alpha_sum)

vigsterkr · 2017-06-23T15:38:41Z

src/shogun/clustering/GMM.cpp

-			cov_sum=SG_MALLOC(float64_t, 1);
-			cov_sum[0]=0;
+			cov_sum = SGMatrix<float64_t>(1, 1);
+			cov_sum.zero();


linalg::zero(cov_sum)

vigsterkr · 2017-06-23T15:39:26Z

src/shogun/clustering/GMM.cpp

 		}

 		for (int32_t j=0; j<alpha.num_rows; j++)
 		{
 			SGVector<float64_t> v=dotdata->get_computed_dot_feature_vector(j);
-			SGVector<float64_t>::add(v.vector, 1, v.vector, -1, mean_sum, v.vlen);
+
+			SGVector<float64_t>::add(


linalg::add

vigsterkr · 2017-06-23T15:41:11Z

src/shogun/clustering/GMM.cpp

+			    case SPHERICAL:
+				    float64_t temp = 0;
+
+				    for (int32_t k = 0; k < num_dim; k++)


this is basically linalg::dot(v,v)

karlnapf · 2017-06-23T20:57:31Z

Once this is done (very nice initiative), let's re-design the Gaussians and mixture models a bit. I can help with that, let me know once you are ready with this stuff

vigsterkr · 2017-06-24T09:20:35Z

src/shogun/clustering/GMM.cpp

 		}

-		for (int32_t j=0; j<num_dim; j++)
-			mean_sum[j]/=alpha_sum;
+		linalg::scale(mean_sum, mean_sum, 1 / alpha_sum);


1.0/alpha_sum

vigsterkr · 2017-06-24T09:20:53Z

src/shogun/clustering/GMM.cpp

 		}

 		for (int32_t j=0; j<alpha.num_rows; j++)
 		{
 			SGVector<float64_t> v=dotdata->get_computed_dot_feature_vector(j);
-			SGVector<float64_t>::add(v.vector, 1, v.vector, -1, mean_sum, v.vlen);
+			mean_sum.display_vector();


vigsterkr · 2017-06-24T09:21:03Z

src/shogun/clustering/GMM.cpp


+			linalg::add(v, mean_sum, v, float64_t(1), float64_t(-1));
+			mean_sum.display_vector();


MikeLing · 2017-06-24T09:21:14Z

Hi @vigsterkr, I misapprehend your comment. I thought you are saying we need to add more mathematic function in SGVector and SGMatrix :P Had update the pr, plz tell me if there are anything else to do in this pr. Thank you

vigsterkr · 2017-06-24T09:21:14Z

src/shogun/clustering/GMM.cpp

+					    cov_sum(0, k) += v.vector[k] * v.vector[k] *
+					                     alpha.matrix[j * alpha.num_cols + i];
+
+				    cov_sum.display_matrix();


vigsterkr · 2017-06-24T09:21:20Z

src/shogun/clustering/GMM.cpp

+
+				    cov_sum(0, 0) +=
+				        temp * alpha.matrix[j * alpha.num_cols + i];
+				    cov_sum.display_matrix();


vigsterkr · 2017-06-24T09:22:23Z

src/shogun/clustering/GMM.cpp

 		}

 		m_coefficients.vector[i]=alpha_sum;
 		alpha_sum_sum+=alpha_sum;
 	}

-	for (int32_t i=0; i<alpha.num_cols; i++)
-		m_coefficients.vector[i]/=alpha_sum_sum;
+	linalg::scale(m_coefficients, m_coefficients, 1 / alpha_sum_sum);


1.0/alpha_sum_sum

vigsterkr · 2017-06-24T09:24:07Z

src/shogun/clustering/GMM.cpp

+				        CblasRowMajor, num_dim, num_dim,
+				        alpha.matrix[j * alpha.num_cols + i], v.vector, 1,
+				        v.vector, 1, (double*)cov_sum.matrix, num_dim);
+				    cov_sum.display_matrix();


vigsterkr · 2017-06-24T09:24:59Z

src/shogun/clustering/GMM.cpp

-
-					cov_sum[0]+=temp*alpha.matrix[j*alpha.num_cols+i];
-					break;
+				    cblas_dger(


is this the one and only function we use in GMM from LAPACK?
and this is the reason why we require LAPACK?

no, this compute_eigenvectors https://github.com/shogun-toolbox/shogun/blob/develop/src/shogun/clustering/GMM.cpp#L600 also require LAPACK

oooh
old linalg can take care of it: https://github.com/shogun-toolbox/shogun/blob/develop/src/shogun/mathematics/linalg/eigsolver/DirectEigenSolver.h

but ok it should be done if we have some time later.... not part of this pr

vigsterkr · 2017-06-24T09:26:34Z

@MikeLing ok there are some debug stuff that should be removed, otherwise it's looking good. once those debug things are removed and some of those minor problems address we should merge this... later as @karlnapf mentioned we might wanna do a refactor and then get rid of that lapack dependency...

vigsterkr · 2017-06-24T09:35:56Z

@MikeLing my only concern is that GMM is tested nowhere automatically only in a notebook :S

MikeLing · 2017-06-24T09:38:44Z

mmm, do i need to add some unit test for it? Or we could test and merge it after micmn add unit test for GMM?

vigsterkr · 2017-06-24T09:44:41Z

@MikeLing i'm not aware of anybody planning to add unit test for GMM ;)
one easy way that you manually try to run the ipython notebook locally and check if it's the same output as here: http://shogun.ml/notebook/latest/GMM.html

MikeLing · 2017-06-24T09:49:33Z

@vigsterkr oh, I see what you mean. We do have meta test for GMM( integration_meta_cpp-clustering-gmm ) actually, but I'm not sure if that's good enough to test it. Let me test gmm on notebook anyway :)

vigsterkr · 2017-06-24T09:51:38Z

@MikeLing oh yeah that's should at least as well assure that we are still having the same output :)))

vigsterkr · 2017-06-24T10:52:02Z

@MikeLing ok lemme know when you've done with the notebook test as this seems to be ok so i'm gonna merge once you are done with double-checking.thnx

MikeLing · 2017-06-24T15:34:53Z

@vigsterkr Hi, sorry for the late reply. So, first of all, I would say "yes, we are still having the same output for GMM." But, however, the GMM notebook may broken already. I send you some screenshots on irc already. (here is the output of develop branch

lisitsyn · 2017-06-25T14:05:48Z

src/shogun/distributions/Gaussian.cpp

@@ -267,7 +267,7 @@ float64_t CGaussian::compute_log_PDF(SGVector<float64_t> point)
 	return -0.5 * answer;
 }

-SGVector<float64_t> CGaussian::get_mean()
+SGVector<float64_t>& CGaussian::get_mean()


Why returning a reference?

we need use linalg add function in https://github.com/shogun-toolbox/shogun/pull/3859/files#diff-b59cfc6c549dd52160d0ce73b6356b04R377

How is it related? :)

it's not... there shouldn't be a need for doing this.

mmm, but I got error message like
error: no matching function for call to 'add' if I don't return reference. More error message in here https://pastebin.mozilla.org/9025837

Hi @vigsterkr and @lisitsyn , the linalg add need a reference rather than a value like add (SGVector< T > &a, SGVector< T > &b, SGVector< T > &result, T alpha=1, T beta=1) , I just don't know how to make this
linalg::add( components[1]->get_mean(), components[2]->get_mean(), components[1]->get_mean(), alpha1, alpha2); works if I don't return a reference in here. Maybe I'm asking a dumb question, but I just don't know what should I do to make it works :(

Thank you

What happens if you put the get_mean() output into a local variable first? It is a bit nicer to read anyways

MikeLing · 2017-06-25T14:44:25Z

#3866

MikeLing · 2017-07-08T11:55:23Z

let's re-design the Gaussians and mixture models a bit
Hi @karlnapf , even this haven't been merged yet, could you tell me something about Gaussian and mixture models' redesign? Thank you!

karlnapf · 2017-07-09T15:29:06Z

Hi @MikeLing
Let me know when you are done with the clean-up and I can help you re-designing the class.

codecov · 2017-07-11T16:23:58Z

Codecov Report

Merging #3859 into develop will increase coverage by 0.03%.
The diff coverage is 67.81%.

@@             Coverage Diff             @@
##           develop    #3859      +/-   ##
===========================================
+ Coverage    55.79%   55.82%   +0.03%     
===========================================
  Files         1356     1356              
  Lines        94000    93987      -13     
===========================================
+ Hits         52445    52466      +21     
+ Misses       41555    41521      -34

Impacted Files	Coverage Δ
src/shogun/clustering/GMM.cpp	`77.04% <67.81%> (-1.77%)`	⬇️
src/shogun/io/streaming/InputParser.h	`97.29% <0%> (-0.68%)`	⬇️
src/shogun/lib/DataType.cpp	`67.13% <0%> (-0.36%)`	⬇️
src/shogun/optimization/liblinear/tron.cpp	`87.32% <0%> (ø)`	⬆️
src/shogun/machine/KernelMachine.cpp	`79.03% <0%> (+0.32%)`	⬆️
...ogun/features/streaming/StreamingDenseFeatures.cpp	`72.78% <0%> (+0.59%)`	⬆️
src/shogun/lib/List.h	`85.18% <0%> (+0.61%)`	⬆️
src/shogun/lib/external/shogun_libsvm.cpp	`69.31% <0%> (+2.5%)`	⬆️
src/shogun/distance/BrayCurtisDistance.cpp	`100% <0%> (+3.44%)`	⬆️
src/shogun/lib/SGVector.h	`69.51% <0%> (+3.65%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e6c6db...cf51d05. Read the comment docs.

MikeLing · 2017-07-12T14:08:40Z

@karlnapf ask review for this pr :)

Thank you

karlnapf · 2017-07-13T11:54:53Z

Mmmh the unit test thing is quite concerning ....
I think for simple re-factoring, we can merge without having tests, but generally, we should add some

notebook checks:

visually works
unit tests:
compare with sklearn on a simple toy example with the same initialization as sklearn and only single iterations
domain/math tests
free energy increases in every iteration of EM (this is a really good sanity check)
all probability vectors always sum to 1 (logsumexp is 0)
when initialized with a good solution, it doesnt jump around but convergence quickly to the same solution as initialized

karlnapf · 2017-07-13T11:56:47Z

src/shogun/clustering/GMM.cpp

@@ -275,8 +272,11 @@ float64_t CGMM::train_smem(int32_t max_iter, int32_t max_cand, float64_t min_cov
 				counter++;
 			}
 		}
-		CMath::qsort_backward_index(split_crit, split_ind, int32_t(m_components.size()));
-		CMath::qsort_backward_index(merge_crit, merge_ind, int32_t(m_components.size()*(m_components.size()-1)/2));
+		CMath::qsort_backward_index(


can you please refactor these two methods as well?

karlnapf · 2017-07-13T11:57:28Z

src/shogun/clustering/GMM.cpp

@@ -385,8 +374,10 @@ void CGMM::partial_em(int32_t comp1, int32_t comp2, int32_t comp3, float64_t min
 	float64_t noise_mag=SGVector<float64_t>::twonorm(components[0]->get_mean().vector, dim_n)*0.1/
 						CMath::sqrt((float64_t)dim_n);

-	SGVector<float64_t>::add(components[1]->get_mean().vector, alpha1, components[1]->get_mean().vector, alpha2,
-				components[2]->get_mean().vector, dim_n);
+	SGVector<float64_t> temp_mean = components[2]->get_mean();


auto please

karlnapf · 2017-07-13T11:58:11Z

src/shogun/clustering/GMM.cpp

-				components[2]->get_mean().vector, dim_n);
+	SGVector<float64_t> temp_mean = components[2]->get_mean();
+	SGVector<float64_t> temp_mean_result = components[1]->get_mean();
+	linalg::add(temp_mean_result, temp_mean, temp_mean_result, alpha1, alpha2);


doesnt linalg have an in-place add? That would be cleaner here

mmm, what's the in-place add of linalg you mean?

nevermind for now.
Let's move on with this. What is missing?

karlnapf · 2017-07-13T11:59:19Z

src/shogun/clustering/GMM.cpp


 		for (int32_t j=0; j<alpha.num_rows; j++)
 		{
 			alpha_sum+=alpha.matrix[j*alpha.num_cols+i];
 			SGVector<float64_t> v=dotdata->get_computed_dot_feature_vector(j);
-			SGVector<float64_t>::add(mean_sum, alpha.matrix[j*alpha.num_cols+i], v.vector, 1, mean_sum, v.vlen);
+			linalg::add(


this seems to be a matrix product, and it would be better to use no loop here

mmmm, I'm not sure why this doesn't work:

alpha_sum = alpha_sum_v[i]; SGMatrix<float64_t> v=dotdata->get_computed_dot_feature_matrix(); auto column_vector = SGVector<float64_t>(alpha.get_column_vector(i), alpha.num_rows, false); linalg::matrix_prod(v, column_vector, mean_sum);

Maybe it's not about matrix product ? Or I misunderstand something here?

karlnapf · 2017-07-13T12:00:58Z

src/shogun/clustering/GMM.cpp

+
+				    break;
+			    case DIAG:
+				    for (int32_t k = 0; k < num_dim; k++)


matrix multiplication!

alpha.matrix[j * alpha.num_cols + i] is a element rather than a vector or matrix. So I think it's something like vector^{2}*R(R is a real number) rather than matrix multiplication. Does it make sense to u? :)

karlnapf

I am generally OK here.
But we cannot merge this before we have at least a working integration test (with new random) that remains unchanged for the refactoring (we dont change any computation)

vigsterkr · 2017-07-13T12:04:31Z

@karlnapf this has nothing to do with the new random.

karlnapf · 2017-07-13T12:05:59Z

Ah ok.
So this test then must not change integration test results ....

karlnapf · 2017-07-13T12:06:58Z

travis seems ok.
i think we can risk it then if we do a visual check ....

vigsterkr · 2017-07-20T05:22:56Z

@MikeLing can we please finish up this one by friday? what is missing is couple of matrix multiplications instead of using for loops... ping me if you need help, plz! just let's have this finally merged!

MikeLing force-pushed the GMM_refactor branch from c8d7db8 to 163fa39 Compare June 23, 2017 14:51

vigsterkr requested changes Jun 23, 2017

View reviewed changes

MikeLing force-pushed the GMM_refactor branch from 163fa39 to 42f1019 Compare June 24, 2017 09:16

vigsterkr reviewed Jun 24, 2017

View reviewed changes

MikeLing force-pushed the GMM_refactor branch from 42f1019 to 559f250 Compare June 24, 2017 09:32

vigsterkr approved these changes Jun 24, 2017

View reviewed changes

lisitsyn reviewed Jun 25, 2017

View reviewed changes

MikeLing force-pushed the GMM_refactor branch from 559f250 to 8e29fe7 Compare June 26, 2017 08:47

MikeLing force-pushed the GMM_refactor branch 2 times, most recently from 0ab8441 to 9df2c13 Compare July 11, 2017 14:14

MikeLing force-pushed the GMM_refactor branch 2 times, most recently from 037a675 to aa8b60d Compare July 13, 2017 03:01

karlnapf reviewed Jul 13, 2017

View reviewed changes

karlnapf requested changes Jul 13, 2017

View reviewed changes

MikeLing force-pushed the GMM_refactor branch 3 times, most recently from d015011 to eedc791 Compare July 23, 2017 13:38

use SGVector instead of plain pointer

cf51d05

MikeLing force-pushed the GMM_refactor branch from eedc791 to cf51d05 Compare July 24, 2017 02:37

vigsterkr merged commit 0792f2d into shogun-toolbox:develop Sep 13, 2017


		linalg::add(v, mean_sum, v, float64_t(1), float64_t(-1));
		mean_sum.display_vector();

use SGVector instead of plain pointer in GMM #3859

use SGVector instead of plain pointer in GMM #3859

Conversation

MikeLing commented Jun 23, 2017

vigsterkr commented Jun 23, 2017

MikeLing commented Jun 23, 2017 • edited

vigsterkr commented Jun 23, 2017

vigsterkr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlnapf commented Jun 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeLing commented Jun 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeLing Jun 24, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vigsterkr commented Jun 24, 2017

vigsterkr commented Jun 24, 2017

MikeLing commented Jun 24, 2017

vigsterkr commented Jun 24, 2017

MikeLing commented Jun 24, 2017

vigsterkr commented Jun 24, 2017

vigsterkr commented Jun 24, 2017

MikeLing commented Jun 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeLing Jun 26, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeLing commented Jun 25, 2017

MikeLing commented Jul 8, 2017

karlnapf commented Jul 9, 2017

codecov bot commented Jul 11, 2017 • edited

Codecov Report

MikeLing commented Jul 12, 2017 • edited

karlnapf commented Jul 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeLing Jul 20, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlnapf left a comment

Choose a reason for hiding this comment

vigsterkr commented Jul 13, 2017

karlnapf commented Jul 13, 2017

karlnapf commented Jul 13, 2017

vigsterkr commented Jul 20, 2017

MikeLing commented Jun 23, 2017 •

edited

MikeLing Jun 24, 2017 •

edited

MikeLing Jun 26, 2017 •

edited

codecov bot commented Jul 11, 2017 •

edited

MikeLing commented Jul 12, 2017 •

edited

MikeLing Jul 20, 2017 •

edited