Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set lhs of underlying distance to cluster centers (#4388) #4510

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions scripts/check_format.sh
Expand Up @@ -19,7 +19,7 @@ function check_shogun_style {
then
echo "Branch ${2:-} does not exists locally. Fetching it."
git fetch origin "${2:-}:${2:-}"
braceletboy marked this conversation as resolved.
Show resolved Hide resolved

fi

BASE_COMMIT=$(git rev-parse ${2:-})
Expand All @@ -31,7 +31,7 @@ function check_shogun_style {
echo "Running clang-format-3.8 against branch ${2:-}, with hash $BASE_COMMIT"

COMMIT_FILES=$(git diff --name-only $BASE_COMMIT)

# Use clang-format only on existent files
LIST=("")
for file in $COMMIT_FILES
Expand Down
90 changes: 85 additions & 5 deletions src/shogun/clustering/Hierarchical.cpp
@@ -1,17 +1,22 @@
/*
* This software is distributed under BSD 3-clause license (see LICENSE file).
*
* Authors: Soeren Sonnenburg, Sergey Lisitsyn, Heiko Strathmann,
* Giovanni De Toni, Viktor Gal, Evan Shelhamer
* Authors: Soeren Sonnenburg, Sergey Lisitsyn, Heiko Strathmann,
* Giovanni De Toni, Viktor Gal, Evan Shelhamer, Rukmangadh Sai Myana
*/

#include <iostream>
braceletboy marked this conversation as resolved.
Show resolved Hide resolved
#include <limits>
#include <shogun/base/Parallel.h>
#include <shogun/base/SGObject.h>
#include <shogun/base/progress.h>
#include <shogun/clustering/Hierarchical.h>
#include <shogun/distance/Distance.h>
#include <shogun/features/DenseFeatures.h>
#include <shogun/features/Features.h>
#include <shogun/labels/Labels.h>
#include <shogun/mathematics/Math.h>
#include <shogun/mathematics/linalg/LinalgNamespace.h>

using namespace shogun;

Expand Down Expand Up @@ -126,7 +131,7 @@ bool CHierarchical::train_machine(CFeatures* data)
auto pb = SG_PROGRESS(range(0, num_pairs - 1));
int32_t k=-1;
int32_t l=0;
for (; l<num && (num-l)>=merges && k<num_pairs-1; l++)
for (; l < num && l < merges && k < num_pairs - 1; l++)
{
while (k<num_pairs-1)
{
Expand Down Expand Up @@ -212,8 +217,83 @@ SGMatrix<int32_t> CHierarchical::get_cluster_pairs()
return SGMatrix<int32_t>(pairs,2,merges, false);
}


void CHierarchical::store_model_features()
{
/* TODO. Currently does nothing since apply methods are not implemented. */
CDenseFeatures<float64_t>* rhs =
(distance->get_rhs())->as<CDenseFeatures<float64_t>>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need the parenthesis here.
Also, are we really sure that the features are float64 bit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove the parenthesis.
I checked the KmeansBase file and used float64 bit. I felt that it made sense to make it float64 bit. What other type do you suggest?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KMeans is another algorithm. I just checked Hierarchical.cpp and it does not fix the features to be 64 bit float, but rather just calles the distance method of CDistance. So we will need to think a bit more here in order to not make the algorithm only work for those 64bit dense feautres. What about 32 bit? What about sparse features?` What about string features? But let's address the other points first, this one will take some time and effort

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I get your point. Thanks!


int32_t num_vectors = rhs->get_num_vectors();
int32_t num_features = rhs->get_num_features();

SGVector<int32_t> n_cluster_samples = SGVector<int32_t>(
num_vectors + merges);
shogun::linalg::set_const(n_cluster_samples, 0);

SGMatrix<float64_t> null_matrix = SGMatrix<float64_t>(
num_features, num_vectors + merges);
CDenseFeatures<float64_t>* temp_cluster_centers =
new CDenseFeatures<float64_t>(null_matrix);

for (int32_t i = 0; i < num_vectors; i++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, makes sense.

{
bool dofree_c, dofree_rhs;
int32_t centerIdx = assignment[i];
SG_DEBUG("\n");
braceletboy marked this conversation as resolved.
Show resolved Hide resolved
SG_DEBUG("On %04i point, with assignment=%04i \n", i, centerIdx);
float64_t* center = temp_cluster_centers->get_feature_vector(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a SGVector based API for feature vectors that you could use here. This pointer based one is deprecated

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Thanks :)

centerIdx, num_features, dofree_c);
float64_t* sample_point =
rhs->get_feature_vector(i, num_features, dofree_rhs);
for (int32_t j = 0; j < num_features; j++)
{
if (n_cluster_samples[centerIdx] > 0)
{
center[j] = (center[j] * n_cluster_samples[centerIdx] +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should leave a comment on what type of algorithm/approach is used here to store/compute model features

sample_point[j]) /
(n_cluster_samples[centerIdx] + 1);
}
else
center[j] = sample_point[j];
SG_DEBUG(
"The %04i feature of the updated temp center is "
"%+06.6f \n",
j, center[j]);
}
n_cluster_samples[centerIdx]++;
rhs->free_feature_vector(sample_point, num_features, dofree_rhs);
temp_cluster_centers->free_feature_vector(
center, num_features, dofree_c);
}

CDenseFeatures<float64_t>* cluster_centers =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the code to compute something should be in here. This is just a method to store a "smaller" version of the features, namely only those needed to apply the model to new data. If you need to compute things in order to do that, it should happen in a helper method

new CDenseFeatures<float64_t>(null_matrix);

int32_t curr_index = 0;
for (int32_t i = 0; i < num_vectors + merges; i++)
{
if (n_cluster_samples[i] > 0)
{
bool dofree_temp, dofree;
float64_t* temp_center = temp_cluster_centers->get_feature_vector(
i, num_features, dofree_temp);
float64_t* center = cluster_centers->get_feature_vector(
curr_index, num_features, dofree);
SG_DEBUG("\n");
SG_DEBUG("The %04i cluster center:\n", curr_index);
for (int32_t j = 0; j < num_features; j++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls use std::copy or similar for those operations, it is faster

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Thank you!

{
center[j] = temp_center[j];
SG_DEBUG(
"The %04i feature of the center is %+06.6f \n", j,
center[j]);
}
curr_index++;
temp_cluster_centers->free_feature_vector(
temp_center, num_features, dofree_temp);
cluster_centers->free_feature_vector(center, num_features, dofree);
}
}
distance->init(cluster_centers, rhs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should add a unit test that

  1. Trains the clustering
  2. Applies to test data
  3. Calls store_model_features, asserts that now the features are different
  4. Now applies to test data again and assert that results didnt change

SG_UNREF(temp_cluster_centers);
SG_UNREF(rhs);
}
10 changes: 5 additions & 5 deletions src/shogun/clustering/Hierarchical.h
@@ -1,8 +1,8 @@
/*
* This software is distributed under BSD 3-clause license (see LICENSE file).
*
* Authors: Soeren Sonnenburg, Sergey Lisitsyn, Heiko Strathmann, Yuyu Zhang,
* Bjoern Esser, Saurabh Goyal
* Authors: Soeren Sonnenburg, Sergey Lisitsyn, Heiko Strathmann, Yuyu Zhang,
* Bjoern Esser, Saurabh Goyal, Rukmangadh Sai Myana
*/

#ifndef _HIERARCHICAL_H__
Expand Down Expand Up @@ -119,9 +119,9 @@ class CHierarchical : public CDistanceMachine
*/
virtual bool train_machine(CFeatures* data=NULL);

/** TODO: Ensures cluster centers are in lhs of underlying distance
* Currently: does nothing.
* */
/** Ensures cluster centers are in lhs of underlying distance
*
*/
virtual void store_model_features();

private:
Expand Down