Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Hamming distance C++ example #39

Closed
TieSKey opened this issue Oct 11, 2019 · 12 comments
Closed

[Question] Hamming distance C++ example #39

TieSKey opened this issue Oct 11, 2019 · 12 comments

Comments

@TieSKey
Copy link

TieSKey commented Oct 11, 2019

Hi,
do you have any example of ngt usage on C++ with hamming distance?

I'm currently creating an index with this Properties for use with a 128 bits query:

NGT::Property property;
    property.setDefault();
    property.dimension = 8; // 4bytes per element x 8 elements 
    property.distanceType = NGT::Property::DistanceType::DistanceTypeHamming;
    property.indexType = NGT::Property::IndexType::GraphAndTree;
    property.objectType = NGT::Property::ObjectType::Uint8;

While it does compile and run, no result is found. I'm appending data as float which should be the same size as an uint but maybe that's incorrect.

const float *p = desc.ptr<float>(0);
        std::vector<float> obj(p, p + 8);

        // allocate query object.
        query = index->allocateObject(obj);

        NGT::SearchContainer sc(*query);     

Using L2 and floats works fine so i'm probably doing something wrong on my end, an example would really help me.

Thanks in advance.

@masajiro
Copy link
Member

I made an example for hamming distance. Since this uses the data in this repository, run this in the top directory of this repository. The number of dimensions must be 16 for 128 bits. I hope this example makes sense.

#include        "NGT/Index.h"

int
main(int argc, char **argv)
{
  string        indexFile       = "index";
  string        objectFile      = "./data/sift-dataset-5k.tsv";
  string        queryFile       = "./data/sift-query-3.tsv";
  uint32_t      bitSize         = 128;

  NGT::Property property;
  property.dimension = bitSize / 8;
  property.graphType = NGT::Property::GraphType::GraphTypeANNG;
  property.objectType = NGT::ObjectSpace::ObjectType::Uint8;
  property.distanceType = NGT::Index::Property::DistanceType::DistanceTypeHamming;

  try {
    NGT::Index  index;
    index.createGraphAndTree(indexFile, property);
    index.open(indexFile);
    ifstream    is(objectFile);
    string line;
    while (getline(is, line)) {
      vector<string> tokens;
      NGT::Common::tokenize(line, tokens, " \t");      // split an object string into values by separators.                  
      vector<float> obj;
      for (vector<string>::iterator ti = tokens.begin(); ti != tokens.end(); ++ti) {
        obj.push_back(NGT::Common::strtol(*ti));
      }
      obj.resize(property.dimension);  // shrink vectors because original ones are too long for this sample.                 
      index.append(obj);
    }
    index.createIndex(16);
    index.saveIndex(indexFile);
  } catch (NGT::Exception &err) {
    cerr << "Error " << err.what() << endl;
    return 1;
  } catch (...) {
    cerr << "Error" << endl;
    return 1;
  }

  try {
    NGT::Index  index(indexFile);
    ifstream    is(queryFile);
    string line;
    while (getline(is, line)) {
      NGT::Object *query = 0;
      {
        vector<string> tokens;
        NGT::Common::tokenize(line, tokens, " \t");
        vector<float> obj;
        for (vector<string>::iterator ti = tokens.begin(); ti != tokens.end(); ++ti) {
          obj.push_back(NGT::Common::strtol(*ti));
        }
        obj.resize(property.dimension);
        cout << "Query: ";
        for (size_t i = 0; i < obj.size(); i++) {
          cout << obj[i] << " ";
        }
        cout << endl << endl;
        query = index.allocateObject(obj);
      }
      NGT::SearchContainer sc(*query);
      NGT::ObjectDistances objects;
      sc.setResults(&objects);
      sc.setSize(10);
      sc.setEpsilon(0.2);

      index.search(sc);
      cout << "Rank\tID\tDistance" << endl;
      for (size_t i = 0; i < objects.size(); i++) {
        cout << i + 1 << "\t" << objects[i].id << "\t" << objects[i].distance << " : ";
        NGT::ObjectSpace &objectSpace = index.getObjectSpace();
        uint8_t *object = static_cast<uint8_t*>(objectSpace.getObject(objects[i].id));
        for (size_t i = 0; i < objectSpace.getDimension(); ++i) {
          cout << static_cast<int>(object[i]) << " ";
        }
        cout << endl;
      }
      cout << endl;
      index.deleteObject(query);
    }
  } catch (NGT::Exception &err) {
    cerr << "Error " << err.what() << endl;
    return 1;
  } catch (...) {
    cerr << "Error" << endl;
    return 1;
  }
  return 0;
}

@TieSKey
Copy link
Author

TieSKey commented Oct 12, 2019

Fantastic. Arigatou gosaimasu Iwasaki-hakase.

So going by your example, the way we input data to the index (in chunks as floats, string, etc) doesn't affect the algorithm right? (which is perfectly logical but sometimes implementation details can introduce some oddities)
As part of my phd tesis I'm trying to interface this library with opencv and dlib. Working over this example to implement overloads of allocateObject (and related methods) to take vector as params seems to be working correctly but it needs more testing.
Edit: just noticed that Uint8 is just a def for unsigned char so no need to touch base code.

@masajiro
Copy link
Member

どういたしまして Dou itashimashite.

Now I think Index::append() is confusing when using hamming distance. One float value is forcedly set to an unsigned char variable. This means that a float value must be positive and smaller than 256.

the way we input data to the index (in chunks as floats, string, etc) doesn't affect the algorithm right?

I am not sure how you want to apply hamming distance to your floats and string because hamming distance is a bitwise operator.

@TieSKey
Copy link
Author

TieSKey commented Oct 16, 2019

I am not sure how you want to apply hamming distance to your floats and string because hamming distance is a bitwise operator.

Indeed, I was wondering if maybe some memory alignment or input data separation was dependent on the elements count when adding data to the index. So having N/4 4byte elements instead of N 1byte elements could cause problems or unexpected behaviors. Well, in your example the number of elements is originally the same and the extra bytes are discarded later.

Now I think Index::append() is confusing when using hamming distance. One float value is forcedly set to an unsigned char variable. This means that a float value must be positive and smaller than 256.

Oh, I overlooked this. An Uint8 overload would be a lot clearer, yes, and since the underlying function doing the memory allocation/copying is already templated, the required changes are minimal.

@masajiro
Copy link
Member

masajiro commented Nov 5, 2019

Now I think Index::append() is confusing when using hamming distance. One float value is forcedly set to an unsigned char variable. This means that a float value must be positive and smaller than 256.

To solve the confusion, I updated Index::append() to be able to use with an uint8 argument. You can write as the example below.

vector<uint8_t> obj;
...
index.append(obj);

@masajiro
Copy link
Member

Below is a sample for hamming distance with the latest version for your reference.

#include        "NGT/Index.h"

using namespace std;

int
main(int argc, char **argv)
{
  string        indexFile       = "index";
  string        objectFile      = "./data/sift-dataset-5k.tsv";
  string        queryFile       = "./data/sift-query-3.tsv";
  size_t        bitSize         = 128;
  // index construction                                                                                                              
  try {
    NGT::Property       property;                                                              
    property.dimension          = bitSize / 8;                                                      
    property.objectType         = NGT::ObjectSpace::ObjectType::Uint8;                                              
    property.distanceType       = NGT::Index::Property::DistanceType::DistanceTypeHamming;
    NGT::Index  index(property);
    ifstream    is(objectFile);
    string      line;
    while (getline(is, line)) {
      vector<string>    tokens;
      NGT::Common::tokenize(line, tokens, " \t");      // split an object string into values by separators.                          
      vector<uint8_t>   obj;                                                                                              
      for (vector<string>::iterator ti = tokens.begin(); ti != tokens.end(); ++ti) {
        obj.push_back(NGT::Common::strtol(*ti));
      }
      obj.resize(property.dimension);  // cut off unnecessary data in the file.                                                       
      index.append(obj);
    }
    index.createIndex(16);
    index.saveIndex(indexFile);
  } catch (NGT::Exception &err) {
    cerr << "Error " << err.what() << endl;
    return 1;
  } catch (...) {
    cerr << "Error" << endl;
    return 1;
  }

  // nearest neighbor search                                                                                                         
  try {
    NGT::Index          index(indexFile);
    NGT::Property       property;
    index.getProperty(property);
    ifstream            is(queryFile);
    string      line;
    while (getline(is, line)) {
      vector<uint8_t>   query;
      {
        vector<string>  tokens;
        NGT::Common::tokenize(line, tokens, " \t");
        for (vector<string>::iterator ti = tokens.begin(); ti != tokens.end(); ++ti) {
          query.push_back(NGT::Common::strtol(*ti));
        }
        query.resize(property.dimension);
        cout << "Query : ";
        for (size_t i = 0; i < query.size(); i++) {
          cout << std::bitset<8>(query[i]) << " ";                                                            
        }
      }

      NGT::SearchQuery          sc(query);
      NGT::ObjectDistances      objects;
      sc.setResults(&objects);
      sc.setSize(10);
      sc.setEpsilon(0.2);

      index.search(sc);
      cout << endl << "Rank\tID\tDistance" << std::showbase << endl;
      for (size_t i = 0; i < objects.size(); i++) {
        cout << i + 1 << "\t" << objects[i].id << "\t" << objects[i].distance << "\t: ";
        NGT::ObjectSpace &objectSpace = index.getObjectSpace();
        uint8_t *object = static_cast<uint8_t*>(objectSpace.getObject(objects[i].id));
        for (size_t i = 0; i < objectSpace.getDimension(); ++i) {
          cout << std::bitset<8>(object[i]) << " ";                                                                  
        }
        cout << endl;
      }
      cout << endl;
    }
  } catch (NGT::Exception &err) {
    cerr << "Error " << err.what() << endl;
    return 1;
  } catch (...) {
    cerr << "Error" << endl;
    return 1;
  }

  return 0;
}

@TieSKey
Copy link
Author

TieSKey commented Nov 19, 2019

Looks good, thx again!

Btw, ONNG takes a really long time to build, around 15-20 mins on my computer, that's expected. But what about panng? does it build any faster?

@masajiro
Copy link
Member

PANNG can be built faster than ONNG with the recommended number of edges 100. But, first, you may want to reduce the number of edges of ONNG (ANNG) to build it faster, because 100 edges are too many for most datasets. The number of edges can be specified with NGT::Property::edgeSizeForCreation or -E of ngt create command.

@TieSKey
Copy link
Author

TieSKey commented Jan 3, 2020

Akemashite omedeto gosaimasu.

Running some more benchmarks, when I construct an ONNG index by setting:
property.graphType = NGT::Property::GraphType::GraphTypeONNG

it actually takes less time than if I construct an ANNG index with 10-20 edges and then run the optimizer over it, which if I understand correctly, constructs a PANNG:

 NGT::GraphOptimizer optimizer;
                optimizer.numOfOutgoingEdges = 4;
                optimizer.numOfIncomingEdges = 8;
                optimizer.execute(*index);

Is this correct? Am I misunderstanding something?

Thx as always for your assistance.

@masajiro
Copy link
Member

masajiro commented Jan 6, 2020

あけましておめでとうございます。今年もよろしくお願い致します。
Akemashite omedetou gozaimasu. Kotoshimo yoroshiku onegai itashimasu.

I experimentally added the type ONNG (NGT::Property::GraphType::GraphTypeONNG) to the index construction in order to directly construct an ONNG index without constructing an ANNG index. Unfortunately, the pseudo ONNG in this way did not achieve the same performance as that of an authentic ONNG index that is constructed from an ANNG index. Thus, I don't recommend you to use an pseudo ONNG index as an authentic ONNG index. However, you might able to use the pseudo ONNG index as an ANNG index, when you construct ONNG. It means that, as you mentioned, first construct an pseudo ONNG, then optimize it to construct ONNG. I expect that the ONNG index optimized from a pseudo ONNG index, which is not a PANNG index, achieves almost the same performance as that of the ONNG index from ANNG.

PANNG can be created with this command.

The index constructed by using the following manner might be almost the same as PANNG. But, I am not so sure.

NGT::GraphOptimizer optimizer;
                optimizer.numOfOutgoingEdges = 4;
                optimizer.numOfIncomingEdges = INT32_MAX;
                optimizer.execute(*index);

@TieSKey
Copy link
Author

TieSKey commented Jan 21, 2020

Thanks as always for your assistance.

I'm finishing up an academic publication regarding ANN algorithms and their use for Augmented Reality and wanted to publicly thank you for your assistance if u are ok with that.

@masajiro
Copy link
Member

I'm glad to help you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants