Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues with SQL backend and Unique Constraint #26

Closed
pguridi opened this issue Jan 21, 2014 · 18 comments
Closed

issues with SQL backend and Unique Constraint #26

pguridi opened this issue Jan 21, 2014 · 18 comments

Comments

@pguridi
Copy link
Contributor

pguridi commented Jan 21, 2014

I'm having troubles with the Unique constraint..,or maybe I'm getting something wrong.. :S. here goes..
(talking about the plain SQL backend, not the ORM):
the schema dump looks like:

CREATE TABLE IF NOT EXISTS `fingerprints` (
  `hash` binary(10) NOT NULL,
  `song_id` mediumint(8) unsigned NOT NULL,
  `offset` int(10) unsigned NOT NULL,
  PRIMARY KEY (`hash`),
  UNIQUE KEY `song_id` (`song_id`,`offset`,`hash`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

"hash" is a primary key, which means will be unique for each record regardless of the other columns. What happens is that I fingerprint 2 different files, and I get some hashes that are already in the DB but from the other file (with different offset). And because hash is unique, an IntegrityError is raised. Then, the hash of the new file is never saved.

Am I missing something?.

@Wessie
Copy link
Contributor

Wessie commented Jan 21, 2014

The primary key isn't the only constraint there, UNIQUE KEY song_id (song_id,offset,hash) tells the database that there can't ever be a duplicate row that contains the exact same song_id, offset and hash pairing.

@Wessie
Copy link
Contributor

Wessie commented Jan 21, 2014

On second look, no that is indeed an error in the schema. We should have a separate id as PRIMARY KEY and keep the UNIQUE constraint for the other fields.

This is a backwards incompatible change though, we need to check this out with @worldveil.

@pguridi
Copy link
Contributor Author

pguridi commented Jan 21, 2014

Yes, I can see that. But doesn't apply both constraints?.
I think, that with the current schema, is impossible to save a same hash, with different offset and song_id. Because the composite unique constraint will not apply, but the primary key on "hash" will.

@pguridi
Copy link
Contributor Author

pguridi commented Jan 21, 2014

because the insert query uses "INSERT IGNORE" , all the constraint errors will be silent, not only the ones from the UNIQUE KEY song_id (song_id,offset,hash), also the failing rows from the "hash primary key"..

@Wessie
Copy link
Contributor

Wessie commented Jan 21, 2014

@pguridi indeed, this is a bug. Although not one that will reduce accuracy most likely. Would still be good to fix it. You should use the correct method in the ORM.

@pguridi
Copy link
Contributor Author

pguridi commented Jan 21, 2014

yes. Ill add an id primary key field, remove the primary key from hash and keep the composite constraint.

@worldveil
Copy link
Owner

The primary key on hash is for performance reasons, why do you want to change it?

@Wessie
Copy link
Contributor

Wessie commented Jan 22, 2014

The change is needed because it is impossible to enter both (examples) (hash=10, offset=0, song_id=0) and (hash=10, offset=100, song_id=0). But this applies for anything where hash is the same but with a different offset or song_id combination.

The fix is to add an id column as new primary key and add an index to hash to keep the performance up.

@pguridi
Copy link
Contributor Author

pguridi commented Jan 22, 2014

indeed, with the primary key in the "hash" field, the composite unique constraint UNIQUE KEY song_id (song_id,offset,hash) would never be used. Because there will not be a single repeated hash in the whole table.

@worldveil
Copy link
Owner

Ah, yes I overlooked that. Is there any key/indexing configuration where we can avoid adding an id yet still have 1) index on hash, and 2) unique constraint on hash, offset, song_id? The inclusion of an id field will cause an increase in storage size.

@Wessie
Copy link
Contributor

Wessie commented Jan 22, 2014

@worldveil we could try using a composite primary key, but I'm not sure how well supported it is on older MySQL versions or when using ORMs.

@pguridi
Copy link
Contributor Author

pguridi commented Jan 23, 2014

@Wessie AFAIK, SQLAlchemy supports composite foreign keys. Also sqlite supports it. "Both single column and composite (multiple column) primary keys are supported." [1]

[1] http://www.sqlite.org/lang_createtable.html

@worldveil
Copy link
Owner

I'd say let's go for composite key if we can then. What would the syntax be for that? My SQL is a little rusty.

If that won't work then we can add an id but I'd like to avoid that as it adds a lot of disk space since we'd add one to each fingerprint.

@worldveil
Copy link
Owner

Should be a simple change to this:

CREATE TABLE IF NOT EXISTS `fingerprints` (
  `hash` binary(10) NOT NULL,
  `song_id` mediumint(8) unsigned NOT NULL,
  `offset` int(10) unsigned NOT NULL,
  INDEX (`hash`),
  UNIQUE KEY `uniq_constraint` (`song_id`,`offset`,`hash`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

What's odd is as @pguridi says, MySQL for me allows different song_id, hash combinations even with the primary key. However, just using INDEX should be the correct way of doing it.

@Wessie
Copy link
Contributor

Wessie commented Jan 24, 2014

CREATE TABLE IF NOT EXISTS `fingerprints` (
  `hash` binary(10) NOT NULL,
  `song_id` mediumint(8) unsigned NOT NULL,
  `offset` int(10) unsigned NOT NULL,
  PRIMARY KEY (`hash`, `song_id`, `offset`)
  UNIQUE KEY `uniq_constraint` (`hash`, `song_id`, `offset`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Should be the correct version, please use utf8 standard as well.

@worldveil
Copy link
Owner

Yes good on utf8. We need the index on hash, though. INDEX doesn't enforce uniqueness, it's just for lookup speed. Also the primary and unique there are redundant?

@Wessie
Copy link
Contributor

Wessie commented Jan 24, 2014

I'm not sure the INDEX is required since it is also a primary key. But adding it shouldn't hurt.

@worldveil
Copy link
Owner

The SQL statement to retrieve hashes works on a WHERE hash = <value> query. Thus we need an index that hashes on only the hash field. The primary key here is redundant to the UNIQUE one, it only ensures a fast lookup on all three fields, which we don't need (and don't think we use). We need only two things:

  1. Fast lookup on hash field
  2. Unique constraint on (hash, song_id, offset)

Thus, the minimal amount of indexes to achieve that is:

CREATE TABLE IF NOT EXISTS `fingerprints` (
  `hash` binary(10) NOT NULL,
  `song_id` mediumint(8) unsigned NOT NULL,
  `offset` int(10) unsigned NOT NULL,
  INDEX (`hash`)
  UNIQUE KEY `uniq_constraint` (`hash`, `song_id`, `offset`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants