New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate samples and overlap between train and test #44

Closed
Britefury opened this Issue Aug 31, 2017 · 1 comment

Comments

Projects
None yet
2 participants
@Britefury

Britefury commented Aug 31, 2017

I hope I have got this right, but it seems that there are 43 samples duplicated in the training set and 1 sample that is duplicated in the test set. There are also 10 samples in the training set that appear in the test set. This was done by comparing the samples at the byte level.

Here is a list of the duplicates:

Training set duplicates:
[601, 39865]
[831, 24228]
[1826, 23718]
[2024, 53883]
[4974, 6293]
[5520, 49165]
[5790, 11845]
[5822, 33399]
[6139, 37731]
[6280, 41036]
[8485, 31238]
[8841, 28184]
[12571, 56657]
[14096, 32343]
[14710, 22159]
[15587, 28635]
[19308, 20114]
[19668, 21571]
[19760, 39489]
[19888, 24443]
[21072, 32800]
[22852, 28789]
[23052, 57107]
[23413, 33731]
[24785, 46015]
[25297, 40077]
[25629, 49588]
[26314, 49351]
[27045, 40033]
[27421, 31627]
[32113, 38337]
[32300, 33730]
[32303, 56840]
[32888, 41918]
[32922, 54584]
[36634, 39841]
[38261, 41877]
[42756, 53842]
[46667, 57724]
[46782, 54829]
[47929, 54185]
[48480, 59607]
[48955, 51368]
Test set duplicates:
[6334, 8569]
Training set samples overlapping with test set:
Train samples [3763] overlap with test samples [7243]
Train samples [4944] overlap with test samples [7781]
Train samples [6168] overlap with test samples [9227]
Train samples [12404] overlap with test samples [4037]
Train samples [15943] overlap with test samples [6659]
Train samples [22403] overlap with test samples [7762]
Train samples [34617] overlap with test samples [4990]
Train samples [35772] overlap with test samples [7216]
Train samples [48228] overlap with test samples [5867]
Train samples [52205] overlap with test samples [9560]

The code required to generate the above output is as follows (assuming the input images are in the variables train_X and test_X:

def sample_bytes(x):
    result = []
    for i in range(len(x)):
        b = x[i].tobytes()
        result.append(b)
    return result

train_h = sample_bytes(train_X)
test_h = sample_bytes(test_X)

train_dict = {}
test_dict = {}
for i, h in enumerate(train_h):
    train_dict.setdefault(h, []).append(i)
for i, h in enumerate(test_h):
    test_dict.setdefault(h, []).append(i)

print('Training set duplicates:')
for k, v in train_dict.items():
    if len(v) > 1:
        for j in range(1, len(v)):
            assert (ds.train_X_u8[v[0]] == ds.train_X_u8[v[j]]).all()
        print(v)

print('Test set duplicates:')
for k, v in test_dict.items():
    if len(v) > 1:
        for j in range(1, len(v)):
            assert (ds.test_X_u8[v[0]] == ds.test_X_u8[v[j]]).all()
        print(v)

print('Training set samples overlapping with test set:')
for k, v in train_dict.items():
    if k in test_dict:
        assert (ds.train_X_u8[v[0]] == ds.test_X_u8[test_dict[k][0]]).all()
        print('Train samples {} overlap with test samples {}'.format(v, test_dict[k]))

overlap = set(train_h).intersection(set(test_h))
print(len(overlap))
assert overlap == set()
@hanxiao

This comment has been minimized.

Show comment
Hide comment
@hanxiao

hanxiao Aug 31, 2017

Member

🙇 Many thanks for finding out this issue! check my PR #45

Member

hanxiao commented Aug 31, 2017

🙇 Many thanks for finding out this issue! check my PR #45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment