Just a quick look at duplicates in the training data itself.

There is roughly a 10% of entries with exactly the same 'X' features, with just different 'ID' and 'y' fields, sometimes diverging quite a lot. A good example of this are rows 3070 and 3133.

Until now I just ignored this, but it might be worth reducing "dimensionality" in the sample space.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
train_data = pd.read_csv('../input/train.csv')
test_data = pd.read_csv('../input/test.csv')

In [None]:
feature_columns = train_data.columns[2:]
feature_columns

In [None]:
label_columns = []
for dtype, column in zip(train_data.dtypes, train_data.columns):
    if dtype == object:
        label_columns.append(column)
label_columns

In [None]:
print("{} duplicate entries in training, out of {}, a {:.2f} %".format(
    len(train_data[train_data.duplicated(subset=feature_columns, keep=False)]),
    len(train_data),
    100 * len(train_data[train_data.duplicated(subset=feature_columns, keep=False)]) / len(train_data)
    ))
train_data[train_data.duplicated(subset=feature_columns, keep=False)].sort_values(by=label_columns)

Let's take a quick look at the standard deviation of the 'y' for each duplicate group.

In [None]:
duplicate_std = train_data[train_data.duplicated(subset=feature_columns,
                             keep=False)].groupby(list(feature_columns.values))['y'].aggregate(['std', 'size']).reset_index(drop=True)

duplicate_std.sort_values(by='std', ascending=False)

In [None]:
print("{} duplicate groups in training".format(
    len(train_data[train_data.duplicated(subset=feature_columns,
                             keep=False)].groupby(list(feature_columns.values)).size().reset_index())))

    
train_data[train_data.duplicated(subset=feature_columns,
                             keep=False)].groupby(list(feature_columns.values)).size().reset_index()

Is the same the case with test_data?

In [None]:
print("{} duplicate entries in test, out of {}, a {:.2f} %".format(
    len(test_data[test_data.duplicated(subset=feature_columns, keep=False)]),
    len(test_data),
    100 * len(test_data[test_data.duplicated(subset=feature_columns, keep=False)]) / len(test_data) 
    ))
test_data[test_data.duplicated(subset=feature_columns,
                               keep=False)].groupby(label_columns, axis=0).count()[['ID']]

In [None]:
print("{} duplicate groups in test".format(
    len(test_data[test_data.duplicated(subset=feature_columns,
                             keep=False)].groupby(list(feature_columns.values)).size().reset_index())))

test_data[test_data.duplicated(subset=feature_columns,
                             keep=False)].groupby(list(feature_columns.values)).size().reset_index()

In [None]:
all_data = pd.concat((train_data.drop('y', axis=1), test_data))
print("{} duplicate entries in total, out of {}, a {:.2f} %".format(
    len(all_data[all_data.duplicated(subset=feature_columns, keep=False)]),
    len(all_data),
    100 * len(all_data[all_data.duplicated(subset=feature_columns, keep=False)])/ len(all_data)
    ))

print("{} duplicate groups in total".format(
    len(all_data[all_data.duplicated(subset=feature_columns,
                             keep=False)].groupby(list(feature_columns.values)).size().reset_index())))

all_data[all_data.duplicated(subset=feature_columns,
                             keep=False)].groupby(list(feature_columns.values)).size().reset_index()

As a quick analysis, we can see that the sample data is noisy. Also, given the divergence in the value 'y' for the same inputs (except for the 'ID'), a similar situation must be expected in the test data. Even worse, it could be the case the test data 'y' is the one diverging from the results modelled on the training.