It is interesting to me how non-random the Id column seems to be in the dataset.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn import model_selection, neighbors

Let's make a simple plot of the joint distribution of the Id and Cover_Type columns.

In [None]:
df = pd.read_csv('../input/forest-cover-type-kernels-only/train.csv')
fig, ax = plt.subplots(figsize=(10, 9))
sns.stripplot('Cover_Type', 'Id', data=df, jitter=True, ax=ax, size=2)

It's clear that a huge amount of information about the Cover_Type is contained in the Id column. To confirm this, let's build a k-NN classifier on just the Id data and see how well it performs on our validation set.

In [None]:
X = np.array(df['Id']).reshape(-1, 1)
y = df['Cover_Type']

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)

# Grid search to determine best k
accuracies=[]
for k in range(1, 31):
    clf = neighbors.KNeighborsClassifier(n_neighbors=k, p=1, n_jobs=-1)
    clf.fit(X_train, y_train)
    accuracy = clf.score(X_test, y_test)
    accuracies.append(accuracy)
    
best_k = np.argmax(accuracies) + 1 # best k is consistently 1
print('Highest-performing k: {} (acc: {})'.format(best_k, max(accuracies)))

The 1-NN algorithm achieves between 0.52 and 0.55 accuracy on the validation set (since there are seven balanced classes, random classification would yield only 14% accuracy), which is impressive considering it only uses the Id of the observations to make its predictions. Of course, this classifier cannot be used on the test set, since the test set operates on a different range of Ids. However, it can be interesting to make the same study we made on the training set on the best-performing publicly available submission. At the time of writing, it is the one from this kernel: https://www.kaggle.com/codename007/forest-cover-type-eda-baseline-model

In [None]:
df = pd.read_csv('../input/forest-cover-type-eda-baseline-model/etc.csv')
fig, [ax1, ax2] = plt.subplots(1, 2, sharey=True, figsize=(18, 9))
ax1.set_title('Unshuffled')
ax2.set_title('Shuffled (for comparison)')
sns.stripplot('Cover_Type', 'Id', data=df, jitter=True, ax=ax1, size=0.5)
np.random.shuffle(df['Id'])
sns.stripplot('Cover_Type', 'Id', data=df, jitter=True, ax=ax2, size=0.5)

Once again there is clear structure in the joint distribution of cover types and Ids. This shows that the dataset's Ids were not shuffled, which has implications for how private CV scores compare to public LB.