Based on the [GLRC kernel](https://www.kaggle.com/the1owl/google-landmark-random-choice-glrc) by [the1owl](https://www.kaggle.com/the1owl), I have tested some other strategies for drawing random choices from the distribution of landmark frequencies in the train set:

In [1]:
import pandas as pd
import pylab as pl

train = pd.read_csv('../input/train.csv')
# test = pd.read_csv('../input/test.csv')
sub = pd.read_csv('../input/sample_submission.csv')
pl.seed(0)


# version 1
Draw landmakrd_id and probability values from the distribution in train set:

In [None]:
probs = train.landmark_id.value_counts() / train.shape[0]
sub['landmark_id'] = train.landmark_id.iloc[pl.randint(0, train.shape[0], sub.shape[0])].values
sub['prob'] = probs[sub.landmark_id].values
sub['landmarks'] = sub.landmark_id.astype(str) + ' ' + sub.prob.astype(str)
sub[['id','landmarks']].to_csv('submission_1.csv', index=False)
sub[['id','landmarks']].head()

# version 2
Use the most frequent landmark and its probability for all test locations:

In [None]:
sub['landmarks'] = '%d %g' % (probs.index[0], probs[probs.index[0]])
sub[['id','landmarks']].to_csv('submission_2.csv', index=False)
sub[['id','landmarks']].head()

# version 3
Use only the first 1000 most frequent landmarks in train set for building the test set values:

In [None]:
N = 1000
probs = train.landmark_id.value_counts() / train.shape[0]
probs = probs.iloc[:N]
probs = pd.DataFrame({'landmark_id': probs.index,
                      'probability': probs.values}, index=pl.arange(N))
T = pd.merge(train, probs, on='landmark_id', how='inner')
inx = pl.randint(0, T.shape[0], sub.shape[0])
sub['landmark_id'] = T.landmark_id.iloc[inx].values
sub['prob'] = T.probability.iloc[inx].values
sub['landmarks'] = sub.landmark_id.astype(str) + ' ' + sub.prob.astype(str)
sub[['id','landmarks']].to_csv('submission_3.csv', index=False)
sub[['id','landmarks']].head()

# version 4

The same as version 3, but with unique confidence values:

(it might help hacking GAP)

In [None]:
N = 1000
probs = train.landmark_id.value_counts() / train.shape[0]
probs = probs.iloc[:N]
probs = pd.DataFrame({'landmark_id': probs.index,
                      'probability': 0.1}, index=pl.arange(N))
T = pd.merge(train, probs, on='landmark_id', how='inner')
inx = pl.randint(0, T.shape[0], sub.shape[0])
sub['landmark_id'] = T.landmark_id.iloc[inx].values
sub['prob'] = T.probability.iloc[inx].values
sub['landmarks'] = sub.landmark_id.astype(str) + ' ' + sub.prob.astype(str)
sub[['id','landmarks']].to_csv('submission_4.csv', index=False)
sub[['id','landmarks']].head()

# summary

to summarize that, I show here how the GAP changes with number of landmarks from train set used in randomly gueesing the landmakrks in the test set:

(I use the GAP function defined by [David](https://www.kaggle.com/davidthaler) in [https://www.kaggle.com/davidthaler/gap-metric](https://www.kaggle.com/davidthaler/gap-metric)

In [4]:
import seaborn as sns

def GAP_vector(pred, conf, true):
    '''
    Compute Global Average Precision (aka micro AP), the metric for the
    Google Landmark Recognition competition.
    This function takes predictions, labels and confidence scores as vectors.
    In both predictions and ground-truth, use None/np.nan for "no label".

    Args:
        pred: vector of integer-coded predictions
        conf: vector of probability or confidence scores for pred
        true: vector of integer-coded labels for ground truth
        return_x: also return the data frame used in the calculation

    Returns:
        GAP score
    (https://www.kaggle.com/davidthaler/gap-metric)
    '''
    x = pd.DataFrame({'pred': pred, 'conf': conf, 'true': true})
    x.sort_values('conf', ascending=False, inplace=True, na_position='last')
    x['correct'] = (x.true == x.pred).astype(int)
    x['prec_k'] = x.correct.cumsum() / (pl.arange(len(x)) + 1)
    x['term'] = x.prec_k * x.correct
    gap = x.term.sum() / x.true.count()
    return gap


Ns = [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 15000]
vlSize = 20000
Ttr = train.copy()
res = pd.DataFrame([])
row = 0
for r in range(30):
    vlInd = pl.randint(0, Ttr.shape[0], vlSize)
    val = Ttr.iloc[vlInd, :]
    train = Ttr.iloc[pl.setdiff1d(pl.arange(Ttr.shape[0]), vlInd)]

    for ni, N in enumerate(Ns):
        probs = train.landmark_id.value_counts() / train.shape[0]
        probs = probs.iloc[:N]
        probs = pd.DataFrame({'landmark_id': probs.index,
                              'probability': probs.values}, index=pl.arange(probs.shape[0]))
        T = pd.merge(train, probs, on='landmark_id', how='inner')
        inx = pl.randint(0, T.shape[0], val.shape[0])
        val['landmark_pred'] = T.landmark_id.iloc[inx].values
        val['prob'] = T.probability.iloc[inx].values
        GAP1 = GAP_vector(val.landmark_pred, val.prob, val.landmark_id)
        GAP2 = GAP_vector(val.landmark_pred, [0.1 for i in range(len(val))], val.landmark_id)
        res = res.append(pd.DataFrame({'n':[N, N], 'run':[r, r], 'GAP':[GAP1, GAP2], 'conf':['prob', 'fixed']}))

pl.figure(figsize=(15,7))
ax = sns.boxplot(x='n', y='GAP', hue='conf', data=res)
ax.set_yscale('log')

so, the GAP score goes down by increasing the number of used highly frequent landmarks, in some cases, by using 2 landmarks and having a fixed "confidence" value, we get a higher GAP score than even when N is 1. That is because the probabilities in the train set might not match the probabilities in the test set. So, my next output file will be based on that:

In [7]:
train = pd.read_csv('../input/train.csv')
sub = pd.read_csv('../input/sample_submission.csv')

N = 2
probs = train.landmark_id.value_counts() / train.shape[0]
probs = probs.iloc[:N]
probs = pd.DataFrame({'landmark_id': probs.index,
                      'probability': 0.1}, index=pl.arange(N))
T = pd.merge(train, probs, on='landmark_id', how='inner')
inx = pl.randint(0, T.shape[0], sub.shape[0])
sub['landmark_id'] = T.landmark_id.iloc[inx].values
sub['prob'] = T.probability.iloc[inx].values
sub['landmarks'] = sub.landmark_id.astype(str) + ' ' + sub.prob.astype(str)
sub[['id','landmarks']].to_csv('submission_5.csv', index=False)
sub[['id','landmarks']].head()