# Querying firestorm at random

### Initialize

In [None]:
from google.cloud import firestore

import pandas as pd
import itertools
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
db = firestore.Client.from_service_account_json("../credentials/stairway-firestore-key.json")

## At random

Firestore doesn't have a random query functionality, so you will have to implement a randomizer yourself. There's two ways this could be done:
1. If you know the distribution of your document ids, draw randomly from there and query documents based on their id directly.
2. Choose some feature of your documents, draw a random number from it and use it to subset the data. Then sort the result and retrieve the top k documents (or sort descending if nothing found).


### Implementation of random 1

Showstopper: can Firestorm query multiple documents at once based on document id? If not, a for loop is required to fetch all destinations (=> multiple queries instead of one).

TODO: investigate!

### Implementation of random 2

For this, we need to select a feature first and look at its distribution to start drawing random numbers from.

In [None]:
dest = pd.read_csv("../data/destinations.csv")
dest = dest.loc[dest['succes'] == 1]  # only use OK destination data
dest.index = dest['id']  # set firestorm document id equal to stairway destination id
dest.shape

For example, `osp_importance`:

In [None]:
feature = 'osp_importance'

# Fit a normal distribution to the data:
mu, std = norm.fit(dest[feature])

# Plot the histogram.
plt.hist(dest[feature], density=True, alpha=0.6)

# Plot the PDF.
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

Now we can randomly pick some kinde of number from this distribution by:

In [None]:
np.random.normal(mu, std)

Which means the final querying of firestorm would look as follows. 

**Note:** to combine the equality operator (`==`) with a range or array-contains clause (`<, <=, >, >=, or array_contains`), make sure to create a composite index (one for each `continent` + `osp_importance`). Also, you cannot do range filters on two variables. See query docs [here](https://firebase.google.com/docs/firestore/query-data/queries).

In [None]:
query = (
    db
    .collection("destinations")
    .where('EU', '==', 1)
    .where('osp_importance', '<=', np.random.normal(mu, std))
    .order_by('osp_importance', direction=firestore.Query.DESCENDING)
    .limit(3)
    .get()
)

for doc in itertools.islice(query, 2):
    print(u'{} => {}'.format(doc.id, doc.to_dict()['name']))

Next steps: 
- Probably the distribution of `osp_importance` is highly variable per continent. Maybe fit something like this for each continent? 
- The `<=` filter in combination with descending probably causes the top `osp_importance` destinations to be never selected even though you might want these to be selected the most. Think about whether this is desired.

Done.