# Querying firestorm at random

### Initialize

In [None]:
from google.cloud import firestore

import pandas as pd
import itertools
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
db = firestore.Client.from_service_account_json("../credentials/stairway-firestore-key.json")

## At random

Firestore doesn't have a random query functionality, so you will have to implement a randomizer yourself. There's two ways this could be done:
1. If you know the distribution of your document ids, draw randomly from there and query documents based on their id directly.
2. Choose some feature of your documents, draw a random number from it and use it to subset the data. Then sort the result and retrieve the top k documents (or sort descending if nothing found).


### Implementation of random 1

Showstopper: can Firestorm query multiple documents at once based on document id? If not, a for loop is required to fetch all destinations (=> multiple queries instead of one).

TODO: investigate!

### Implementation of random 2

For this, we need to select a feature first and look at its distribution to start drawing random numbers from.

In [None]:
dest = pd.read_csv("../data/destinations.csv")
dest = dest.loc[dest['succes'] == 1]  # only use OK destination data
dest.index = dest['id']  # set firestorm document id equal to stairway destination id
dest.shape

For example, `osp_importance`:

In [None]:
feature = 'osp_importance'

# Fit a normal distribution to the data:
mu, std = norm.fit(dest[feature])

# Plot the histogram.
plt.hist(dest[feature], density=True, alpha=0.6)

# Plot the PDF.
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

Now we can randomly pick some kinde of number from this distribution by:

In [None]:
np.random.normal(mu, std)

Which means the final querying of firestorm would look as follows. 

**Note:** to combine the equality operator (`==`) with a range or array-contains clause (`<, <=, >, >=, or array_contains`), make sure to create a composite index (one for each `continent` + `osp_importance`). Also, you cannot do range filters on two variables. See query docs [here](https://firebase.google.com/docs/firestore/query-data/queries).

In [None]:
query = (
    db
    .collection("destinations")
    .where('EU', '==', 1)
    .where('osp_importance', '<=', np.random.normal(mu, std))
    .order_by('osp_importance', direction=firestore.Query.DESCENDING)
    .limit(3)
    .get()
)

for doc in itertools.islice(query, 2):
    print(u'{} => {}'.format(doc.id, doc.to_dict()['name']))

Next steps: 
- Probably the distribution of `osp_importance` is highly variable per continent. Maybe fit something like this for each continent? 
- The `<=` filter in combination with descending probably causes the top `osp_importance` destinations to be never selected even though you might want these to be selected the most. Think about whether this is desired.

## Ultimate query - assessing Firestorm's fit

### Example Airbnb

It is quite possible that we eventually end up with a similar query as Airbnb makes:

https://www.airbnb.com/s/Barcelona--Spain/homes?refinement_paths%5B%5D=%2Fhomes
- &current_tab_id=home_tab
- &selected_tab_id=home_tab
- &metadata_only=false
- &version=1.7.0
- &items_per_grid=18
- &screen_size=large
- &map_toggle=true
- &search_type=unknown
- &hide_dates_and_guests_filters=false
- &ne_lat=43.97363914475397&ne_lng=5.173845810128569&sw_lat=38.69043481932856&sw_lng=-0.5720037992464313
- &zoom=7
- &search_by_map=true
- &checkin=2020-05-09
- &checkout=2020-05-10
- &adults=16
- &amenities%5B%5D=25&amenities%5B%5D=9
- &property_type_id%5B%5D=2&property_type_id%5B%5D=1
- &price_min=83&price_max=451

From here, we can distill some simple queries (`adults=16`), some range queries `price_min= & price_max=` and some array filters (`amenities= & amenities=`) - that could possibly be brought down to simple select queries.

### Own criteria

Eventually we will need to be able to query on quite some criteria ourselves:

* geolocation: `ne_lat`, `ne_lng`, `sw_lat`, `sw_lng`
    - range query based on lat & lon?
    - or, filter query based on geohash? 
* price
    - range query based on monetary amount: `budget_min`, `budget_max`
    - or, [in](https://firebase.google.com/docs/firestore/query-data/queries#in_and_array-contains-any) query to check if destination budget type (like average) is in one of the requested by user (average - and thus also cheap): `budget_class`?
* period of travel
    - in combination with other featuers like **weather**, 
    - range query of travel dates `checkin`, `checkout`,
    - filter on `month` or `period` => possibly even select multiple months
* weather
    - given a period or all months selected:
        - range query based on temperature: `temp_min`, `temp_max`
        - filter query based on weather type (sunny, rainy, cloudy, snow...) `temp_type`?
* activities
    - [array contains](https://firebase.google.com/docs/firestore/query-data/queries#array_membership) query, check if user requested activities are in destination activities array: `activities`
    - or multiple single filters (like airbnb): `activity`
* passport
* safety requirements

Most of these will be 'hard' filters that **must** be applied.

However, within this selection, we also want to do an **ordering** of the destinations that are most likely to be a fit with the end-user.

### Recommendation queries in old code

For recommendation queries we need to rank destinations based on some ordering. Based on the user profile and/or request, this ordering will be different and calculations will need to be done. 

For example in our old code, sorting is done by summing feature scores for all selected features by the end-user and sorting on that:

query for recommendations in [`dao/bm25Recommendations.py`](https://github.com/Braamling/project_travel/blob/master/REST_API/project_travel/dao/bm25Recommendations.py)

```python
def query(self, columns, user_id, limit, offset):
    dbHelper = DbHelpers()

    order_by = dbHelper.build_feature_score_filter(columns)

    query = "SELECT ds.destination_id FROM (`destination_scores` ds" +\
            " INNER JOIN `destinations` d ON succes > 0 AND " +\
            "ds.destination_id = d.id) LEFT JOIN `wishlist` w " +\
            "ON d.id = w.destination_id AND w.user_id = %s " +\
            "WHERE w.destination_id is NULL ORDER BY " +\
            "( " + order_by + " ) DESC LIMIT %s OFFSET %s"
```

where the `order_by` constructs the combined feature score in [`dao/dbHelpers.py`](https://github.com/Braamling/project_travel/blob/master/REST_API/project_travel/dao/dbHelpers.py):

```python
def build_feature_score_filter(self, features):
    valid_columns = self.get_column_names('destination_scores')
    order_by = "( "
    for feature in features[:-1]:
        if feature in valid_columns:
            order_by += feature + " + "

    if features[-1] in valid_columns:
        order_by += features[-1] + " )"
    else:
        order_by += " )"

    return order_by
```

This demands a bit of flexibility of your system.

### Get requests in old code

Given that we have applied all of the above filters and got a destination id, this is how we used to retrieve the destination info: 

join and receive all info from 1 destination in [`classes/destination.py`](https://github.com/Braamling/project_travel/blob/master/REST_API/project_travel/classes/destination.py)

```python
# Join and recieve all information about a destination.
    query = "SELECT d.*, r.*, t.*, b.* FROM `destinations` d INNER JOIN" +\
            " `rain` r ON d.id = %s AND d.temperature_id = " +\
            " r.id INNER JOIN `temperatures` t ON d.temperature_id = t.id" +\
            " INNER JOIN `attributes` a ON a.id = d.attr_id " +\
            " INNER JOIN `budget` b on b.id = d.budget_id"

    cur.execute(query, (destination_id))
```

It get's a bit more complicated if you want to take into account what destinations the user has already looked at. For example, this is how we got visited destinations in [`dao/destinations.py`](https://github.com/Braamling/project_travel/blob/master/REST_API/project_travel/dao/destinations.py)

```python
def get_visited(self, user_id):
    cur = self.get_cursor()

    query = "SELECT d.id, d.name, d.country_name, d.longitude, d.latitude FROM `visited` v INNER JOIN " +\
            " `users` u ON u.id = %s AND u.id = v.user_id INNER JOIN " +\
            "`destinations` d ON d.id = v.destination_id WHERE v.status = 'VISITED'"

    cur.execute(query, (user_id))
```

This is obviously not the hardest part. But it will depend on where you store the user feedback.

### Conclusion: fit with Firestore

Firebase has a couple of limitations with regards to querying:
- You can use only one `in` or `array-contains-any` clause per query. You can't use both `in` and `array-contains-any` in the same query.
- You can combine `array-contains` with `in` but not with `array-contains-any`.
- You can only perform range comparisons (`<, <=, >, >=`) on a single field, and you can include at most one `array-contains` or `array-contains-any`clause in a compound query:

This means that if we want to continue with Firebase, we need to:
* Use **geohashing** for the 'hard' filtering based on location
* Wether retrieving all destinations from the 'hard' filter and calculating the smart stuff in the Flask app is good enough in terms of performance.
    - compared to Bram's SQL code that works with an offset you will be retrieving significantly more data from the database each time Flask calls the DB... Can make it slow + expensive.
* Think about how to save user feedback (watched/likes/dislikes) and combine it in recommendation and bucketlist queries

Conclusion: it is probably wise to abandon Firestore and think of an alternative.

Done.