# Time travel
Using the rank cluster, we can run our collection of real search terms against the new mapping with the new query structure. We can then analyse the results according to the same set of metrics as we used for the data which was collected in real time. In other words, we can look at how search _would have_ performed if we had made these changes earlier. It's a time-travelling A/B test.

In [None]:
import os
import json
from elasticsearch import Elasticsearch
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from scipy import stats

## Getting queries
Because the queries are written and tested in typescript, we need to export a json version of them before they can be used in these python notebooks. Running `yarn getQueryJSON <query_name>` will generate a `.json` version of the query alongside the `.ts` original.

We can then import the query as follows:

In [None]:
query_name = "works-with-search-fields"

In [None]:
with open(f"data/queries/{query_name}.json", "r") as f:
    query = json.load(f)

we can now open a connection to our rank cluster and run our query against it

In [None]:
secret = lambda name: os.environ[name][1:-1]

es = Elasticsearch(
    cloud_id=secret("ES_RANK_CLOUD_ID"),
    http_auth=(secret("ES_RANK_USER"), secret("ES_RANK_PASSWORD")),
)

es.indices.exists(index=query_name)

In [None]:
def format_query(search_term):
    return {
        "query": json.loads(
            json.dumps(query).replace(
                "{{query}}", search_term.replace("'", "").replace('"', "")
            )
        )
    }

In [None]:
df = pd.read_csv("./searches.csv")

In [None]:
terms = df["search_terms"].unique()

In [None]:
n = 5000

In [None]:
result_totals = []

In [None]:
for term in tqdm(terms[:n]):
    try:
        response = es.search(index=query_name, body=format_query(term))
        result_totals.append(response["hits"]["total"]["value"])
    except:
        pass

In [None]:
pd.Series(result_totals).hist(bins=200);

In [None]:
count_2, division_2 = np.histogram(pd.Series(result_totals), bins=500)

Elastic limits the number of `totalResults`, which leads to a spike in at 10,000 (the max value). Instead of trying to fit an exponential to that weirdly shaped data, we just crop out the last bin from the histogram and fit to the data within the reliable range.

In [None]:
count_1, division_1 = np.histogram(df["n_results"], bins=division_2)

In [None]:
simple_result_totals = []
for term in tqdm(terms[:n]):
    try:
        response = es.search(
            index=query_name,
            body={
                "query": {
                    "simple_query_string": {
                        "query": term,
                        "fields": ["*"],
                        "default_operator": "or",
                    }
                }
            },
        )
        simple_result_totals.append(response["hits"]["total"]["value"])
    except:
        pass

In [None]:
count_3, division_3 = np.histogram(pd.Series(simple_result_totals), bins=division_2)

In [None]:
data = pd.DataFrame()
data["old"] = pd.Series(dict(zip(division_1, count_1)))[:9900]
data["new"] = pd.Series(dict(zip(division_2, count_2)))[:9900]
data["oldest"] = pd.Series(dict(zip(division_3, count_3)))[:9900]

In [None]:
data

In [None]:
data.to_csv("counts.csv")

In [None]:
from sklearn.preprocessing import MaxAbsScaler

In [None]:
data[["old", "new", "oldest"]] = MaxAbsScaler().fit_transform(data)

In [None]:
data

In [None]:
old_fit = stats.expon.fit(data["old"])
new_fit = stats.expon.fit(data["new"])
oldest_fit = stats.expon.fit(data["oldest"])

old_fit, new_fit, oldest_fit

In [None]:
a = data.plot()
a.set_xlim(0, 750)