New reporting sources #3110

harrisonpim · 2018-11-30T15:22:14Z

What is this PR trying to achieve?

adds further reporting channels from source data → kibana

sierra
calm
switches to the new-format miro VHS

Who is this change for?

📄 📊

…y now

…ting_switch_to_miro_complete

…metrust/platform into reporting_switch_to_miro_complete

…ting_switch_to_miro_complete

reporting/reporting_calm_transformer/src/reporting_calm_transformer.py

mklander · 2018-12-04T15:36:30Z

reporting/reporting_calm_transformer/src/reporting_calm_transformer.py

+es_password = ""
+es_url = ""
+path_to_calm_json = ""
+


maybe take these from the command line or load from a local . file?

mklander · 2018-12-04T15:37:40Z

reporting/reporting_calm_transformer/src/test_date_transforms.py

+def test_cleans_simple_date():
+    test_dict = {"UserDate1": "01/01/1970"}
+    assert transform(test_dict) == {"UserDate1": "1970-01-01"}
+


make the month and day different so this tests the formatting is as you expect.

mklander · 2018-12-04T15:39:34Z

reporting/reporting_calm_transformer/src/test_transform.py

+        "single_element_list": ["single list item"],
+        "nan_field": math.nan,
+    }
+


maybe clearer to put these cases in the respective tests.

mklander · 2018-12-04T15:47:34Z

reporting/terraform/main.tf

+    ES_URL      = "${var.reporting_es_url}"
+    ES_USER     = "${var.reporting_es_user}"
+    ES_PASS     = "${var.reporting_es_pass}"
+    ES_INDEX    = "sierra"


I think we should move this to a more specific index and include the creation data to avoid confusion and name clashes.

alexwlchan

Quick skim between builds. Will review properly later.

alexwlchan · 2018-12-04T15:52:17Z

reporting/reporting_calm_transformer/src/reporting_calm_transformer.py

@@ -0,0 +1,33 @@
+"""
+Basic transformer, which cleans up the static calm data before sending it off
+to an elasticsearch indesx. The raw data can be downloaded by running:


Suggested change

to an elasticsearch indesx. The raw data can be downloaded by running:

to an Elasticsearch index. The raw data can be downloaded by running:

alexwlchan · 2018-12-04T15:53:16Z

reporting/reporting_calm_transformer/src/reporting_calm_transformer.py

+    python monitoring/scripts/download_oai_harvest.py
+
+from the root of this repo. The `path_to_calm_json` parameter below should be
+set to the relative path to that file from this directory. The `es_username`,


I'm not sure I understand what this parameter should be. It's probably also not necessary – you can work it out.

# Path to the root of the Git repo subprocess.check_call([ "git", "rev-parse", "--show-toplevel"]).strip().decode("utf8") # Path to the current file os.path.abspath(__file__)

alexwlchan · 2018-12-04T15:54:52Z

reporting/reporting_calm_transformer/src/reporting_calm_transformer.py

+df = pd.read_json(path_to_calm_json)
+es = Elasticsearch(es_url, http_auth=(es_username, es_password))
+
+for idx in tqdm(range(len(df))):


Would it be neater to iterate over the records directly? A quick Google suggests something like:

for record in tqdm(df.iterrows(), total=len(df))

alexwlchan · 2018-12-04T15:55:30Z

reporting/reporting_calm_transformer/src/test_date_transforms.py

+
+def test_handles_badly_formatted_date():
+    test_dict = {"UserDate1": "a badly formatted date"}
+    assert transform(test_dict) == {"UserDate1": None}


Do you want to be throwing this data away? Some of the Calm dates are super funky – maybe stash the raw string in UserDate1.raw or something?

alexwlchan · 2018-12-04T15:56:18Z

reporting/reporting_calm_transformer/src/transform.py

+
+def convert_date_to_iso(date_string):
+    try:
+        return parse(date_string).date().isoformat()


You should check if the Calm date ever uses "day-first" like in the UK in a way that could be ambiguous.

See https://notebook.alexwlchan.net/2018/10/beware-ambiguous-dates-with-dateutil-parse/

alexwlchan · 2018-12-04T15:56:32Z

reporting/reporting_calm_transformer/src/transform.py

+    return transformed_record
+
+
+keys_to_parse = {


👍 for set

alexwlchan · 2018-12-04T15:56:57Z

reporting/reporting_calm_transformer/src/transform.py

+            if new_value.startswith("'"):
+                new_value = new_value[1:]
+            if new_value.endswith("'"):
+                new_value = new_value[:-1]


What if a value is only partially quoted? Should that still be stripped?

alexwlchan

A few more comments inline.

alexwlchan · 2018-12-06T13:57:10Z

reporting/reporting_calm_transformer/src/reporting_calm_transformer.py

+
+
+path_to_es_credentials = (
+    os.path.dirname(os.path.realpath(__file__)) + "/es_credentials.json"


tiny nit: In Python, it's generally preferable to join paths with os.path.join, which is more portable across different operating systems.

alexwlchan · 2018-12-06T13:57:52Z

reporting/reporting_calm_transformer/src/test_date_transforms.py

+def test_handles_badly_formatted_date():
+    test_dict = {"UserDate1": "a badly formatted date"}
+    assert transform(test_dict) == {
+        "UserDate1_raw": "a badly formatted date",


alexwlchan · 2018-12-06T13:58:25Z

reporting/reporting_calm_transformer/src/test_transform.py

+    raw_data = {
+        "quote_before": "some text",
+        "quote_after": "some text",
+        "quote_both": "some text",


Should there be quotes in some of these strings?

alexwlchan · 2018-12-06T13:58:54Z

reporting/reporting_calm_transformer/src/transform.py

+            new_value = record[key][0]
+
+        if isinstance(new_value, str):
+            if new_value.startswith("'") and new_value.endswith("'"):


Do you are about double quotes?

haven't seen any instances of ugly doubles, so leaving them be for now

alexwlchan · 2018-12-06T13:59:24Z

reporting/reporting_miro_transformer/src/test_date_transforms.py

+
+def test_cleans_simple_date():
+    test_dict = {"all_amendment_date": "01/01/1970"}
+    assert transform(test_dict) == {"all_amendment_date": "1970-01-01"}


Ditto Mark's comment above – use different day/month.

Could some of this date parsing stuff be in a common lib?

Possibly, although i'm not sure it's worth it at this point. Will keep in mind though

alexwlchan · 2018-12-06T14:00:15Z

reporting/reporting_miro_transformer/src/transform.py

@@ -40,7 +38,7 @@ def convert_date_to_iso(date_string):
    try:
        return parse(date_string).date().isoformat()
    except (ValueError, TypeError):
-        return date_string
+        return None


minor: Good Python practice is to throw an exception rather than return sentinel values, so you're not relying on the caller to check if foo is None

alexwlchan · 2018-12-06T14:01:06Z

reporting/reporting_sierra_transformer/src/test_transform.py

+        "empty_string": "",
+    }
+
+    if full:


Rather than an ambiguously named full parameter, how about splitting this into two methods?

def build_bib_record() def build_sierra_transformable()

alexwlchan · 2018-12-06T14:01:20Z

reporting/reporting_sierra_transformer/src/test_transform.py

+            {"a": "b", "x": "y"},
+            {"a": "c", "x": "z"},
+        ],
+        "orders_date": ["1970-01-01T00:00:00"],


As before, different day/month.

alexwlchan · 2018-12-06T14:02:58Z

reporting/reporting_sierra_transformer/src/transform.py

+    """
+    try:
+        data_to_unpack = deepcopy(view_record[field_name])
+        del edit_record[field_name]


I think these are the only two lines that can throw a KeyError? In general it's good practice to keep the body of your try small, so you don't catch and lose unexpected errors.

…metrust/platform into reporting_switch_to_miro_complete

harrisonpim added 10 commits November 27, 2018 12:09

switch miro index to read from new, complete VHS table

4587079

switch reindexer back to main miro table

0288972

remove references to data field - source data is handed to us directl…

9d9d5ff

…y now

add reporting_ prefix to lambdas

485d889

Merge branch 'master' of github.com:wellcometrust/platform into repor…

c83e6af

…ting_switch_to_miro_complete

add basic calm transformer

9ae397a

merge sierra transformer

c0a49bf

add calm reporting tests

53a968c

Merge branch 'master' of github.com:wellcometrust/platform into repor…

fda1753

…ting_switch_to_miro_complete

move sierra to reporting_sierra

b448519

harrisonpim self-assigned this Nov 30, 2018

harrisonpim requested a review from mklander November 30, 2018 15:22

Apply auto-formatting rules

0395307

harrisonpim changed the title ~~[WIP] Reporting switch to miro complete~~ [WIP] New reporting sources Nov 30, 2018

harrisonpim added 6 commits November 30, 2018 16:38

bump aws_utils to 2.2.1

40cb782

Merge branch 'reporting_switch_to_miro_complete' of github.com:wellco…

3b311a7

…metrust/platform into reporting_switch_to_miro_complete

add reporting_ prefix to post_to_slack lambda

1f11062

Merge branch 'master' of github.com:wellcometrust/platform into repor…

e1b85ab

…ting_switch_to_miro_complete

rename reporting_sierra_transformer.py

1214b86

Merge branch 'master' of github.com:wellcometrust/platform into repor…

260738a

…ting_switch_to_miro_complete

harrisonpim changed the title ~~[WIP] New reporting sources~~ New reporting sources Dec 4, 2018

harrisonpim mentioned this pull request Dec 4, 2018

Reporting pipeline sierra #3028

Closed

mklander reviewed Dec 4, 2018

View reviewed changes

alexwlchan reviewed Dec 4, 2018

View reviewed changes

harrisonpim and others added 3 commits December 4, 2018 17:41

tidy up calm transformer

1070143

break instead of stopiteration

a958601

Apply auto-formatting rules

0d6733b

alexwlchan reviewed Dec 6, 2018

View reviewed changes

harrisonpim added 2 commits December 6, 2018 15:17

minor sierra updates

38d1be2

minor miro updates

1072b9b

harrisonpim and others added 3 commits December 6, 2018 15:18

minor calm updates

94d5072

Merge branch 'reporting_switch_to_miro_complete' of github.com:wellco…

909a141

…metrust/platform into reporting_switch_to_miro_complete

Apply auto-formatting rules

cf889d5

mklander approved these changes Dec 7, 2018

View reviewed changes

harrisonpim merged commit bf7cc38 into master Dec 7, 2018

harrisonpim deleted the reporting_switch_to_miro_complete branch December 7, 2018 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New reporting sources #3110

New reporting sources #3110

harrisonpim commented Nov 30, 2018

mklander Dec 4, 2018

mklander Dec 4, 2018

mklander Dec 4, 2018

mklander Dec 4, 2018

alexwlchan left a comment

alexwlchan Dec 4, 2018

alexwlchan Dec 4, 2018

alexwlchan Dec 4, 2018

alexwlchan Dec 4, 2018

alexwlchan Dec 4, 2018

alexwlchan Dec 4, 2018

alexwlchan Dec 4, 2018

alexwlchan left a comment

alexwlchan Dec 6, 2018

alexwlchan Dec 6, 2018

alexwlchan Dec 6, 2018

alexwlchan Dec 6, 2018

harrisonpim Dec 6, 2018

alexwlchan Dec 6, 2018

harrisonpim Dec 6, 2018

alexwlchan Dec 6, 2018

alexwlchan Dec 6, 2018

alexwlchan Dec 6, 2018

alexwlchan Dec 6, 2018

	to an elasticsearch indesx. The raw data can be downloaded by running:
	to an Elasticsearch index. The raw data can be downloaded by running:



		path_to_es_credentials = (
		os.path.dirname(os.path.realpath(__file__)) + "/es_credentials.json"

New reporting sources #3110

New reporting sources #3110

Conversation

harrisonpim commented Nov 30, 2018

What is this PR trying to achieve?

Who is this change for?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexwlchan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexwlchan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment