Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New reporting sources #3110

Merged
merged 25 commits into from
Dec 7, 2018
Merged

New reporting sources #3110

merged 25 commits into from
Dec 7, 2018

Conversation

harrisonpim
Copy link
Contributor

What is this PR trying to achieve?

adds further reporting channels from source data → kibana

  • sierra
  • calm
  • switches to the new-format miro VHS

Who is this change for?

📄 📊

@harrisonpim harrisonpim self-assigned this Nov 30, 2018
@harrisonpim harrisonpim changed the title [WIP] Reporting switch to miro complete [WIP] New reporting sources Nov 30, 2018
@harrisonpim harrisonpim changed the title [WIP] New reporting sources New reporting sources Dec 4, 2018
es_password = ""
es_url = ""
path_to_calm_json = ""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe take these from the command line or load from a local . file?

def test_cleans_simple_date():
test_dict = {"UserDate1": "01/01/1970"}
assert transform(test_dict) == {"UserDate1": "1970-01-01"}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the month and day different so this tests the formatting is as you expect.

"single_element_list": ["single list item"],
"nan_field": math.nan,
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe clearer to put these cases in the respective tests.

ES_URL = "${var.reporting_es_url}"
ES_USER = "${var.reporting_es_user}"
ES_PASS = "${var.reporting_es_pass}"
ES_INDEX = "sierra"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this to a more specific index and include the creation data to avoid confusion and name clashes.

Copy link
Contributor

@alexwlchan alexwlchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick skim between builds. Will review properly later.

@@ -0,0 +1,33 @@
"""
Basic transformer, which cleans up the static calm data before sending it off
to an elasticsearch indesx. The raw data can be downloaded by running:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to an elasticsearch indesx. The raw data can be downloaded by running:
to an Elasticsearch index. The raw data can be downloaded by running:

python monitoring/scripts/download_oai_harvest.py

from the root of this repo. The `path_to_calm_json` parameter below should be
set to the relative path to that file from this directory. The `es_username`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what this parameter should be. It's probably also not necessary – you can work it out.

# Path to the root of the Git repo
subprocess.check_call([
    "git", "rev-parse", "--show-toplevel"]).strip().decode("utf8")

# Path to the current file
os.path.abspath(__file__)

df = pd.read_json(path_to_calm_json)
es = Elasticsearch(es_url, http_auth=(es_username, es_password))

for idx in tqdm(range(len(df))):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be neater to iterate over the records directly? A quick Google suggests something like:

for record in tqdm(df.iterrows(), total=len(df))


def test_handles_badly_formatted_date():
test_dict = {"UserDate1": "a badly formatted date"}
assert transform(test_dict) == {"UserDate1": None}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to be throwing this data away? Some of the Calm dates are super funky – maybe stash the raw string in UserDate1.raw or something?


def convert_date_to_iso(date_string):
try:
return parse(date_string).date().isoformat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check if the Calm date ever uses "day-first" like in the UK in a way that could be ambiguous.

See https://notebook.alexwlchan.net/2018/10/beware-ambiguous-dates-with-dateutil-parse/

return transformed_record


keys_to_parse = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for set

if new_value.startswith("'"):
new_value = new_value[1:]
if new_value.endswith("'"):
new_value = new_value[:-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a value is only partially quoted? Should that still be stripped?

Copy link
Contributor

@alexwlchan alexwlchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments inline.



path_to_es_credentials = (
os.path.dirname(os.path.realpath(__file__)) + "/es_credentials.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit: In Python, it's generally preferable to join paths with os.path.join, which is more portable across different operating systems.

def test_handles_badly_formatted_date():
test_dict = {"UserDate1": "a badly formatted date"}
assert transform(test_dict) == {
"UserDate1_raw": "a badly formatted date",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

raw_data = {
"quote_before": "some text",
"quote_after": "some text",
"quote_both": "some text",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be quotes in some of these strings?

new_value = record[key][0]

if isinstance(new_value, str):
if new_value.startswith("'") and new_value.endswith("'"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you are about double quotes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't seen any instances of ugly doubles, so leaving them be for now


def test_cleans_simple_date():
test_dict = {"all_amendment_date": "01/01/1970"}
assert transform(test_dict) == {"all_amendment_date": "1970-01-01"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto Mark's comment above – use different day/month.

Could some of this date parsing stuff be in a common lib?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, although i'm not sure it's worth it at this point. Will keep in mind though

@@ -40,7 +38,7 @@ def convert_date_to_iso(date_string):
try:
return parse(date_string).date().isoformat()
except (ValueError, TypeError):
return date_string
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Good Python practice is to throw an exception rather than return sentinel values, so you're not relying on the caller to check if foo is None

"empty_string": "",
}

if full:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than an ambiguously named full parameter, how about splitting this into two methods?

def build_bib_record()

def build_sierra_transformable()

{"a": "b", "x": "y"},
{"a": "c", "x": "z"},
],
"orders_date": ["1970-01-01T00:00:00"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As before, different day/month.

"""
try:
data_to_unpack = deepcopy(view_record[field_name])
del edit_record[field_name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are the only two lines that can throw a KeyError? In general it's good practice to keep the body of your try small, so you don't catch and lose unexpected errors.

@harrisonpim harrisonpim merged commit bf7cc38 into master Dec 7, 2018
@harrisonpim harrisonpim deleted the reporting_switch_to_miro_complete branch December 7, 2018 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants