-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New reporting sources #3110
New reporting sources #3110
Conversation
…ting_switch_to_miro_complete
…ting_switch_to_miro_complete
…metrust/platform into reporting_switch_to_miro_complete
…ting_switch_to_miro_complete
…ting_switch_to_miro_complete
es_password = "" | ||
es_url = "" | ||
path_to_calm_json = "" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe take these from the command line or load from a local . file?
def test_cleans_simple_date(): | ||
test_dict = {"UserDate1": "01/01/1970"} | ||
assert transform(test_dict) == {"UserDate1": "1970-01-01"} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make the month and day different so this tests the formatting is as you expect.
"single_element_list": ["single list item"], | ||
"nan_field": math.nan, | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe clearer to put these cases in the respective tests.
ES_URL = "${var.reporting_es_url}" | ||
ES_USER = "${var.reporting_es_user}" | ||
ES_PASS = "${var.reporting_es_pass}" | ||
ES_INDEX = "sierra" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move this to a more specific index and include the creation data to avoid confusion and name clashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick skim between builds. Will review properly later.
@@ -0,0 +1,33 @@ | |||
""" | |||
Basic transformer, which cleans up the static calm data before sending it off | |||
to an elasticsearch indesx. The raw data can be downloaded by running: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to an elasticsearch indesx. The raw data can be downloaded by running: | |
to an Elasticsearch index. The raw data can be downloaded by running: |
python monitoring/scripts/download_oai_harvest.py | ||
|
||
from the root of this repo. The `path_to_calm_json` parameter below should be | ||
set to the relative path to that file from this directory. The `es_username`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what this parameter should be. It's probably also not necessary – you can work it out.
# Path to the root of the Git repo
subprocess.check_call([
"git", "rev-parse", "--show-toplevel"]).strip().decode("utf8")
# Path to the current file
os.path.abspath(__file__)
df = pd.read_json(path_to_calm_json) | ||
es = Elasticsearch(es_url, http_auth=(es_username, es_password)) | ||
|
||
for idx in tqdm(range(len(df))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be neater to iterate over the records directly? A quick Google suggests something like:
for record in tqdm(df.iterrows(), total=len(df))
|
||
def test_handles_badly_formatted_date(): | ||
test_dict = {"UserDate1": "a badly formatted date"} | ||
assert transform(test_dict) == {"UserDate1": None} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to be throwing this data away? Some of the Calm dates are super funky – maybe stash the raw string in UserDate1.raw
or something?
|
||
def convert_date_to_iso(date_string): | ||
try: | ||
return parse(date_string).date().isoformat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should check if the Calm date ever uses "day-first" like in the UK in a way that could be ambiguous.
See https://notebook.alexwlchan.net/2018/10/beware-ambiguous-dates-with-dateutil-parse/
return transformed_record | ||
|
||
|
||
keys_to_parse = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for set
if new_value.startswith("'"): | ||
new_value = new_value[1:] | ||
if new_value.endswith("'"): | ||
new_value = new_value[:-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if a value is only partially quoted? Should that still be stripped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments inline.
|
||
|
||
path_to_es_credentials = ( | ||
os.path.dirname(os.path.realpath(__file__)) + "/es_credentials.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nit: In Python, it's generally preferable to join paths with os.path.join
, which is more portable across different operating systems.
def test_handles_badly_formatted_date(): | ||
test_dict = {"UserDate1": "a badly formatted date"} | ||
assert transform(test_dict) == { | ||
"UserDate1_raw": "a badly formatted date", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
raw_data = { | ||
"quote_before": "some text", | ||
"quote_after": "some text", | ||
"quote_both": "some text", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be quotes in some of these strings?
new_value = record[key][0] | ||
|
||
if isinstance(new_value, str): | ||
if new_value.startswith("'") and new_value.endswith("'"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you are about double quotes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haven't seen any instances of ugly doubles, so leaving them be for now
|
||
def test_cleans_simple_date(): | ||
test_dict = {"all_amendment_date": "01/01/1970"} | ||
assert transform(test_dict) == {"all_amendment_date": "1970-01-01"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto Mark's comment above – use different day/month.
Could some of this date parsing stuff be in a common lib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, although i'm not sure it's worth it at this point. Will keep in mind though
@@ -40,7 +38,7 @@ def convert_date_to_iso(date_string): | |||
try: | |||
return parse(date_string).date().isoformat() | |||
except (ValueError, TypeError): | |||
return date_string | |||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Good Python practice is to throw an exception rather than return sentinel values, so you're not relying on the caller to check if foo is None
"empty_string": "", | ||
} | ||
|
||
if full: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than an ambiguously named full
parameter, how about splitting this into two methods?
def build_bib_record()
def build_sierra_transformable()
{"a": "b", "x": "y"}, | ||
{"a": "c", "x": "z"}, | ||
], | ||
"orders_date": ["1970-01-01T00:00:00"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As before, different day/month.
""" | ||
try: | ||
data_to_unpack = deepcopy(view_record[field_name]) | ||
del edit_record[field_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are the only two lines that can throw a KeyError
? In general it's good practice to keep the body of your try
small, so you don't catch and lose unexpected errors.
…metrust/platform into reporting_switch_to_miro_complete
What is this PR trying to achieve?
adds further reporting channels from source data → kibana
Who is this change for?
📄 📊