Add Ceph Upstream Prediction Models #1

chauhankaranraj · 2020-08-24T16:20:56Z

Hey @yaarith, I added the ceph upstream models to this repo, and updated the model.py script to use these by default.

Marking this PR as WIP because there's one piece still missing - vendor info. Is it possible to get the "vendor", "nvme_vendor" and "model_name" keys from the smartctl jsons as well? These aren't used as inputs to models, but are needed to decide which model to use (models are vendor based atm). Could you please take a look? Sorry for not mentioning this requirement earlier.

Edit: Addresses ceph#13

yaarith · 2020-08-26T14:04:34Z

@chauhankaranraj Great!!

I pushed the changes you asked:
4c12cc2

Please note that 'nvme_vendor' key was not added since it's the same as 'vendor'. That's because we retrieve the data from the database, where it was processed and filtered (as opposed to the raw SMART data in the client side, where 'vendor' and 'nvme_vendor' may contain different values).

Let me know if you need anything :-)

Signed-off-by: Karan Chauhan <kachau@redhat.com>

chauhankaranraj · 2020-08-26T17:11:20Z

Thanks for the changes, @yaarith, I have updated the predictor :)

yaarith · 2020-08-26T19:47:45Z

@chauhankaranraj Great, many thanks!!

I wish to test the integration, can you please add the updated requirements.txt?

Signed-off-by: Karan Chauhan <kachau@redhat.com>

chauhankaranraj · 2020-08-26T20:38:47Z

I wish to test the integration, can you please add the updated requirements.txt?

Of course, should be updated now.

yaarith · 2020-08-28T17:23:32Z

device_prediction/model.py

+        predictor.initialize("{}/models/{}".format(get_diskfailurepredictor_path(), args.predictor_model))
+
+        # make prediction
+        prediction_result = predictor.predict(device_data)


'device_data' is not defined.

I think you meant to add after line 305
device_data = json.load(inp_json)
?

Oops, looks like I accidentally deleted that after testing 😅
I'll update it to include device_data = json.loads(inp_json) (coz IIUC inp_json will be a string so loads should be used instead of load)

Thanks, @chauhankaranraj!

Yes, json.loads().

I still see several warnings and another error:

$ ./predict_device.py 1669 redhat ./model.py:189: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future. means = np.vstack(gen) ./model.py:194: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future. stds = np.vstack(gen) ./model.py:197: RuntimeWarning: invalid value encountered in true_divide cvs = stds / means /usr/local/lib64/python3.6/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.preprocessing.data module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.preprocessing. Anything that cannot be imported from sklearn.preprocessing is now part of the private API. warnings.warn(message, FutureWarning) /usr/local/lib64/python3.6/site-packages/sklearn/base.py:334: UserWarning: Trying to unpickle estimator RobustScaler from version 0.19.2 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk. UserWarning) Traceback (most recent call last): File "./model.py", line 326, in <module> sys.exit(main()) File "./model.py", line 319, in main prediction_result = predictor.predict(device_data) File "./model.py", line 264, in predict preprocessed_data = self.__preprocess(disk_days, manufacturer) File "./model.py", line 212, in __preprocess featurized = scaler.transform(featurized) File "/usr/local/lib64/python3.6/site-packages/sklearn/preprocessing/_data.py", line 1259, in transform X -= self.center_ ValueError: operands could not be broadcast together with shapes (161,70) (67,) (161,70)

@yaarith I think most of the warnings are expected future deprecation warnings. But the last UserWarning is something I haven't seen before - are the packages installed according to the requirements? As per requirements.txt, scikit-learn needs to be 0.19.2, but from the output it seems like 0.23.2 is being used?

Could you please lmk how I can replicate the last ValueError? I tried running ./predict_device.py 1669 redhat but I don't have the grafana.dsn file so I get an error. So then I ran ./model.py --predictor-model redhat but that didn't throw this error.

Note: for testing, I had commented out the stdin.read() and loaded the sample input as follows:

# inp_json = sys.stdin.read() # device_data = json.loads(inp_json) fname = 'input_samples/predict_1669.json' with open(fname, 'rb') as f: device_data = json.load(f)

I think most of the warnings are expected future deprecation warnings

These warnings will become errors soon, and I'd rather have this code stable for a while :-)

As per requirements.txt, scikit-learn needs to be 0.19.2, but from the output it seems like 0.23.2 is being used?

Indeed, 0.23.2 is used. I think this issue relates to the issues Ceph users are having with diskprediction local module. Users who still have version 0.19.2 will eventually upgrade scikit-learn and the module will be broken. Also, the latest version in Ceph is 0.23.2, so we want to keep it on par with that. Is it possible to make it work with the latest scikit-learn version?

Could you please lmk how I can replicate the last ValueError?

You can run:
$ cat ./input_samples/predict_1669.json | ./model.py --predictor-model redhat
which saves you the changes between testing and production versions (no need to add the open file code block).
When I ran the code like this I see the exact same error.

In this case I prefer we move on to refining the time and confidence granularity of the existing model /
training the new model. So we can use what we have now, then replace it with an improved model later.

Yep, that's exactly what I had in mind.

Regarding ProphetStor: since we will not retrain the model it should not be a problem.

Sounds good!

Please remove any logic from model.py which is not related to 'redhat' model; we will add different models in separate files later.

Cool, shall I remove the placeholder function simple_prediction as well then?

Yep, that's exactly what I had in mind.

Can we make sure we train using the latest versions of numpy, scikit-learn, etc.? These models serve Ceph upstream and we want to move forward to the latest releases.

Cool, shall I remove the placeholder function simple_prediction as well then?

Yes, you can remove anything in this file which is not related to 'redhat' model (so anything which was part of the original placeholder model). The idea is that each model will be in a separate file.

The output should go to stdout, please remove the comment from print(prediction_result) (line 303 in model.py).

Oh, I meant to uncomment the line :-)

Can we make sure we train using the latest versions of numpy, scikit-learn, etc.? These models serve Ceph upstream and we want to move forward to the latest releases.

Definitely. IIRC when we trained these models last year, the constraint sklearn==0.19 already existed, so we didn't want to break the existing models by updating it. But since that constraint is not relevant here, we'll use the latest and greatest now :)

Yes, you can remove anything in this file which is not related to 'redhat' model (so anything which was part of the original placeholder model). The idea is that each model will be in a separate file.

Oh, I meant to uncomment the line :-)

Pushed these changes 👍

Thanks!

But since that constraint is not relevant here, we'll use the latest and greatest now :)

Sounds good. We do need to understand how we integrate them with future Ceph releases, so the diskprediction module uses them and does not break.

Does model.py need the SMART reports to be consecutive?
So if we have 6 reports with SMART data, but there's a gap between them - will it affect the results in any way?

Does model.py need the SMART reports to be consecutive?
So if we have 6 reports with SMART data, but there's a gap between them - will it affect the results in any way?

It won't break, but the output may not be accurate in this case. The model has been trained to predict health based on exactly last 6 days of device behavior. So if the input is data from days other than the last 6 days, then we can't be sure that the result will be meaningful.

yaarith · 2020-09-09T16:57:25Z

device_prediction/model.py

+RHDiskFailurePredictor uses the models developed at the AICoE in the
+Office of the CTO at Red Hat. These models were built using the open
+source Backblaze SMART metrics dataset.
+PSDiskFailurePredictor uses the models developed by ProphetStor as an


Can also remove this ProphetStor reference

chauhankaranraj force-pushed the wip-device-prediction-integration branch from 1cd8922 to 13acba1 Compare August 26, 2020 14:46

chauhankaranraj and others added 2 commits August 26, 2020 13:08

Added upstream models. Updated model.py to use these by default.

4c3512b

Signed-off-by: Karan Chauhan <kachau@redhat.com>

Updated device health predictor for vendor,model inputs

36ef936

Signed-off-by: Karan Chauhan <kachau@redhat.com>

chauhankaranraj force-pushed the wip-device-prediction-integration branch from eb4ceab to 36ef936 Compare August 26, 2020 17:10

chauhankaranraj changed the title ~~[WIP][do-not-merge] Add Ceph Upstream Prediction Models~~ Add Ceph Upstream Prediction Models Aug 26, 2020

Updated requirements.txt for upstream models.

86991e6

Signed-off-by: Karan Chauhan <kachau@redhat.com>

yaarith reviewed Aug 28, 2020

View reviewed changes

Karanraj Chauhan added 5 commits August 28, 2020 14:59

Fixed missing json loading from cmd input

c69b6a9

Updated rolling stats calc to be compatible w/ latest numpy

03cff55

Fixed typo

199f947

Alternate method for division to avoid RuntimeWarning

000bc56

Removed prophetstor model related input. Remove commented out code

ca1dd07

yaarith reviewed Sep 9, 2020

View reviewed changes

Karanraj Chauhan added 2 commits September 9, 2020 13:09

Removed placeholder predictor and argparser. Added print result

13f73d1

Removed prophetstor reference

c8773b5

yaarith merged commit 779d858 into yaarith:wip-device-prediction-integration Sep 10, 2020

chauhankaranraj mentioned this pull request Dec 3, 2020

SMART Metrics Forecasting aicoe-aiops/ceph_drive_failure#11

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Ceph Upstream Prediction Models #1

Add Ceph Upstream Prediction Models #1

chauhankaranraj commented Aug 24, 2020 •

edited

Loading

yaarith commented Aug 26, 2020

chauhankaranraj commented Aug 26, 2020

yaarith commented Aug 26, 2020

chauhankaranraj commented Aug 26, 2020

yaarith Aug 28, 2020

chauhankaranraj Aug 28, 2020

yaarith Aug 31, 2020

chauhankaranraj Aug 31, 2020

yaarith Aug 31, 2020

chauhankaranraj Sep 9, 2020

yaarith Sep 9, 2020

chauhankaranraj Sep 9, 2020

yaarith Sep 10, 2020

chauhankaranraj Sep 10, 2020

yaarith Sep 9, 2020

Add Ceph Upstream Prediction Models #1

Add Ceph Upstream Prediction Models #1

Conversation

chauhankaranraj commented Aug 24, 2020 • edited Loading

yaarith commented Aug 26, 2020

chauhankaranraj commented Aug 26, 2020

yaarith commented Aug 26, 2020

chauhankaranraj commented Aug 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chauhankaranraj commented Aug 24, 2020 •

edited

Loading