<h1 align="center">
  <a href="https://uptrain.ai">
    <img width="300" src="https://user-images.githubusercontent.com/108270398/214240695-4f958b76-c993-4ddd-8de6-8668f4d0da84.png" alt="uptrain">
  </a>
</h1>

<h1 style="text-align: center;">Integrity Checks on Production Data</h1>

In this notebook, we will see how we can use UpTrain package to check for data integrity, both on input data as well as custom metrics

### Install the required packages for this example

In [None]:
!pip install torch imgaug

In [1]:
import sys
import os
import subprocess
import zipfile
import numpy as np
import uptrain
import sys
sys.path.insert(0,'..')

from helper_files import read_json, write_json, KpsDataset
from helper_files import body_length_signal

import torch

Download dataset from remote

In [2]:
data_dir = "data"
remote_url = "https://oodles-dev-training-data.s3.amazonaws.com/data.zip"
orig_training_file = 'data/training_data.json'
if not os.path.exists(data_dir):
    try:
        # Most Linux distributions have Wget installed by default.
        # Below command is to install wget for MacOS
        wget_installed_ok = subprocess.call("brew install wget", shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)
        print("Successfully installed wget")
    except:
        dummy = 1
    try:
        if not os.path.exists("data.zip"):
            file_downloaded_ok = subprocess.call("wget " + remote_url, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)
            print("Data downloaded")
        with zipfile.ZipFile("data.zip", 'r') as zip_ref:
            zip_ref.extractall("./")
        full_training_data = read_json(orig_training_file)
        np.random.seed(1)
        np.random.shuffle(full_training_data)
        reduced_training_data = full_training_data[0:1000]
        write_json(orig_training_file, reduced_training_data)
        print("Prepared Example Dataset")
        os.remove("data.zip")
    except Exception as e:
        print(e)
        print("Could not load training data")

In [3]:
real_world_test_cases = 'data/real_world_testing_data.json'
golden_testing_file = 'data/golden_testing_data.json'

inference_batch_size = 16

Next, we train our network using Deep Neural Network

In [4]:
from helper_files import get_accuracy_torch, train_model_torch, BinaryClassification
train_model_torch('data/training_data.json', 'version_0')

Training on:  data/training_data.json  which has  1000  data-points
Trained model exists. Skipping training again.


Next, we get the model accuracy on testing dataset, which is again low due to misclassification of Pushup signals.

In [5]:
get_accuracy_torch(golden_testing_file, 'version_0')

Evaluating on  15731  data-points


0.9092873943169538

Let's define the UpTrain config to check data integrity. We have defined two checks:

1. Check if the input features are not null. 

2. Check if body length (a custom defined metric) is greater than 50

In [6]:
cfg = {
    "checks": [{
        'type': uptrain.Anomaly.DATA_INTEGRITY,
        'measurable_args': {
            'type': uptrain.MeasurableType.INPUT_FEATURE,
            'feature_name': 'kps'
        },
        'integrity_type': 'non_null'
    },
    {
        'type': uptrain.Anomaly.DATA_INTEGRITY,
        'measurable_args': {
            'type': uptrain.MeasurableType.CUSTOM,
            'signal_formulae': uptrain.Signal("body_length", body_length_signal),
        },
        "integrity_type": "greater_than",
        "threshold": 50
    },],
    "retraining_folder": "uptrain_smart_data_data_integrity",
    "tb_logging": True
}

In [7]:
framework_torch = uptrain.Framework(cfg)

model_dir = 'trained_models_torch/'
model_save_name = 'version_0'
real_world_dataset = KpsDataset(
    real_world_test_cases, batch_size=inference_batch_size, is_test=True
)
model = BinaryClassification()
model.load_state_dict(torch.load(model_dir + model_save_name))
model.eval()

for i,elem in enumerate(real_world_dataset):

    # Do model prediction
    inputs = {"data": {"kps": elem[0]["kps"]}, "id": elem[0]["id"]}
    x_test = torch.tensor(inputs["data"]["kps"]).type(torch.float)
    test_logits = model(x_test).squeeze() 
    preds = torch.round(torch.sigmoid(test_logits)).detach().numpy()

    # Log model inputs and outputs to the uptrain Framework to monitor data integrity
    idens = framework_torch.log(inputs=inputs, outputs=preds)

Deleting the folder:  uptrain_logs



NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.9.0 at http://localhost:6008/ (Press CTRL+C to quit)
