Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Machine Learning for Data Anomalies (and how) #3289

Closed
BrockCodes opened this issue Apr 12, 2023 · 2 comments
Closed

Implement Machine Learning for Data Anomalies (and how) #3289

BrockCodes opened this issue Apr 12, 2023 · 2 comments

Comments

@BrockCodes
Copy link

Problem

Finding and identifying anomalies in Realm's data can help prevent errors like bad change sets etc. By integrating ML.Net into the C# SDK you are implementing a means of Realm to self correct and heal itself natively.

Client reset logic, anomaly detections etc. You can even create parameters that can be called for developers to call on the ML.Net to perform additional functions for their users even.

ML.NET is a framework for building custom machine learning models in C#. While it is not directly related to detecting anomalies in data or inserting the most recent document version in the C# Realm SDK, it is possible to use ML.NET to create a custom model for anomaly detection.

To integrate ML.NET into the C# Realm SDK, you will need to first install the ML.NET NuGet package:

Install-Package Microsoft.ML

Next, you can create a custom model using ML.NET to detect anomalies in your data. Here's an example of how you can train and use a simple anomaly detection model in C#:

using System;
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;

class AnomalyDetectionModel
{
    public class AnomalyData
    {
        [LoadColumn(0)]
        public float Value { get; set; }
    }

    public class AnomalyPrediction
    {
        [VectorType(3)]
        public double[] Prediction { get; set; }
    }

    private PredictionEngine<AnomalyData, AnomalyPrediction> _engine;

    public AnomalyDetectionModel(string modelPath)
    {
        var context = new MLContext();
        var model = context.Model.Load(modelPath, out var schema);
        _engine = context.Model.CreatePredictionEngine<AnomalyData, AnomalyPrediction>(model);
    }

    public bool IsAnomaly(float value)
    {
        var prediction = _engine.Predict(new AnomalyData { Value = value });
        return prediction.Prediction[0] > prediction.Prediction[2];
    }

    public static void TrainModel(string trainingDataPath, string modelPath)
    {
        var context = new MLContext();
        var data = context.Data.LoadFromTextFile<AnomalyData>(trainingDataPath, separatorChar: ',');
        var pipeline = context.Transforms.DetectSpikeBySsa(outputColumnName: "Prediction", inputColumnName: nameof(AnomalyData.Value), confidence: 95.0, pvalueHistoryLength: 30, trainingWindowSize: 90, seasonalityWindowSize: 30);
        var model = pipeline.Fit(data);
        context.Model.Save(model, data.Schema, modelPath);
    }
}

In this example, we define a custom AnomalyData class to represent our data, which has a single float value. We also define a AnomalyPrediction class to represent the output of our model, which is a vector with three values.

We then create a AnomalyDetectionModel class, which loads a pre-trained model from a file and provides a IsAnomaly method to detect anomalies in new data. The IsAnomaly method takes a single float value and returns true if the value is an anomaly, or false otherwise.

Finally, we define a static TrainModel method that trains a new anomaly detection model using the ML.NET DetectSpikeBySsa transform. This method takes a path to a CSV file containing training data and a path to where the trained model should be saved.

To integrate this with the C# Realm SDK, you can call the AnomalyDetectionModel.IsAnomaly method on new data as it is inserted into the database. You can also periodically retrain the model using the AnomalyDetectionModel.TrainModel method, using data from the Realm database as the training data.

The time it takes to train the model and detect anomalies in new data depends on the size of the training data and the complexity of the model. The time complexity of the DetectSpikeBySsa transform used in this example is O(N log N), where N is the is the length of the input time series data. However, this can be improved to O(n log n) using fast SSA techniques such as the Truncated Fourier Transform SSA (TFT-SSA) or the Fast Basic SSA (FB-SSA).

It is important to note that the time complexity of DetectSpikeBySsa may be further impacted by any preprocessing steps or postprocessing steps that are performed. But the relative size and speed of Realm should make it negligible.

Solution

The larger the training data and the more complex the model, the longer it will take to train and detect anomalies. It's important to strike a balance between the accuracy of the model and the time it takes to train and detect anomalies.

Once the model is trained and anomalies are detected in new data, the next step is to insert the most recent document version into the Realm database. This can be done using the C# Realm SDK, which provides an easy-to-use API for interacting with the database.

Here is an example of how to insert a new document version into a Realm database using the C# Realm SDK:

using Realms;

// Define a model for your data
public class MyDataModel : RealmObject
{
    [PrimaryKey]
    public int Id { get; set; }

    public string Name { get; set; }

    public int Value { get; set; }
}

// Create a new Realm instance
var realm = Realm.GetInstance();

// Create a new instance of your data model
var myData = new MyDataModel
{
    Id = 1,
    Name = "My Data",
    Value = 10
};

// Add the new data to the database
using (var trans = realm.BeginWrite())
{
    realm.Add(myData, true);
    trans.Commit();
}

This example defines a simple data model, creates a new instance of the model, and inserts it into the Realm database using a write transaction. The true parameter passed to the realm.Add() method ensures that any existing data with the same primary key is updated rather than duplicated.

The time it takes to insert the new document version into the Realm database will depend on the size and complexity of the data being inserted, as well as the current state of the database. However, Realm's efficient data storage and indexing should help to minimize the time required for this step.

Alternatives

No response

How important is this improvement for you?

Would be a major improvement

Feature would mainly be used with

Atlas Device Sync

@BrockCodes
Copy link
Author

I also want to add, if you implement ML.Net into realm, you can spread it to a central Atlas to run detections. This will create a means of mass Atlas wide automation to find and fix problems before they are even problems.

I have also listed how to link an Atlas Cluster to ML.Net, but of course you could come up with your own solution for Atlas Backend.

https://www.linkedin.com/pulse/use-mongodb-atlas-mlnet-detect-fraud-brock-leonard%3FtrackingId=28YZiws1Qjq%252BCIoO923UUQ%253D%253D/?trackingId=28YZiws1Qjq%2BCIoO923UUQ%3D%3D

@nirinchev
Copy link
Member

Hi Brock, appreciate the suggestion and the extensive walkthrough. Unfortunately, we don't have immediate plans to add anomaly detection to the Realm database. For a bit of context:

  1. We feel this is a very heavy handed approach to address a very narrow set of issues. Bad changeset errors are extremely rare and are virtually always caused by bugs in the sync algorithm. Attempting to use anomaly detection to hide those bugs is unreasonably expensive in terms of storage and cpu time and not worth it compared to investing into broader fuzz testing and logging.
  2. As Realm is an embedded database, the SDK needs to be as lightweight as possible as users will be downloading it as part of their application. Taking on a dependency on Microsoft.ML goes counter to this goal.
  3. Any ML improvements to the Realm database need to happen in the Core codebase in order to benefit all users of all SDKs. Implementing general-purpose improvements at the SDK layer would lead to massive amounts of duplicated efforts and likely different outcomes for users of the different SDKs.

Again, appreciate the suggestion and we'll keep it in mind as we're thinking about the evolution of the Core database, but for the time being, I'll close this as not planned.

@nirinchev nirinchev closed this as not planned Won't fix, can't repro, duplicate, stale Apr 14, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants