This directory demonstrates usage of DeepSparse's key API, including:
This page requires DeepSparse Server installation.
DeepSparse's key feature is its performance on commodity CPUs. For dense unoptimized models, DeepSparse is competitive with other CPU runtimes like ONNX Runtime. However, when optimization techniques like pruning and quantization are applied to a model, DeepSparse can achieve an order-of-magnitude speedup.
As an example, let's compare DeepSparse and ORT's performance on BERT using a 90% pruned-quantized version in SparseZoo on an AWS c6i.16xlarge
instance (32 cores).
ORT achieves 18.5 items/second running BERT (make sure you have ORT installed pip install onnxruntime
):
deepsparse.benchmark zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/base-none -b 64 -s sync -nstreams 1 -i [64,384] -e onnxruntime
>> Original Model Path: zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/base-none
>> Batch Size: 64
>> Scenario: sync
>> Throughput (items/sec): 18.5742
DeepSparse achieves 226 items/second running the pruned-quantized version of BERT:
deepsparse.benchmark zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none -b 64 -s sync -nstreams 1 -i [64,384]
>> Original Model Path: zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none
>> Batch Size: 64
>> Scenario: sync
>> Throughput (items/sec): 226.6340
DeepSparse achieves a 12x speedup over ORT!
Pro-Tip: In place of a SparseZoo stubs, you can pass a local ONNX file to test your model.
Checkout the Performance Benchmarking guide for more details.
Now that we have seen DeepSparse's performance gains, we can add DeepSparse to an application.
DeepSparse includes three deployment APIs:
- Engine is the lowest-level API. With Engine, you pass tensors and receive the raw logits.
- Pipeline wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and receive the prediction.
- Server wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP and receive the prediction.
The following are simple examples of each API to get a sense of how it is used. For the example, we will use the sentiment analysis use-case with a 90% pruned-quantized version of BERT.
Engine is the lowest-level API, allowing you to run inference directly on input tensors.
The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input.
from deepsparse import compile_model
from deepsparse.utils import generate_random_inputs, model_to_path
# download onnx, compile model
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
batch_size = 1
bert_model = compile_model(
model=zoo_stub, # sparsezoo stub or path/to/local/model.onnx
batch_size=batch_size # default is batch 1
)
# run inference (input is raw numpy tensors, output is raw scores)
inputs = generate_random_inputs(model_to_path(zoo_stub), batch_size)
output = bert_model(inputs)
print(output)
# > [array([[-0.3380675 , 0.09602544]], dtype=float32)] << raw scores
DeepSparse can accept ONNX models from two sources:
-
SparseZoo Stubs: SparseZoo is Neural Magic's open-source repository of sparse models. You can pass a SparseZoo stub, a unique identifier for each model to DeepSparse, which downloads the necessary ONNX files from the remote repository.
-
Custom ONNX: DeepSparse allows you to use your own model in ONNX format. Checkout the SparseML user guide for more details on exporting your sparse models to ONNX format. Here's a quick example using a custom ONNX file from the ONNX model zoo:
wget https://github.com/onnx/models/raw/main/vision/classification/mobilenet/model/mobilenetv2-7.onnx
> Saving to: ‘mobilenetv2-7.onnx’
from deepsparse import compile_model
from deepsparse.utils import generate_random_inputs
onnx_filepath = "mobilenetv2-7.onnx"
batch_size = 1
# Generate random sample input
inputs = generate_random_inputs(onnx_filepath, batch_size)
# Compile and run
engine = compile_model(onnx_filepath, batch_size)
outputs = engine.run(inputs)
Pipeline is the default API for interacting with DeepSparse. Similar to Hugging Face Pipelines, DeepSparse Pipelines wrap Engine with pre- and post-processing (as well as other utilities), enabling you to send raw data to DeepSparse and receive the post-processed prediction.
The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data.
from deepsparse import Pipeline
# download onnx, set up pipeline
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
batch_size = 1
sentiment_analysis_pipeline = Pipeline.create(
task="sentiment-analysis", # name of the task
model_path=zoo_stub, # zoo stub or path to local onnx file
batch_size=batch_size # default is batch 1
)
# run inference (input is a sentence, output is the prediction)
prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines")
print(prediction)
# > labels=['positive'] scores=[0.9954759478569031]
Checkout the DeepSparse Pipeline guide for more details.
Server wraps Pipelines with REST APIs, that make it easy to stand up a model serving endpoint running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed predictions.
DeepSparse Server is launched from the command line, configured via arguments or a server configuration file.
The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo and launches a sentiment analysis endpoint:
deepsparse.server \
--task sentiment-analysis \
--model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none
Alternatively, the following configuration file can launch the Server.
# config.yaml
endpoints:
- task: sentiment-analysis
route: /predict
model: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none
Spinning up:
deepsparse.server \
--config-file config.yaml
You should see Uvicorn report that it is running on port 5543. Navigating to the /docs
endpoint will
show the exposed routes as well as sample requests.
We can then send a request over HTTP. In this example, we will use the Python requests package to format the HTTP.
import requests
url = "http://localhost:5543/predict" # Server's port default to 5543
obj = {"sequences": "Snorlax loves my Tesla!"}
response = requests.post(url, json=obj)
print(response.text)
# {"labels":["positive"],"scores":[0.9965094327926636]}
Checkout the DeepSparse Server guide for more details.
DeepSparse supports many CV and NLP use cases out of the box. Check out the Use Cases page for details on the task-specific APIs.