StreamSets DataOps Platform Tutorials

The following tutorials demonstrate features of StreamSets Data Collector, StreamSets Transformer, StreamSets Control Hub and StreamSets SDK For Python.

StreamSets Data Collector -- Basic Tutorials

Log Shipping to Elasticsearch - Read weblog files from a local filesystem directory, decorate some of the fields (e.g. GeoIP Lookup), and write them to Elasticsearch.
Simple Kafka Enablement using StreamSets Data Collector
What’s the Biggest Lot in the City of San Francisco? - Read city lot data from JSON, calculate lot areas in JavaScript, and write them to Hive.
Ingesting Local Data into Azure Data Lake Store - Read records from a local CSV-formatted file, mask out PII (credit card numbers) and send them to a JSON-formatted file in Azure Data Lake Store.
Working with StreamSets Data Collector and Microsoft Azure - Integrate Azure Blob Storage, Apache Kafka on HDInsight, Azure SQL Data Warehouse and Apache Hive backed by Azure Blob Storage.

StreamSets Data Collector -- Writing Custom Pipeline Stages

Creating a Custom StreamSets Origin - Build a simple custom origin that reads a Git repository's commit log and produces the corresponding records.
Creating a Custom Multithreaded StreamSets Origin - A more advanced tutorial focusing on building an origin that supports parallel execution, so the pipeline can run in multiple threads.
Creating a Custom StreamSets Processor - Build a simple custom processor that reads metadata tags from image files and writes them to the records as fields.
Creating a Custom StreamSets Destination - Build a simple custom destination that writes batches of records to a webhook.

We have a DataCollector API Java Docs to share in case of need, please reach out to us if you need them.

StreamSets Data Collector -- Advanced Features

Ingesting Drifting Data into Hive and Impala - Build a pipeline that handles schema changes in MySQL, creating and altering Hive tables accordingly.
Creating a StreamSets Spark Transformer in Java - Build a simple Java Spark Transformer that computes a credit card's issuing network from its number.
Creating a StreamSets Spark Transformer in Scala - Build a simple Scala Spark Transformer that computes a credit card's issuing network from its number.
Creating a CRUD Microservice Pipeline - Build a microservice pipeline to implement a RESTful web service that reads from and writes to a database via JDBC.

The Data Collector documentation also includes an extended tutorial that walks through basic Data Collector functionality, including creating, previewing and running a pipeline, and creating alerts.

StreamSets Data Collector -- Kubernetes-based Deployment

Kubernetes-based Deployment - Example configurations for Kubernetes-based deployments of StreamSets Data Collector.

StreamSets Control Hub

Creating Custom Data Protector Procedure - Create, build and deploy your own custom data protector procedure that you can use as protection method to apply to record fields.

StreamSets Transformer

Creating a Custom Processor for StreamSets Transformer - Create a simple custom processor, using Java and Scala, that will compute the type of a credit card from its number, and configure Transformer to use it.
Creating a Custom Scala Project For StreamSets Transformer - This tutorial explains how to create a custom Scala project and import the compiled jar into StreamSets Transformer.

StreamSets SDK for Python

Common

Find SDK methods and fields of an object available - Object examples can be instances of a pipeline or SCH job or a stage under the pipeline.

Control Hub

Getting started with StreamSets SDK for Python - Design and publish a pipeline. Then create, start, and stop a job using StreamSets SDK for Python.
Jobs related tutorials
- Sample ways to fetch one or more jobs - Sample ways to fetch one or more jobs.
- Start a job and monitor that specific job - Start a job and monitor that specific job using metrics and time series metrics.
- Move jobs from dev to prod using data_collector_labels - Move jobs from dev to prod by updating data_collector label.
- Generate a report for a specific job - Generate a report for a specific job and then; fetch and download it.
- See logs for a data-collector where a job is running - Get the DataCollector where a job is running and then see its logs.
Pipelines related tutorials
- Common pipeline methods - Common operations for StreamSets Control Hub pipelines like update, duplicate , import, export.
- Loop over pipelines and stages and make an edit to stages - When there are many pipelines and stages that need an update, SDK for Python makes it easy to update them with just a few lines of code.
- Create CI CD pipeline used in demo - This covers the steps to create CI CD pipeline as used in the SCH CI CD demo. The steps include how to add stages like JDBC, some processors and Kineticsearch; and how to set stage configurations. Also shows, the use of runtime parameters.

License

StreamSets Data Collector and its tutorials are built on open source technologies; the tutorials and accompanying code are licensed with the Apache License 2.0.

Contributing Tutorials

We welcome contributors! Please check out our guidelines to get started.

Name		Name	Last commit message	Last commit date
Latest commit History 228 Commits
sample_data		sample_data
sdk-tutorials		sdk-tutorials
tutorial-1		tutorial-1
tutorial-2		tutorial-2
tutorial-3		tutorial-3
tutorial-adls-destination		tutorial-adls-destination
tutorial-crud-microservice		tutorial-crud-microservice
tutorial-custom-dataprotector-procedure		tutorial-custom-dataprotector-procedure
tutorial-destination		tutorial-destination
tutorial-hivedrift		tutorial-hivedrift
tutorial-kubernetes-deployment		tutorial-kubernetes-deployment
tutorial-multithreaded-origin		tutorial-multithreaded-origin
tutorial-origin		tutorial-origin
tutorial-processor		tutorial-processor
tutorial-spark-transformer-scala		tutorial-spark-transformer-scala
tutorial-spark-transformer		tutorial-spark-transformer
working-with-azure		working-with-azure
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

StreamSets DataOps Platform Tutorials

StreamSets Data Collector -- Basic Tutorials

StreamSets Data Collector -- Writing Custom Pipeline Stages

StreamSets Data Collector -- Advanced Features

StreamSets Data Collector -- Kubernetes-based Deployment

StreamSets Control Hub

StreamSets Transformer

StreamSets SDK for Python

Common

Control Hub

License

Contributing Tutorials

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 17

Uh oh!

Languages

License

streamsets/tutorials

Folders and files

Latest commit

History

Repository files navigation

StreamSets DataOps Platform Tutorials

StreamSets Data Collector -- Basic Tutorials

StreamSets Data Collector -- Writing Custom Pipeline Stages

StreamSets Data Collector -- Advanced Features

StreamSets Data Collector -- Kubernetes-based Deployment

StreamSets Control Hub

StreamSets Transformer

StreamSets SDK for Python

Common

Control Hub

License

Contributing Tutorials

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 17

Uh oh!

Languages

Packages