Fast and Easy to convert Mapping data flows from Azure Data Factory to Microsoft Fabric Notebook and Spark Job.
You can try converting Azure Mapping dataflow using OpenAI to Fabric PySpark Notebook from mapping-dataflow-to-fabric-with-openai
- Overview
- Extendability
- Installation
- Pre-Prerequisite
- Execution
- Usage
- Example
- Limitations
- Debugging
The Mapping data flow to Microsoft Fabric notebook is a command line tool to convert Mapping data flow script code to Spark Scala and PySpark code. The objective of the conversion tool is to provide the Spark code in the following format:
- PySpark Script
- Scala Spark Object
- Microsoft Fabric PySpark Notebook
- Microsoft Fabric Scala Spark Notebook
The tool is composed of following logical components:
- Syntactical Parsers
- Dataflow Spark data structures (model classes)
- Code templates - to generate the spark code in File format
- File and Notebook writer
The syntactical parsers parse the MDF (mapping dataflow) script code to Spark code. These parsers work on an input stream(in this case, mapping dataflow transformation) and consumes a collection of tokens.
These parsers are built using Scala Parser Combinators. Scala parser combinators allow you to use high-order functions compositionally to construct mapping data flow transformation grammar structures. These high-order functions define the transformation of DSL in a purely functional way.
Scala parser combinator library allows you to build primitive parsers and compose them functionally to generate larger
ones. These parsers return a semantic model of a transformation; at the end of the complete parsing process, we will
have the semantic model as the AST. In Scala, we can define an AST to use a hierarchy of case classes(model classes)
with an abstract class at the top. For example, AST for the MDF row modifier
dimcity filter(StateProvince==$StateFilter1 || StateProvince==$StateFilter2) ~> cityfilter
will be:
DataFlowFilter(dimcity,ListExpressionCondition(List(ExpressionCondition(List(StateProvince),==,List(StateFilter1),||), ExpressionCondition(List(StateProvince),==,List(StateFilter2),))),cityfilter)
A transformation from the transformation flow is parsed using spark parser like Source, Filter, Join, etc. Transformation parsers are loaded using Service loader pattern. The following diagram depicts relationship of parsers with other combinators.
The mdf-to-spark execution flow is as following:
AA new parser class must extend from BaseStandardTokenParser
and make an entry into the service loader (
META-INF/services) to leverage it the same at runtime processing.
Code templates for Scala and PySpark are defined in resources/code/filetemplate
; placeholders are passed along with
the template arguments array to generate the spark code.
The tool provides two output formats using FileFormatCodeWriter
and NbFormatWriter
- File - Scala Spark and PySpark
- Notebook Format - Scala and PySpark
The output of the Code Template is input for the File writer. The Notebook metadata are configured
using resources/code/notebookmetadata
and written using python nbformat package.
There are multiple ways you can provide input script code to the application.
- File - Data flow script representation (
com.microsoft.azure.adf.dataflow.source.ScriptCodeFileSource
) - Data Flow GET REST API (
com.microsoft.azure.adf.dataflow.source.DataFlowRestGet
)
The Data Flow GET REST API is implemented using Azure Active Directory token authentication by default. It would require the following environment variables.
- AZURE_CLIENT_ID for Azure client ID.
- AZURE_TENANT_ID for Azure tenant ID.
- AZURE_CLIENT_SECRET or AZURE_CLIENT_CERTIFICATE_PATH for client secret or client certificate.
- Azure subscription ID can also be configured via the environment variable AZURE_SUBSCRIPTION_ID.
You can add any additional source by extending com.microsoft.azure.adf.dataflow.source.MdfScriptCodeSource
and registering the same as service META-INF/services/com.microsoft.azure.adf.dataflow.source.MdfScriptCodeSource
.
Pre-Prerequisite
- Java 11
- SBT 1.9.3 (only if you want to build from source)
- Scala 2.12.11 (only if you want to build from source)
- Python 3.10.11
Nbformat Installation
The tool uses nbformat package to create Notebook and jep to execute Python modules from Java. You need to run the following commands to install jep and nbformat.
pip install jep
pip install nbformat
The entry point for the tool is com.microsoft.azure.adf.tool.MdfToSpark
; the tool can read MDF script code from a file
and DataFlow REST API
Converting Script code from the file to the Spark code would require the following program arguments:
--inputpath
- absolute script code path (including file name)--className
- output Spark and PySpark file name--appName
- Spark job name--source File
java -Djava.library.path=C:\Python310\Lib\site-packages\jep -jar mdf-to-spark-assembly-0.1.jar --source file --flowname <flowname> --inputpath <scriptcode file path> --className <output scala class> --appName <spark job name>
Reading script code using REST API and converting it to the Spark code would require the following program arguments and need to provide AZURE_SUBSCRIPTION_ID environment variable:
--rg
- Resource group--factoryName
- Factory name--dataFlowName
- Data Flow name--className
- output Spark and PySpark file name--appName
- Spark job name--source API
Few Optional program arguments are (for the notebook generation):
--lakeHouseId
- Existing target Microsoft Fabric lakehouse Id--lakeHouseName
- Existing target Microsoft Fabric lakehouse name--workSpaceId
- Existing target Microsoft Fabric workspace Id
java -Djava.library.path=<<path of python3>>\site-packages\jep -jar mdf-to-spark-assembly-<version>.jar --source (api or file) --rg <resource group> --factoryName <adf name> \
--dataFlowName <dataFlowName> --className <output scala class> --appName <spark job name> --lakeHouseId <lakeHouseId> --lakeHouseName <lakeHouseName> --workSpaceId <Fabric workspaceId>
The current tools supports limited set of properties for the following mapping data flow transformations:
- Source
- Select Column Mapping
- Join
- Union
- Filter
- Sink
- Sort
- 🆕Flatten (unroll_by only, without unroll_root)
You can debug Scala's parser combinators extended from com.microsoft.azure.adf.dataflow.parser.syntactical.spark.BaseStandardTokenParser
by passing java system parameter -Ddebug=true
.