Mlflow is by Databricks for building machin learning Pipeline.<br>Apache Spark is for ETL and Model building.
1. **Tracking** Apis to record results and experiment models, along with what code version, configs and results.
    - **MLflow Tracking** is a logging API that is agnostic to the libraries and environments that actually do the training. It is organized around the concept of runs, which are executions of data science code. Runs are aggregated into experiments where many runs can be a part of a given experiment and an MLflow Tracking Server can host many experiments. You can log to the tracking server using a notebook, local app, or cloud job, as shown in Figure 12-2.
    - Let’s examine the different things you can log to the tracking server:
        - **Parameters**: key-value inputs to your code; E.g. num_trees or max_depth in your random forest)
        - **Metrics**: numeric values (can update over time); E.g. RMSE or accuracy values
        - **Artifacts**: files, data and models; E.g. matplotlib images, Parquet files, etc.
        - **Tags and Notes**: information about a run (can update after the run)
        - **Source**: what code ran?
        - **Version**: what version of the code?
2. **Projects** Simple, conventional file format (eg Docker file) to package your project into reproducible runs on any platform by anyone. This adds structure for reproducibility to ones models' experiments.
3. **Models** Like **MLFlow** projects, it too is conventional file format to package ones modles for deployment to diverse execution environments; cloud, local machine, containers.
4. **Registry** its a repository for named versions of model with other metadata associated with it, like tags, comments, who created and when etc,. providing easy hand off or exchange to DevOps and CI/CD operations.

By default MLflow records everything to the filesystem, but you can specify a database for faster querying, such as for the parameters and metrics.

# In Python
          import mlflow
          import mlflow.spark
          with mlflow.start_run(run_name="random-forest") as run:
           # Log params: Num Trees and Max Depth
           mlflow.log_param("num_trees", rf.getNumTrees())
           mlflow.log_param("max_depth", rf.getMaxDepth())
           # Log model
           pipelineModel = pipeline.fit(trainDF)
           mlflow.spark.log_model(pipelineModel, "model")
           # Log metrics: RMSE and R2
           predDF = pipelineModel.transform(testDF)
           rmse = regressionEvaluator.evaluate(predDF)
           r2 = regressionEvaluator.setMetricName("r2").evaluate(predDF)
           mlflow.log_metrics({"rmse": rmse, "r2": r2})
           # Log artifact: Feature Importance Scores
           ...
           mlflow.log_artifact("feature-importance.csv")