# Understand the data science process

## Explore common machine learning models
The purpose of machine learning is to train models that can identify patterns in large amounts of data. You can then use the patterns to make predictions that provide you with new insights on which you can take actions.

The possibilities with machine learning may appear endless, so let's begin by understanding the four common types of machine learning models:

<img src="../images/01_Get started with Microsoft Fabric/09/machine-learning-tasks.png" alt="four common types of machine learning models" style="border: 2px solid black; border-radius: 10px;">

1. **Classification:** Predict a categorical value like whether a customer may churn.
2. **Regression:** Predict a numerical value like the price of a product.
3. **Clustering:** Group similar data points into clusters or groups.
4. **Forecasting:** Predict future numerical values based on time-series data like the expected sales for the coming month.

To decide which type of machine learning model you need to train, you first need to understand the business problem and the data available to you.

## Understand the data science process
To train a machine learning model, the process commonly involves the following steps:

<img src="../images/01_Get started with Microsoft Fabric/09/data-science-process.png" alt="sequential steps in the data science process" style="border: 2px solid black; border-radius: 10px;">

1. **Define the problem:** Together with business users and analysts, decide on what the model should predict and when it's successful.
2. **Get the data:** Find data sources and get access by storing your data in a Lakehouse.
3. **Prepare the data:** Explore the data by reading it from a Lakehouse into a notebook. Clean and transform the data based on the model's requirements.
4. **Train the model:** Choose an algorithm and hyperparameter values based on trial and error by tracking your experiments with MLflow.
5. **Generate insights:** Use model batch scoring to generate the requested predictions.


As a data scientist, most of your time is spent on preparing the data and training the model. How you prepare the data and which algorithm you choose to train a model can influence your model's success.

You can prepare and train a model by using open-source libraries available for the language of your choice. For example, if you work with Python, you can prepare the data with Pandas and Numpy, and train a model with libraries like [Scikit-Learn](https://scikit-learn.org/stable/), [PyTorch](https://pytorch.org/), or [SynapseML](https://microsoft.github.io/SynapseML/).

When experimenting, you want to keep an overview of all the different models you've trained. You want to understand how your choices influence the model's success. By tracking your experiments with MLflow in Microsoft Fabric, you're able to easily manage and deploy the models you've trained.

# Explore and process data with Microsoft Fabric

# Ingest your data into Microsoft Fabric
To work with data in Microsoft Fabric, you first need to ingest data. You can ingest data from multiple sources, both local and cloud data sources. For example, you can ingest data from a CSV file stored on your local machine or in an Azure Data Lake Storage (Gen2).

After connecting to a data source, you can save the data into a Microsoft Fabric **lakehouse**. You can use the lakehouse as a central location to store any structured, semi-structured, and unstructured files. You can then easily connect to the lakehouse whenever you want to access your data for exploration or transformation.

## Explore and transform your data
As a data scientist, you may be most familiar with writing and executing code in **notebooks**. Microsoft Fabric offers a familiar notebook experience, powered by Spark compute.

**Apache Spark:** is an open source parallel processing framework for large-scale data processing and analytics.

Notebooks are automatically attached to Spark compute. When you run a cell in a notebook for the first time, a new Spark session starts. The session persists when you run subsequent cells. The Spark session will automatically stop after some time of inactivity to save costs. You can also manually stop the session.

When you're working in a notebook, you can choose the language you want to use. For data science workloads, you're likely to work with PySpark (Python) or SparkR (R).

<img src="../images/01_Get started with Microsoft Fabric/09/notebooks.png" alt="Screenshot of a notebook in Microsoft Fabric." style="border: 2px solid black; border-radius: 10px;">


Within the notebook, you can explore your data using your preferred library, or with any of the built-in visualization options. If necessary, you can transform your data and save the processed data by writing it back to the lakehouse.

### Prepare your data with the Data Wrangler
To help you explore and transform your data more quickly, Microsoft Fabric offers the easy-to-use Data Wrangler.

After launching the Data Wrangler, you'll get a descriptive overview of the data you're working with. You can view the summary statistics of your data to find any issues like missing values.

To clean your data, you can choose any of the built-in data-cleaning operations. When you select an operation, a preview of the result and the associated code is automatically generated for you. When you have selected all necessary operations, you can export the transformations to code and execute it on your data.

# Train and score models with Microsoft Fabric

## Understand experiments
Whenever you train a model in a notebook that you want to track, you create an experiment in Microsoft Fabric.

An experiment can consist of multiple runs. Each run represents a task you executed in a notebook, like training a machine learning model.

For example, to train a machine learning model for sales forecasting, you can try different training datasets with the same algorithm. Each time you train a model with a different dataset, you create a new experiment run. Then, you can compare the experiment runs to determine the best performing model.

### Start tracking metrics
To compare experiment runs, you can track parameters, metrics, and artifacts for each run.

All parameters, metrics, and artifacts you track in an experiment run are shown in the experiments overview. You can view experiment runs individually in the Run details tab, or compare across runs with the Run list:

<img src="../images/01_Get started with Microsoft Fabric/09/experiment.png" alt="Screenshot of an experiment overview in Microsoft Fabric." style="border: 2px solid black; border-radius: 10px;">

By tracking your work with MLflow, you can compare model training iterations and decide which configuration resulted in the best model for your use case.

## Understand models
After you train a model, you want to use it for scoring. With scoring, you use the model on new data to generate predictions or insights. When you train and track a model with MLflow, artifacts are stored within the experiment run to represent your model and its metadata. You can save these artifacts in Microsoft Fabric as a model.

By saving your model artifacts as a registered model in Microsoft Fabric, you can easily manage your models. Anytime you train a new model and save it under the same name, you add a new version to the model.

<img src="../images/01_Get started with Microsoft Fabric/09/models.png" alt="Screenshot of the model overview in Microsoft Fabric.." style="border: 2px solid black; border-radius: 10px;">

### Use a model to generate insights
To use a model for generating predictions, you can use the PREDICT function in Microsoft Fabric. The PREDICT function is built to easily integrate with MLflow models and allows you to use the model for generating batch predictions.

For example, every week you receive sales data from several stores. Based on the historical data, you've trained a model that can predict the sales for the next week, based on the sales of the last few weeks. You tracked the model with MLflow and saved it in Microsoft Fabric. Whenever the new weekly sales data comes in, you use the PREDICT function to let the model generate the forecast for the next week. The forecasted sales data is stored as a table in a lakehouse, which is visualized in a Power BI report for business users to consume.