Skip to content

Latest commit

 

History

History
240 lines (138 loc) · 11.7 KB

module02a.md

File metadata and controls

240 lines (138 loc) · 11.7 KB

Module 02A - Register & Scan (ADLS Gen2)

< Previous Module - Home - Next Module >

📢 Introduction

To populate Microsoft Purview with assets for data discovery and understanding, you must register sources that exist across our data estate so that we can leverage the out of the box scanning capabilities. Scanning enables Microsoft Purview to extract technical metadata such as the fully qualified name, schema, data types, and apply classifications by parsing a sample of the underlying data.

In this module, you'll walk through how to register and scan data sources. You'll create a new collection for your first data source, upload data and configure scanning. By the end of this module you'll have technical metadata, such as schema information, stored in Purview. You can use this to start linking to business terms, allowing your team members to easier find data.

🤔 Prerequisites

🔨 Tools

🎯 Objectives

  • Create a collection.
  • Register and scan an Azure Data Lake Storage Gen2 account using the Microsoft Purview managed identity.

📑 Table of Contents

# Section Role
1 Grant the Microsoft Purview Managed Identity Access Azure Administrator
2 Upload Data to Azure Data Lake Storage Gen2 Account Azure Administrator
3 Create a Collection Collection Administrator
4 Register a Source (ADLS Gen2) Data Source Administrator
5 Scan a Source with the Microsoft Purview Managed Identity Data Source Administrator
6 View Assets Data Reader

1. Grant the Microsoft Purview Managed Identity Access

💡 Did you know?

To scan a source, Microsoft Purview requires a set of credentials. For Azure Data Lake Storage Gen2, Microsoft Purview supports the following authentication methods.

  • System-assigned Managed Identity (recommended)
  • User-assigned Managed Identity
  • Service Principal
  • Account Key

In this module we will walk through how to grant the Microsoft Purview system-assigned managed identity the necessary access to successfully configure and run a scan.

  1. Navigate to your Azure Data Lake Storage Gen2 account (e.g. pvlab{randomId}adls) and select Access Control (IAM) from the left navigation menu.

    Microsoft Purview

  2. Click Add role assignment.

    Microsoft Purview

  3. Filter the list of roles by searching for Storage Blob Data Reader, click the row to select the role, and then click Next.

    Access Control Role

  4. Under Assign access to, select Managed identity, click + Select members, select Microsoft Purview account from the Managed Identity drop-down menu, select the managed identity for your Microsoft Purview account (e.g. pvlab-{randomId}-pv), click Select. Finally, click Review + assign.

    Access Control Members

  5. Click Review + assign once more to perform the role assignment.

    Access Control Assign

  6. To confirm the role has been assigned, navigate to the Role assignments tab and filter the Scope to This resource. You should be able to see that the Microsoft Purview managed identity has been granted the Storage Blob Data Reader role.

    Role Assignment

2. Upload Data to Azure Data Lake Storage Gen2 Account

Before proceeding with the following steps, you will need to:

  • Download and install Azure Storage Explorer.
  • Open Azure Storage Explorer.
  • Sign in to Azure via View > Account Management > Add an account....
  1. Download a copy of the Bing Coronavirus Query Set to your local machine. Note: This data set was originally sourced from Microsoft Research Open Data.

  2. Locate the downloaded zip file via File Explorer and unzip the contents by right-clicking the file and selecting Extract All....

    Extract zip file

  3. Click Extract.

    Extract

  4. Open Azure Storage Explorer, click on the Toggle Explorer icon, expand the Azure Subscription to find your Azure Storage Account. Right-click on Blob Containers and select Create Blob Container. Name the container raw.

    Create Blob Container

  5. With the container name selected, click on the Upload button and select Upload Folder....

    Upload Folder

  6. Click on the ellipsis to select a folder.

    Browse

  7. Navigate to the extracted BingCoronavirusQuerySet folder (e.g. Downloads\BingCoronavirusQuerySet) and click Select Folder.

    Folder

  8. Click Upload.

    Upload

  9. Monitor the Activities until the transfer is complete.

    Transfer Complete

3. Create a Collection

💡 Did you know?

Collections in Microsoft Purview can be used to organize data sources, scans, and assets in a hierarchical model based on how your organization plans to use Microsoft Purview. The collection hierarchy also forms the security boundary for your metadata to ensure users don't have access to data they don't need (e.g. sensitive metadata).

For more information, check out Collection Architectures and Best Practices.

  1. Open the Microsoft Purview Governance Portal, navigate to Data Map > Collections, and click Add a collection.

    New Collection

  2. Provide the collection a Name (e.g. Contoso) and click Create.

    New Collection

4. Register a Source (ADLS Gen2)

  1. Open the Microsoft Purview Governance Portal, navigate to Data Map > Sources, and click on Register.

    Register

  2. Search for Data Lake, select Azure Data Lake Storage Gen2, and click Continue.

    Sources

  3. Select the Azure subscription, Storage account name, Collection, and click Register.

    💡 Did you know?

    At this point, we have simply registered a data source. Assets are not written to the catalog until after a scan has finished running.

    Source Properties

5. Scan a Source with the Microsoft Purview Managed Identity

  1. Open the Microsoft Purview Governance Portal, navigate to Data Map > Sources, and within the Azure Data Lake Storage Gen2 tile, click the New Scan button.

    New Scan

  2. Click Test connection to ensure the Microsoft Purview managed identity has the appropriate level of access to read the Azure Data Lake Storage Gen2 account. If successful, click Continue.

    Test Connection

  3. Expand the hierarchy to see which assets will be within the scans scope, and click Continue.

    Scan Scope

  4. Select the system default scan rule set and click Continue.

    💡 Did you know?

    Scan Rule Sets determine which File Types and Classification Rules are in scope. If you want to include a custom file type or custom classification rule as part of a scan, a custom scan rule set will need to be created.

    Scan rule set

  5. Select Once and click Continue.

    Scan Trigger

  6. Click Save and Run.

    Run Scan

  7. To monitor the progress of the scan run, click View Details.

    View Details

  8. Click Refresh to periodically update the status of the scan. Note: It will take approximately 5 to 10 minutes to complete.

    Monitor Scan

6. View Assets

  1. Navigate to the Microsoft Purview Governance Portal > Data catalog, and perform a wildcard search by typing the asterisk character (*) into the search box and hitting the Enter key to submit the query.

    ALT

  2. You should be able to see a list of assets within the search results, which is a result of the scan.

    ALT

🎓 Knowledge Check

https://aka.ms/purviewlab/q02

  1. What type of object can help organize data sources into logical groups?

    A ) Buckets
    B ) Collections
    C ) Groups

  2. At which point does Microsoft Purview begin to populate the data map with assets?

    A ) After a Microsoft Purview account is created
    B ) After a Data Source has been registered
    C ) After a Data Source has been scanned

  3. Which of the following attributes is not automatically assigned to an asset as a result of the system-built scanning functionality?

    A ) Technical Metadata (e.g. Fully Qualified Name, Path, Schema, etc)
    B ) Glossary Terms (e.g. column Sales Tax is tagged with the Sales Tax glossary term)
    C ) Classifications (e.g. column ccnum is tagged with the Credit Card Number classification)

🎉 Summary

This module provided an overview of how to create a collection, register a source, and trigger a scan.

Continue >