# Data Profile
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

A DataProfile collects summary statistics on the data produced by a Dataflow. `Dataflow.get_profile()` executes the Dataflow and returns a newly constructed DataProfile.

In [1]:
import azureml.dataprep as dprep

df = dprep.smart_read_file('data/crime0-10.csv')
profile = df.get_profile()
profile

  app.launch_new_instance()


Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
ID,FieldType.DECIMAL,1.01397e+07,1.01409e+07,10.0,0.0,10.0,0.0,0.0,0.0,10139700.0,10139700.0,10139700.0,10139800.0,10139800.0,10140400.0,10140900.0,10140900.0,10140900.0,10140100.0,409.806,167941.0,0.688352,-1.15364
Case Number,FieldType.STRING,HY329177,HY330421,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Date,FieldType.STRING,07/05/2015 10:10:00 PM,07/05/2015 11:50:00 PM,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Block,FieldType.STRING,011XX W MORSE AVE,121XX S FRONT AVE,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
IUCR,FieldType.DECIMAL,460,1811,10.0,0.0,10.0,0.0,0.0,0.0,460.0,473.0,460.0,610.0,975.0,1320.0,1811.0,1811.0,1811.0,1008.7,435.056,189273.0,0.27388,-1.23243
Primary Type,FieldType.STRING,ARSON,THEFT,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Description,FieldType.STRING,$500 AND UNDER,TO VEHICLE,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Location Description,FieldType.STRING,ALLEY,VEHICLE NON-COMMERCIAL,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Arrest,FieldType.BOOLEAN,False,True,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,
Domestic,FieldType.BOOLEAN,False,True,10.0,0.0,10.0,0.0,0.0,0.0,,,,,,,,,,,,,,


In [2]:
print(str(profile))

ColumnProfile
    name: ID
    type: FieldType.DECIMAL

    min: 10139697.0
    max: 10140868.0
    count: 10.0
    missing_count: 0.0
    not_missing_count: 10.0
    percent_missing: 0.0
    error_count: 0.0
    empty_count: 0.0


    Quantiles:
         0.1%: 10139697.0
           1%: 10139709.5
           5%: 10139697.0
          25%: 10139762.0
          50%: 10139830.5
          75%: 10140379.0
          95%: 10140868.0
          99%: 10140868.0
        99.9%: 10140868.0

    mean: 10140062.299999999
    std: 409.80565854762824
    variance: 167940.67777765525
    skewness: 0.6883524570707924
    kurtosis: -1.1536428984684086 

ColumnProfile
    name: Case Number
    type: FieldType.STRING

    min: HY329177
    max: HY330421
    count: 10.0
    missing_count: 0.0
    not_missing_count: 10.0
    percent_missing: 0.0
    error_count: 0.0
    empty_count: 0.0


ColumnProfile
    name: Date
    type: FieldType.STRING

    min: 07/05/2015 10:10:00 PM
    max: 07/05/2015 11:50:00 PM
  

A DataProfile contains a collection of ColumnProfiles, indexed by column name. Each ColumnProfile has attributes for the calculated column statistics.

In [3]:
profile.columns['ID']

Unnamed: 0,Statistics
Type,FieldType.DECIMAL
Min,1.01397e+07
Max,1.01409e+07
Count,10
Missing Count,0
Not Missing Count,10
Percent missing,0
Error Count,0
Empty count,0
0.1% Quantile,1.01397e+07


We can also extract a specific attribute across all columns by using a dict comprehension.

In [4]:
column_types = { c.name: c.type for c in profile.columns.values() }
column_types

{'ID': <FieldType.DECIMAL: 3>,
 'Case Number': <FieldType.STRING: 0>,
 'Date': <FieldType.STRING: 0>,
 'Block': <FieldType.STRING: 0>,
 'IUCR': <FieldType.DECIMAL: 3>,
 'Primary Type': <FieldType.STRING: 0>,
 'Description': <FieldType.STRING: 0>,
 'Location Description': <FieldType.STRING: 0>,
 'Arrest': <FieldType.BOOLEAN: 1>,
 'Domestic': <FieldType.BOOLEAN: 1>,
 'Beat': <FieldType.DECIMAL: 3>,
 'District': <FieldType.DECIMAL: 3>,
 'Ward': <FieldType.DECIMAL: 3>,
 'Community Area': <FieldType.DECIMAL: 3>,
 'FBI Code': <FieldType.STRING: 0>,
 'X Coordinate': <FieldType.DECIMAL: 3>,
 'Y Coordinate': <FieldType.DECIMAL: 3>,
 'Year': <FieldType.DECIMAL: 3>,
 'Updated On': <FieldType.STRING: 0>,
 'Latitude': <FieldType.DECIMAL: 3>,
 'Longitude': <FieldType.DECIMAL: 3>,
 'Location': <FieldType.STRING: 0>}

A ColumnProfile may also contain a summary of values with their respective counts. (This is only available if the column has fewer than a thousand unique values.)

In [5]:
profile.columns['Primary Type'].value_counts

[ValueCountEntry(value='CRIMINAL DAMAGE', count=3),
 ValueCountEntry(value='BATTERY', count=2),
 ValueCountEntry(value='THEFT', count=1),
 ValueCountEntry(value='BURGLARY', count=1),
 ValueCountEntry(value='MOTOR VEHICLE THEFT', count=1),
 ValueCountEntry(value='ARSON', count=1),
 ValueCountEntry(value='NARCOTICS', count=1)]

Numeric ColumnProfiles include an estimated histogram of the data.

In [6]:
profile.columns['District'].histogram

[HistogramBucket(lower_bound=5.0, upper_bound=6.9, count=1.1333333333333335),
 HistogramBucket(lower_bound=6.9, upper_bound=8.8, count=1.1666666666666672),
 HistogramBucket(lower_bound=8.8, upper_bound=10.7, count=1.549999999999999),
 HistogramBucket(lower_bound=10.7, upper_bound=12.6, count=0.8499999999999996),
 HistogramBucket(lower_bound=12.6, upper_bound=14.5, count=0.6333333333333337),
 HistogramBucket(lower_bound=14.5, upper_bound=16.4, count=1.2666666666666666),
 HistogramBucket(lower_bound=16.4, upper_bound=18.299999999999997, count=0.47499999999999964),
 HistogramBucket(lower_bound=18.299999999999997, upper_bound=20.2, count=0.4750000000000014),
 HistogramBucket(lower_bound=20.2, upper_bound=22.099999999999998, count=0.47499999999999787),
 HistogramBucket(lower_bound=22.099999999999998, upper_bound=24.0, count=0.9750000000000014)]

For columns containing data of mixed types, the ColumnProfile also contains count of values of each type.