<h1> Example of TabularDataset Categoricals Issue </h1>

Native pandas preserves categorical data types when writing and reading to Parquet format.  Unfortunately, Azure's TabularDataset implementation does not preserve this nice behavior.   This notebook demonstrates the difference.

This is more of a nice to have than a requirement, but I could imagine cases with (say) GBM models which accept categoricals not being able to leverage TabularDatasets due to this behavior.

Core developed using Python 3.6 via AzureML.

<h2> Pure Pandas Behavior </h2>
This section demonstrates that Pandas is able to write a dataframe that contains categoricals to a Parquet file, and then read in the data with categorical information intact.  I was using Pandas 0.25.3

In [1]:
import pandas as pd
pd.__version__

'0.25.3'

In [55]:
import numpy as np
np.__version__

'1.18.5'

In [14]:
# Directory into which you want to save the parquet files
base_dir = os.getenv('HOME')

In [9]:
# Create an example dataframe with a categorical feature
# Note I scramble the category order to illustrate that category order is preserved during
# parquet read/write
cat_in_df = pd.DataFrame.from_dict({'x': list(range(100)), 
                                   'y': pd.Categorical(['a', 'b', 'c', 'd', 'e']*20, 
                                                       categories = ['e', 'a', 'b', 'd', 'c'])})
cat_in_df.dtypes

x       int64
y    category
dtype: object

In [12]:
# List the categories - note random order
cat_in_df['y'].cat.categories

Index(['e', 'a', 'b', 'd', 'c'], dtype='object')

In [15]:
# Save the data to parquet
cat_in_df.to_parquet(os.path.join(base_dir, 'cat_in_df.parquet'))

In [16]:
# Retrieve from parquet.  Verify the categorical feature remains unchanged
cat_out_df = pd.read_parquet(os.path.join(base_dir, 'cat_in_df.parquet'))
cat_out_df.dtypes

x       int64
y    category
dtype: object

In [17]:
# List the categories - note random order is preserved!
cat_out_df['y'].cat.categories

Index(['e', 'a', 'b', 'd', 'c'], dtype='object')

Everything is fine.  I am able to write categoricals to Parquet, then read the information back.

<h2> Azure Setup </h2>
I want to show the same series of operations via Azure TabularDataset, but first I need to do some setup for the Azure Machine Learning environment.  You will need to use your own workspace, blob storage, etc. 

In [18]:
# Connect to workspace - use your own info here
from azureml.core.workspace import Workspace

ws = Workspace.get(name='YOUR-INFO-HERE', # Put in your own info
               subscription_id="YOUR-INFO-HERE",
               resource_group="YOUR-INFO-HERE")

In [19]:
# connect to a datastore - use your own info here
from azureml.core import Datastore
output_datastore = Datastore.get(ws, 'YOUR-INFO-HERE')

In [20]:
# Set a target path on your datastore where you want to save files
output_path = 'YOUR-INFO-HERE'

<h2> TabularDataSet Behavior </h2>
In this section I show a failure when the same set of operations is performed using the Azure TabularDataset wrapper

In [21]:
# Move the parquet file to the datastore
output_datastore.upload_files(files=[os.path.join(base_dir, 'cat_in_df.parquet')],
                              target_path=output_path,
                              overwrite=True)

Uploading an estimated of 1 files
Uploading /home/azureuser/cat_in_df.parquet
Uploaded /home/azureuser/cat_in_df.parquet, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_46278f9661044f5aadd8b7e3f2508339

In [22]:
# Make the initial file into a dataset
from azureml.core import Dataset

cat_in_dset =  Dataset.Tabular.from_parquet_files(path=[(output_datastore,  
                                                           '/'.join([output_path, 'cat_in_df.parquet']))])   

In [23]:
# Read the dataset into Pandas
cat_in_df_2 = cat_in_dset.to_pandas_dataframe()

<h3> Here is the issue - Lost categorical information </h3>

In [24]:
# Note the categorical information has been lost!
cat_in_df_2.dtypes

x     int64
y    object
dtype: object