# One Hot Encoding
Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

DataPrep has the ability to perform one hot encoding on a selected column using `one_hot_encode`. The result dataflow will have a new binary column for each categorical label encountered in the selected column.

In [1]:
import azureml.dataprep as dprep
dataflow = dprep.read_csv(path='./data/crime0-10.csv')
dataflow.head(10)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820,THEFT,$500 AND UNDER,STREET,False,False,...,41,10,06,1129230.0,1933315.0,2015,07/12/2015 12:42:46 PM,41.973309466,-87.800174996,"(41.973309466, -87.800174996)"
1,10139776,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460,BATTERY,SIMPLE,STREET,False,True,...,49,1,08B,1167370.0,1946271.0,2015,07/12/2015 12:42:46 PM,42.008124017,-87.65955018,"(42.008124017, -87.65955018)"
2,10140270,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9,53,08B,,,2015,07/12/2015 12:42:46 PM,,,
3,10139885,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37,25,05,1141721.0,1907465.0,2015,07/12/2015 12:42:46 PM,41.902152027,-87.754883404,"(41.902152027, -87.754883404)"
4,10140379,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27,28,07,1168413.0,1901632.0,2015,07/12/2015 12:42:46 PM,41.885610142,-87.657008701,"(41.885610142, -87.657008701)"
5,10140868,HY330421,07/05/2015 10:54:00 PM,118XX S PEORIA ST,1320,CRIMINAL DAMAGE,TO VEHICLE,VEHICLE NON-COMMERCIAL,False,False,...,34,53,14,1172409.0,1826485.0,2015,07/12/2015 12:42:46 PM,41.6793109,-87.644545209,"(41.6793109, -87.644545209)"
6,10139762,HY329232,07/05/2015 10:42:00 PM,026XX W 37TH PL,1020,ARSON,BY FIRE,VACANT LOT/LAND,False,False,...,12,58,09,1159436.0,1879658.0,2015,07/12/2015 12:42:46 PM,41.825500607,-87.690578042,"(41.825500607, -87.690578042)"
7,10139722,HY329228,07/05/2015 10:30:00 PM,016XX S CENTRAL PARK AVE,1811,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,ALLEY,True,False,...,24,29,18,1152687.0,1891389.0,2015,07/12/2015 12:42:46 PM,41.857827814,-87.715028789,"(41.857827814, -87.715028789)"
8,10139774,HY329209,07/05/2015 10:15:00 PM,048XX N ASHLAND AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,False,False,...,46,3,14,1164821.0,1932394.0,2015,07/12/2015 12:42:46 PM,41.970099796,-87.669324377,"(41.970099796, -87.669324377)"
9,10139697,HY329177,07/05/2015 10:10:00 PM,058XX S ARTESIAN AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,ALLEY,False,False,...,16,63,14,1160997.0,1865851.0,2015,07/12/2015 12:42:46 PM,41.787580282,-87.685233078,"(41.787580282, -87.685233078)"


To use `one_hot_encode` from a dataflow, simply specify the source column. `one_hot_encode` will figure out all the distinct values or categorical labels in the source column using the current data, and it will return a new dataflow with a new binary column for each categorical label. Note that the categorical labels are remembered in the data flow step.

In [2]:
result_dataflow = dataflow.one_hot_encode(source_column='Location Description')
result_dataflow.head(10)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Location Description_STREET,Location Description_ALLEY,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820,THEFT,$500 AND UNDER,STREET,1,0,...,41,10,06,1129230.0,1933315.0,2015,07/12/2015 12:42:46 PM,41.973309466,-87.800174996,"(41.973309466, -87.800174996)"
1,10139776,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460,BATTERY,SIMPLE,STREET,1,0,...,49,1,08B,1167370.0,1946271.0,2015,07/12/2015 12:42:46 PM,42.008124017,-87.65955018,"(42.008124017, -87.65955018)"
2,10140270,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,1,0,...,9,53,08B,,,2015,07/12/2015 12:42:46 PM,,,
3,10139885,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,0,0,...,37,25,05,1141721.0,1907465.0,2015,07/12/2015 12:42:46 PM,41.902152027,-87.754883404,"(41.902152027, -87.754883404)"
4,10140379,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,1,0,...,27,28,07,1168413.0,1901632.0,2015,07/12/2015 12:42:46 PM,41.885610142,-87.657008701,"(41.885610142, -87.657008701)"
5,10140868,HY330421,07/05/2015 10:54:00 PM,118XX S PEORIA ST,1320,CRIMINAL DAMAGE,TO VEHICLE,VEHICLE NON-COMMERCIAL,0,0,...,34,53,14,1172409.0,1826485.0,2015,07/12/2015 12:42:46 PM,41.6793109,-87.644545209,"(41.6793109, -87.644545209)"
6,10139762,HY329232,07/05/2015 10:42:00 PM,026XX W 37TH PL,1020,ARSON,BY FIRE,VACANT LOT/LAND,0,0,...,12,58,09,1159436.0,1879658.0,2015,07/12/2015 12:42:46 PM,41.825500607,-87.690578042,"(41.825500607, -87.690578042)"
7,10139722,HY329228,07/05/2015 10:30:00 PM,016XX S CENTRAL PARK AVE,1811,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,ALLEY,0,1,...,24,29,18,1152687.0,1891389.0,2015,07/12/2015 12:42:46 PM,41.857827814,-87.715028789,"(41.857827814, -87.715028789)"
8,10139774,HY329209,07/05/2015 10:15:00 PM,048XX N ASHLAND AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,0,0,...,46,3,14,1164821.0,1932394.0,2015,07/12/2015 12:42:46 PM,41.970099796,-87.669324377,"(41.970099796, -87.669324377)"
9,10139697,HY329177,07/05/2015 10:10:00 PM,058XX S ARTESIAN AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,ALLEY,0,1,...,16,63,14,1160997.0,1865851.0,2015,07/12/2015 12:42:46 PM,41.787580282,-87.685233078,"(41.787580282, -87.685233078)"


By default, all the new columns will use the `source_column` name as a prefix. However, if you would like to specify your own prefix, simply pass a `prefix` string as a second parameter.

In [3]:
result_dataflow = dataflow.one_hot_encode(source_column='Location Description', prefix='LOCATION_')
result_dataflow.head(10)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,LOCATION_STREET,LOCATION_ALLEY,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820,THEFT,$500 AND UNDER,STREET,1,0,...,41,10,06,1129230.0,1933315.0,2015,07/12/2015 12:42:46 PM,41.973309466,-87.800174996,"(41.973309466, -87.800174996)"
1,10139776,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460,BATTERY,SIMPLE,STREET,1,0,...,49,1,08B,1167370.0,1946271.0,2015,07/12/2015 12:42:46 PM,42.008124017,-87.65955018,"(42.008124017, -87.65955018)"
2,10140270,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,1,0,...,9,53,08B,,,2015,07/12/2015 12:42:46 PM,,,
3,10139885,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,0,0,...,37,25,05,1141721.0,1907465.0,2015,07/12/2015 12:42:46 PM,41.902152027,-87.754883404,"(41.902152027, -87.754883404)"
4,10140379,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,1,0,...,27,28,07,1168413.0,1901632.0,2015,07/12/2015 12:42:46 PM,41.885610142,-87.657008701,"(41.885610142, -87.657008701)"
5,10140868,HY330421,07/05/2015 10:54:00 PM,118XX S PEORIA ST,1320,CRIMINAL DAMAGE,TO VEHICLE,VEHICLE NON-COMMERCIAL,0,0,...,34,53,14,1172409.0,1826485.0,2015,07/12/2015 12:42:46 PM,41.6793109,-87.644545209,"(41.6793109, -87.644545209)"
6,10139762,HY329232,07/05/2015 10:42:00 PM,026XX W 37TH PL,1020,ARSON,BY FIRE,VACANT LOT/LAND,0,0,...,12,58,09,1159436.0,1879658.0,2015,07/12/2015 12:42:46 PM,41.825500607,-87.690578042,"(41.825500607, -87.690578042)"
7,10139722,HY329228,07/05/2015 10:30:00 PM,016XX S CENTRAL PARK AVE,1811,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,ALLEY,0,1,...,24,29,18,1152687.0,1891389.0,2015,07/12/2015 12:42:46 PM,41.857827814,-87.715028789,"(41.857827814, -87.715028789)"
8,10139774,HY329209,07/05/2015 10:15:00 PM,048XX N ASHLAND AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,0,0,...,46,3,14,1164821.0,1932394.0,2015,07/12/2015 12:42:46 PM,41.970099796,-87.669324377,"(41.970099796, -87.669324377)"
9,10139697,HY329177,07/05/2015 10:10:00 PM,058XX S ARTESIAN AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,ALLEY,0,1,...,16,63,14,1160997.0,1865851.0,2015,07/12/2015 12:42:46 PM,41.787580282,-87.685233078,"(41.787580282, -87.685233078)"


To have more control over the categorical labels, create a builder using `dataflow.builders.one_hot_encode`. The builder allows to preview and modify the categorical labels before generating a new dataflow with the results.

In [4]:
builder = dataflow.builders.one_hot_encode(source_column='Location Description', prefix='LOCATION_')

To generate the categorical labels, call the `learn` method on the builder object.

In [5]:
builder.learn()

To preview the categorical labels, simply access them through the property `categorical_labels` on the builder object

In [6]:
builder.categorical_labels

['STREET',
 'ALLEY',
 'APARTMENT',
 'VACANT LOT/LAND',
 'VEHICLE NON-COMMERCIAL',
 'SMALL RETAIL STORE']

To modify the generated `categorical_labels`, just assign a new value to `categorical_labels` or modify the existing one.  
The following example adds a missing label not found on the sample data to `categorical_labels`.

In [7]:
builder.categorical_labels.append('TOWNHOUSE')
builder.categorical_labels

['STREET',
 'ALLEY',
 'APARTMENT',
 'VACANT LOT/LAND',
 'VEHICLE NON-COMMERCIAL',
 'SMALL RETAIL STORE',
 'TOWNHOUSE']

Once the desired results are achieved, call `builder.to_dataflow` to get the new dataflow with the encoded labels.

In [8]:
dataflow = builder.to_dataflow()
dataflow.head(10)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,LOCATION_STREET,LOCATION_ALLEY,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820,THEFT,$500 AND UNDER,STREET,1,0,...,41,10,06,1129230.0,1933315.0,2015,07/12/2015 12:42:46 PM,41.973309466,-87.800174996,"(41.973309466, -87.800174996)"
1,10139776,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460,BATTERY,SIMPLE,STREET,1,0,...,49,1,08B,1167370.0,1946271.0,2015,07/12/2015 12:42:46 PM,42.008124017,-87.65955018,"(42.008124017, -87.65955018)"
2,10140270,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,1,0,...,9,53,08B,,,2015,07/12/2015 12:42:46 PM,,,
3,10139885,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,0,0,...,37,25,05,1141721.0,1907465.0,2015,07/12/2015 12:42:46 PM,41.902152027,-87.754883404,"(41.902152027, -87.754883404)"
4,10140379,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,1,0,...,27,28,07,1168413.0,1901632.0,2015,07/12/2015 12:42:46 PM,41.885610142,-87.657008701,"(41.885610142, -87.657008701)"
5,10140868,HY330421,07/05/2015 10:54:00 PM,118XX S PEORIA ST,1320,CRIMINAL DAMAGE,TO VEHICLE,VEHICLE NON-COMMERCIAL,0,0,...,34,53,14,1172409.0,1826485.0,2015,07/12/2015 12:42:46 PM,41.6793109,-87.644545209,"(41.6793109, -87.644545209)"
6,10139762,HY329232,07/05/2015 10:42:00 PM,026XX W 37TH PL,1020,ARSON,BY FIRE,VACANT LOT/LAND,0,0,...,12,58,09,1159436.0,1879658.0,2015,07/12/2015 12:42:46 PM,41.825500607,-87.690578042,"(41.825500607, -87.690578042)"
7,10139722,HY329228,07/05/2015 10:30:00 PM,016XX S CENTRAL PARK AVE,1811,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,ALLEY,0,1,...,24,29,18,1152687.0,1891389.0,2015,07/12/2015 12:42:46 PM,41.857827814,-87.715028789,"(41.857827814, -87.715028789)"
8,10139774,HY329209,07/05/2015 10:15:00 PM,048XX N ASHLAND AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,0,0,...,46,3,14,1164821.0,1932394.0,2015,07/12/2015 12:42:46 PM,41.970099796,-87.669324377,"(41.970099796, -87.669324377)"
9,10139697,HY329177,07/05/2015 10:10:00 PM,058XX S ARTESIAN AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,ALLEY,0,1,...,16,63,14,1160997.0,1865851.0,2015,07/12/2015 12:42:46 PM,41.787580282,-87.685233078,"(41.787580282, -87.685233078)"
