# 1. Préparation des données avec Azure ML Services



#### Note: Some features in this Notebook will _not_ work with the Private Preview version of the SDK; it assumes the Public Preview version.

Wonder how you can make the most of the Azure ML Data Prep SDK? In this "Getting Started" guide, we'll showcase a few highlights that make this SDK shine for big datasets where `pandas` and `dplyr` can fall short. Using the [Ford GoBike dataset](https://www.fordgobike.com/system-data) as an example, we'll cover how to build Dataflows that allow you to:


## 1.1 Informations

In [1]:
import sys
sys.version

'3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) \n[GCC 7.2.0]'

In [2]:
!pip install azureml



In [3]:
import azureml.core
print("SDK Version:", azureml.core.VERSION)

SDK Version: 0.1.74


In [4]:
from IPython.display import display
from os import path
from tempfile import mkdtemp

import pandas as pd
import azureml.dataprep as dprep

## 2. Importation des données

Azure ML Data Prep supports many different file reading formats (i.e. CSV, Excel, Parquet), and also offers the ability to infer column types automatically. 

In [5]:
gobike = dprep.read_csv(path='https://dprepdata.blob.core.windows.net/demo/ford_gobike/2017-fordgobike-tripdata.csv')
gobike.head(11)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender
0,80110,2017-12-31 16:57:39.6540,2018-01-01 15:12:50.2450,74,Laguna St at Hayes St,37.776434819204745,-122.42624402046204,43,San Francisco Public Library (Grove St at Hyde...,37.7787677,-122.4159292,96,Customer,1987.0,Male
1,78800,2017-12-31 15:56:34.8420,2018-01-01 13:49:55.6170,284,Yerba Buena Center for the Arts (Howard St at ...,37.78487208436062,-122.40087568759915,96,Dolores St at 15th St,37.7662102,-122.4266136,88,Customer,1965.0,Female
2,45768,2017-12-31 22:45:48.4110,2018-01-01 11:28:36.8830,245,Downtown Berkeley BART,37.8703477,-122.2677637,245,Downtown Berkeley BART,37.8703477,-122.2677637,1094,Customer,,
3,62172,2017-12-31 17:31:10.6360,2018-01-01 10:47:23.5310,60,8th St at Ringold St,37.77452040113685,-122.4094493687153,5,Powell St BART Station (Market St at 5th St),37.78389935708493,-122.4084448814392,2831,Customer,,
4,43603,2017-12-31 14:23:14.0010,2018-01-01 02:29:57.5710,239,Bancroft Way at Telegraph Ave,37.8688126,-122.258764,247,Fulton St at Bancroft Way,37.8677892,-122.2658964,3167,Subscriber,1997.0,Female
5,9226,2017-12-31 22:51:00.9180,2018-01-01 01:24:47.1660,30,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,30,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,1487,Customer,,
6,4507,2017-12-31 23:49:28.4220,2018-01-01 01:04:35.6190,259,Addison St at Fourth St,37.866249,-122.2993708,259,Addison St at Fourth St,37.866249,-122.2993708,3539,Customer,1991.0,Female
7,4334,2017-12-31 23:46:37.1960,2018-01-01 00:58:51.2110,284,Yerba Buena Center for the Arts (Howard St at ...,37.78487208436062,-122.40087568759915,284,Yerba Buena Center for the Arts (Howard St at ...,37.78487208436062,-122.40087568759915,1503,Customer,,
8,4150,2017-12-31 23:37:07.5480,2018-01-01 00:46:18.3080,20,Mechanics Monument Plaza (Market St at Bush St),37.7913,-122.399051,20,Mechanics Monument Plaza (Market St at Bush St),37.7913,-122.399051,3125,Customer,,
9,4238,2017-12-31 23:35:38.1450,2018-01-01 00:46:17.0530,20,Mechanics Monument Plaza (Market St at Bush St),37.7913,-122.399051,20,Mechanics Monument Plaza (Market St at Bush St),37.7913,-122.399051,2543,Customer,,


In order to iterate more quickly, we can take a sample of our data. Later, we can then apply the same transformations to the entire dataset.

In [6]:
sampled_gobike = gobike.take_sample(probability=0.01, seed=5)

## 3. Audit des données

Let's understand what our data looks like. Azure ML Data Prep facilitates this process by offering data profiles that help us glimpse into column types and column summary statistics.

In [7]:
# Audit sur l'ensemble des données
#gobike.get_profile()

In [8]:
# Audit sur l'échentillon
sampled_gobike.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Error Count,Lower Quartile,Median,Upper Quartile,Standard Deviation,Mean
duration_sec,FieldType.STRING,1000,997,5268.0,0.0,0.0,,,,,
start_time,FieldType.STRING,2017-06-28 13:48:12.3890,2017-12-31 23:55:09.6860,5268.0,0.0,0.0,,,,,
end_time,FieldType.STRING,2017-06-28 14:00:04.9250,2018-01-01 15:12:50.2450,5268.0,0.0,0.0,,,,,
start_station_id,FieldType.STRING,10,99,5268.0,0.0,0.0,,,,,
start_station_name,FieldType.STRING,10th Ave at E 15th St,Yerba Buena Center for the Arts (Howard St at ...,5268.0,0.0,0.0,,,,,
start_station_latitude,FieldType.STRING,37.3229796,37.88022244590679,5268.0,0.0,0.0,,,,,
start_station_longitude,FieldType.STRING,-121.8766132,-122.44429260492325,5268.0,0.0,0.0,,,,,
end_station_id,FieldType.STRING,10,99,5268.0,0.0,0.0,,,,,
end_station_name,FieldType.STRING,10th Ave at E 15th St,Yerba Buena Center for the Arts (Howard St at ...,5268.0,0.0,0.0,,,,,
end_station_latitude,FieldType.STRING,37.3229796,37.8740141,5268.0,0.0,0.0,,,,,


It appears that we have quite a few missing values in `member_birth_year`. We also immediately see that we have some empty strings in our `member_gender` column. With the data profiler, we can quickly do a sanity check on our dataset and see where we might need to start data cleaning.

## 4. Transformation des données par l'exemple

Azure ML Data Prep comes with additional "smart" transforms created by Microsoft Research. Here, we'll look at how you can derive a new column by providing examples of input-output pairs. Rather than explicitly using regular expressions to extract dates or hours from datetimes, we can provide examples for Azure ML Data Prep to learn what the pattern is. In fact, these smart transformations can also handle more complex derivations like inferring the day of the week from datetimes.

In [9]:
sgb_derived = sampled_gobike\
    .to_string(
        columns=['start_time', 'end_time']
    )\
    .derive_column_by_example(
        source_columns='start_time',
        new_column_name='date',
        example_data=[('2017-12-31 16:57:39.6540', '2017-12-31'), ('2017-12-31 16:57:39', '2017-12-31')]
    )\
    .derive_column_by_example(
        source_columns='start_time',
        new_column_name='hour',
        example_data=[('2017-12-31 16:57:39.6540', '16')]
    )\
    .derive_column_by_example(
        source_columns='start_time',
        new_column_name='wday',
        example_data=[('2017-12-31 16:57:39.6540', 'Sunday')]
    )\
    .derive_column_by_example(
        source_columns='start_time',
        new_column_name='day',
        example_data=[('2017-12-31 16:57:39.6540', '31')]
    )\
      .derive_column_by_example(
        source_columns='start_time',
        new_column_name='month',
        example_data=[('2017-12-31 16:57:39.6540', '12')]
    )

In [10]:
sgb_derived.filter(dprep.col('wday') != 'Sunday').head(11)

Unnamed: 0,duration_sec,start_time,month,day,wday,hour,date,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender
0,572,2017-12-30 18:19:05.7310,12,30,Saturday,18,2017-12-30,2017-12-30 18:28:37.8250,323,Broadway at Kearny,37.79801364395978,-122.40595042705534,15,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,3033,Customer,1996.0,Male
1,821,2017-12-30 18:13:05.1240,12,30,Saturday,18,2017-12-30,2017-12-30 18:26:46.3650,317,San Salvador St at 9th St,37.333955,-121.877349,304,Jackson St at 5th St,37.3487586867448,-121.89479783177376,2533,Subscriber,1970.0,Male
2,2789,2017-12-30 15:52:21.2220,12,30,Saturday,15,2017-12-30,2017-12-30 16:38:50.4270,121,Mission Playground,37.7592103,-122.4213392,120,Mission Dolores Park,37.7614205,-122.4264353,1837,Customer,,
3,1380,2017-12-30 16:07:44.1080,12,30,Saturday,16,2017-12-30,2017-12-30 16:30:44.9010,239,Bancroft Way at Telegraph Ave,37.8688126,-122.258764,241,Ashby BART Station,37.8524766,-122.2702132,1297,Customer,,
4,198,2017-12-30 16:23:49.0310,12,30,Saturday,16,2017-12-30,2017-12-30 16:27:07.7230,76,McCoppin St at Valencia St,37.77166246221617,-122.42242321372034,76,McCoppin St at Valencia St,37.77166246221617,-122.42242321372034,364,Customer,,
5,463,2017-12-30 15:43:11.5520,12,30,Saturday,15,2017-12-30,2017-12-30 15:50:55.0040,324,Union Square (Powell St at Post St),37.788299978150825,-122.40853071212769,44,Civic Center/UN Plaza BART Station (Market St ...,37.7810737,-122.4117382,762,Subscriber,1991.0,Female
6,1334,2017-12-30 14:21:05.3330,12,30,Saturday,14,2017-12-30,2017-12-30 14:43:20.0360,85,Church St at Duboce Ave,37.7700831,-122.4291557,17,Embarcadero BART Station (Beale St at Market St),37.792251,-122.397086,974,Customer,,
7,603,2017-12-30 14:15:40.5830,12,30,Saturday,14,2017-12-30,2017-12-30 14:25:43.7850,6,The Embarcadero at Sansome St,37.80477,-122.403234,323,Broadway at Kearny,37.79801364395978,-122.40595042705534,2359,Customer,,
8,858,2017-12-30 13:52:14.7550,12,30,Saturday,13,2017-12-30,2017-12-30 14:06:32.9160,60,8th St at Ringold St,37.77452040113685,-122.4094493687153,36,Folsom St at 3rd St,37.78383,-122.39887,276,Customer,,
9,1720,2017-12-30 12:13:24.1740,12,30,Saturday,12,2017-12-30,2017-12-30 12:42:04.3660,6,The Embarcadero at Sansome St,37.80477,-122.403234,99,Folsom St at 15th St,37.7670373,-122.4154425,617,Customer,1972.0,Female


We can also filter on other column types; let's take a peek at rides that lasted over 5 hours.

In [12]:
sgb_derived.filter(dprep.col('duration_sec') > (60 * 60 * 5)).head(11)

Unnamed: 0,duration_sec,start_time,month,day,wday,hour,date,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender
0,80110,2017-12-31 16:57:39.6540,12,31,Sunday,16,2017-12-31,2018-01-01 15:12:50.2450,74,Laguna St at Hayes St,37.776434819204745,-122.42624402046204,43,San Francisco Public Library (Grove St at Hyde...,37.7787677,-122.4159292,96,Customer,1987.0,Male
1,1397,2017-12-31 23:55:09.6860,12,31,Sunday,23,2017-12-31,2018-01-01 00:18:26.7210,78,Folsom St at 9th St,37.7737172,-122.4116467,15,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,1667,Customer,,
2,2511,2017-12-31 22:58:26.5240,12,31,Sunday,22,2017-12-31,2017-12-31 23:40:17.9870,30,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,6,The Embarcadero at Sansome St,37.80477,-122.403234,3329,Customer,,
3,901,2017-12-31 22:17:39.9100,12,31,Sunday,22,2017-12-31,2017-12-31 22:32:41.6300,66,3rd St at Townsend St,37.77874161153677,-122.39274082710836,23,The Embarcadero at Steuart St,37.791464,-122.391034,3569,Customer,,
4,853,2017-12-31 21:22:37.6430,12,31,Sunday,21,2017-12-31,2017-12-31 21:36:51.1790,14,Clay St at Battery St,37.795001,-122.39997,8,The Embarcadero at Vallejo St,37.799953,-122.398525,101,Customer,,
5,189,2017-12-31 20:48:37.4920,12,31,Sunday,20,2017-12-31,2017-12-31 20:51:46.6580,37,2nd St at Folsom St,37.78499972833808,-122.39593561749642,48,2nd St at S Park St,37.782411189735896,-122.39270595839116,1056,Customer,1988.0,Male
6,1208,2017-12-31 19:56:43.3940,12,31,Sunday,19,2017-12-31,2017-12-31 20:16:51.5510,13,Commercial St at Montgomery St,37.794231,-122.402923,6,The Embarcadero at Sansome St,37.80477,-122.403234,2736,Subscriber,1994.0,Male
7,426,2017-12-31 17:11:48.3310,12,31,Sunday,17,2017-12-31,2017-12-31 17:18:55.0010,123,Folsom St at 19th St,37.7605936,-122.4148171,134,Valencia St at 24th St,37.7524278,-122.4206278,2256,Subscriber,1991.0,Male
8,962,2017-12-31 16:57:20.9300,12,31,Sunday,16,2017-12-31,2017-12-31 17:13:23.5600,176,MacArthur BART Station,37.82840997305853,-122.26631462574004,231,14th St at Filbert St,37.80874983465997,-122.28328227996823,305,Customer,,
9,204,2017-12-31 15:26:41.6630,12,31,Sunday,15,2017-12-31,2017-12-31 15:30:06.1460,86,Market St at Dolores St,37.7693053,-122.4268256,95,Sanchez St at 15th St,37.7662185,-122.4310597,2183,Subscriber,1992.0,Male


## 5. Transformation des données

In addition to "smart" transformations, Azure ML Data Prep also supports many common data science transforms familiar to other industry-standard data science libraries. Here, we'll explore the ability to `summarize` and `replace`. We'll also get to use `join` when we handle assertions.

#### Agrégation


Azure Data Prep also makes it easy to append this output of `summarize` to the original table based on the grouping variable. 

In [14]:
sgb_appended = sgb_derived\
    .summarize(
        summary_columns=[
            dprep\
                .SummaryColumnsValue(
                    column_id='duration_sec', 
                    summary_column_name='duration_sec_mean', 
                    summary_function=dprep.SummaryFunction.MEAN
                )
        ],
        group_by_columns=['date'],
        join_back=True
    )
sgb_appended.head(11)

Unnamed: 0,duration_sec,start_time,month,day,wday,hour,date,end_time,start_station_id,start_station_name,...,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,duration_sec_mean
0,80110,2017-12-31 16:57:39.6540,12,31,Sunday,16,2017-12-31,2018-01-01 15:12:50.2450,74,Laguna St at Hayes St,...,-122.42624402046204,43,San Francisco Public Library (Grove St at Hyde...,37.7787677,-122.4159292,96,Customer,1987.0,Male,"azureml.dataprep.native.DataPrepError(""'Cannot..."
1,1397,2017-12-31 23:55:09.6860,12,31,Sunday,23,2017-12-31,2018-01-01 00:18:26.7210,78,Folsom St at 9th St,...,-122.4116467,15,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,1667,Customer,,,"azureml.dataprep.native.DataPrepError(""'Cannot..."
2,2511,2017-12-31 22:58:26.5240,12,31,Sunday,22,2017-12-31,2017-12-31 23:40:17.9870,30,San Francisco Caltrain (Townsend St at 4th St),...,-122.395282,6,The Embarcadero at Sansome St,37.80477,-122.403234,3329,Customer,,,"azureml.dataprep.native.DataPrepError(""'Cannot..."
3,901,2017-12-31 22:17:39.9100,12,31,Sunday,22,2017-12-31,2017-12-31 22:32:41.6300,66,3rd St at Townsend St,...,-122.39274082710836,23,The Embarcadero at Steuart St,37.791464,-122.391034,3569,Customer,,,"azureml.dataprep.native.DataPrepError(""'Cannot..."
4,853,2017-12-31 21:22:37.6430,12,31,Sunday,21,2017-12-31,2017-12-31 21:36:51.1790,14,Clay St at Battery St,...,-122.39997,8,The Embarcadero at Vallejo St,37.799953,-122.398525,101,Customer,,,"azureml.dataprep.native.DataPrepError(""'Cannot..."
5,189,2017-12-31 20:48:37.4920,12,31,Sunday,20,2017-12-31,2017-12-31 20:51:46.6580,37,2nd St at Folsom St,...,-122.39593561749642,48,2nd St at S Park St,37.782411189735896,-122.39270595839116,1056,Customer,1988.0,Male,"azureml.dataprep.native.DataPrepError(""'Cannot..."
6,1208,2017-12-31 19:56:43.3940,12,31,Sunday,19,2017-12-31,2017-12-31 20:16:51.5510,13,Commercial St at Montgomery St,...,-122.402923,6,The Embarcadero at Sansome St,37.80477,-122.403234,2736,Subscriber,1994.0,Male,"azureml.dataprep.native.DataPrepError(""'Cannot..."
7,426,2017-12-31 17:11:48.3310,12,31,Sunday,17,2017-12-31,2017-12-31 17:18:55.0010,123,Folsom St at 19th St,...,-122.4148171,134,Valencia St at 24th St,37.7524278,-122.4206278,2256,Subscriber,1991.0,Male,"azureml.dataprep.native.DataPrepError(""'Cannot..."
8,962,2017-12-31 16:57:20.9300,12,31,Sunday,16,2017-12-31,2017-12-31 17:13:23.5600,176,MacArthur BART Station,...,-122.26631462574004,231,14th St at Filbert St,37.80874983465997,-122.28328227996823,305,Customer,,,"azureml.dataprep.native.DataPrepError(""'Cannot..."
9,204,2017-12-31 15:26:41.6630,12,31,Sunday,15,2017-12-31,2017-12-31 15:30:06.1460,86,Market St at Dolores St,...,-122.4268256,95,Sanchez St at 15th St,37.7662185,-122.4310597,2183,Subscriber,1992.0,Male,"azureml.dataprep.native.DataPrepError(""'Cannot..."


#### Remplacement des valeurs manquantes

Recall that our `member_gender` column had empty strings that stood in place of `None`. Let's use our `replace` function to properly recode them as `None`s.

In [15]:
sgb_replaced = sampled_gobike.replace_na(columns=['member_gender'])
sgb_replaced.head(11)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender
0,80110,2017-12-31 16:57:39.6540,2018-01-01 15:12:50.2450,74,Laguna St at Hayes St,37.776434819204745,-122.42624402046204,43,San Francisco Public Library (Grove St at Hyde...,37.7787677,-122.4159292,96,Customer,1987.0,Male
1,1397,2017-12-31 23:55:09.6860,2018-01-01 00:18:26.7210,78,Folsom St at 9th St,37.7737172,-122.4116467,15,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,1667,Customer,,
2,2511,2017-12-31 22:58:26.5240,2017-12-31 23:40:17.9870,30,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,6,The Embarcadero at Sansome St,37.80477,-122.403234,3329,Customer,,
3,901,2017-12-31 22:17:39.9100,2017-12-31 22:32:41.6300,66,3rd St at Townsend St,37.77874161153677,-122.39274082710836,23,The Embarcadero at Steuart St,37.791464,-122.391034,3569,Customer,,
4,853,2017-12-31 21:22:37.6430,2017-12-31 21:36:51.1790,14,Clay St at Battery St,37.795001,-122.39997,8,The Embarcadero at Vallejo St,37.799953,-122.398525,101,Customer,,
5,189,2017-12-31 20:48:37.4920,2017-12-31 20:51:46.6580,37,2nd St at Folsom St,37.78499972833808,-122.39593561749642,48,2nd St at S Park St,37.782411189735896,-122.39270595839116,1056,Customer,1988.0,Male
6,1208,2017-12-31 19:56:43.3940,2017-12-31 20:16:51.5510,13,Commercial St at Montgomery St,37.794231,-122.402923,6,The Embarcadero at Sansome St,37.80477,-122.403234,2736,Subscriber,1994.0,Male
7,426,2017-12-31 17:11:48.3310,2017-12-31 17:18:55.0010,123,Folsom St at 19th St,37.7605936,-122.4148171,134,Valencia St at 24th St,37.7524278,-122.4206278,2256,Subscriber,1991.0,Male
8,962,2017-12-31 16:57:20.9300,2017-12-31 17:13:23.5600,176,MacArthur BART Station,37.82840997305853,-122.26631462574004,231,14th St at Filbert St,37.80874983465997,-122.28328227996823,305,Customer,,
9,204,2017-12-31 15:26:41.6630,2017-12-31 15:30:06.1460,86,Market St at Dolores St,37.7693053,-122.4268256,95,Sanchez St at 15th St,37.7662185,-122.4310597,2183,Subscriber,1992.0,Male


## Assert on invalid data 



In [23]:
gb_asserted = gobike\
    .assert_value(
        columns='member_birth_year', 
        expression=dprep.f_or(dprep.value.is_null(), dprep.value <= 1920),
        error_code='InvalidDate'
    )


Now, we can filter to see what caused the errors above:

In [24]:
gb_errors = gb_asserted.filter(dprep.col('member_birth_year').is_error())
gb_errors.head(5)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender
0,80110,2017-12-31 16:57:39.6540,2018-01-01 15:12:50.2450,74,Laguna St at Hayes St,37.776434819204745,-122.42624402046204,43,San Francisco Public Library (Grove St at Hyde...,37.7787677,-122.4159292,96,Customer,"azureml.dataprep.native.DataPrepError(""'Invali...",Male
1,78800,2017-12-31 15:56:34.8420,2018-01-01 13:49:55.6170,284,Yerba Buena Center for the Arts (Howard St at ...,37.78487208436062,-122.40087568759915,96,Dolores St at 15th St,37.7662102,-122.4266136,88,Customer,"azureml.dataprep.native.DataPrepError(""'Invali...",Female
2,45768,2017-12-31 22:45:48.4110,2018-01-01 11:28:36.8830,245,Downtown Berkeley BART,37.8703477,-122.2677637,245,Downtown Berkeley BART,37.8703477,-122.2677637,1094,Customer,"azureml.dataprep.native.DataPrepError(""'Invali...",
3,62172,2017-12-31 17:31:10.6360,2018-01-01 10:47:23.5310,60,8th St at Ringold St,37.77452040113685,-122.4094493687153,5,Powell St BART Station (Market St at 5th St),37.78389935708493,-122.4084448814392,2831,Customer,"azureml.dataprep.native.DataPrepError(""'Invali...",
4,43603,2017-12-31 14:23:14.0010,2018-01-01 02:29:57.5710,239,Bancroft Way at Telegraph Ave,37.8688126,-122.258764,247,Fulton St at Bancroft Way,37.8677892,-122.2658964,3167,Subscriber,"azureml.dataprep.native.DataPrepError(""'Invali...",Female


#### Jointure
But what were the original values? Let's use `join` to figure out what the values were that caused our assert to throw an error. 

In [25]:
gb_errors.join(
    left_dataflow=gb_errors,
    right_dataflow=gobike,
    join_key_pairs=[
        ('duration_sec', 'duration_sec'),
        ('start_station_id', 'start_station_id'),
        ('bike_id', 'bike_id')
    ]
).head(11)

Unnamed: 0,l_duration_sec,l_start_time,l_end_time,l_start_station_id,l_start_station_name,l_start_station_latitude,l_start_station_longitude,l_end_station_id,l_end_station_name,l_end_station_latitude,...,r_start_station_latitude,r_start_station_longitude,r_end_station_id,r_end_station_name,r_end_station_latitude,r_end_station_longitude,r_bike_id,r_user_type,r_member_birth_year,r_member_gender
0,80110,2017-12-31 16:57:39.6540,2018-01-01 15:12:50.2450,74,Laguna St at Hayes St,37.776434819204745,-122.42624402046204,43,San Francisco Public Library (Grove St at Hyde...,37.7787677,...,37.776434819204745,-122.42624402046204,43,San Francisco Public Library (Grove St at Hyde...,37.7787677,-122.4159292,96,Customer,1987.0,Male
1,78800,2017-12-31 15:56:34.8420,2018-01-01 13:49:55.6170,284,Yerba Buena Center for the Arts (Howard St at ...,37.78487208436062,-122.40087568759915,96,Dolores St at 15th St,37.7662102,...,37.78487208436062,-122.40087568759915,96,Dolores St at 15th St,37.7662102,-122.4266136,88,Customer,1965.0,Female
2,45768,2017-12-31 22:45:48.4110,2018-01-01 11:28:36.8830,245,Downtown Berkeley BART,37.8703477,-122.2677637,245,Downtown Berkeley BART,37.8703477,...,37.8703477,-122.2677637,245,Downtown Berkeley BART,37.8703477,-122.2677637,1094,Customer,,
3,62172,2017-12-31 17:31:10.6360,2018-01-01 10:47:23.5310,60,8th St at Ringold St,37.77452040113685,-122.4094493687153,5,Powell St BART Station (Market St at 5th St),37.78389935708493,...,37.77452040113685,-122.4094493687153,5,Powell St BART Station (Market St at 5th St),37.78389935708493,-122.4084448814392,2831,Customer,,
4,43603,2017-12-31 14:23:14.0010,2018-01-01 02:29:57.5710,239,Bancroft Way at Telegraph Ave,37.8688126,-122.258764,247,Fulton St at Bancroft Way,37.8677892,...,37.8688126,-122.258764,247,Fulton St at Bancroft Way,37.8677892,-122.2658964,3167,Subscriber,1997.0,Female
5,9226,2017-12-31 22:51:00.9180,2018-01-01 01:24:47.1660,30,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,30,San Francisco Caltrain (Townsend St at 4th St),37.776598,...,37.776598,-122.395282,30,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,1487,Customer,,
6,4507,2017-12-31 23:49:28.4220,2018-01-01 01:04:35.6190,259,Addison St at Fourth St,37.866249,-122.2993708,259,Addison St at Fourth St,37.866249,...,37.866249,-122.2993708,259,Addison St at Fourth St,37.866249,-122.2993708,3539,Customer,1991.0,Female
7,4334,2017-12-31 23:46:37.1960,2018-01-01 00:58:51.2110,284,Yerba Buena Center for the Arts (Howard St at ...,37.78487208436062,-122.40087568759915,284,Yerba Buena Center for the Arts (Howard St at ...,37.78487208436062,...,37.78487208436062,-122.40087568759915,284,Yerba Buena Center for the Arts (Howard St at ...,37.78487208436062,-122.40087568759915,1503,Customer,,
8,4150,2017-12-31 23:37:07.5480,2018-01-01 00:46:18.3080,20,Mechanics Monument Plaza (Market St at Bush St),37.7913,-122.399051,20,Mechanics Monument Plaza (Market St at Bush St),37.7913,...,37.7913,-122.399051,20,Mechanics Monument Plaza (Market St at Bush St),37.7913,-122.399051,3125,Customer,,
9,4238,2017-12-31 23:35:38.1450,2018-01-01 00:46:17.0530,20,Mechanics Monument Plaza (Market St at Bush St),37.7913,-122.399051,20,Mechanics Monument Plaza (Market St at Bush St),37.7913,...,37.7913,-122.399051,20,Mechanics Monument Plaza (Market St at Bush St),37.7913,-122.399051,2543,Customer,,


If we look at `r_member_birth_year`, we see that these people were listed as being born in 1886. That's impossible! Now that we've identified outliers and anomalies, we can appropriately clean our data however we like.

## 6. Exportation fichier dprep

One of the beautiful features of Azure ML Data Prep is that you only need to write your code once and choose whether to scale up or out; it takes care of figuring out how. To do so, you can export the `.dprep` file you've written tested on a smaller dataset, then run it with your larger dataset. Here, we show how you can export your new package. For a more detailed example on how to execute it on Spark, check out our [New York Taxicab scenario](https://github.com/Microsoft/PendletonDocs/blob/master/Scenarios/NYTaxiCab/01.new_york_taxi.ipynb).

In [26]:
gobike = gobike.set_name(name="gobike")
package_path = path.join(mkdtemp(), "gobike.dprep")

print("Saving package to: {}".format(package_path))
package = dprep.Package(arg=gobike)
package.save(file_path=package_path)

Saving package to: /tmp/tmpdwtitvt3/gobike.dprep


Package
  name: None
  path: /tmp/tmpdwtitvt3/gobike.dprep
  dataflows: [
    Dataflow {
      name: gobike
      steps: 3
    },
  ]

## Want more information?

Congratulations on finishing your introduction to the Azure ML Data Prep SDK! If you'd like more detailed tutorials on how to construct machine learning datasets or dive deeper into all of its functionality, you can find more information in our detailed notebooks [here](https://github.com/Microsoft/PendletonDocs). There, we cover topics including how to:

* Cache your Dataflow to speed up your iterations
* Add your custom Python transforms
* Impute missing values
* Sample your data
* Reference and link between Dataflows
* Apply your Dataflow to a new, larger data source