# 1. Préparation des données avec Azure ML Service



#### Note: Some features in this Notebook will _not_ work with the Private Preview version of the SDK; it assumes the Public Preview version.

Wonder how you can make the most of the Azure ML Data Prep SDK? In this "Getting Started" guide, we'll showcase a few highlights that make this SDK shine for big datasets where `pandas` and `dplyr` can fall short. Using the [Ford GoBike dataset](https://www.fordgobike.com/system-data) as an example, we'll cover how to build Dataflows that allow you to:


> https://docs.microsoft.com/fr-fr/python/api/overview/azure/dataprep/intro?view=azure-dataprep-py

## 1.1 Paramétrage

In [6]:
!pip install --upgrade azureml-dataprep

Collecting azureml-dataprep
[?25l  Downloading https://files.pythonhosted.org/packages/8d/47/b0fb506f7812d64339bc5ec087c6f9ba017a71a82ddaf3e0a448676d55b8/azureml_dataprep-0.5.3-py3-none-any.whl (24.6MB)
[K    100% |████████████████████████████████| 24.6MB 164kB/s eta 0:00:01
Collecting dotnetcore2==2.1.7 (from azureml-dataprep)
[?25l  Downloading https://files.pythonhosted.org/packages/5e/ad/563e291f291ddae9967fc6347b5240e5ee5fdb173c33e0780c30ca8cd4c2/dotnetcore2-2.1.7-py3-none-manylinux1_x86_64.whl (28.7MB)
[K    100% |████████████████████████████████| 28.7MB 135kB/s eta 0:00:01    81% |██████████████████████████      | 23.3MB 39.8MB/s eta 0:00:01
Installing collected packages: dotnetcore2, azureml-dataprep
  Found existing installation: dotnetcore2 2.1.6
    Uninstalling dotnetcore2-2.1.6:
      Successfully uninstalled dotnetcore2-2.1.6
  Found existing installation: azureml-dataprep 0.5.2
    Uninstalling azureml-dataprep-0.5.2:
      Successfully uninstalled azureml-dataprep-0

In [7]:
import azureml.core
print("SDK Version:", azureml.core.VERSION)

SDK Version: 1.0.2


In [8]:
from IPython.display import display
from os import path
from tempfile import mkdtemp

import pandas as pd
import azureml.dataprep as dprep

## 2. Importation des données

Azure ML Data Prep supports many different file reading formats (i.e. CSV, Excel, Parquet), and also offers the ability to infer column types automatically. 

In [12]:
gobike = dprep.auto_read_file(path="https://dprepdata.blob.core.windows.net/demo/ford_gobike/2017-fordgobike-tripdata.csv")

In [13]:
gobike.head(10)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender
0,80110.0,2017-12-31 16:57:39.654,2018-01-01 15:12:50.245,74.0,Laguna St at Hayes St,37.776435,-122.426244,43.0,San Francisco Public Library (Grove St at Hyde...,37.778768,-122.415929,96.0,Customer,1987.0,Male
1,78800.0,2017-12-31 15:56:34.842,2018-01-01 13:49:55.617,284.0,Yerba Buena Center for the Arts (Howard St at ...,37.784872,-122.400876,96.0,Dolores St at 15th St,37.76621,-122.426614,88.0,Customer,1965.0,Female
2,45768.0,2017-12-31 22:45:48.411,2018-01-01 11:28:36.883,245.0,Downtown Berkeley BART,37.870348,-122.267764,245.0,Downtown Berkeley BART,37.870348,-122.267764,1094.0,Customer,,
3,62172.0,2017-12-31 17:31:10.636,2018-01-01 10:47:23.531,60.0,8th St at Ringold St,37.77452,-122.409449,5.0,Powell St BART Station (Market St at 5th St),37.783899,-122.408445,2831.0,Customer,,
4,43603.0,2017-12-31 14:23:14.001,2018-01-01 02:29:57.571,239.0,Bancroft Way at Telegraph Ave,37.868813,-122.258764,247.0,Fulton St at Bancroft Way,37.867789,-122.265896,3167.0,Subscriber,1997.0,Female
5,9226.0,2017-12-31 22:51:00.918,2018-01-01 01:24:47.166,30.0,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,30.0,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,1487.0,Customer,,
6,4507.0,2017-12-31 23:49:28.422,2018-01-01 01:04:35.619,259.0,Addison St at Fourth St,37.866249,-122.299371,259.0,Addison St at Fourth St,37.866249,-122.299371,3539.0,Customer,1991.0,Female
7,4334.0,2017-12-31 23:46:37.196,2018-01-01 00:58:51.211,284.0,Yerba Buena Center for the Arts (Howard St at ...,37.784872,-122.400876,284.0,Yerba Buena Center for the Arts (Howard St at ...,37.784872,-122.400876,1503.0,Customer,,
8,4150.0,2017-12-31 23:37:07.548,2018-01-01 00:46:18.308,20.0,Mechanics Monument Plaza (Market St at Bush St),37.7913,-122.399051,20.0,Mechanics Monument Plaza (Market St at Bush St),37.7913,-122.399051,3125.0,Customer,,
9,4238.0,2017-12-31 23:35:38.145,2018-01-01 00:46:17.053,20.0,Mechanics Monument Plaza (Market St at Bush St),37.7913,-122.399051,20.0,Mechanics Monument Plaza (Market St at Bush St),37.7913,-122.399051,2543.0,Customer,,


In order to iterate more quickly, we can take a sample of our data. Later, we can then apply the same transformations to the entire dataset.

In [14]:
sampled_gobike = gobike.take_sample(probability=0.01, seed=5)

## 3. Audit des données

Let's understand what our data looks like. Azure ML Data Prep facilitates this process by offering data profiles that help us glimpse into column types and column summary statistics.

In [15]:
# Audit sur l'ensemble des données
#gobike.get_profile()

In [16]:
# Audit sur l'échentillon
sampled_gobike.get_profile()

Unnamed: 0,Type,Min,Max,Count,Missing Count,Not Missing Count,Percent missing,Error Count,Empty count,0.1% Quantile,1% Quantile,5% Quantile,25% Quantile,50% Quantile,75% Quantile,95% Quantile,99% Quantile,99.9% Quantile,Mean,Standard Deviation,Variance,Skewness,Kurtosis
duration_sec,FieldType.DECIMAL,61,81686,5268.0,0.0,5268.0,0.0,0.0,0.0,79.9623,256.069,250.957,382.902,595.069,936.276,2273.98,13244.8,56320.1,1096.7,3332.52,11105700.0,14.3611,267.013
start_time,FieldType.DATE,2017-06-28 13:48:12.389000+00:00,2017-12-31 23:55:09.686000+00:00,5268.0,0.0,5268.0,0.0,0.0,0.0,,,,,,,,,,,,,,
end_time,FieldType.DATE,2017-06-28 14:00:04.925000+00:00,2018-01-01 15:12:50.245000+00:00,5268.0,0.0,5268.0,0.0,0.0,0.0,,,,,,,,,,,,,,
start_station_id,FieldType.DECIMAL,3,340,5268.0,0.0,5268.0,0.0,0.0,0.0,3.0,11.2005,10.8108,23.4783,66.6362,141.088,288.461,323.68,324.759,94.8958,86.5886,7497.59,1.08634,0.269416
start_station_name,FieldType.STRING,10th Ave at E 15th St,Yerba Buena Center for the Arts (Howard St at ...,5268.0,0.0,5268.0,0.0,0.0,0.0,,,,,,,,,,,,,,
start_station_latitude,FieldType.DECIMAL,37.323,37.8802,5268.0,0.0,5268.0,0.0,0.0,0.0,37.3252,37.7612,37.7609,37.7738,37.7837,37.7954,37.8319,37.8692,37.8738,37.7715,0.0868063,0.00753533,-4.47498,19.8988
start_station_longitude,FieldType.DECIMAL,-122.444,-121.877,5268.0,0.0,5268.0,0.0,0.0,0.0,-122.444,-122.423,-122.423,-122.411,-122.399,-122.391,-122.25,-121.886,-121.877,-122.364,0.106098,0.0112567,3.21728,11.149
end_station_id,FieldType.DECIMAL,3,337,5268.0,0.0,5268.0,0.0,0.0,0.0,3.0,9.55321,9.0,22.8194,65.7664,137.809,290.327,323.878,324.759,93.3768,86.306,7448.72,1.11127,0.330681
end_station_name,FieldType.STRING,10th Ave at E 15th St,Yerba Buena Center for the Arts (Howard St at ...,5268.0,0.0,5268.0,0.0,0.0,0.0,,,,,,,,,,,,,,
end_station_latitude,FieldType.DECIMAL,37.323,37.874,5268.0,0.0,5268.0,0.0,0.0,0.0,37.3256,37.761,37.7606,37.7745,37.7834,37.7954,37.83,37.8685,37.8731,37.7713,0.0868062,0.00753531,-4.50761,20.0773


It appears that we have quite a few missing values in `member_birth_year`. We also immediately see that we have some empty strings in our `member_gender` column. With the data profiler, we can quickly do a sanity check on our dataset and see where we might need to start data cleaning.

## 4. Transformation des données par l'exemple

Azure ML Data Prep comes with additional "smart" transforms created by Microsoft Research. Here, we'll look at how you can derive a new column by providing examples of input-output pairs. Rather than explicitly using regular expressions to extract dates or hours from datetimes, we can provide examples for Azure ML Data Prep to learn what the pattern is. In fact, these smart transformations can also handle more complex derivations like inferring the day of the week from datetimes.

In [17]:
sgb_derived = sampled_gobike\
    .to_string(
        columns=['start_time', 'end_time']
    )\
    .derive_column_by_example(
        source_columns='start_time',
        new_column_name='date',
        example_data=[('2017-12-31 16:57:39.6540', '2017-12-31'), ('2017-12-31 16:57:39', '2017-12-31')]
    )\
    .derive_column_by_example(
        source_columns='start_time',
        new_column_name='hour',
        example_data=[('2017-12-31 16:57:39.6540', '16')]
    )\
    .derive_column_by_example(
        source_columns='start_time',
        new_column_name='wday',
        example_data=[('2017-12-31 16:57:39.6540', 'Sunday')]
    )\
    .derive_column_by_example(
        source_columns='start_time',
        new_column_name='day',
        example_data=[('2017-12-31 16:57:39.6540', '31')]
    )\
      .derive_column_by_example(
        source_columns='start_time',
        new_column_name='month',
        example_data=[('2017-12-31 16:57:39.6540', '12')]
    )

In [18]:
sgb_derived.filter(dprep.col('wday') != 'Sunday').head(11)

Unnamed: 0,duration_sec,start_time,month,day,wday,hour,date,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender
0,572.0,2017-12-30 18:19:05.731000,12,30,Saturday,18,2017-12-30,2017-12-30 18:28:37.825000,323.0,Broadway at Kearny,37.798014,-122.40595,15.0,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,3033.0,Customer,1996.0,Male
1,821.0,2017-12-30 18:13:05.124000,12,30,Saturday,18,2017-12-30,2017-12-30 18:26:46.365000,317.0,San Salvador St at 9th St,37.333955,-121.877349,304.0,Jackson St at 5th St,37.348759,-121.894798,2533.0,Subscriber,1970.0,Male
2,2789.0,2017-12-30 15:52:21.222000,12,30,Saturday,15,2017-12-30,2017-12-30 16:38:50.427000,121.0,Mission Playground,37.75921,-122.421339,120.0,Mission Dolores Park,37.76142,-122.426435,1837.0,Customer,,
3,1380.0,2017-12-30 16:07:44.108000,12,30,Saturday,16,2017-12-30,2017-12-30 16:30:44.901000,239.0,Bancroft Way at Telegraph Ave,37.868813,-122.258764,241.0,Ashby BART Station,37.852477,-122.270213,1297.0,Customer,,
4,198.0,2017-12-30 16:23:49.031000,12,30,Saturday,16,2017-12-30,2017-12-30 16:27:07.723000,76.0,McCoppin St at Valencia St,37.771662,-122.422423,76.0,McCoppin St at Valencia St,37.771662,-122.422423,364.0,Customer,,
5,463.0,2017-12-30 15:43:11.552000,12,30,Saturday,15,2017-12-30,2017-12-30 15:50:55.004000,324.0,Union Square (Powell St at Post St),37.7883,-122.408531,44.0,Civic Center/UN Plaza BART Station (Market St ...,37.781074,-122.411738,762.0,Subscriber,1991.0,Female
6,1334.0,2017-12-30 14:21:05.333000,12,30,Saturday,14,2017-12-30,2017-12-30 14:43:20.036000,85.0,Church St at Duboce Ave,37.770083,-122.429156,17.0,Embarcadero BART Station (Beale St at Market St),37.792251,-122.397086,974.0,Customer,,
7,603.0,2017-12-30 14:15:40.583000,12,30,Saturday,14,2017-12-30,2017-12-30 14:25:43.785000,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,323.0,Broadway at Kearny,37.798014,-122.40595,2359.0,Customer,,
8,858.0,2017-12-30 13:52:14.755000,12,30,Saturday,13,2017-12-30,2017-12-30 14:06:32.916000,60.0,8th St at Ringold St,37.77452,-122.409449,36.0,Folsom St at 3rd St,37.78383,-122.39887,276.0,Customer,,
9,1720.0,2017-12-30 12:13:24.174000,12,30,Saturday,12,2017-12-30,2017-12-30 12:42:04.366000,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,99.0,Folsom St at 15th St,37.767037,-122.415442,617.0,Customer,1972.0,Female


We can also filter on other column types; let's take a peek at rides that lasted over 5 hours.

In [19]:
sgb_derived.filter(dprep.col('duration_sec') > (60 * 60 * 5)).head(11)

Unnamed: 0,duration_sec,start_time,month,day,wday,hour,date,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender
0,80110.0,2017-12-31 16:57:39.654000,12,31,Sunday,16,2017-12-31,2018-01-01 15:12:50.245000,74.0,Laguna St at Hayes St,37.776435,-122.426244,43.0,San Francisco Public Library (Grove St at Hyde...,37.778768,-122.415929,96.0,Customer,1987.0,Male
1,19738.0,2017-12-27 10:08:32.337000,12,27,Wednesday,10,2017-12-27,2017-12-27 15:37:31.208000,95.0,Sanchez St at 15th St,37.766219,-122.43106,31.0,Raymond Kimbell Playground,37.783813,-122.434559,931.0,Customer,,
2,22442.0,2017-12-15 12:24:44.370000,12,15,Friday,12,2017-12-15,2017-12-15 18:38:47.010000,275.0,Julian St at 6th St,37.342997,-121.888889,278.0,The Alameda at Bush St,37.331932,-121.904888,1872.0,Subscriber,1983.0,Male
3,27433.0,2017-11-06 08:33:58.415000,11,6,Monday,8,2017-11-06,2017-11-06 16:11:11.583000,75.0,Market St at Franklin St,37.773793,-122.421239,58.0,Market St at 10th St,37.776619,-122.417385,2291.0,Customer,,
4,28467.0,2017-10-25 22:59:12.266000,10,25,Wednesday,22,2017-10-25,2017-10-26 06:53:39.765000,108.0,16th St Mission BART,37.76471,-122.419957,119.0,18th St at Noe St,37.761047,-122.432642,2423.0,Subscriber,1973.0,Male
5,22532.0,2017-10-05 18:08:47.166000,10,5,Thursday,18,2017-10-05,2017-10-06 00:24:19.285000,323.0,Broadway at Kearny,37.798014,-122.40595,4.0,Cyril Magnin St at Ellis St,37.785881,-122.408915,307.0,Subscriber,1987.0,Male
6,21050.0,2017-10-03 08:59:55.957000,10,3,Tuesday,8,2017-10-03,2017-10-03 14:50:46.614000,24.0,Spear St at Folsom St,37.789677,-122.390428,24.0,Spear St at Folsom St,37.789677,-122.390428,2818.0,Customer,,
7,81686.0,2017-09-30 18:12:21.667000,9,30,Saturday,18,2017-09-30,2017-10-01 16:53:48.361000,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,81.0,Berry St at 4th St,37.77588,-122.39317,2371.0,Customer,,
8,29494.0,2017-09-23 11:08:56.665000,9,23,Saturday,11,2017-09-23,2017-09-23 19:20:31.620000,4.0,Cyril Magnin St at Ellis St,37.785881,-122.408915,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,2210.0,Customer,,
9,76987.0,2017-09-20 09:22:14.400000,9,20,Wednesday,9,2017-09-20,2017-09-21 06:45:21.472000,164.0,Isabella St at San Pablo Ave,37.814988,-122.274844,164.0,Isabella St at San Pablo Ave,37.814988,-122.274844,554.0,Customer,1977.0,Female


## 5. Transformation des données

In addition to "smart" transformations, Azure ML Data Prep also supports many common data science transforms familiar to other industry-standard data science libraries. Here, we'll explore the ability to `summarize` and `replace`. We'll also get to use `join` when we handle assertions.

#### Agrégation


Azure Data Prep also makes it easy to append this output of `summarize` to the original table based on the grouping variable. 

In [20]:
sgb_appended = sgb_derived\
    .summarize(
        summary_columns=[
            dprep\
                .SummaryColumnsValue(
                    column_id='duration_sec', 
                    summary_column_name='duration_sec_mean', 
                    summary_function=dprep.SummaryFunction.MEAN
                )
        ],
        group_by_columns=['date'],
        join_back=True
    )
sgb_appended.head(11)

Unnamed: 0,duration_sec,start_time,month,day,wday,hour,date,end_time,start_station_id,start_station_name,...,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,duration_sec_mean
0,80110.0,2017-12-31 16:57:39.654000,12,31,Sunday,16,2017-12-31,2018-01-01 15:12:50.245000,74.0,Laguna St at Hayes St,...,-122.426244,43.0,San Francisco Public Library (Grove St at Hyde...,37.778768,-122.415929,96.0,Customer,1987.0,Male,5309.111111
1,1397.0,2017-12-31 23:55:09.686000,12,31,Sunday,23,2017-12-31,2018-01-01 00:18:26.721000,78.0,Folsom St at 9th St,...,-122.411647,15.0,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,1667.0,Customer,,,5309.111111
2,2511.0,2017-12-31 22:58:26.524000,12,31,Sunday,22,2017-12-31,2017-12-31 23:40:17.987000,30.0,San Francisco Caltrain (Townsend St at 4th St),...,-122.395282,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,3329.0,Customer,,,5309.111111
3,901.0,2017-12-31 22:17:39.910000,12,31,Sunday,22,2017-12-31,2017-12-31 22:32:41.630000,66.0,3rd St at Townsend St,...,-122.392741,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,3569.0,Customer,,,5309.111111
4,853.0,2017-12-31 21:22:37.643000,12,31,Sunday,21,2017-12-31,2017-12-31 21:36:51.179000,14.0,Clay St at Battery St,...,-122.39997,8.0,The Embarcadero at Vallejo St,37.799953,-122.398525,101.0,Customer,,,5309.111111
5,189.0,2017-12-31 20:48:37.492000,12,31,Sunday,20,2017-12-31,2017-12-31 20:51:46.658000,37.0,2nd St at Folsom St,...,-122.395936,48.0,2nd St at S Park St,37.782411,-122.392706,1056.0,Customer,1988.0,Male,5309.111111
6,1208.0,2017-12-31 19:56:43.394000,12,31,Sunday,19,2017-12-31,2017-12-31 20:16:51.551000,13.0,Commercial St at Montgomery St,...,-122.402923,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,2736.0,Subscriber,1994.0,Male,5309.111111
7,426.0,2017-12-31 17:11:48.331000,12,31,Sunday,17,2017-12-31,2017-12-31 17:18:55.001000,123.0,Folsom St at 19th St,...,-122.414817,134.0,Valencia St at 24th St,37.752428,-122.420628,2256.0,Subscriber,1991.0,Male,5309.111111
8,962.0,2017-12-31 16:57:20.930000,12,31,Sunday,16,2017-12-31,2017-12-31 17:13:23.560000,176.0,MacArthur BART Station,...,-122.266315,231.0,14th St at Filbert St,37.80875,-122.283282,305.0,Customer,,,5309.111111
9,204.0,2017-12-31 15:26:41.663000,12,31,Sunday,15,2017-12-31,2017-12-31 15:30:06.146000,86.0,Market St at Dolores St,...,-122.426826,95.0,Sanchez St at 15th St,37.766219,-122.43106,2183.0,Subscriber,1992.0,Male,5309.111111


#### Remplacement des valeurs manquantes

Recall that our `member_gender` column had empty strings that stood in place of `None`. Let's use our `replace` function to properly recode them as `None`s.

In [21]:
sgb_replaced = sampled_gobike.replace_na(columns=['member_gender'])
sgb_replaced.head(11)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender
0,80110.0,2017-12-31 16:57:39.654,2018-01-01 15:12:50.245,74.0,Laguna St at Hayes St,37.776435,-122.426244,43.0,San Francisco Public Library (Grove St at Hyde...,37.778768,-122.415929,96.0,Customer,1987.0,Male
1,1397.0,2017-12-31 23:55:09.686,2018-01-01 00:18:26.721,78.0,Folsom St at 9th St,37.773717,-122.411647,15.0,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,1667.0,Customer,,
2,2511.0,2017-12-31 22:58:26.524,2017-12-31 23:40:17.987,30.0,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,3329.0,Customer,,
3,901.0,2017-12-31 22:17:39.910,2017-12-31 22:32:41.630,66.0,3rd St at Townsend St,37.778742,-122.392741,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,3569.0,Customer,,
4,853.0,2017-12-31 21:22:37.643,2017-12-31 21:36:51.179,14.0,Clay St at Battery St,37.795001,-122.39997,8.0,The Embarcadero at Vallejo St,37.799953,-122.398525,101.0,Customer,,
5,189.0,2017-12-31 20:48:37.492,2017-12-31 20:51:46.658,37.0,2nd St at Folsom St,37.785,-122.395936,48.0,2nd St at S Park St,37.782411,-122.392706,1056.0,Customer,1988.0,Male
6,1208.0,2017-12-31 19:56:43.394,2017-12-31 20:16:51.551,13.0,Commercial St at Montgomery St,37.794231,-122.402923,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,2736.0,Subscriber,1994.0,Male
7,426.0,2017-12-31 17:11:48.331,2017-12-31 17:18:55.001,123.0,Folsom St at 19th St,37.760594,-122.414817,134.0,Valencia St at 24th St,37.752428,-122.420628,2256.0,Subscriber,1991.0,Male
8,962.0,2017-12-31 16:57:20.930,2017-12-31 17:13:23.560,176.0,MacArthur BART Station,37.82841,-122.266315,231.0,14th St at Filbert St,37.80875,-122.283282,305.0,Customer,,
9,204.0,2017-12-31 15:26:41.663,2017-12-31 15:30:06.146,86.0,Market St at Dolores St,37.769305,-122.426826,95.0,Sanchez St at 15th St,37.766219,-122.43106,2183.0,Subscriber,1992.0,Male


## Assert on invalid data 



In [22]:
gb_asserted = gobike\
    .assert_value(
        columns='member_birth_year', 
        expression=dprep.f_or(dprep.value.is_null(), dprep.value <= 1920),
        error_code='InvalidDate'
    )


Now, we can filter to see what caused the errors above:

In [23]:
gb_errors = gb_asserted.filter(dprep.col('member_birth_year').is_error())
gb_errors.head(5)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender
0,80110.0,2017-12-31 16:57:39.654,2018-01-01 15:12:50.245,74.0,Laguna St at Hayes St,37.776435,-122.426244,43.0,San Francisco Public Library (Grove St at Hyde...,37.778768,-122.415929,96.0,Customer,"azureml.dataprep.native.DataPrepError(""'Invali...",Male
1,78800.0,2017-12-31 15:56:34.842,2018-01-01 13:49:55.617,284.0,Yerba Buena Center for the Arts (Howard St at ...,37.784872,-122.400876,96.0,Dolores St at 15th St,37.76621,-122.426614,88.0,Customer,"azureml.dataprep.native.DataPrepError(""'Invali...",Female
2,43603.0,2017-12-31 14:23:14.001,2018-01-01 02:29:57.571,239.0,Bancroft Way at Telegraph Ave,37.868813,-122.258764,247.0,Fulton St at Bancroft Way,37.867789,-122.265896,3167.0,Subscriber,"azureml.dataprep.native.DataPrepError(""'Invali...",Female
3,4507.0,2017-12-31 23:49:28.422,2018-01-01 01:04:35.619,259.0,Addison St at Fourth St,37.866249,-122.299371,259.0,Addison St at Fourth St,37.866249,-122.299371,3539.0,Customer,"azureml.dataprep.native.DataPrepError(""'Invali...",Female
4,2183.0,2017-12-31 23:52:55.581,2018-01-01 00:29:18.743,67.0,San Francisco Caltrain Station 2 (Townsend St...,37.776639,-122.395526,24.0,Spear St at Folsom St,37.789677,-122.390428,2311.0,Subscriber,"azureml.dataprep.native.DataPrepError(""'Invali...",Male


#### Jointure
But what were the original values? Let's use `join` to figure out what the values were that caused our assert to throw an error. 

In [24]:
gb_errors.join(
    left_dataflow=gb_errors,
    right_dataflow=gobike,
    join_key_pairs=[
        ('duration_sec', 'duration_sec'),
        ('start_station_id', 'start_station_id'),
        ('bike_id', 'bike_id')
    ]
).head(11)

Unnamed: 0,l_duration_sec,l_start_time,l_end_time,l_start_station_id,l_start_station_name,l_start_station_latitude,l_start_station_longitude,l_end_station_id,l_end_station_name,l_end_station_latitude,...,r_start_station_latitude,r_start_station_longitude,r_end_station_id,r_end_station_name,r_end_station_latitude,r_end_station_longitude,r_bike_id,r_user_type,r_member_birth_year,r_member_gender
0,80110.0,2017-12-31 16:57:39.654,2018-01-01 15:12:50.245,74.0,Laguna St at Hayes St,37.776435,-122.426244,43.0,San Francisco Public Library (Grove St at Hyde...,37.778768,...,37.776435,-122.426244,43.0,San Francisco Public Library (Grove St at Hyde...,37.778768,-122.415929,96.0,Customer,1987.0,Male
1,78800.0,2017-12-31 15:56:34.842,2018-01-01 13:49:55.617,284.0,Yerba Buena Center for the Arts (Howard St at ...,37.784872,-122.400876,96.0,Dolores St at 15th St,37.76621,...,37.784872,-122.400876,96.0,Dolores St at 15th St,37.76621,-122.426614,88.0,Customer,1965.0,Female
2,43603.0,2017-12-31 14:23:14.001,2018-01-01 02:29:57.571,239.0,Bancroft Way at Telegraph Ave,37.868813,-122.258764,247.0,Fulton St at Bancroft Way,37.867789,...,37.868813,-122.258764,247.0,Fulton St at Bancroft Way,37.867789,-122.265896,3167.0,Subscriber,1997.0,Female
3,4507.0,2017-12-31 23:49:28.422,2018-01-01 01:04:35.619,259.0,Addison St at Fourth St,37.866249,-122.299371,259.0,Addison St at Fourth St,37.866249,...,37.866249,-122.299371,259.0,Addison St at Fourth St,37.866249,-122.299371,3539.0,Customer,1991.0,Female
4,2183.0,2017-12-31 23:52:55.581,2018-01-01 00:29:18.743,67.0,San Francisco Caltrain Station 2 (Townsend St...,37.776639,-122.395526,24.0,Spear St at Folsom St,37.789677,...,37.776639,-122.395526,24.0,Spear St at Folsom St,37.789677,-122.390428,2311.0,Subscriber,1990.0,Male
5,2170.0,2017-12-31 23:52:55.937,2018-01-01 00:29:06.924,67.0,San Francisco Caltrain Station 2 (Townsend St...,37.776639,-122.395526,24.0,Spear St at Folsom St,37.789677,...,37.776639,-122.395526,24.0,Spear St at Folsom St,37.789677,-122.390428,3717.0,Subscriber,1990.0,Male
6,1544.0,2017-12-31 23:53:38.943,2018-01-01 00:19:23.047,14.0,Clay St at Battery St,37.795001,-122.39997,27.0,Beale St at Harrison St,37.788059,...,37.795001,-122.39997,27.0,Beale St at Harrison St,37.788059,-122.391865,558.0,Subscriber,1980.0,Female
7,1474.0,2017-12-31 23:54:40.146,2018-01-01 00:19:14.351,14.0,Clay St at Battery St,37.795001,-122.39997,27.0,Beale St at Harrison St,37.788059,...,37.795001,-122.39997,27.0,Beale St at Harrison St,37.788059,-122.391865,3646.0,Subscriber,1979.0,Male
8,1532.0,2017-12-31 23:52:49.497,2018-01-01 00:18:21.953,78.0,Folsom St at 9th St,37.773717,-122.411647,15.0,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,...,37.773717,-122.411647,15.0,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,3114.0,Subscriber,1988.0,Other
9,1216.0,2017-12-31 23:46:33.993,2018-01-01 00:06:50.058,4.0,Cyril Magnin St at Ellis St,37.785881,-122.408915,123.0,Folsom St at 19th St,37.760594,...,37.785881,-122.408915,123.0,Folsom St at 19th St,37.760594,-122.414817,1473.0,Subscriber,1971.0,Male


If we look at `r_member_birth_year`, we see that these people were listed as being born in 1886. That's impossible! Now that we've identified outliers and anomalies, we can appropriately clean our data however we like.

## 6. Exportation fichier dprep

One of the beautiful features of Azure ML Data Prep is that you only need to write your code once and choose whether to scale up or out; it takes care of figuring out how. To do so, you can export the `.dprep` file you've written tested on a smaller dataset, then run it with your larger dataset. Here, we show how you can export your new package. For a more detailed example on how to execute it on Spark, check out our [New York Taxicab scenario](https://github.com/Microsoft/PendletonDocs/blob/master/Scenarios/NYTaxiCab/01.new_york_taxi.ipynb).

In [25]:
gobike = gobike.set_name(name="gobike")
package_path = path.join(mkdtemp(), "gobike.dprep")

print("Saving package to: {}".format(package_path))
package = dprep.Package(arg=gobike)
package.save(file_path=package_path)

Saving package to: /tmp/tmp54l1rjoe/gobike.dprep


Package
  name: None
  path: /tmp/tmp54l1rjoe/gobike.dprep
  dataflows: [
    Dataflow {
      name: gobike
      steps: 4
    },
  ]

On charge le package

In [28]:
package = dprep.Package.open("/tmp/tmp54l1rjoe/gobike.dprep")
dataflow_list = package.dataflows

> Documentation
https://docs.microsoft.com/fr-fr/python/api/overview/azure/dataprep/intro?view=azure-dataprep-py

## Want more information?

Congratulations on finishing your introduction to the Azure ML Data Prep SDK! If you'd like more detailed tutorials on how to construct machine learning datasets or dive deeper into all of its functionality, you can find more information in our detailed notebooks [here](https://github.com/Microsoft/PendletonDocs). There, we cover topics including how to:

* Cache your Dataflow to speed up your iterations
* Add your custom Python transforms
* Impute missing values
* Sample your data
* Reference and link between Dataflows
* Apply your Dataflow to a new, larger data source