# 01 - Data Engineering

 In this step, we download data and register the data as Azure Machine Learning Dataset manually. In later step, we will automate the step so it can be scheduled.


By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. Because the data remains in its existing location, you incur no extra storage cost, and don't risk the integrity of your data sources. Also datasets are lazily evaluated, which aids in workflow performance speeds. You can create datasets from datastores, public URLs, and Azure Open Datasets.

For a low-code experience, Create Azure Machine Learning datasets with the Azure Machine Learning studio.

With Azure Machine Learning datasets, you can:

- Keep a single copy of data in your storage, referenced by datasets.

- Seamlessly access data during model training without worrying about connection strings or data paths. Learn more about how to train with datasets.

- Share data and collaborate with other users.

In this exercise, you learn how to create Azure Machine Learning datasets to access data for your local or 
remote experiments with the Azure Machine Learning Python SDK. To understand where datasets fit in Azure Machine 
Learning's overall data access workflow, see the [Securely access data](https://docs.microsoft.com/en-us/azure/machine-learning/concept-data#data-workflow) article.

![Data Engineering](./images/01-DataEngineering.jpg)

## Data Engineering 

**Input** : Raw Data 

**Output** : Registered Data Set (ProductReview)

In [4]:
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen

## Download data from source

In [2]:
from tqdm import tqdm
import requests

url = "http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Software_5.json.gz"
response = requests.get(url, stream=True)

with open("Software_5.json.gz", "wb") as handle:
    for data in tqdm(response.iter_content()):
        handle.write(data)

5339013it [00:53, 99494.44it/s] 


## Process received data

In [5]:
### load the meta data

data = []
with gzip.open('Software_5.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# total length of list, this number equals total number of products
print(len(data))

# first row of the list
print(data[0])

12805
{'overall': 4.0, 'verified': False, 'reviewTime': '10 20, 2010', 'reviewerID': 'A38NELQT98S4H8', 'asin': '0321719816', 'style': {'Format:': ' DVD-ROM'}, 'reviewerName': 'WB Halper', 'reviewText': "I've been using Dreamweaver (and it's predecessor Macromedia's UltraDev) for many years.  For someone who is an experienced web designer, this course is a high-level review of the CS5 version of Dreamweaver, but it doesn't go into a great enough level of detail to find it very useful.\n\nOn the other hand, this is a great tool for someone who is a relative novice at web design.  It starts off with a basic overview of HTML and continues through the concepts necessary to build a modern web site.  Someone who goes through this course should exit with enough knowledge to create something that does what you want it do do...within reason.  Don't expect to go off and build an entire e-commerce system with only this class under your belt.\n\nIt's important to note that there's a long gap from s

In [6]:
# convert list into pandas dataframe

df = pd.DataFrame.from_dict(data)

print(len(df))

12805


In [7]:
### remove rows with unformatted title (i.e. some 'title' may still contain html style content)

df3 = df.fillna('')
df3.iloc[2]

overall                                                           5
verified                                                      False
reviewTime                                              10 16, 2010
reviewerID                                            ACJT8MUC0LRF0
asin                                                     0321719816
style                                       {'Format:': ' DVD-ROM'}
reviewerName                                              D. Fowler
reviewText        If you've been wanting to learn how to create ...
summary           This is excellent software for those who want ...
unixReviewTime                                           1287187200
vote                                                              3
image                                                              
Name: 2, dtype: object

## Register processed data as Dataset

In [8]:
from azureml.core import Dataset, Datastore, Workspace
from azureml.data.datapath import DataPath

workspace = Workspace.from_config()
default_datastore = Datastore.get_default(workspace)

ds_name = 'ProductReview'
data_path = DataPath(datastore=default_datastore, path_on_datastore='product_review')

ds = Dataset.Tabular.register_pandas_dataframe(df3, 
                                    default_datastore, 
                                    ds_name, 
                                    description=None, 
                                    tags=None, 
                                    show_progress=True)

Method register_pandas_dataframe: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/7e03b820-d14b-415b-b43e-bba2408163c7/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


## Show Dataset

Below screenshot shows details of **ProductReview** dataset:

![Dataset](./images/01-DataEngineeringOutput.jpb.jpg)