## Up and running with Azure Machine Learning

Azure Machine Learning is a fully managed service on the cloud for big data processing that includes a browser based workbench for the data science workflow such as authoring, evaluating and publishing predictive models.

In this article we will look at an overview of how to get started with Azure Machine Learning in minutes and deploy a data science experiment on the cloud by uploading a dataset, creating a data preprocessing workflow, and finally creating, training and evaluating a machine learning model. 

##### Azure Machine Learning Studio

Azure Machine Learning offers an integrated, drag-and-drop tool called the Azure Machine Learning Studio to build, test, and deploy predictive analytics solutions in a collaborative fashion.

Azure Machine Learning Studio gives you a fairly independent environment called ML Workspace to work on. Using this interactive, visual workspace, you drag-and-drop datasets and analysis modules onto an interactive canvas, connecting them together to form an experiment, which you run in Machine Learning Studio. 

You create a model, train the model, and score and test the model. When you're ready, you can convert your training experiment to a predictive experiment, and then publish it as a web service so that your model can be accessed by others and easily be consumed by custom apps or Business Intelligence (BI) tools.

### Getting Started with Azure Machine Learning

Microsoft makes it really easy to get started with Azure Machine Learning! The simplest way to get up and running is to go to https://studio.azureml.net/. Click Sign In if you have already created an account. If you have not created an account yet, then Sign up for a free or paid account. 

![Sign In to Azure Machine Learning Studio](images/sign-in-to-azure-ml.png "Sign In to Azure Machine Learning Studio")

### Create an Experiment

As you login to Azure Machine Learning Studio, you are presented with the Machine Learning Workspace where all your Machine Learning objects live. 

First, you need to create an <b>Experiment</b> to create, train, score, and test your model. An experiment is simply a set of connected components; additional modules can be added to take care of data preprocessing, features selection, data splitting, cross-validation, etc. You can find many predictive analytic experiments contributed by Microsoft and the data science community as starting points to develop your own solutions [here](https://gallery.cortanaintelligence.com/).

Creating an experiment is simple:  
1. Click <b>+NEW</b> at the at the bottom of the Machine Learning Studio page.
2. Select <b>EXPERIMENT</b>, and then select <b>Blank Experiment</b>.

![New Experiment](images/create-new-workspace.png "New Experiment")

Newly created experiment is given a default name - seen at the top of the page; you can select and rename it to provide a meaningful name for your experiment. The name does not need to be unique.

On the right pane, you can enter a brief <b>Summary</b> and a detailed <b>Description</b> about your experiment.

![New Experiment Attributes](images/experiment-attributes.png "New Experiment Attributes")

### Setting up the Dataset

The most important thing to perform machine learning is to set up the data. To the left of the experiment canvas is a palette of datasets and modules. There are several sample datasets included with Machine Learning Studio that you can use, or you can import data from many sources. 

<b>Datasets</b> refer to data containers. To use your own data in Machine Learning Studio for developing and training a predictive analytics solution, you can:
- enter data manually by <b>typing values</b> for creating a small dataset
- upload data from a <b>local file</b> on your hard drive to create a dataset module in your workspace
- access data from one of several online data sources using the <b>Import Data</b> module while your experiment is running
- use data from another Azure Machine learning experiment saved as a <b>dataset</b>
- use data from an on-premises <b>SQL Server database</b>

For this experiment, I will use an open dataset that has already been downloaded from [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Wine+Quality). There are two CSV files containing information about red and white variants of the Portuguese "Vinho Verde" wine. These datasets can be used to model wine quality based on physicochemical tests and can be viewed as classification or regression tasks. 

To import data from a local file on hard drive:
1. Click <b>+NEW</b>  at the bottom of the Machine Learning Studio page.
2. Select <b>DATASET</b>, then select <b>FROM LOCAL FILE</b>.
3. In the <b>Upload a new dataset</b> dialog, browse to the file you want to upload
4. Enter a name, identify the data type, and optionally enter a description. A description is recommended - it allows you to record any characteristics about the data that you want to remember when using the data in the future. Note that Azure ML Studio does a pretty good job at automatically detecting the type (in this case, CSV).
5. The checkbox <b>This is the new version of an existing dataset</b> allows you to update an existing dataset with new data. 

![New Dataset from Local File](images/from-local-file.png "New Dataset from Local File")

During upload, a message is displayed that your dataset is being uploaded. Upload time will depend on the size of your file and the speed of your connection. If the file takes long time too load, you can do other things inside Machine Learning Studio or open another browser window while you wait. But, closing the browser will cause the file upload to fail. Once your data is uploaded, it's stored in a dataset module under <b>My Datasets</b>.

<b>Note:</b> Azure Machine Learning only accepts the comma (,) seperated, American style CSV. If your CSV file is delimited by semicolon (;) then you must convert your CSV to comma-delimited before uploading to Azure Machine Learning Studio. In Excel, go to <b>Data > Import External Data > Import Data</b>. Find and select your CSV file. When it opens in the <b>Text Import Wizard</b>, import the data specifying comma (,) as the delimiter to use.

![My Datasets](images/datasets.png "My Datasets")

### Prepare and Visualize the Data

To begin modeling, drag and drop your datasets in the experiment canvas.

![Drag and Drop Datasets](images/drag-and-drop-datasets.png "Drag and Drop Datasets")

Datasets and modules have input and output ports represented by small circles - input ports at the top, output ports at the bottom. To create a flow of data through your experiment, you will be connecting an output port of one module to an input port of another. At any time, you can click the output port of a dataset or module to see what the data looks like at that point in the data flow.

Let's click on the output port of winequality-red dataset and select <b>Visualize</b> to view the data.

![Visualize the data](images/visualize-data.png "Visualize the data")

In this sample dataset, various physicochemical variables are associated with each red wine instance. Given the variables for a specific wine available as inputs, we will try to predict its quality as available in the far-right column.

![Sample dataset](images/data-visualization.png "Sample dataset")

You can also select a column in order to visualize a summary statistics of the particular column.

Visualization also provides a convenient feature to compare two columns using a scatter plot. While viewing the statistics of a column, simply select <b>compare to</b> and choose another column from the dropdown to view the scatter plot.

![Visualize dataset statistics](images/visualize-statistics-plot.png "Visualize dataset statistics")

Let's close the visualization window by clicking the "<b>x</b>" in the upper-right corner and look at how we can preprocess the data by cleaning up missing values or excluding rows.

<b>Note:</b> A dataset usually requires some preprocessing before it can be analyzed. For example, you might want to clean up the missing values so that the model can analyze the data correctly. Also, a few of the columns may have highly skewed data or contain a large proportion of missing values, so you might want to exclude such columns from the model altogether.

<b>Note:</b> Cleaning the missing values from input data is a prerequisite for using most of the Azure Machine Learning modules.

First, we will exclude certain columns by using the <b>Select Columns in Dataset</b> module:
1. Type select columns in the Search box at the top of the module palette to find the <b>Select Columns in Dataset</b> module.
2. Drag the module to experiment canvas and connect the output port of the winequality-red.csv dataset to the input port of the <b>Select Columns in Dataset</b> module.

![Add Select Columns in Dataset module](images/add-select-columns-in-dataset.png "Add Select Columns in Dataset module")

To include or exclude specific columns, click the <b>Select Columns in Dataset</b> module:
1. Click <b>Launch column selector</b> in the <b>Properties</b> pane on the right.
2. Select <b>WITH RULES</b> from the left pane of Select columns dialog
3. <b>ALL COLUMNS</b> directs <b>Select Columns in Dataset</b> to pass through all the columns, while  <b>NO COLUMNS</b> directs <b>Select Columns in Dataset</b> to prevent all the columns from inclusion.
4. Since we want to <b>exclude</b> specific columns, therefore we will select <b>ALL COLUMNS</b> under <b>Begin with</b>. Then, from the drop-downs, select <b>exclude</b> and column names. 
5. Finally, click inside the text box and a list of columns is displayed. Select the columns to be excluded, and those are added to the text box.

![Select columns to exclude](images/select-columns.png "Select columns to exclude")

Click the check mark (OK) button on the lower-right to close the column selector. The properties pane will show that the selected column(s) being excluded.

Next, we will clean up missing data by using the <b>Clean Missing Data</b> module:
1. Find the <b>Clean Missing Data</b> module, drag it to experiment canvas and connect the output port of the <b>Select Columns in Dataset</b> module to the input port of the <b>Clean Missing Data</b> module.
2. Since we want to clean the data by removing rows that have any missing values, therefore select <b>Remove entire row</b> from the <b>Cleaning mode</b> dropdown in the <b>Properties</b> pane on the right.

![Clean missing data](images/clean-missing-data.png "Clean missing data")

<b>Note:</b> You can add a comment to a module by double-clicking the module and entering text. It is recommended to add a brief comment so that you can see at a glance what the module is doing in your experiment. 

Now that our data preprocessing workflow is complete, let's run the experiment by clicking <b>RUN</b> at the bottom of the page. When the experiment has finished running, all the modules have a green check mark to indicate that they finished successfully. If you want to view the cleaned dataset, click the left output port of the <b>Clean Missing Data</b> module and select <b>Visualize</b>. You'll find that the columns sulfur dioxide and sulphates from our sample dataset is no longer included, and there are no missing values.

![Cleaned dataset](images/cleaned-dataset.png "Cleaned dataset")

### Create a Model

Now that our data is ready, we want to construct a predictive model. We need some data to train the model and some to test it. Therefore, we will use the <b>Split Data</b> module to split the dataset into two separate datasets: one for training our model and one for testing it.

1. Find the <b>Split Data</b> module, drag it onto the canvas, and connect it to the <b>Clean Missing Data</b> module.
2. The <b>Split Data</b> module gives you two output ports, based on the <b>Fraction of rows</b> parameter. By default, the split ratio is 0.5 and the <b>Randomized split</b> parameter is set. This means that a random half of the data is output through one port of the <b>Split Data</b> module, and half through the other. You can adjust these parameters across several experiment runs and see what changes. For this example, I'll change the <b>Fraction of rows</b> to .7 to make 70% of the data output through the left port and leave the rest as-is.

![Split data](images/split-data.png "Split data")

We can use the outputs of the <b>Split Data</b> module however we like, but let's choose to use the left output as training data and the right output as testing data.

There are various models we could choose from. You can find all the available models under the <b>Intialize model</b> menu in the <b>Machine Learning</b> node on your left pane. Since we are trying to predict wine quality, which is a categorical data, therefor I selected <b>Classification</b> and focused on the Multiclass models (Decision Forest, Decision Jungle, Logistic Regression and Neural Network). 

For the purpose of this example, let's directly choose <b>Multiclass Logistic Regression</b>; although one of the benefits of using Azure Machine Learning Studio for creating machine learning models is the ability to try more than one type of model at a time in a single experiment and compare the results, and in practice we will add multiple models in the same experiment.

In future posts, we will look at solving specific Machine Learning problems using the Azure Machine Learning modules.

Drag the <b>Multiclass Logistic Regression</b> module to the experiment canvas. Note that, when you drag the <b>Multiclass Logistic Regression</b> module into your experiment, you are only creating a model configuration object. It will contain all your model parameters, which you can always edit, even after running an experiment.

![Regression](images/regression.png "Regression")

In order to train a model, we need to:
1. Find and drag the <b>Train Model</b> to the experiment canvas.
2. Connect the output of the <b>Multiclass Logistic Regression</b> module to the left input of the <b>Train Model</b> module.
3. Connect the training data output (left port) of the <b>Split Data</b> module to the right input of the <b>Train Model module</b>.
4. Select the <b>Train Model</b> module, click <b>Launch column selector</b> in the <b>Properties</b> pane on the right, and then select the column that our model is going to predict - in this case <i>quality</i>.

![Train model](images/train-model.png "Train model")

Finally, we will add the <b>Score Model</b> module to verify our model using the remaining 30% of our dataset and see how well our model performs. 

Find and drag the <b>Score Model</b> module to the experiment, connect the output of the <b>Train Model</b> module to the left input port of <b>Score Model</b>, and connect the test data output (right port) of the <b>Split Data</b> module to the right input port of Score Model.

Once everything is configured, let's finally run our experiment by clicking on the <b>RUN</b> button at the bottom of the page. A spinning indicator on each module shows that it's running, and then a green check mark shows when the module is finished. When all the modules have a check mark, the experiment has finished running.

![Running experiment](images/running.png "Running experiment")

### Evaluate the Model

Once the model completes running, we can quickly inspect the results by clicking on the output port of our new <b>Score Model</b> block, and then selecting <b>Visualize</b>. The output shows the predicted class for each of the rows, together with the known values from the test data, and the statistics about each column, including mean, median, min, max, standard deviation, and a graphical distribution of its values (histogram).

However, manually reading through all the rows is not an efficient way to evaluate our model: Select and drag the <b>Evaluate Model</b> module to the experiment canvas, and connect the output of the <b>Score Model</b> module to the left input of <b>Evaluate Model</b>. 

<b>Note:</b> There are two input ports on the <b>Evaluate Model</b> module because it can be used to compare two models side by side. 

![Evaluate](images/evaluate.png "Evaluate")

Let's re-run the experiment, click the output port of the <b>Evaluate Model</b> module, and then select <b>Visualize</b>.

Here we find very useful data about our model accuracy, precision and recall. More intuitively, you can graphically visualize how the 6 possible classes have been classified in the testing set by looking at the Confusion Matrix.

By examining the various metrics produced by the <b>Evaluate Model</b> module, you can decide whether the model is close or too far off from giving you the results you desire. You can continue to improve the model by iterating on your experiment and changing parameter values in the different models. The science and art of interpreting these results and tuning the model performance is outside the scope of this post. 

When you're satisfied with your model, you can deploy it as a web service.
