![air-paradis](https://drive.google.com/uc?id=1T26mpOAUvJP700W4m8bjfYCLmDYVcyJL)

# <font color=red><center>**AIR PARADIS**</center></font>

**Air Paradis** is an airline company that wants to use AI (*Artificial Intelligence*) to **detect Bad Buzz associated with its brand** in online public tweets.

**As AI engineer for Marketing Intelligence Consulting**, we will dive into **NLP** (*Natural Language Processing*) techniques to serve Air Paradis' purpose.

Indeed, NLP allows a machine to **understand and process human language**. It will help us to solve this **text classification goal** and **detect sentiment** (positive or negative) from these tweets.

We will deploy our best **DETECT SENTIMENT solution** through <font color=salmon>**Microsoft Azure Machine Learning plateform**</font> (***MS Azure ML***).

<br>

Therefore, we will structure the project as follows:

<br>

| **Services / Tools** | **Objective** | **Available notebook** |
| :-- | :-- | :-- |
| **Google Colab and Python libraries** | Build quality of data by pre-processing the tweets text | Notebook N°1 |
| **Google Colab / MS Azure Cognitive Services API** | Use Text Analytics > Sentiment API | Notebook N°2 |
| **Python Script / MS Azure ML Studio > Designer** | Use "Drag-and-Drop" pipeline with no code in Azure ML Studio| **<font color=green>Notebook N°3</font>** |
| **Tensorflow-Keras / Google Colab PRO with GPU/TPU** | Train and evaluate advanced models | Notebook N°4 |
| **MS Azure ML Cloud > Models** | Deploy the best solution in MS Azure WebService | Notebook N°5 |

<br>

This notebook is dedicated to 3rd task : **test Azure Machine Learning Studio Designer to build and train a logistic regression model as Sentiment classification**.

# <font color=brown><center>**NOTEBOOK 3<br>AZURE MACHINE LEARNING STUDIO<br>LOGISTIC REGRESSION PIPELINE**</center></font>

![aml-studio](https://drive.google.com/uc?id=1H-cir-dzjvvU8ggTfmVcILDtLmAD_mfL)

# <font color=salmon>PART 1 - AZURE MACHINE LEARNING STUDIO</font>

## <font color=green>P1.1 - Understand Azure ML Studio</font>

**Azure Machine Learning** is a cloud-based environment that can be used to **train, deploy, automate, manage, and track ML models**.

**Azure Machine Learning Studio** is a web portal in Azure Machine Learning for **low-code and no-code options** for model training, deployment, and asset management:
- **Notebooks**, directly integrated to write and run our own code in managed Jupyter Notebook servers;
- **Designer**, with Drag-and-drop interface to train and deploy machine learning models as *pipeline*, with low or no-code;
- **Experiment**, to diagnose errors and warnings, or track performance metrics of experiments scaled with compute target;
- **Compute**, to manage a wide and customizable range of compute ressources.

The documentation about Azure ML Studio can be found [here](https://ml.azure.com/).

![designer](https://drive.google.com/uc?id=1ZCh32XNfGvUIfqR8DNZNdhQCsODuWda7)

An **AML workplace** is necessary to interact with AML Studio.

This workplace is the centralized place which contains all the components that allows us to create, use and register any of Azure Machine Learning resources.

The documentation about the workspace can be found [here](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace).

## <font color=green>P1.2 - Understand Azure ML Designer</font>

As said previously, the **Designer** is a **drag-and-drop** tool used to create machine learning models with no code or low code.

In the Designer interface, Azure Machine Learning Studio gives 2 options:
- **Create from scratch a new pipeline**;
- **Clone one of the available sample pipelines** and adapt it to our need.

We have chosen the 2nd option: 
- Clone the **Text Classification - Wikipedia SP 500 Dataset** sample;
- Adapt the easy-to-use prebuilt modules to our needs.

![create](https://drive.google.com/uc?id=1i813a14CVo9pkmesFV7a--SM3M-Dekll)![sample](https://drive.google.com/uc?id=16_8cWEXomO6qeT_9m8_6UF0efShv9WQq)

## <font color=green>P1.3 - Understand pipeline steps and modules</font>

The pipeline is a **series of steps called "modules", connected to each other in order to build a complete workflow of a machine learning task**.

A **module** is an algorithm that can be performed on the data, such as:
- **Data preparation**: dataset importation, sampling, data cleaning;
- **Data transformation**: hashing and n-grams features from text;
- **Data split in subsets**: train and test sets;
- **Model selection**: regression, classification;
- **Training configuration**: especially, compute resources;
- **Progress monitoring**: errors and warning diagnosis, performance tracking,...

# <font color=salmon>PART 2 - WORK WITH CLONED PIPELINE</font>

## <font color=green>P2.1 - Use Sample dataframe</font>

First we need to use our sample dataframe to replace the "Wikipedia SP 500 Dataset" in the pipeline.

As seen in our first notebook, the text that has the best performance is **cleaned_tweet**, which have been pre-processed but have kept stopwords, and are not lemmatized.

We will use to keep all our models **comparable**, with roughly same variables to appy, except sample size and some little variations.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Let's get to AML Studio to load it.

## <font color=green>P2.2 - Clone existing pipeline</font>

When we click on **Designer** in the left menu, we know have a copy.

![amls-clone](https://drive.google.com/uc?id=1vS54I9EwnjorLi4srugtADTc077rCDtw)

## <font color=green>P2.3 - Register our sample dataset</font>

Then, we go to AML Studio again and choose **Datasets** in the left menu.

![dataset](https://drive.google.com/uc?id=1TmynispJorTdtqMnQmnvqvSfRj8nZZ7T)

## <font color=green>P2.4 - Adapt cloned pipeline</font>

### <font color=blue>P2.4.1 - 1st step : change the dataset</font>

We just need to click on the module of the old dataset (Wikipedia), delete it and drag-and-drop the new dataset.

Eventually, we can change some parameters.

![change](https://drive.google.com/uc?id=1LLIAjUhMlyRWh8ZDu01U485xpJse-f6j)

### <font color=blue>P2.4.2 - Following steps : adjust the parameters of other modules</font>

The following actions have been done:
- **Preprocess Text** has been removed as we do want the model to predict on our cleaned tweet;
- **Split Data** has been re-ajusted to 0.8 Train set and 0.2 Test set, add a Random seed (42), keep the data with *stratified split*;
- We have changed the model from *Multiclass Logistic Regression* to **Two-Class Logistic Regression** with random seed = 42, as classification algorithm.

In the cloned pipeline, 2 models of Text Data transformation are used:
- Feature Hashing;
- Extract N-Gram Features from Text.

The unique parameter we have changed for both is *Target column* to match our cleaned tweet.

The rest was kept inchanged : **Train Model, Score Model, Evaluate Model**.

![pipeline](https://drive.google.com/uc?id=1OZVcDQBxCdJrl1aPF7vngLXLdkXADLeJ)

## <font color=green>P2.5 - Set up compute target</font>

Here again, we have 2 choices to create a compute target:
- Create one with the predefined configuration;
- Create a compute cluster ==> this is the option we took.

## <font color=green>P2.5 - Set up pipeline run and submit</font>

We just have to create the experiment by giving a name.

<font color=red>And we wait BUT we encountered a lot of hiccups and we had to run 14 submissions before we got a result!</font>.

This is not what I would qualify of an **"easy drag-and-drop"** tool, but it's a personal opinion.

Besides, it requires a minimum of understanding of how each algorithm is working - in terms of input and output requirements or parameters, to be able to debug when the runs of each module fail.

## <font color=green>P2.6 - Evaluate the result</font>


As we have split our 1600 data to 0.8% Train and 0.2% Test, the evaluation was made on 320 data rows.

### <font color=blue>P2.6.1 - Evaluate feature hashing (left port)</font>

**Feature Hashing** module transforms a stream of English text into a set of integer features.

It operates on the exact **strings** provided as input and does not perform any linguistic analysis or preprocessing.

Internally, the Feature Hashing module creates a dictionary of n-grams, the size of the n-grams can be controled by using the N-grams property.

After the dictionary is built, the Feature Hashing module converts the dictionary terms into hash values. It then computes whether a feature was used in each case. For each row of text data, the module outputs a set of columns, one column for each hashed feature.

#### **Visualize transformed dataset**

![dataset](https://drive.google.com/uc?id=143xei5Kr3k9gB9u-jHbJPWe14p0M0b-R)

#### **Visualize metrics and confusion matrix**

***Observations***:
- First, we note that the confusion matrix is not plotted as usual, that means, it is structured differently: the actual values are in the top (abscissa) and the predicted values are in the left (ordinate); besides, positive value (1) is given first; in the figure, the True Positives are in the top-left box, and the True Negatives are in the bottom-right box.

![left](https://drive.google.com/uc?id=18Dor4k9xMv47L7BjJQmYPLKil6YjKfX3)

<font color=red>**Note that the model has its best accuracy (0.634) at 0.5 classification threshold.**</font>

### <font color=blue>P2.6.2 - Evaluate n-gram extraction features from text (right port)</font>

The **Extract N-Gram Features from Text** module *featurizes* unstructured text data. 

It creates a new list of n-grams from dictionary provided by the processed column of raw text: for example, if the N-Gram size is 2, unigrams and bigrams will be created.

The <code>**weighting function**</code> was set to **TF-IDF Weight**, that means, that, *term frequency/inverse document frequency (TF/IDF) score* is assigned to the extracted n-grams. The value for each n-gram is its TF score multiplied by its IDF score.

#### **Visualize transformed dataset and vocabulary**

![dataset](https://drive.google.com/uc?id=1ksxl6vmNd6JPqBSkPZA98fM37YRDym-U) ![vocab](https://drive.google.com/uc?id=1LNmFLhPjRndDG6kGv6HbzvU5egzfBqhW)

#### **Visualize metrics and confusion matrix**

![right](https://drive.google.com/uc?id=1Z6SduZEsmH898PTW2dq5L0K2_qQ-GnRa)

<font color=red>**Note that the model has its best accuracy (0.706) at 0.5 classification threshold.**</font>

The result can be summarized as below.

In [None]:
import pandas as pd

# Create empty list
scores_model = []

# Append scores/results
scores_model.append({'Model': 'Studio - Feat. hashing',
                     'Predict_time':'{:0.1f}'.format(60+21.45),
                     'AUC_Score':'{:0.3f}%'.format(70.2),
                     'Accuracy':'{:0.3f}%'.format(63.4)})

# Save in DF
model_results = pd.DataFrame.from_records(scores_model)

# Append scores/results
model_results = model_results.append({'Model': 'Studio - N-grams extraction',
                                      'Predict_time':'{:0.1f}'.format(60+43.43),
                                      'AUC_Score':'{:0.3f}%'.format(78.4),
                                      'Accuracy':'{:0.3f}%'.format(70.6)},
                                     ignore_index=True)

model_results

Unnamed: 0,Model,Predict_time,AUC_Score,Accuracy
0,Studio - Feat. hashing,81.5,70.200%,63.400%
1,Studio - N-grams extraction,103.4,78.400%,70.600%


In [None]:
from pathlib import Path
src_folder = Path('/content/drive/MyDrive/OC_IA/P07')

# Save to CSV file
model_results.to_csv(src_folder / 'p7_03_model_results.csv', index=False)

## <font color=green>P2.7 - Get experiment link for presentation</font>

In [None]:
from IPython.display import clear_output

# Install azure ml SDK
!pip install azureml-core
# !pip install azureml-pipeline
# !pip install azureml-pipeline-core

clear_output()

In [None]:
import azureml.core

# Check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  1.36.0


In [None]:
from azureml.core import Workspace

# Connect to workspace
ws = Workspace.from_config('/content/drive/MyDrive/OC_IA/P07/p7_03_ws_config.json')

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code RM4CT33UY to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.


In [None]:
from azureml.core.experiment import Experiment

# Get a list of all Experiment object in a Workspace
list_experiments = Experiment.list(ws)
list_experiments

[Experiment(Name: OCIA_P07,
 Workspace: OC_IA_Tyson)]

In [None]:
# Retrieved from Azure Machine Learning web UI
run_id = '4ce36137-4bc1-48d6-8619-8bb9a7a30dc0'
exp = ws.experiments['OCIA_P07']
run = next(run for run in exp.get_runs() if run.id == run_id)
run

Experiment,Id,Type,Status,Details Page,Docs Page
OCIA_P07,4ce36137-4bc1-48d6-8619-8bb9a7a30dc0,azureml.PipelineRun,Completed,Link to Azure Machine Learning studio,Link to Documentation


# <font color=salmon>CONCLUSION</font>

From our point of view, this tool does not keep its promises : it is supposed to be straightforward, BUT IT IS NOT AT ALL.

We have to admit that Microsoft lacks in pedagogy and it has trouble presenting his tutorials in a sequential manner, keeping in mind the context in which the tool is used.

In [None]:
# Load result Sentiment API
results1 = pd.read_csv(src_folder / 'p7_02_model_results.csv')
# results1.reset_index(drop=True, inplace=True)
results1

Unnamed: 0,Model,Predict_time,AUC_Score,Accuracy
0,Original tweets,186.1,74.349%,74.335%
1,Cleaned tweets,192.2,75.989%,75.949%
2,Lemmatized tweets,184.0,75.380%,75.319%


In [None]:
# Load result ML Studio
results2 = pd.read_csv(src_folder / 'p7_03_model_results.csv')
# results2.reset_index(drop=True, inplace=True)
results2

Unnamed: 0,Model,Predict_time,AUC_Score,Accuracy
0,Studio - Feat. hashing,81.5,70.200%,63.400%
1,Studio - N-grams extraction,103.4,78.400%,70.600%
