Exportation des donnees

Prérequis
If you are using an Azure Machine Learning Notebook VM. you are all set. Otherwise, make sure you go through the configuration Notebook first if you haven't.


In [1]:
#Check core SDK version number
import azureml.core

print("SDK version", azureml.core.VERSION)

SDK version 1.51.0


In [2]:
!pip install azure-ai-ml

Collecting azure-ai-ml
  Downloading azure_ai_ml-1.12.1-py3-none-any.whl (8.3 MB)
[K     |████████████████████████████████| 8.3 MB 5.7 MB/s eta 0:00:01
Collecting azure-storage-file-datalake<13.0.0
  Downloading azure_storage_file_datalake-12.14.0-py3-none-any.whl (251 kB)
[K     |████████████████████████████████| 251 kB 74.1 MB/s eta 0:00:01
Collecting pydash<7.0.6,>=6.0.0
  Downloading pydash-7.0.5-py3-none-any.whl (109 kB)
[K     |████████████████████████████████| 109 kB 78.8 MB/s eta 0:00:01
Collecting azure-storage-file-share<13.0.0
  Downloading azure_storage_file_share-12.15.0-py3-none-any.whl (267 kB)
[K     |████████████████████████████████| 267 kB 81.4 MB/s eta 0:00:01
[?25hCollecting marshmallow<4.0.0,>=3.5
  Downloading marshmallow-3.20.1-py3-none-any.whl (49 kB)
[K     |████████████████████████████████| 49 kB 7.7 MB/s  eta 0:00:01
[?25hCollecting strictyaml<2.0.0
  Downloading strictyaml-1.7.3-py3-none-any.whl (123 kB)
[K     |████████████████████████████████| 123 

## **Autre methode**

In [3]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# authenticate
credential = DefaultAzureCredential()
SUBSCRIPTION='2b635f7e-bdd3-49af-9129-9a0d8d7da646'
RESOURCE_GROUP='groupe_ressource_groupe2'
WS_NAME='workspace_nlp_groupe2'
# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id=SUBSCRIPTION,
    resource_group_name=RESOURCE_GROUP,
    workspace_name=WS_NAME,
)

In [4]:
# Verify that the handle works correctly.  
# If you ge an error here, modify your SUBSCRIPTION, RESOURCE_GROUP, and WS_NAME in the previous cell.
ws = ml_client.workspaces.get(WS_NAME)
print(ws.location,":", ws.resource_group)

westus3 : groupe_ressource_groupe2


Create a job environment

In [5]:
import os
dependencies_dir = "./dependencies"
os.makedirs(dependencies_dir, exist_ok=True)

In [6]:
%%writefile {dependencies_dir}/conda.yaml
name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - numpy=1.21.2
  - pip=21.2.4
  - scikit-learn=1.0.2
  - scipy=1.7.1
  - pandas>=1.1,<1.2
  - pip:
    - inference-schema[numpy-support]==1.3.0
    - mlflow==2.8.0
    - mlflow-skinny==2.8.0
    - azureml-mlflow==1.51.0
    - psutil>=5.8,<5.9
    - tqdm>=4.59,<4.60
    - ipykernel~=6.0
    - matplotlib
    - nltk==3.7
    - imblearn
    - seaborn

Writing ./dependencies/conda.yaml


In [7]:
from azure.ai.ml.entities import Environment
from sklearn.naive_bayes import GaussianNB

custom_env_name = "aml-scikit-learn"

custom_job_env = Environment(
    name=custom_env_name,
    description="Custom environment for Chichewa Defaults job",
    tags={"scikit-learn": "1.0.2"},
    conda_file=os.path.join(dependencies_dir, "conda.yaml"),
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
)
custom_job_env = ml_client.environments.create_or_update(custom_job_env)

print(
    f"Environment with name {custom_job_env.name} is registered to workspace, the environment version is {custom_job_env.version}"
)

Environment with name aml-scikit-learn is registered to workspace, the environment version is 1


Create training script

In [8]:
import os
train_src_dir = "./src"
os.makedirs(train_src_dir, exist_ok=True)

In [9]:
%%writefile {train_src_dir}/main.py
import os
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
#Cette classe est utilisée pour convertir une collection de documents bruts en une matrice de fonctionnalités TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
#accuracy_score calcule la précision du modèle en comparant les étiquettes prédites aux véritables étiquettes
#classification_report génère un rapport qui inclut la précision, le rappel, le score F1 et la prise en charge de chaque classe du problème de classification
#confusion_matrix génère une matrice qui montre le nombre de vrais positifs, de faux positifs, de vrais négatifs et de faux négatifs pour chaque classe du problème de classification
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix #Ces fonctions sont utilisées pour évaluer les performances d'un modèle de classification
#Cette fonction est utilisée pour diviser un ensemble de données en ensembles de formation et de test pour les modèles d'apprentissage automatique.
from sklearn.model_selection import train_test_split
#Cette classe est utilisée pour implémenter des classificateurs linéaires avec un entraînement par descente de gradient stochastique (SGD).
from sklearn.linear_model import SGDClassifier
#Cette classe est utilisée pour coder les caractéristiques
# catégorielles sous forme d'entiers.
from sklearn.preprocessing import LabelEncoder
#Cette classe est utilisée pour lemmatiser les mots, qui est le processus de réduction des mots à leur forme de base ou dictionnaire.
from nltk.stem import WordNetLemmatizer
#Cette classe est utilisée pour effectuer un suréchantillonnage à l'aide de l'algorithme SMOTE (Synthetic Minority Over-sampling Technique).
from imblearn.over_sampling import SMOTE
# Le module fournit un ensemble de fonctions qui facilitent la création d'une variété de graphiques et de tracés en Python.
import matplotlib.pyplot as plt
# Seaborn est une bibliothèque de visualisation de données Python construite sur Matplotlib
import seaborn as sns
#Pandas est une bibliothèque Python populaire utilisée pour la manipulation et l'analyse de données
import pandas as pd
#NumPy est une bibliothèque de calcul scientifique construite sur le langage de programmation Python qui fournit un ensemble de fonctions
#et de fonctionnalités de haut niveau pour effectuer l'analyse et la manipulation de données.
import numpy as np
# Est un module Python intégré utilisé pour travailler avec des expressions régulières.
# Les expressions régulières sont un outil puissant pour la correspondance de modèles et la manipulation de texte,
# et le module fournit un ensemble de fonctions et de classes pour travailler avec des expressions régulières en Python.
import re
# Est utilisé pour contrôler les messages d'avertissement.
# Le module fournit un ensemble de fonctions et de classes pour filtrer, ignorer ou générer des messages d'avertissement dans le code Python.
import warnings
warnings.filterwarnings("ignore")

def main():
    """Main function of the script."""

    # input and output arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="path to input data")
    parser.add_argument("--test_train_ratio", type=float, required=False, default=0.3)
    parser.add_argument("--n_estimators", required=False, default=100, type=int)
    parser.add_argument("--learning_rate", required=False, default=0.1, type=float)
    parser.add_argument("--registered_model_name", type=str, help="model name")
    args = parser.parse_args()
   
    # Start Logging
    mlflow.start_run()

    # enable autologging
    mlflow.sklearn.autolog()

    ###################
    #<prepare the data>
    ###################
    print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

    print("input data:", args.data)
    
    news = pd.read_csv(args.data)
    print(news.columns)

    mlflow.log_metric("num_rows", news.shape[0])
    mlflow.log_metric("num_columns",news.shape[1])
    chichewa = ['i', 'ine', 'wanga', 'inenso', 'ife', 'athu', 'athu', 'tokha', 'inu', 'ndinu','iwe ukhoza', 'wako','wekha','nokha','iye','wake','iyemwini','icho','ndi','zake','lokha','iwo','awo','iwowo','chiyani','amene', 'uyu', 'uyo', 'awa', "ndili", 'ndi', 'ali','anali','khalani','akhala','kukhala',' Khalani nawo','wakhala','anali','chitani','amachita','kuchita', 'a', 'an', 'pulogalamu ya', 'ndi', 'koma', 'ngati', 'kapena', 'chifukwa', 'monga', 'mpaka', 'pamene', 'wa', 'pa ',' by','chifukwa' 'ndi','pafupi','kutsutsana','pakati','kupyola','nthawi', 'nthawi','kale','pambuyo','pamwamba', 'pansipa', 'kuti', 'kuchokera', 'mmwamba', 'pansi', 'mu', 'kunja', 'kuyatsa', 'kuchoka', 'kutha', 'kachiwiri', 'kupitilira','kenako',' kamodzi','apa','apo','liti','pati','bwanji','onse','aliyense','onse','aliyense', 'ochepa', 'zambiri', 'ambiri', 'ena', 'otero', 'ayi', 'kapena', 'osati', 'okha', 'eni', 'omwewo', 'kotero',' kuposa','nawonso',' kwambiri','angathe','ndidzatero','basi','musatero', 'musachite',' muyenera', 'muyenera kukhala','tsopano', 'sali', 'sindinathe','​​sanachite','satero','analibe', 'sanatero','sanachite','sindinatero','ayi','si', 'ma', 'sizingatheke','mwina','sayenera', 'osowa','osafunikira', 'shan' , 'nenani', 'sayenera', 'sanali', 'anapambana', 'sangachite', 'sanakonde', 'sangatero']
    wn = WordNetLemmatizer()
    def text_preprocessing(review):
        review = re.sub('[^a-zA-Z]', ' ', review)
        review = review.lower()
        review = review.split()
        review = [wn.lemmatize(word) for word in review if not word in chichewa]
        review = ' '.join(review)
        return review
        
    news['Text'] = news['Text'].apply(text_preprocessing)

    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(news['Text']).toarray()
    training = pd.DataFrame(X)

    
    X = training
    y = news['Label']

    label_encoder = LabelEncoder()
    y_label = label_encoder.fit_transform(y)

    smote = SMOTE()
    X, y_label = smote.fit_resample(X,y_label)
    np.bincount(y_label)


    #Split train and test datasets
    X_train, X_test, y_train, y_test = train_test_split(X, y_label, test_size=0.1, random_state=0)
    ####################
    #</prepare the data>
    ####################

    ##################
    #<train the model>
    ##################

    # convert the dataframe values to array
    X_train = X_train.values

    # convert the dataframe values to array
    X_test = X_test.values

    print(f"Training with data of shape {X_train.shape}")

    model = GaussianNB()
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    print(classification_report(y_test, y_pred))

    
    test_pred = label_encoder.inverse_transform(y_pred)
    test_label = label_encoder.inverse_transform(y_test)
    cf_matrix = confusion_matrix(test_pred, test_label)
    sns.heatmap(cf_matrix, annot=True)
    ###################
    #</train the model>
    ###################

    ##########################
    #<save and register model>
    ##########################
    # Registering the model to the workspace
    print("Registering the model via MLFlow")
    mlflow.sklearn.log_model(
        sk_model=model,
        registered_model_name=args.registered_model_name,
        artifact_path=args.registered_model_name,
    )

    # Saving the model to a file
    mlflow.sklearn.save_model(
        sk_model=model,
        path=os.path.join(args.registered_model_name, "trained_model"),
    )
    ###########################
    #</save and register model>
    ###########################
    
    # Stop Logging
    mlflow.end_run()

if __name__ == "__main__":
    main()

Writing ./src/main.py


Configure the command

In [10]:
from azure.ai.ml import command
from azure.ai.ml import Input

registered_model_name = "chichewa_defaults_model"

job = command(
    inputs=dict(
        data=Input(
            type="uri_file",
            path="Train.csv",
        ),
        test_train_ratio=0.2,
        learning_rate=0.25,
        registered_model_name=registered_model_name,
    ),
    code="./src/",  # location of source code
    command="python main.py --data ${{inputs.data}} --test_train_ratio ${{inputs.test_train_ratio}} --learning_rate ${{inputs.learning_rate}} --registered_model_name ${{inputs.registered_model_name}}",
    environment="aml-scikit-learn@latest",
    display_name="chichewa_default_prediction",
)

Submit the job

In [11]:
ml_client.create_or_update(job)

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
[32mUploading src (0.01 MBs): 100%|██

Experiment,Name,Type,Status,Details Page
projet_nlp_2023_groupe2,gentle_yak_7w951b305j,command,Starting,Link to Azure Machine Learning studio


## **Deploy a model as an online endpoint**

Register the model

In [12]:
# Import the necessary libraries
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

# Provide the model details, including the
# path to the model files, if you've stored them locally.
mlflow_model = Model(
    path="./deploy/chichewa_defaults_model/",
    type=AssetTypes.MLFLOW_MODEL,
    name="chichewa_defaults_model",
    description="MLflow Model created from local files.",
)

# Register the model
ml_client.models.create_or_update(mlflow_model)

[32mUploading chichewa_defaults_model (0.12 MBs):   0%|          | 0/122845 [00:00<?, ?it/s][32mUploading chichewa_defaults_model (0.12 MBs):   1%|          | 832/122845 [00:00<00:14, 8259.90it/s][32mUploading chichewa_defaults_model (0.12 MBs): 100%|██████████| 122845/122845 [00:00<00:00, 1134997.80it/s]
[39m



Model({'job_name': None, 'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'chichewa_defaults_model', 'description': 'MLflow Model created from local files.', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/2b635f7e-bdd3-49af-9129-9a0d8d7da646/resourceGroups/groupe_ressource_groupe2/providers/Microsoft.MachineLearningServices/workspaces/workspace_nlp_groupe2/models/chichewa_defaults_model/versions/1', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/clustergroupe2/code/Users/fama.coundoul/projet_nlp_2023_groupe2', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f5e9dda8880>, 'serialize': <msrest.serialization.Serializer object at 0x7f5e9ddd8550>, 'version': '1', 'latest_version': None, 'path': 'azureml://subscriptions/2b635f7e-bdd3-49af-9129-9a0d8d7da646/resourceGroups/groupe_ressource_groupe2/workspaces/workspac

In [13]:
# Let's pick the latest version of the model
latest_model_version = max(
    [int(m.version) for m in ml_client.models.list(name=registered_model_name)]
)

print(latest_model_version)

1


Endpoints and deployments

Create an online endpoint

In [14]:
import uuid

# Create a unique name for the endpoint
online_endpoint_name = "chichewa-endpoint-" + str(uuid.uuid4())[:8]

In [15]:
from azure.ai.ml.entities import ManagedOnlineEndpoint

# define an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="this is an online endpoint",
    auth_mode="key",
    tags={
        "training_dataset": "chichewa_defaults",
    },
)

In [16]:
# create the online endpoint
# expect the endpoint to take approximately 2 minutes.

endpoint = ml_client.online_endpoints.begin_create_or_update(endpoint).result()

In [17]:
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

print(
    f'Endpoint "{endpoint.name}" with provisioning state "{endpoint.provisioning_state}" is retrieved'
)

Endpoint "chichewa-endpoint-97ce6c6d" with provisioning state "Succeeded" is retrieved


Deployment using an MLflow model

Deploy the model to the endpoint

In [18]:
from azure.ai.ml.entities import ManagedOnlineDeployment

# Choose the latest version of our registered model for deployment
model = ml_client.models.get(name=registered_model_name, version=latest_model_version)

# define an online deployment
# if you run into an out of quota error, change the instance_type to a comparable VM that is available.\
# Learn more on https://azure.microsoft.com/en-us/pricing/details/machine-learning/.
chichewa_deployment = ManagedOnlineDeployment(
    name="chichewa",
    endpoint_name=online_endpoint_name,
    model=model,
    instance_type="Standard_F4s_v2",
    instance_count=1,
)

In [19]:
# create the online deployment
chichewa_deployment = ml_client.online_deployments.begin_create_or_update(chichewa_deployment).result()

# blue deployment takes 100% traffic
# expect the deployment to take approximately 8 to 10 minutes.
endpoint.traffic = {"chichewa": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

Check: endpoint chichewa-endpoint-97ce6c6d exists
Readonly attribute principal_id will be ignored in class <class 'azure.ai.ml._restclient.v2022_05_01.models._models_py3.ManagedServiceIdentity'>
Readonly attribute tenant_id will be ignored in class <class 'azure.ai.ml._restclient.v2022_05_01.models._models_py3.ManagedServiceIdentity'>


............................................................................................

ManagedOnlineEndpoint({'public_network_access': 'Enabled', 'provisioning_state': 'Succeeded', 'scoring_uri': 'https://chichewa-endpoint-97ce6c6d.westus3.inference.ml.azure.com/score', 'openapi_uri': 'https://chichewa-endpoint-97ce6c6d.westus3.inference.ml.azure.com/swagger.json', 'name': 'chichewa-endpoint-97ce6c6d', 'description': 'this is an online endpoint', 'tags': {'training_dataset': 'chichewa_defaults'}, 'properties': {'azureml.onlineendpointid': '/subscriptions/2b635f7e-bdd3-49af-9129-9a0d8d7da646/resourcegroups/groupe_ressource_groupe2/providers/microsoft.machinelearningservices/workspaces/workspace_nlp_groupe2/onlineendpoints/chichewa-endpoint-97ce6c6d', 'AzureAsyncOperationUri': 'https://management.azure.com/subscriptions/2b635f7e-bdd3-49af-9129-9a0d8d7da646/providers/Microsoft.MachineLearningServices/locations/westus3/mfeOperationsStatus/oe:4bf28531-1e1d-4933-ae53-b53f06f404ff:ecc7df4c-18a3-40b1-88ce-b376868fe45a?api-version=2022-02-01-preview'}, 'print_as_yaml': True, 'id'

Vérifier l'état du point de terminaison

In [20]:
# return an object that contains metadata for the endpoint
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

# print a selection of the endpoint's metadata
print(
    f"Name: {endpoint.name}\nStatus: {endpoint.provisioning_state}\nDescription: {endpoint.description}"
)

Name: chichewa-endpoint-97ce6c6d
Status: Succeeded
Description: this is an online endpoint


In [21]:
# existing traffic details
print(endpoint.traffic)

# Get the scoring URI
print(endpoint.scoring_uri)

{'chichewa': 100}
https://chichewa-endpoint-97ce6c6d.westus3.inference.ml.azure.com/score


In [22]:
import pandas as pd
news=pd.read_csv("Train.csv")
len(news['Text'])

1436

In [30]:
news.shape[0]

1436

In [33]:
news

Unnamed: 0,ID,Text,Label
1,ID_AGoFySzn,MCP siidakhutire ndi kalembera Chipani cha Ma...,POLITICS
2,ID_AGrrkBGP,Bungwe la MANEPO Lapempha Boma Liganizire Anth...,HEALTH
3,ID_AIJeigeG,Ndale zogawanitsa miyambo zanyanya Si zachile...,POLITICS
4,ID_APMprMbV,Nanga wapolisi ataphofomoka? Masiku ano sichi...,LAW/ORDER
5,ID_ARZdpinF,Kusamvetsetsana kwabuka ku Mwanza Kusamvetset...,LAW/ORDER
...,...,...,...
1431,ID_zmTmmEio,Eni Minibus Ati Ali ndi Ufulu Wokweza Mitengo ...,TRANSPORT
1432,ID_znOlIaGQ,Kachali apepesa: Kulankhula motumbwa kuthe An...,POLITICS
1433,ID_znracTjN,Mawu supports non-fiction writers The Malawi ...,EDUCATION
1434,ID_ztdsmmva,Tame Mwawa: Phwete ndiye kudya kwake Sewero l...,SOCIAL ISSUES
