<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-🚀" data-toc-modified-id="Introduction-🚀-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction 🚀</a></span></li><li><span><a href="#Import-des-librairies" data-toc-modified-id="Import-des-librairies-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import des librairies</a></span></li><li><span><a href="#Entrainement" data-toc-modified-id="Entrainement-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Entrainement</a></span><ul class="toc-item"><li><span><a href="#Envoie-des-utterances" data-toc-modified-id="Envoie-des-utterances-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Envoie des utterances</a></span></li><li><span><a href="#Entrainement-du-modèle" data-toc-modified-id="Entrainement-du-modèle-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Entrainement du modèle</a></span></li><li><span><a href="#Publication-du-modèle" data-toc-modified-id="Publication-du-modèle-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Publication du modèle</a></span></li><li><span><a href="#Prédiction" data-toc-modified-id="Prédiction-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Prédiction</a></span></li><li><span><a href="#Formatage-des-données-prébuilt" data-toc-modified-id="Formatage-des-données-prébuilt-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Formatage des données prébuilt</a></span><ul class="toc-item"><li><span><a href="#Date" data-toc-modified-id="Date-3.5.1"><span class="toc-item-num">3.5.1&nbsp;&nbsp;</span>Date</a></span></li><li><span><a href="#Budget" data-toc-modified-id="Budget-3.5.2"><span class="toc-item-num">3.5.2&nbsp;&nbsp;</span>Budget</a></span></li><li><span><a href="#Geography" data-toc-modified-id="Geography-3.5.3"><span class="toc-item-num">3.5.3&nbsp;&nbsp;</span>Geography</a></span></li></ul></li><li><span><a href="#Scoring" data-toc-modified-id="Scoring-3.6"><span class="toc-item-num">3.6&nbsp;&nbsp;</span>Scoring</a></span><ul class="toc-item"><li><span><a href="#Optimisation" data-toc-modified-id="Optimisation-3.6.1"><span class="toc-item-num">3.6.1&nbsp;&nbsp;</span>Optimisation</a></span></li></ul></li></ul></li><li><span><a href="#Conclusion-🏁" data-toc-modified-id="Conclusion-🏁-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conclusion 🏁</a></span></li></ul></div>

# Introduction 🚀

Dans ce notebook, nous allons entraîner notre modèle avec toutes les données d'entraînement et utiliser les données de test pour tester sa performance. 

Nous allons reproduire une partie du notebook "LUIS" où nous avons effectué un premier essai de création d'application et d'entraînement via un seul exemple. Nous commencerons par la création de l'application. 

Une fois que notre application est créée, nous l'entraînerons avec toutes les données d'entraînement. Alors, c'est parti! 🏁👩‍💻

# Import des librairies

In [1]:
import json, time, uuid
import re

import os
from dotenv import load_dotenv

import pandas as pd
import numpy as np
from tqdm import tqdm

from dateutil.parser import parse
from price_parser import Price

from azure.cognitiveservices.language.luis.authoring import LUISAuthoringClient
from azure.cognitiveservices.language.luis.authoring.models import ApplicationCreateObject
from azure.cognitiveservices.language.luis.runtime import LUISRuntimeClient
from msrest.authentication import CognitiveServicesCredentials
from functools import reduce

In [2]:
authoringKey = os.environ.get("APP_AUTHORING_KEY")
authoringEndpoint = os.environ.get("ENDPOINT_AUTHORING_URL")
predictionKey = os.environ.get("APP_PREDICTION_KEY")
predictionEndpoint = os.environ.get("ENDPOINT_PREDICTION_URL")

appName = 'LUIS-P10'
versionId = "0.1"
intentName = "Booking"

In [3]:
client = LUISAuthoringClient(authoringEndpoint,
                             CognitiveServicesCredentials(authoringKey))

In [4]:
# define app basics
appDefinition = ApplicationCreateObject(name=appName,
                                        initial_version_id=versionId,
                                        culture='en-us')

# create app
app_id = client.apps.add(appDefinition)

# get app id - necessary for all other changes
print("Created LUIS app with ID {}".format(app_id))

Created LUIS app with ID 0fa21d7f-f2fb-4cd2-9835-25ddd194251d


In [5]:
# app_id = "e1365658-152c-43ef-b0b3-6d443ad954a9"

client.model.add_intent(app_id, versionId, intentName)

'df5415ec-224a-45f4-bad1-478a9c9785f3'

In [6]:
def get_grandchild_id(model, childName, grandChildName):

    theseChildren = next(
        filter((lambda child: child.name == childName), model.children))
    theseGrandchildren = next(
        filter((lambda child: child.name == grandChildName),
               theseChildren.children))

    grandChildId = theseGrandchildren.id

    return grandChildId

In [7]:
# Add Prebuilt entity
client.model.add_prebuilt(
    app_id,
    versionId,
    prebuilt_extractor_names=['datetimeV2', 'money', 'geographyV2'])

# define machine-learned entity
mlEntityDefinition = [{
    "name":
    "Fly",
    "children": [{
        "name": "or_city"
    }, {
        "name": "dst_city"
    }, {
        "name": "str_date"
    }, {
        "name": "end_date"
    }, {
        "name": "budget"
    }]
}]

# add entity to app
modelId = client.model.add_entity(app_id,
                                  versionId,
                                  name="FlyOrder",
                                  children=mlEntityDefinition)

# define phraselist - add phrases as significant vocabulary to app
# Ajoute un poids aux mots
phraseList2 = {
    "enabledForAllModels": False,
    "isExchangeable": True,
    "name": "Phraselist",
    "phrases": "fly, destination, hotel"
}

# add phrase list to app
phraseListId2 = client.features.add_phrase_list(app_id, versionId, phraseList2)

# add phrase list as feature to subentity model
modelObject = client.model.get_entity(app_id, versionId, modelId)
FlyDst_cityId = get_grandchild_id(modelObject, "Fly", "dst_city")

phraseListFeatureDefinition = {
    "feature_name": "Phraselist",
    "model_name": None
}
client.features.add_entity_feature(app_id, versionId, FlyDst_cityId,
                                   phraseListFeatureDefinition)

<azure.cognitiveservices.language.luis.authoring.models._models_py3.OperationStatus at 0x154b5ed9030>

In [8]:
# Define labeled example
labeledExampleUtteranceWithMLEntity = {
    "text":
    'IM IN TIJUANA FIND ME A FLIGHT TO CURITIBA AUG 27 TO SEPT 4 for a budget of 3500 dollars',
    "intentName":
    intentName,
    "entityLabels": [{
        "startCharIndex":
        6,
        "endCharIndex":
        87,
        "entityName":
        "FlyOrder",
        "children": [{
            "startCharIndex":
            6,
            "endCharIndex":
            87,
            "entityName":
            "Fly",
            "children": [{
                "startCharIndex": 6,
                "endCharIndex": 13,
                "entityName": "or_city"
            }, {
                "startCharIndex": 34,
                "endCharIndex": 42,
                "entityName": "dst_city"
            }, {
                "startCharIndex": 43,
                "endCharIndex": 49,
                "entityName": "str_date"
            }, {
                "startCharIndex": 53,
                "endCharIndex": 59,
                "entityName": "end_date"
            }, {
                "startCharIndex": 76,
                "endCharIndex": 87,
                "entityName": "budget"
            }]
        }]
    }]
}

client.examples.add(app_id, versionId, labeledExampleUtteranceWithMLEntity,
                    {"enableNestedChildren": True})

<azure.cognitiveservices.language.luis.authoring.models._models_py3.LabelExampleResponse at 0x154b5e2f070>

Tout fonctionne correctement, notre application est maintenant créée, nous allons à présent l'entrainer avec toutes les données d'entrainement.

# Entrainement

Importons nos JSON et essayons d'entrainer notre modèle.

In [9]:
f = open('data/train.json')
train = json.load(f)

f = open('data/test.json')
test = json.load(f)

f = open('data/val.json')
val = json.load(f)

## Envoie des utterances

In [10]:
%%time
for i in train:
    client.examples.add(app_id, versionId, i, {"enableNestedChildren": True})

300
600
900
CPU times: total: 4.92 s
Wall time: 5min 15s


## Entrainement du modèle

In [11]:
client.train.train_version(app_id, versionId)
waiting = True
while waiting:
    info = client.train.get_status(app_id, versionId)

    # get_status returns a list of training statuses, one for each model. Loop through them and make sure all are done.
    waiting = any(
        map(
            lambda x: 'Queued' == x.details.status or 'InProgress' == x.details
            .status, info))
    if waiting:
        print("Waiting 10 seconds for training to complete...")
        time.sleep(10)
    else:
        print("trained")
        waiting = False

Waiting 10 seconds for training to complete...
Waiting 10 seconds for training to complete...
trained


## Publication du modèle

In [12]:
# Mark the app as public so we can query it using any prediction endpoint.
# Note: For production scenarios, you should instead assign the app to your own LUIS prediction endpoint. See:
# https://docs.microsoft.com/en-gb/azure/cognitive-services/luis/luis-how-to-azure-subscription#assign-a-resource-to-an-app
client.apps.update_settings(app_id, is_public=True)

responseEndpointInfo = client.apps.publish(app_id, versionId, is_staging=False)

## Prédiction

In [13]:
runtimeCredentials = CognitiveServicesCredentials(predictionKey)
clientRuntime = LUISRuntimeClient(endpoint=predictionEndpoint,
                                  credentials=runtimeCredentials)

In [14]:
# Production == slot name
predictionRequest = {
    "query":
    'Hello I am from France and I need a flight to England from September 15 to October 28 with a budget of 3500'
}

predictionResponse = clientRuntime.prediction.get_slot_prediction(
    app_id, "Production", predictionRequest)
print("Top intent: {}".format(predictionResponse.prediction.top_intent))
print("Sentiment: {}".format(predictionResponse.prediction.sentiment))
print("Intents: ")

for intent in predictionResponse.prediction.intents:
    print("\t{}".format(json.dumps(intent)))
print("Entities: {}".format(predictionResponse.prediction.entities))

Top intent: Booking
Sentiment: None
Intents: 
	"Booking"
Entities: {'FlyOrder': [{'Fly': [{'or_city': ['France'], 'dst_city': ['England'], 'str_date': ['September 15'], 'end_date': ['October 28'], 'budget': ['3500']}]}], 'geographyV2': [{'value': 'England', 'type': 'state'}], 'datetimeV2': [{'type': 'daterange', 'values': [{'timex': '(XXXX-09-15,XXXX-10-28,P43D)', 'resolution': [{'start': '2022-09-15', 'end': '2022-10-28'}, {'start': '2023-09-15', 'end': '2023-10-28'}]}]}]}


Passons notre json de test à la prédiction, et récupérons les pour ensuite les ranger dans une dataframe.

In [26]:
df_test = pd.read_csv('data/test_df.csv',
                      parse_dates=['str_date_formate', 'end_date_formate'])

In [239]:
df_test.head()

Unnamed: 0,text,or_city,dst_city,str_date,end_date,budget,str_date_formate,end_date_formate,budget_formate
0,Three words: Alexandria to Cencun. Please book...,Alexandria,Cencun,ASAP,Sep 6,,NaT,2022-09-06,
1,"Hello. I'm just gonna be honest with you, here...",Busan,Melbourne,the 23rd,,,2022-10-23,NaT,
2,Need two tickets out of Buenos Aires!,Buenos Aires,,,,,NaT,NaT,
3,I need to get to Marseille from Dallas,Dallas,Marseille,,,,NaT,NaT,
4,"Hi there, I'm looking for a trip for 5 between...",Tofino,Dallas,Aug 25,Aug 28,,2022-08-25,2022-08-28,


In [233]:
df_test_prediction = pd.DataFrame(
    columns=['text', 'or_city', 'dst_city', 'str_date', 'end_date', 'budget'])
df_test_prediction

Unnamed: 0,text,or_city,dst_city,str_date,end_date,budget


In [234]:
%%time
for i in range(len(df_test)):
    texte = df_test.loc[i]['text']

    predictionRequest = {"query": texte}

    predictionResponse = clientRuntime.prediction.get_slot_prediction(
        app_id, "Production", predictionRequest)

    prediction = predictionResponse.prediction.entities

    if prediction.get('FlyOrder') != None:
        if bool(prediction.get('FlyOrder')[0]) == True:
            prediction = prediction.get('FlyOrder')[0].get("Fly")[0]

            # Cette manière d'ajouter un élément au dictionnaire permet
            # d'avoir la key 'text' en premier dans le dictionnaire
            prediction_final = {"text": texte}
            prediction_final.update(prediction)
        else:
            prediction_final = {"text": texte}
            prediction_final.update(prediction)
    else:
        prediction_final = {"text": texte}
        prediction_final.update(prediction)

    df_test_prediction = df_test_prediction.append(prediction_final,
                                                   ignore_index=True)


KeyboardInterrupt



In [235]:
# Supprime le format liste pour n'en garder que son contenu
df_test_prediction = df_test_prediction.applymap(lambda x: x[0]
                                                 if isinstance(x, list) else x)
df_test_prediction

Unnamed: 0,text,or_city,dst_city,str_date,end_date,budget,FlyOrder,datetimeV2,geographyV2
0,Three words: Alexandria to Cencun. Please book...,,,,,,{},"{'type': 'datetime', 'values': [{'timex': 'FUT...",
1,"Hello. I'm just gonna be honest with you, here...",Busan,Melbourne,August 24th,,,,,
2,Need two tickets out of Buenos Aires!,,,,,,{},,"{'value': 'Buenos Aires', 'type': 'city'}"
3,I need to get to Marseille from Dallas,Dallas,Marseille,,,,,,
4,"Hi there, I'm looking for a trip for 5 between...",,,,,,{},"{'type': 'daterange', 'values': [{'timex': '(X...",
5,"Hi i'm from Buenos Aires, and I want to book a...",Buenos Aires,,,,,,,
6,Hey! Yes! The 5 of us are looking to go somewh...,,,,,,{},"{'type': 'daterange', 'values': [{'timex': '(X...",
7,Can you get me to Kyoto,,Kyoto,,,,,,
8,"yeah, hey. i need to know if you operate out o...",Tel Aviv,,,,,,,
9,"I have investment meetings in Sao Paulo, the d...",,,,,,{},"{'type': 'datetime', 'values': [{'timex': 'FUT...","{'value': 'Sao Paulo', 'type': 'city'}"


Nous avons quelques colonnes supplémentaires qui concernent les prébuilt, nous les utiliserons peut-être plus tard si nous voulons prendre en compte ces derniers dans le score final du modèle. Exportons notre dataframe de prédiction.

In [3]:
# df_test_prediction.to_csv('data/df_prediction.csv', index=None)
df_test_prediction = pd.read_csv('data/df_prediction.csv')
df_test_prediction

Unnamed: 0,text,or_city,dst_city,str_date,end_date,budget,FlyOrder,datetimeV2,geographyV2,money
0,Three words: Alexandria to Cencun. Please book...,,,,,,{},"{'type': 'datetime', 'values': [{'timex': 'FUT...",,
1,"Hello. I'm just gonna be honest with you, here...",Busan,Melbourne,August 24th,,,,,,
2,Need two tickets out of Buenos Aires!,,,,,,{},,"{'value': 'Buenos Aires', 'type': 'city'}",
3,I need to get to Marseille from Dallas,Dallas,Marseille,,,,,,,
4,"Hi there, I'm looking for a trip for 5 between...",,,,,,{},"{'type': 'daterange', 'values': [{'timex': '(X...",,
...,...,...,...,...,...,...,...,...,...,...
195,Hello. I am a deeply tormented children’s writ...,,,,,,{},"{'type': 'daterange', 'values': [{'timex': '20...","{'value': 'Minneapolis', 'type': 'city'}","{'number': 2000, 'units': 'Dollar'}"
196,I have a business trip coming up in Punta Cana...,,Punta Cana,,,,,,,
197,Melbourne please,,,,,,{},,"{'value': 'Melbourne', 'type': 'city'}",
198,Good morning! So I just won the lottery and de...,San Francisco,Sacramento,,,,,,,


In [4]:
def parse_date(date):
    try:
        return parse(date, fuzzy_with_tokens=True)[0]
    except:
        return np.nan

In [5]:
# Nous formatons les prédictions au bon format
df_test_prediction['str_date'] = df_test_prediction['str_date'].apply(
    parse_date)
df_test_prediction['end_date'] = df_test_prediction['end_date'].apply(
    parse_date)

In [6]:
df_test_prediction

Unnamed: 0,text,or_city,dst_city,str_date,end_date,budget,FlyOrder,datetimeV2,geographyV2,money
0,Three words: Alexandria to Cencun. Please book...,,,NaT,NaT,,{},"{'type': 'datetime', 'values': [{'timex': 'FUT...",,
1,"Hello. I'm just gonna be honest with you, here...",Busan,Melbourne,2022-08-24,NaT,,,,,
2,Need two tickets out of Buenos Aires!,,,NaT,NaT,,{},,"{'value': 'Buenos Aires', 'type': 'city'}",
3,I need to get to Marseille from Dallas,Dallas,Marseille,NaT,NaT,,,,,
4,"Hi there, I'm looking for a trip for 5 between...",,,NaT,NaT,,{},"{'type': 'daterange', 'values': [{'timex': '(X...",,
...,...,...,...,...,...,...,...,...,...,...
195,Hello. I am a deeply tormented children’s writ...,,,NaT,NaT,,{},"{'type': 'daterange', 'values': [{'timex': '20...","{'value': 'Minneapolis', 'type': 'city'}","{'number': 2000, 'units': 'Dollar'}"
196,I have a business trip coming up in Punta Cana...,,Punta Cana,NaT,NaT,,,,,
197,Melbourne please,,,NaT,NaT,,{},,"{'value': 'Melbourne', 'type': 'city'}",
198,Good morning! So I just won the lottery and de...,San Francisco,Sacramento,NaT,NaT,,,,,


## Formatage des données prébuilt

### Date

Nous devons maintenant formater les données prédites par les prébuilts, nous pourrons alors comparer le score du modèle de base, puis le score du modèle si nous ajoutons les prédictions du prébuilt.

In [7]:
for i in df_test_prediction["datetimeV2"]:
    print(i)

{'type': 'datetime', 'values': [{'timex': 'FUTURE_REF', 'resolution': [{'value': '2022-10-28 17:00:44'}]}]}
nan
nan
nan
{'type': 'daterange', 'values': [{'timex': '(XXXX-08-25,XXXX-08-28,P3D)', 'resolution': [{'start': '2022-08-25', 'end': '2022-08-28'}, {'start': '2023-08-25', 'end': '2023-08-28'}]}]}
nan
{'type': 'daterange', 'values': [{'timex': '(XXXX-09-13,XXXX-09-22,P9D)', 'resolution': [{'start': '2022-09-13', 'end': '2022-09-22'}, {'start': '2023-09-13', 'end': '2023-09-22'}]}]}
nan
nan
{'type': 'datetime', 'values': [{'timex': 'FUTURE_REF', 'resolution': [{'value': '2022-10-28 17:00:47'}]}]}
nan
nan
{'type': 'duration', 'values': [{'timex': 'P9D', 'resolution': [{'value': '777600'}]}]}
nan
{'type': 'daterange', 'values': [{'timex': '(XXXX-09-01,XXXX-09-16,P15D)', 'resolution': [{'start': '2022-09-01', 'end': '2022-09-16'}, {'start': '2023-09-01', 'end': '2023-09-16'}]}]}
nan
{'type': 'daterange', 'values': [{'timex': '(XXXX-09-10,XXXX-09-11,P1D)', 'resolution': [{'start': '202

In [11]:
# Suppression des string pour lire les dictionnaires
df_test_prediction_avec_prebuilt = df_test_prediction.applymap(
    lambda x: x[0] if isinstance(x, list) else x).copy()
df_test_prediction_avec_prebuilt["date_prebuilt_start"] = np.nan
df_test_prediction_avec_prebuilt["date_prebuilt_end"] = np.nan
df_test_prediction_avec_prebuilt["date_prebuilt_inconnu"] = np.nan

for i in range(len(df_test_prediction_avec_prebuilt)):
    try:
        dict_sans_string = eval(
            df_test_prediction_avec_prebuilt["datetimeV2"][i])
        resolution = dict_sans_string.get("values")[0].get("resolution")[0]
        # Si start et end sont les clefs du dict resolution
        if "start" and "end" in resolution:
            df_test_prediction_avec_prebuilt["date_prebuilt_start"][
                i] = resolution.get("start")
            df_test_prediction_avec_prebuilt["date_prebuilt_end"][
                i] = resolution.get("end")
        elif "value" in resolution:
            df_test_prediction_avec_prebuilt["date_prebuilt_inconnu"][
                i] = resolution.get("value")
    except:
        pass

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test_prediction_avec_prebuilt["date_prebuilt_inconnu"][i] = resolution.get("value")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test_prediction_avec_prebuilt["date_prebuilt_start"][i] = resolution.get("start")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test_prediction_avec_prebuilt["date_prebuilt_end"][i] = resolution.get("end")


In [12]:
df_test_prediction_avec_prebuilt.head(5)

Unnamed: 0,text,or_city,dst_city,str_date,end_date,budget,FlyOrder,datetimeV2,geographyV2,money,date_prebuilt_start,date_prebuilt_end,date_prebuilt_inconnu
0,Three words: Alexandria to Cencun. Please book...,,,NaT,NaT,,{},"{'type': 'datetime', 'values': [{'timex': 'FUT...",,,,,2022-10-28 17:00:44
1,"Hello. I'm just gonna be honest with you, here...",Busan,Melbourne,2022-08-24,NaT,,,,,,,,
2,Need two tickets out of Buenos Aires!,,,NaT,NaT,,{},,"{'value': 'Buenos Aires', 'type': 'city'}",,,,
3,I need to get to Marseille from Dallas,Dallas,Marseille,NaT,NaT,,,,,,,,
4,"Hi there, I'm looking for a trip for 5 between...",,,NaT,NaT,,{},"{'type': 'daterange', 'values': [{'timex': '(X...",,,2022-08-25,2022-08-28,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,Hello. I am a deeply tormented children’s writ...,,,NaT,NaT,,{},"{'type': 'daterange', 'values': [{'timex': '20...","{'value': 'Minneapolis', 'type': 'city'}","{'number': 2000, 'units': 'Dollar'}",2000-01-01,2001-01-01,
196,I have a business trip coming up in Punta Cana...,,Punta Cana,NaT,NaT,,,,,,,,
197,Melbourne please,,,NaT,NaT,,{},,"{'value': 'Melbourne', 'type': 'city'}",,,,
198,Good morning! So I just won the lottery and de...,San Francisco,Sacramento,NaT,NaT,,,,,,,,


Nous avons donc toutes les dates bien formatées. Nous verrons de ce que nous ferons des dates sans "start" ni "end".

### Budget

Commençons le formatage de la colonne money (budget).

In [13]:
def parse_price(price):
    try:
        price = Price.fromstring(price, decimal_separator=".")
        return price.amount_float
    except:
        return np.nan

In [14]:
df_test_prediction_avec_prebuilt["money"].dropna()

45       {'number': 3900, 'units': 'Dollar'}
51       {'number': 6200, 'units': 'Dollar'}
88          {'number': 0, 'units': 'Dollar'}
101    {'number': 400.14, 'units': 'Dollar'}
171      {'number': 6200, 'units': 'Dollar'}
195      {'number': 2000, 'units': 'Dollar'}
199      {'number': 4000, 'units': 'Dollar'}
Name: money, dtype: object

In [16]:
df_test_prediction_avec_prebuilt['money'] = df_test_prediction_avec_prebuilt[
    'money'].apply(parse_price)

In [17]:
df_test_prediction_avec_prebuilt['money']

0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
        ...  
195    2000.0
196       NaN
197       NaN
198       NaN
199    4000.0
Name: money, Length: 200, dtype: float64

Nous avons terminé de formater le prebuilt money, passons maintenant à geographyV2.

### Geography

In [18]:
df_test_prediction_avec_prebuilt["geographyV2"].dropna()

2       {'value': 'Buenos Aires', 'type': 'city'}
9          {'value': 'Sao Paulo', 'type': 'city'}
11        {'value': 'Queenstown', 'type': 'city'}
12     {'value': 'Ciudad Juarez', 'type': 'city'}
14          {'value': 'San Jose', 'type': 'city'}
                          ...                    
190         {'value': 'Curitiba', 'type': 'city'}
192       {'value': 'queenstown', 'type': 'city'}
195      {'value': 'Minneapolis', 'type': 'city'}
197        {'value': 'Melbourne', 'type': 'city'}
199         {'value': 'BRASILIA', 'type': 'city'}
Name: geographyV2, Length: 92, dtype: object

Le problème que nous avons ici, c'est que nous n’avons qu'une seule valeur de détectée, nous ne savons pas si cette valeur est pour la destination ou l'origine.

Nous allons donc simplement formater cette colonne de manière à n'avoir que la région. Et nous verrons ce que nous ferons de cette colonne plus tard.

In [19]:
def pays_ville(localisation):
    try:
        locat = eval(localisation)
        return locat.get("value")
    except:
        return np.nan

In [20]:
df_test_prediction_avec_prebuilt[
    "geographyV2"] = df_test_prediction_avec_prebuilt["geographyV2"].apply(
        pays_ville)

In [21]:
df_test_prediction_avec_prebuilt["geographyV2"].dropna()

2       Buenos Aires
9          Sao Paulo
11        Queenstown
12     Ciudad Juarez
14          San Jose
           ...      
190         Curitiba
192       queenstown
195      Minneapolis
197        Melbourne
199         BRASILIA
Name: geographyV2, Length: 92, dtype: object

In [22]:
df_test_prediction_avec_prebuilt[[
    "or_city", "dst_city", "geographyV2"
]][df_test_prediction_avec_prebuilt['geographyV2'].notna()]

Unnamed: 0,or_city,dst_city,geographyV2
2,,,Buenos Aires
9,,,Sao Paulo
11,,,Queenstown
12,,,Ciudad Juarez
14,,,San Jose
...,...,...,...
190,,,Curitiba
192,,,queenstown
195,,,Minneapolis
197,,,Melbourne


Nous avons maintenant terminé le formatage de nos colonnes, passons au scoring.

## Scoring

In [28]:
df_test_prediction_avec_prebuilt.head(5)

Unnamed: 0,text,or_city,dst_city,str_date,end_date,budget,FlyOrder,datetimeV2,geographyV2,money,date_prebuilt_start,date_prebuilt_end,date_prebuilt_inconnu
0,Three words: Alexandria to Cencun. Please book...,,,NaT,NaT,,{},"{'type': 'datetime', 'values': [{'timex': 'FUT...",,,,,2022-10-28 17:00:44
1,"Hello. I'm just gonna be honest with you, here...",Busan,Melbourne,2022-08-24,NaT,,,,,,,,
2,Need two tickets out of Buenos Aires!,,,NaT,NaT,,{},,Buenos Aires,,,,
3,I need to get to Marseille from Dallas,Dallas,Marseille,NaT,NaT,,,,,,,,
4,"Hi there, I'm looking for a trip for 5 between...",,,NaT,NaT,,{},"{'type': 'daterange', 'values': [{'timex': '(X...",,,2022-08-25,2022-08-28,


In [29]:
# Merge des dataframes de test et de prediction
df_merged = pd.merge(df_test_prediction_avec_prebuilt[[
    "text", "or_city", "dst_city", "str_date", "end_date", "budget",
    "date_prebuilt_start", "date_prebuilt_end", "date_prebuilt_inconnu",
    "money", "geographyV2"
]],
                     df_test[[
                         "text", "or_city", "dst_city", "str_date", "end_date",
                         "budget", "str_date_formate", "end_date_formate",
                         "budget_formate"
                     ]],
                     on=['text'])
df_merged.head(5)

Unnamed: 0,text,or_city_x,dst_city_x,str_date_x,end_date_x,budget_x,date_prebuilt_start,date_prebuilt_end,date_prebuilt_inconnu,money,geographyV2,or_city_y,dst_city_y,str_date_y,end_date_y,budget_y,str_date_formate,end_date_formate,budget_formate
0,Three words: Alexandria to Cencun. Please book...,,,NaT,NaT,,,,2022-10-28 17:00:44,,,Alexandria,Cencun,ASAP,Sep 6,,NaT,2022-09-06,
1,"Hello. I'm just gonna be honest with you, here...",Busan,Melbourne,2022-08-24,NaT,,,,,,,Busan,Melbourne,the 23rd,,,2022-10-23,NaT,
2,Need two tickets out of Buenos Aires!,,,NaT,NaT,,,,,,Buenos Aires,Buenos Aires,,,,,NaT,NaT,
3,I need to get to Marseille from Dallas,Dallas,Marseille,NaT,NaT,,,,,,,Dallas,Marseille,,,,NaT,NaT,
4,"Hi there, I'm looking for a trip for 5 between...",,,NaT,NaT,,2022-08-25,2022-08-28,,,,Tofino,Dallas,Aug 25,Aug 28,,2022-08-25,2022-08-28,


Nous allons commencer la comparaison avec notre modèle de base sans utiliser les prébuilt. Nous ne prendrons que les colonnes qui ont été bien formatées, les autres ne seront pas utiles.

In [30]:
df_score_sans_prebuilt = df_merged[[
    "or_city_x", "dst_city_x", "str_date_x", "end_date_x", "budget_x",
    "or_city_y", "dst_city_y", "str_date_formate", "end_date_formate",
    "budget_formate"
]]

In [31]:
df_score_sans_prebuilt.head(5)

Unnamed: 0,or_city_x,dst_city_x,str_date_x,end_date_x,budget_x,or_city_y,dst_city_y,str_date_formate,end_date_formate,budget_formate
0,,,NaT,NaT,,Alexandria,Cencun,NaT,2022-09-06,
1,Busan,Melbourne,2022-08-24,NaT,,Busan,Melbourne,2022-10-23,NaT,
2,,,NaT,NaT,,Buenos Aires,,NaT,NaT,
3,Dallas,Marseille,NaT,NaT,,Dallas,Marseille,NaT,NaT,
4,,,NaT,NaT,,Tofino,Dallas,2022-08-25,2022-08-28,


In [32]:
df_score_sans_prebuilt = df_score_sans_prebuilt.fillna("0")
df_score_sans_prebuilt.head(5)

Unnamed: 0,or_city_x,dst_city_x,str_date_x,end_date_x,budget_x,or_city_y,dst_city_y,str_date_formate,end_date_formate,budget_formate
0,0,0,0,0,0,Alexandria,Cencun,0,2022-09-06 00:00:00,0
1,Busan,Melbourne,2022-08-24 00:00:00,0,0,Busan,Melbourne,2022-10-23 00:00:00,0,0
2,0,0,0,0,0,Buenos Aires,0,0,0,0
3,Dallas,Marseille,0,0,0,Dallas,Marseille,0,0,0
4,0,0,0,0,0,Tofino,Dallas,2022-08-25 00:00:00,2022-08-28 00:00:00,0


Nous allons maintenant comparer chaques valeurs de chaque colonnes **x(target)** et **y(pred) | formate** à chaque lignes.

Puis, pour chaque bonne prédiction, 1 points sera ajouté à la colonne "Score" de la même ligne (les valeurs 0 qui correspondent aux NaN ne seront évidement pas pris en compte).

In [34]:
list_columns_pred = [
    "or_city_x", "dst_city_x", "str_date_x", "end_date_x", "budget_x"
]
list_columns_target = [
    "or_city_y", "dst_city_y", "str_date_formate", "end_date_formate",
    "budget_formate"
]
df_score_sans_prebuilt['Score'] = 0

for i in range(len(df_test)):
    df_score_sans_prebuilt["Score"].iloc[i] = 0
    for colonne_x, colonne_y in zip(list_columns_target, list_columns_pred):
        if df_score_sans_prebuilt[colonne_y].iloc[i] != "0":
            if df_score_sans_prebuilt[colonne_y].iloc[
                    i] == df_score_sans_prebuilt[colonne_x].iloc[i]:
                df_score_sans_prebuilt["Score"].iloc[i] += 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_score_sans_prebuilt["Score"].iloc[i] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_score_sans_prebuilt["Score"].iloc[i] += 1


In [35]:
quantite_bonne_reponse = df_score_sans_prebuilt["Score"].sum()
quantite_bonne_reponse

138

In [36]:
df_score_sans_prebuilt

Unnamed: 0,or_city_x,dst_city_x,str_date_x,end_date_x,budget_x,or_city_y,dst_city_y,str_date_formate,end_date_formate,budget_formate,Score
0,0,0,0,0,0,Alexandria,Cencun,0,2022-09-06 00:00:00,0,0
1,Busan,Melbourne,2022-08-24 00:00:00,0,0,Busan,Melbourne,2022-10-23 00:00:00,0,0,2
2,0,0,0,0,0,Buenos Aires,0,0,0,0,0
3,Dallas,Marseille,0,0,0,Dallas,Marseille,0,0,0,2
4,0,0,0,0,0,Tofino,Dallas,2022-08-25 00:00:00,2022-08-28 00:00:00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
195,0,0,0,0,0,Minneapolis,Lima,0,0,2000.0,0
196,0,Punta Cana,0,0,0,Punta Cana,Tofino,0,0,0,0
197,0,0,0,0,0,0,Melbourne,0,0,0,0
198,San Francisco,Sacramento,0,0,0,San Francisco,Sacramento,0,0,0,2


In [37]:
df_quantite_cell = df_test[[
    'or_city', 'dst_city', 'str_date', 'end_date', 'budget'
]].shape
df_quantite_cell

In [38]:
quantity_cell = df_quantite_cell[0] * df_quantite_cell[1]
quantity_cell

1000

In [39]:
score = quantite_bonne_reponse / quantity_cell * 100
score = str(score) + "%"
score

'13.8%'

Nous avons un score accuracy de 13.8%. 

### Optimisation

Nous allons maintenant ajouter les données prébuilt dans les prédictions et nous allons voir si cela permet d'améliorer notre score.

In [90]:
df_avec_prebuilt = df_merged[[
    "text", "or_city_x", "dst_city_x", "str_date_x", "end_date_x", "budget_x",
    "geographyV2", "money", "date_prebuilt_start", "date_prebuilt_end",
    "date_prebuilt_inconnu"
]]

In [91]:
df_avec_prebuilt

Unnamed: 0,text,or_city_x,dst_city_x,str_date_x,end_date_x,budget_x,geographyV2,money,date_prebuilt_start,date_prebuilt_end,date_prebuilt_inconnu
0,Three words: Alexandria to Cencun. Please book...,,,NaT,NaT,,,,,,2022-10-28 17:00:44
1,"Hello. I'm just gonna be honest with you, here...",Busan,Melbourne,2022-08-24,NaT,,,,,,
2,Need two tickets out of Buenos Aires!,,,NaT,NaT,,Buenos Aires,,,,
3,I need to get to Marseille from Dallas,Dallas,Marseille,NaT,NaT,,,,,,
4,"Hi there, I'm looking for a trip for 5 between...",,,NaT,NaT,,,,2022-08-25,2022-08-28,
...,...,...,...,...,...,...,...,...,...,...,...
195,Hello. I am a deeply tormented children’s writ...,,,NaT,NaT,,Minneapolis,2000.0,2000-01-01,2001-01-01,
196,I have a business trip coming up in Punta Cana...,,Punta Cana,NaT,NaT,,,,,,
197,Melbourne please,,,NaT,NaT,,Melbourne,,,,
198,Good morning! So I just won the lottery and de...,San Francisco,Sacramento,NaT,NaT,,,,,,


En ce qui concerne les colonnes geographyV2 et data_prebuilt_inconnu, comme nous l'avions dit, nous ne savons pas si elles concernent l'origine ou la destination.

Nous allons partir du faite que ce sont des destinations, en effet, il est bien plus probable que les utilisateurs demandent directement le lieu de destination plutôt que leur localisation d'origine et la date de départ de leur vol plutôt que la date de retour.

In [92]:
# Ramplace les valeurs de la colonne par une autre colonne lorsque celle-ci est un NaN
df_avec_prebuilt["str_date_x"].fillna(df_avec_prebuilt["date_prebuilt_start"],
                                      inplace=True)
df_avec_prebuilt["str_date_x"].fillna(
    df_avec_prebuilt["date_prebuilt_inconnu"], inplace=True)

df_avec_prebuilt["end_date_x"].fillna(df_avec_prebuilt["date_prebuilt_end"],
                                      inplace=True)
df_avec_prebuilt["budget_x"].fillna(df_avec_prebuilt["money"], inplace=True)
df_avec_prebuilt["dst_city_x"].fillna(df_avec_prebuilt["geographyV2"],
                                      inplace=True)

del df_avec_prebuilt['date_prebuilt_start']
del df_avec_prebuilt['date_prebuilt_inconnu']
del df_avec_prebuilt['date_prebuilt_end']
del df_avec_prebuilt['money']
del df_avec_prebuilt['geographyV2']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_avec_prebuilt["str_date_x"].fillna(df_avec_prebuilt["date_prebuilt_start"], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_avec_prebuilt["str_date_x"].fillna(df_avec_prebuilt["date_prebuilt_inconnu"], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_avec_prebuilt["end_date_x"].fillna(df_avec_prebuilt["date_prebuilt_end"], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documen

In [93]:
df_avec_prebuilt

Unnamed: 0,text,or_city_x,dst_city_x,str_date_x,end_date_x,budget_x
0,Three words: Alexandria to Cencun. Please book...,,,2022-10-28 17:00:44,,
1,"Hello. I'm just gonna be honest with you, here...",Busan,Melbourne,2022-08-24 00:00:00,,
2,Need two tickets out of Buenos Aires!,,Buenos Aires,,,
3,I need to get to Marseille from Dallas,Dallas,Marseille,,,
4,"Hi there, I'm looking for a trip for 5 between...",,,2022-08-25,2022-08-28,
...,...,...,...,...,...,...
195,Hello. I am a deeply tormented children’s writ...,,Minneapolis,2000-01-01,2001-01-01,2000.0
196,I have a business trip coming up in Punta Cana...,,Punta Cana,,,
197,Melbourne please,,Melbourne,,,
198,Good morning! So I just won the lottery and de...,San Francisco,Sacramento,,,


Nous pouvons maintenant relancer le scoring.

In [100]:
df_avec_prebuilt_all = pd.merge(df_avec_prebuilt[[
    "text", "or_city_x", "dst_city_x", "str_date_x", "end_date_x", "budget_x"
]],
                                df_test[[
                                    "text", "or_city", "dst_city",
                                    "str_date_formate", "end_date_formate",
                                    "budget_formate"
                                ]],
                                on=['text'])

df_avec_prebuilt_all.head(5)

Unnamed: 0,text,or_city_x,dst_city_x,str_date_x,end_date_x,budget_x,or_city,dst_city,str_date_formate,end_date_formate,budget_formate
0,Three words: Alexandria to Cencun. Please book...,,,2022-10-28 17:00:44,,,Alexandria,Cencun,NaT,2022-09-06,
1,"Hello. I'm just gonna be honest with you, here...",Busan,Melbourne,2022-08-24 00:00:00,,,Busan,Melbourne,2022-10-23,NaT,
2,Need two tickets out of Buenos Aires!,,Buenos Aires,,,,Buenos Aires,,NaT,NaT,
3,I need to get to Marseille from Dallas,Dallas,Marseille,,,,Dallas,Marseille,NaT,NaT,
4,"Hi there, I'm looking for a trip for 5 between...",,,2022-08-25,2022-08-28,,Tofino,Dallas,2022-08-25,2022-08-28,


In [101]:
df_avec_prebuilt_all = df_avec_prebuilt_all[[
    "or_city_x", "dst_city_x", "str_date_x", "end_date_x", "budget_x",
    "or_city", "dst_city", "str_date_formate", "end_date_formate",
    "budget_formate"
]]

In [102]:
list_columns_pred = [
    "or_city_x", "dst_city_x", "str_date_x", "end_date_x", "budget_x"
]
list_columns_target = [
    "or_city", "dst_city", "str_date_formate", "end_date_formate",
    "budget_formate"
]
df_avec_prebuilt_all['Score'] = 0

for i in range(len(df_test)):
    df_avec_prebuilt_all["Score"].iloc[i] = 0
    for colonne_x, colonne_y in zip(list_columns_target, list_columns_pred):
        if df_avec_prebuilt_all[colonne_y].iloc[i] != "0":
            if df_avec_prebuilt_all[colonne_y].iloc[i] == df_avec_prebuilt_all[
                    colonne_x].iloc[i]:
                df_avec_prebuilt_all["Score"].iloc[i] += 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_avec_prebuilt_all["Score"].iloc[i] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_avec_prebuilt_all["Score"].iloc[i] += 1


In [103]:
quantite_bonne_reponse = df_avec_prebuilt_all["Score"].sum()
quantite_bonne_reponse

186

In [104]:
score = quantite_bonne_reponse / 1000 * 100
score = str(score) + "%"
score

'18.6%'

Nous avons pu améliorer le score de 4,8% en passant de 13,8% à 18,6%.

Nous avons terminé l'entrainement de notre modèle.

# Conclusion 🏁

Nous avons accompli beaucoup dans ce notebook ! 🎉

Nous avons commencé par créer notre application LUIS, puis nous avons entraîné notre modèle avec l'intégralité de notre jeu de données d'entraînement. Une fois notre modèle entraîné, nous l'avons mis à l'épreuve en l'utilisant pour prédire les résultats sur notre ensemble de données de test.

Nous avons extrait ces prédictions, y compris les prédictions générées par les modèles préconstruits, et les avons organisées dans un dataframe pour une analyse plus détaillée. Cette étape nous a permis de bien comprendre comment notre modèle performe et où sont ses forces et faiblesses.

Enfin, nous avons formaté les données prédites par les modèles préconstruits. Cela nous a permis de mieux comprendre comment ces modèles peuvent influencer la performance de notre propre modèle et comment nous pourrions les intégrer dans le futur pour améliorer nos prédictions.