<div class="alert alert-success">
<b>
The goal of this notebook is to be able to transform the Minotour chat logs into a dataset that has nearly the same format as the  DSTC2 dataset, which is the one used by the Deeppavlov framework to create a goal-oriented chatbot.
</b>
</div>

<div class="alert alert-success">
<b>
The first step is to import all the csv chat logs and concatenate them, then we will be able to delete all the columns we don't need, and clean our dataset.
</b>
</div>

In [125]:
import random
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

<div class="alert alert-success">
<b>
Let's import all the csv files 
</b>
</div>

In [126]:
data1 = pd.read_csv('C:\\Users\\Med Yasser\\Documents\\3A\\SPRING\\Project\\chat logs\\log_2018-08.csv')
data2 = pd.read_csv('C:\\Users\\Med Yasser\\Documents\\3A\\SPRING\\Project\\chat logs\\log_2018-09.csv')
data3 = pd.read_csv('C:\\Users\\Med Yasser\\Documents\\3A\\SPRING\\Project\\chat logs\\log_2018-10.csv')
data4 = pd.read_csv('C:\\Users\\Med Yasser\\Documents\\3A\\SPRING\\Project\\chat logs\\log_2018-11.csv')
data5 = pd.read_csv('C:\\Users\\Med Yasser\\Documents\\3A\\SPRING\\Project\\chat logs\\log_2018-12.csv')
data6 = pd.read_csv('C:\\Users\\Med Yasser\\Documents\\3A\\SPRING\\Project\\chat logs\\log_2019-01.csv')
data7 = pd.read_csv('C:\\Users\\Med Yasser\\Documents\\3A\\SPRING\\Project\\chat logs\\log_2019-02.csv')
data8 = pd.read_csv('C:\\Users\\Med Yasser\\Documents\\3A\\SPRING\\Project\\chat logs\\log_2019-03.csv')
data9 = pd.read_csv('C:\\Users\\Med Yasser\\Documents\\3A\\SPRING\\Project\\chat logs\\log_2019-04.csv')

<div class="alert alert-success">
<b>
We concatenate them along the index axis (all the csv files have the same columns)
</b>
</div>

In [127]:
concat_data = pd.concat([data1,data2,data3,data4,data5,data6,data7,data8,data9])
concat_data = concat_data[['user','intent','clean_message','response']]

<div class="alert alert-success">
<b>
We want to keep the following columns : intent, clean_message, response.<br>
In order to make a dataset that looks like the DSTC2 dataset, we need to create a new column called act that will represent somehow the "act" of the chatbot according to the intent of the user sentence.
</b>
</div>

In [149]:
concat_data[:10]

Unnamed: 0,user,intent,clean_message,response,act
0,9cd817ac-4f43-f5c8-8311-986630b4eb63,Default Welcome Intent,"""Hi""","""Good day! Are you bored? My name is MinoTour ...",Default Welcome Intent
2,9cd817ac-4f43-f5c8-8311-986630b4eb63,abandon,"""stop""","""Ok how may I help you?""",abandon
3,1572e885-81bd-04e0-2901-82593cc20a28,find.event,"""find events in mocaco""","""What is the city?""",find.event
4,2e24a315-3212-8206-ff78-2ddeaa5e2bc1,Default Welcome Intent,"""Hi""","""Hey! This is nice to see you :) My name is Mi...",Default Welcome Intent
5,2e24a315-3212-8206-ff78-2ddeaa5e2bc1,Default Welcome Intent,"""hi""","""Hi! My name is MinoTour :) What do you want t...",Default Welcome Intent
6,1572e885-81bd-04e0-2901-82593cc20a28,find.event,"""find events in Monaco""","""What is the datetime?""",find.event
13,5c384e65-8ce5-c60f-cfc9-5f1c8533a916,Default Welcome Intent,"""Hi""","""Good day! Are you bored? My name is MinoTour ...",Default Welcome Intent
15,82f744c7-3b36-76c7-5956-a67e84ed2c77,Default Welcome Intent,"""Hi""","""Greetings! My name is MinoTour and I'm here t...",Default Welcome Intent
17,82f744c7-3b36-76c7-5956-a67e84ed2c77,abandon,"""quelle est la meilleure période pour skier à ...","""Bien sûr comment puis-je vous aider maintenan...",abandon
18,efa53fe0-010a-17d3-5b70-3c23dea37a9a,Default Welcome Intent,"""Hi""","""Greetings! My name is MinoTour and I'm here t...",Default Welcome Intent


In [129]:
print("We have %s different discussions." % len(concat_data.user.unique()))

We have 374 different discussions.


In [130]:
print("We have %s different intents." % len(concat_data.intent.unique()))

We have 28 different intents.


In [131]:
print("We have %s different clean_message." % len(concat_data.clean_message.unique()))

We have 2313 different clean_message.


In [132]:
print("We have %s different responses." % len(concat_data.response.unique()))

We have 143 different responses.


<div class="alert alert-success">
<b>
We can see that for the response column there is a value called "result_card", we can say say that it is a kind of API call to an external database, so to simplify things we are going to delete all the line that contains "result_card"
</b>
</div>

In [133]:
concat_data = concat_data[concat_data.response != "<result_card>"]

In [147]:
concat_data[:10]

Unnamed: 0,user,intent,clean_message,response,act
0,9cd817ac-4f43-f5c8-8311-986630b4eb63,Default Welcome Intent,"""Hi""","""Good day! Are you bored? My name is MinoTour ...",Default Welcome Intent
2,9cd817ac-4f43-f5c8-8311-986630b4eb63,abandon,"""stop""","""Ok how may I help you?""",abandon
3,1572e885-81bd-04e0-2901-82593cc20a28,find.event,"""find events in mocaco""","""What is the city?""",find.event
4,2e24a315-3212-8206-ff78-2ddeaa5e2bc1,Default Welcome Intent,"""Hi""","""Hey! This is nice to see you :) My name is Mi...",Default Welcome Intent
5,2e24a315-3212-8206-ff78-2ddeaa5e2bc1,Default Welcome Intent,"""hi""","""Hi! My name is MinoTour :) What do you want t...",Default Welcome Intent
6,1572e885-81bd-04e0-2901-82593cc20a28,find.event,"""find events in Monaco""","""What is the datetime?""",find.event
13,5c384e65-8ce5-c60f-cfc9-5f1c8533a916,Default Welcome Intent,"""Hi""","""Good day! Are you bored? My name is MinoTour ...",Default Welcome Intent
15,82f744c7-3b36-76c7-5956-a67e84ed2c77,Default Welcome Intent,"""Hi""","""Greetings! My name is MinoTour and I'm here t...",Default Welcome Intent
17,82f744c7-3b36-76c7-5956-a67e84ed2c77,abandon,"""quelle est la meilleure période pour skier à ...","""Bien sûr comment puis-je vous aider maintenan...",abandon
18,efa53fe0-010a-17d3-5b70-3c23dea37a9a,Default Welcome Intent,"""Hi""","""Greetings! My name is MinoTour and I'm here t...",Default Welcome Intent


<div class="alert alert-success">
<b>
Now We will create a new column called act so that it can represent somehow the "intent" of the response generated by the chatbot.
The act we will of course be related to the intent of the user message, so we will crate a dictionary so that each intent of the user message corresponds to an act 
</b>
</div>

In [135]:
concat_data["act"]=concat_data["intent"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [155]:
concat_data

Unnamed: 0,user,intent,clean_message,response,act
0,9cd817ac-4f43-f5c8-8311-986630b4eb63,Default Welcome Intent,"""Hi""","""Good day! Are you bored? My name is MinoTour ...",Default Welcome Intent
2,9cd817ac-4f43-f5c8-8311-986630b4eb63,abandon,"""stop""","""Ok how may I help you?""",abandon
3,1572e885-81bd-04e0-2901-82593cc20a28,find.event,"""find events in mocaco""","""What is the city?""",find.event
4,2e24a315-3212-8206-ff78-2ddeaa5e2bc1,Default Welcome Intent,"""Hi""","""Hey! This is nice to see you :) My name is Mi...",Default Welcome Intent
5,2e24a315-3212-8206-ff78-2ddeaa5e2bc1,Default Welcome Intent,"""hi""","""Hi! My name is MinoTour :) What do you want t...",Default Welcome Intent
6,1572e885-81bd-04e0-2901-82593cc20a28,find.event,"""find events in Monaco""","""What is the datetime?""",find.event
13,5c384e65-8ce5-c60f-cfc9-5f1c8533a916,Default Welcome Intent,"""Hi""","""Good day! Are you bored? My name is MinoTour ...",Default Welcome Intent
15,82f744c7-3b36-76c7-5956-a67e84ed2c77,Default Welcome Intent,"""Hi""","""Greetings! My name is MinoTour and I'm here t...",Default Welcome Intent
17,82f744c7-3b36-76c7-5956-a67e84ed2c77,abandon,"""quelle est la meilleure période pour skier à ...","""Bien sûr comment puis-je vous aider maintenan...",abandon
18,efa53fe0-010a-17d3-5b70-3c23dea37a9a,Default Welcome Intent,"""Hi""","""Greetings! My name is MinoTour and I'm here t...",Default Welcome Intent


In [137]:
dict_line = {}
dict_line["user_message"] = "bonjour"
dict_line["bot_message"] = "hola"
print(dict_line)
# dicto_line = {}
dictoo_line= dict.fromkeys(["user_message","bot_message"],None)
print(dictoo_line)

{'user_message': 'bonjour', 'bot_message': 'hola'}
{'user_message': None, 'bot_message': None}


In [158]:
import json

def fill_dict(dico,t,i, t_b, a):
    dico["text"] = t
    dico["itents"] = i
    dico["text_bot"] = t_b
    dico["act"] = a

dialog_id = "9cd817ac-4f43-f5c8-8311-986630b4eb63"
for index,row in concat_data.iterrows(): 
    dict_line= dict.fromkeys(["text","intents","text_bot","act"],None)
    fill_dict(dict_line,row['clean_message'],row['intent'],row['response'],row['act'])
    if (row['user'] == dialog_id):
        with open('data10.json', 'a') as file:  
            json.dump(dict_line, file,ensure_ascii=False)
            file.write("\n")
    else:
        dialog_id = row['user']
        with open('data10.json', 'a') as file:
            file.write("\n")
            json.dump(dict_line, file,ensure_ascii=False)
            file.write("\n")

        


UnicodeEncodeError: 'charmap' codec can't encode characters in position 3-8: character maps to <undefined>