## Classification tutorial

### The goal of this notebook is to be able to transform the Minotour chat logs into a dataset that has exactly the same format as the SNIPS dataset which is the one used by the Deeppavlov framework to do the intent classification.

<div class="alert alert-success">
<b>
The first step is to import all the csv chat logs and concatenate them, then we will be able to delete all the columns we don't need, and clean our dataset.
</b>
</div>

In [2]:
import random
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

<div class="alert alert-success">
<b>
Let's import all the csv files 
</b>
</div>

In [3]:
data1 = pd.read_csv('chat logs/log_2018-08.csv')
data2 = pd.read_csv('chat logs/log_2019-01.csv')
data3 = pd.read_csv('chat logs/log_2019-02.csv')
data4 = pd.read_csv('chat logs/log_2019-03.csv')
data5 = pd.read_csv('chat logs/log_2019-04.csv')
data6 = pd.read_csv('chat logs/log_2018-09.csv')
data7 = pd.read_csv('chat logs/log_2018-10.csv')
data8 = pd.read_csv('chat logs/log_2018-11.csv')
data9 = pd.read_csv('chat logs/log_2018-12.csv')

<div class="alert alert-success">
<b>
We concatenate them along the index axis (all the csv files have the same columns)
</b>
</div>

In [4]:
concat_data = pd.concat([data1,data2,data3,data4,data5,data6,data7,data8,data9])
concat_data = concat_data[['clean_message','intent']]

In [5]:
concat_data.head()

Unnamed: 0,clean_message,intent
0,"""Hi""",Default Welcome Intent
1,"""suggest some events for tomorrow in Nice""",find.event
2,"""stop""",abandon
3,"""find events in mocaco""",find.event
4,"""Hi""",Default Welcome Intent


In [6]:
print("We have %s different intents." % len(concat_data.intent.unique()))

We have 28 different intents.


In [7]:
print("We have %s different clean_message." % len(concat_data.clean_message.unique()))

We have 2313 different clean_message.


In [8]:
# Delete the commas in the clean message
concat_data['clean_message'] = concat_data['clean_message'].str[1:-1]

In [9]:
concat_data.head()

Unnamed: 0,clean_message,intent
0,Hi,Default Welcome Intent
1,suggest some events for tomorrow in Nice,find.event
2,stop,abandon
3,find events in mocaco,find.event
4,Hi,Default Welcome Intent


In [10]:
print("We have actually ", concat_data.shape[0],"lines")

We have actually  4159 lines


<div class="alert alert-success">
<b>
We know that we have some duplicates in this dataset let's try to remove them<br>
We are going to delete all the duplicates by considering the first column which is very important, so that we can't have two lines with the same clean_message but with two different intents
</b>
</div>

In [11]:
#REMOVE DUPLICATES
concat_data.drop_duplicates(subset="clean_message",keep=False,inplace=True)

In [12]:
print("We have actually ", concat_data.shape[0],"lines")

We have actually  1906 lines


<div class="alert alert-success">
<b>
We can see that our dataset is getting much more smaller, more than 2000 lines were duplicates and therefore have been dropped !<br>
In the SNIPS dataset we have 15884 different lines !
</b>
</div>

In [13]:
#SAVE THE DATAFRAME IN A CSV FILE
# concat_data.to_csv("chat_log1.csv")

<div class="alert alert-success">
<b>
When we extracted the dataset without removing the duplicates we had 56% accuracy In the intent classification.<br>
After removing the duplicates the accuracy went up a little bit to 59%.<br>
We tried to change the proportion of train/test split: from 80/20 to 90/10 so that we can have a bigger training set, but the accuracy didn't change.<br>
We suppose the main reason why we don't have such a good accuracy is the small size of the dataset<br>
</b>
</div>

<div class="alert alert-success">
<b>
To make sure it's because of the size of the dataset let's try to add 25% of the SNIPS dataset to this actual dataframe and see if the accuracy changes or not
</b>
</div>

In [15]:
#IMPORT THE SNIPS DATASET
snips_data = pd.read_csv('train.csv')

In [18]:
snips_data.head()

Unnamed: 0,text,intents
0,Add another song to the Cita RomГЎntica playli...,AddToPlaylist
1,add clem burke in my playlist Pre-Party R&B Jams,AddToPlaylist
2,Add Live from Aragon Ballroom to Trapeo,AddToPlaylist
3,add Unite and Win to my night out,AddToPlaylist
4,Add track to my Digster Future Hits,AddToPlaylist


In [22]:
#WE TAKE 25% OF THE DATASET
subset_snips = snips_data.sample(frac=0.25)
subset_snips.head()

Unnamed: 0,text,intents
10541,Where to get painting of The Man in the White ...,SearchCreativeWork
4570,What is the forecast for West Virginia will it...,GetWeather
4834,What is the weather forecast for Laos,GetWeather
15675,show me the schedule of The Loves of Letty in ...,SearchScreeningEvent
12459,I'm wondering when I can see Beating Heart at ...,SearchScreeningEvent


In [23]:
print("The subset of the snips dataset contains %s rows." % len(subset_snips.text))

The subset of the snips dataset contains 3971 rows.


In [27]:
concat_data.rename(index=str, columns={"clean_message": "text", "intent": "intents"}, inplace=True)

In [29]:
#WE CONCATENATE THE TWO DATASETS
full_data = pd.concat([concat_data,subset_snips])
full_data.head()

Unnamed: 0,text,intents
3,find events in mocaco,find.event
8,how about nice,find.alternatives
11,I'm looking for kids activity in Monaco,find.activity
12,what can kids do in Nice,find.alternatives
14,in Nice next month an event to find,find.event


In [32]:
#REMOVE DUPLICATES
full_data.drop_duplicates(subset="text",keep=False,inplace=True)

In [33]:
print("The full dataset contraint %s rows." % len(full_data.text))

The full dataset contraint 5733 rows.


In [34]:
full_data.to_csv("full_data.csv")

<div class="alert alert-success">
<b>
After downloading the file "full_data.csv" and trying to train the model on it for intent classification, we got 82% of accuracy.<br>
This means that the main problem is definitely the size of the dataset.<br>
</b>
</div>