<a href="https://colab.research.google.com/github/sjung-stat/Customer-Support-Chat-Intent-Classification/blob/main/Data%20Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In the 'Exploratory Data Analysis' part, we found out that the dataset is imbalanced. We try to deal with the text data imbalance issue by


1. merging smiliar intents
      - There are too many categories relative to the size of the dataset we have. The best approach is to merge categories that have similar intents. 

2.   augmenting text data which belong to minority class
      - There are several methods to augment text data, including: 
          - back translation (e.g. English -> French -> English)
          - __paraphrasing__
      - We use paraphrasing, because most of the time, we noticed that back translation gives you almost the identical (even word by word) result before and after translation. Parapharsing, on the other hand, gives you a similar but not the same result as the original sentences. But, sometimes, the parapharsing technique even changes some important keywords, so that it doesn't seem to belong to the category anymore For example, when we apply technique to the 'contactless_not_working' category, the keyword 'contactless' translates to other words, so they can't be classifed to the cateogry anymore. In such cases, we either merge the class to similar categories, or remove the minority category when there are not enough data. After data augmentation, we examine whether there are exactly the same sentences in each of the category. If the same sentences are found, we only keep one of them. 

3.   removing text data which have similar meaning in each of the majority class. 
      - There are multiple ways you can calculate the text similarity. This [blog post](https://medium.com/@adriensieg/text-similarities-da019229c894) introduces 10+ different ways you can implement.
      - Among the many different options, we use __BERT embeddings + Cosine Similarity__ method. 
        - BERT produces contextualized word embeddings, meaning that each word's embedding is generated based on the context it appears in, rather than simply being a fixed representation of the word. This allows BERT to capture the nuances of meaning that can be missed by simpler word embedding models.

        - To calculate word similarity using BERT embeddings, you can use cosine similarity to compare the embeddings of two words. Cosine similarity measures the angle between two vectors and returns a value between -1 and 1, where 1 means the vectors are identical and -1 means they are completely dissimilar.






In [None]:
"""
아래는 위의 'bert + cosine similarity' 를 chatgpt 에 물어봤을때, 해당 방법을 python 에 implement 하는 법을 주는 sample code



from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F

# Load the BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Encode two words to get their embeddings
word1 = "dog"
word2 = "cat"
encoded_input1 = tokenizer(word1, return_tensors='pt')
encoded_input2 = tokenizer(word2, return_tensors='pt')
with torch.no_grad():
    embeddings1 = model(**encoded_input1).pooler_output
    embeddings2 = model(**encoded_input2).pooler_output

# Calculate the cosine similarity between the embeddings
similarity = F.cosine_similarity(embeddings1, embeddings2)
print(f"The similarity between '{word1}' and '{word2}' is {similarity.item():.2f}")
"""

'\n아래는 위의 \'bert + cosine similarity\' 를 chatgpt 에 물어봤을때, 해당 방법을 python 에 implement 하는 법을 주는 sample code\n\n\n\nfrom transformers import AutoTokenizer, AutoModel\nimport torch.nn.functional as F\n\n# Load the BERT model and tokenizer\nmodel_name = \'bert-base-uncased\'\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModel.from_pretrained(model_name)\n\n# Encode two words to get their embeddings\nword1 = "dog"\nword2 = "cat"\nencoded_input1 = tokenizer(word1, return_tensors=\'pt\')\nencoded_input2 = tokenizer(word2, return_tensors=\'pt\')\nwith torch.no_grad():\n    embeddings1 = model(**encoded_input1).pooler_output\n    embeddings2 = model(**encoded_input2).pooler_output\n\n# Calculate the cosine similarity between the embeddings\nsimilarity = F.cosine_similarity(embeddings1, embeddings2)\nprint(f"The similarity between \'{word1}\' and \'{word2}\' is {similarity.item():.2f}")\n'

# Import Data

In [None]:
import pandas as pd

url_training = 'https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/train.csv'
url_testing = 'https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/test.csv'
df_training = pd.read_csv(url_training)
df_testing = pd.read_csv(url_testing)

# In order to preserve the original training data, we create a copy of the dataset. 
df_training_copy = df_training.copy()

# Data Augmentation

In [None]:
# Paraphasing and translating text data 

! pip install -q sentence-splitter
! pip install -q transformers
! pip install -q SentencePiece
! pip install -q mtranslate

from mtranslate import translate
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from sentence_splitter import SentenceSplitter, split_text_into_sentences

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 KB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m78.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for mtranslate (setup.py) ... [?25l[?25hdone


In [None]:
# Hugging Face provides the following function which paraphrases a sentence. 
# https://huggingface.co/tuner007/pegasus_paraphrase

model_name = 'tuner007/pegasus_paraphrase'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)

def get_response(input_text,num_return_sequences):
  batch = tokenizer.prepare_seq2seq_batch([input_text],truncation=True,padding='longest',max_length=60, return_tensors="pt").to(torch_device)
  translated = model.generate(**batch,max_length=60,num_beams=10, num_return_sequences=num_return_sequences, temperature=1.5)
  tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
  return tgt_text

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

In [None]:
# Since the function takes only a sentence, write an additional function which takes a paragraph as an input

def get_response_paragraph(input_text):
  splitter = SentenceSplitter(language='en')
  sentence_list = splitter.split(input_text)
  paraphrase = []
  for i in sentence_list:
    a = get_response(i,1)
    paraphrase.append(a)
    
  paraphrase2 = [' '.join(x) for x in paraphrase]
  paraphrase3 = [' '.join(x for x in paraphrase2) ]
  paraphrased_text = str(paraphrase3).strip('[]').strip("'")

  return paraphrased_text

In [None]:
text = "How do I know if I will get my card, or if it is lost?"
print(get_response(text, 1))

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



['How do I know if I get my card or not?']


### 1. ***contactless_not_working*** (row count: 35) --- DONE

> The keyword of this category is 'contactless'. If a sentence doesn't include this keyword, it could be ambiguous to classify them into this category. So, we take a look at the rows that doesn't include this keyword within this category.



In [None]:
# Keyword lookup

word_to_exclude = 'contactless'
category_to_include = 'contactless_not_working'

mask = ~(df_training['text'].str.contains(word_to_exclude, case=False)) & (df_training['category'] == category_to_include)
result = df_training[mask]
result

Unnamed: 0,text,category
1782,Should i uninstall the app before i try it again?,contactless_not_working
1786,how many days processing new card?,contactless_not_working
1788,Would reinstalling the app solve the problem?,contactless_not_working
1791,The terminal I paid at wouldn't take my card. ...,contactless_not_working
1801,how to get new card?,contactless_not_working
1804,Any charges applicable for new card?,contactless_not_working
1807,How can i get new card?,contactless_not_working




> If we remove the data, we end up having only 27 rows in this category, which will be not enough even with data augmentation. In the dataset, there are two different categories that have similar intents: 'card_not_working' and 'virtual_card_not_working' each of which has 112 and 41 rows. So, we merge the three categories together into 'card_not_working'. 




In [None]:
category_not_working = ['contactless_not_working', 'virtual_card_not_working']
mask = df_training_copy['category'].isin(category_not_working)
df_training_copy.category[mask] = 'card_not_working'

In [None]:
df_training_copy[df_training_copy.category=='card_not_working'].count()

text        188
category    188
dtype: int64

--------------------------------------------------------------------------------

### 2. ***lost_or_stolen_card*** (row count: 82) --- DONE



> There are two similar intents related to lost and stolen; 'lost_or_stolen_card' and 'lost_or_stolen_phone'. We investigate the data in each of them, and if they are different only based on the two keywords 'card' and 'phone' and have the same context, we merge them together. 



In [None]:
mask = df_training_copy['category']=='lost_or_stolen_card'
df_training_copy[mask]

Unnamed: 0,text,category
1475,Has there been any activity on my card today?,lost_or_stolen_card
1476,I lost my wallet and all my cards were in it.,lost_or_stolen_card
1477,I'm panicking! I lost my card! Help!,lost_or_stolen_card
1478,I need to report a stolen card,lost_or_stolen_card
1479,How do I replace a stolen card?,lost_or_stolen_card
...,...,...
1552,Can you help me retrieve my card?,lost_or_stolen_card
1553,I lost my card. Can you help me?,lost_or_stolen_card
1554,What should I do? My card has been stolen!,lost_or_stolen_card
1555,My card was stolen last night.,lost_or_stolen_card


In [None]:
mask = df_training_copy['category']=='lost_or_stolen_phone'
df_training_copy[mask]

Unnamed: 0,text,category
5251,"My phone was stolen, what should I do first?",lost_or_stolen_phone
5252,"My phone was stolen, what should I do?",lost_or_stolen_phone
5253,"I lost my phone, what should I do?",lost_or_stolen_phone
5254,"Someone has stolen my phone, what should I do?",lost_or_stolen_phone
5255,I think I lost my phone. Is there a way to pr...,lost_or_stolen_phone
...,...,...
5367,Everything was stolen from me. I can't use th...,lost_or_stolen_phone
5368,My things were stolen and I need to know if I ...,lost_or_stolen_phone
5369,"lost my phone, what is account security?",lost_or_stolen_phone
5370,I lost my phone. What do I do to block someon...,lost_or_stolen_phone


> Based on the contexts of the data, we can merge the two intents and call them 'lost_or_stolen' in general, instead of having two separate intents. And we end up having 203 counts for this new intent.



In [None]:
category_lost_or_stolen = ['lost_or_stolen_card', 'lost_or_stolen_phone']
mask = df_training_copy['category'].isin(category_lost_or_stolen)

df_training_copy.category[mask] = 'lost_or_stolen'
df_training_copy[df_training_copy.category=='lost_or_stolen'].count()

text        203
category    203
dtype: int64

--------------------------------------------------------------------------------

### 3. ***card_acceptance*** (row count: 59) -- DONE



> We take a look at this data, and decide if we should merge them into another intent, or augment this intent. 



In [None]:
mask = df_training_copy['category']=='card_acceptance'
df_training_copy[mask]

Unnamed: 0,text,category
3096,Is there anywhere I can't use my card?,card_acceptance
3097,In which stores can I shop with this card?,card_acceptance
3098,How do I know where I can use my card?,card_acceptance
3099,Can I tell what business will take this card?,card_acceptance
3100,Are there businesses that don't accept this card?,card_acceptance
3101,Is the card welcomed by everybody?,card_acceptance
3102,Do you have a list of businesses that accept t...,card_acceptance
3103,What businesses accept this card?,card_acceptance
3104,What retailers accept my card?,card_acceptance
3105,This card is accepted by what businesses?,card_acceptance


> Since there do not exist similar intents in this dataset, we do the data augmentation and back translation using the technique mentioned above. 



In [None]:
# Paraphrase

df_augment = df_training_copy[mask].text.apply(lambda x: get_response(x, 1))
card_acceptance_augmented = df_augment.to_frame().assign(category="card_acceptance")
card_acceptance_augmented

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



KeyboardInterrupt: ignored

In [None]:
# Back translation (to avoid the situation where translation gives you back the same output as input, we use 3-way translation which makes the translation more complicated: it -> ru -> en)

df_translate_1 = df_training_copy[mask].text.apply(lambda x: translate(x, "it"))
df_translate_2 = df_translate_1.apply(lambda x: translate(x, "ru"))
df_translate_3 = df_translate_2.apply(lambda x: translate(x, "en"))

card_acceptance_translated = df_translate_3.to_frame().assign(category="card_acceptance")
card_acceptance_translated

Unnamed: 0,text,category
3096,Somewhere I can't use my card?,card_acceptance
3097,In which stores can I shop with this card?,card_acceptance
3098,How do I know where I can use my card?,card_acceptance
3099,Can I tell which company will accept this card?,card_acceptance
3100,Are there companies that do not accept this card?,card_acceptance
3101,Is the map popular with everyone?,card_acceptance
3102,Do you have a list of businesses that accept t...,card_acceptance
3103,Which companies accept this card?,card_acceptance
3104,Which shops accept my card?,card_acceptance
3105,Which companies accept this card?,card_acceptance


--------------------------------------------------------------------------------

### 4. ***card_swallowed*** (row count: 61) & ***atm_support*** (row count: 87) --- DONE


> The 'card_swallowed' could be a similar intent as 'atm_support'. In this case, the keyword is ATM. So we can potentially merge them together. 





In [None]:
# Keyword lookup

word_to_exclude = 'ATM'
category_to_include = 'card_swallowed'

mask = ~(df_training_copy['text'].str.contains(word_to_exclude, case=False)) & (df_training_copy['category'] == category_to_include)
result = df_training_copy[mask]
result

Unnamed: 0,text,category
6251,Your machine took my card. How do I get it back?,card_swallowed
6261,How do I get my card back quickly from a bank ...,card_swallowed
6264,I was getting cash and the card got stuck inside,card_swallowed
6275,I was removing a dollar amount from my account...,card_swallowed
6282,The bank machine didn't give my card back how ...,card_swallowed
6291,how can the money machine keep my card what do...,card_swallowed
6292,How do I get a new card as your machine swallo...,card_swallowed
6295,My card was eaten by the cash machine what do ...,card_swallowed
6298,Will you please help me get my card back?,card_swallowed
6303,What are the steps I should take to recover my...,card_swallowed


In [None]:
mask = df_training_copy['category']=='card_swallowed'
df_training_copy[mask]

Unnamed: 0,text,category
6245,What do I do if the ATM took my card?,card_swallowed
6246,What do I do now my credit card has been swall...,card_swallowed
6247,An ATM machine didn't give me back my card.,card_swallowed
6248,"My card got trapped inside an ATM, what should...",card_swallowed
6249,What do I do if I can't get my card out of the...,card_swallowed
...,...,...
6301,"My card is stuck in an ATM machine, how do I g...",card_swallowed
6302,atm ate my card,card_swallowed
6303,What are the steps I should take to recover my...,card_swallowed
6304,"My card is stuck inside the ATM, what am I sup...",card_swallowed


> You can observe that most of the text data has the keyword 'ATM' in them. Even when they don't contain the keyword, you can infer that their cards were swallowed mostly by ATMs. Hence, we can merge the intent to the 'atm_support' intent to reduce the number of intents. 

In [None]:
df_training_copy.category[mask] = 'atm_support'
df_training_copy[df_training_copy.category=='atm_support'].count()

text        148
category    148
dtype: int64

--------------------------------------------------------------------------------

### 5. ***compromised_card*** (row count: 86) -- IN PROGRESS

In [None]:
mask = df_training_copy['category'] == 'compromised_card'
df_training_copy[mask]

Unnamed: 0,text,category
4459,I think someone is using my card without my pe...,compromised_card
4460,What do I do if I detect fraudulent use on my ...,compromised_card
4461,I think my account has been hacked there are c...,compromised_card
4462,I don't recognize some transactions. Is someo...,compromised_card
4463,I just got an email in confirming my purchase ...,compromised_card
...,...,...
4540,The card has suffered a security breach.,compromised_card
4541,My card data has been exposed.,compromised_card
4542,There are transactions that I did not make on ...,compromised_card
4543,I think someone may be using my card.,compromised_card




> NEED TO DO DATA AUGMENTATION (PARAPHRASING + BACK TRANSLATION)



--------------------------------------------------------------------------------

### 6. ***receiving_money*** (row count: 95) -- IN PROGRESS

--------------------------------------------------------------------------------

### 7. ***get_disposable_virtual_card*** (row count: 97) -- IN PROGRESS

> There are 4 different intents that are part of 'getting card':

*   get_disposable_virtual_card (row count: 97)
*   getting_virtual_card (row count: 98)
*   get_physical_card (row count: 106)
*   getting_spare_card (row count: 129)

> Check keywords such as 'disposable', 'virtual', 'physical', 'spare', etc.

In [None]:
mask = df_training_copy['category'] == 'get_disposable_virtual_card'
df_training_copy[mask]

Unnamed: 0,text,category
8371,What do you use disposable cards on?,get_disposable_virtual_card
8372,How do I get a disposable virtual card as well?,get_disposable_virtual_card
8373,"I want a disposable virtual card, how do I do ...",get_disposable_virtual_card
8374,What are the disposable cards for?,get_disposable_virtual_card
8375,I need a disposable virtual card. Please tell ...,get_disposable_virtual_card
...,...,...
8463,"I need to deposit my virtual card, how do i do...",get_disposable_virtual_card
8464,What are the benefits of a disposable card?,get_disposable_virtual_card
8465,What is the disposable card for?,get_disposable_virtual_card
8466,I have heard about these disposable virtual ca...,get_disposable_virtual_card


In [None]:
# Keyword lookup

word_to_exclude = 'disposable'
category_to_include = 'get_disposable_virtual_card'

mask = ~(df_training_copy['text'].str.contains(word_to_exclude, case=False)) & (df_training_copy['category'] == category_to_include)
result = df_training_copy[mask]
result

Unnamed: 0,text,category
8376,What systems do you have in place for my secur...,get_disposable_virtual_card
8381,can i get a virtual card online,get_disposable_virtual_card
8412,how do the cards work?,get_disposable_virtual_card
8421,What is the procedure for depositing a virtual...,get_disposable_virtual_card
8422,can i get a virtual card,get_disposable_virtual_card
8444,How can I get a virtual card for a one-time tr...,get_disposable_virtual_card
8462,I need a single use card for shopping online,get_disposable_virtual_card
8463,"I need to deposit my virtual card, how do i do...",get_disposable_virtual_card
