<a href="https://colab.research.google.com/github/sjung-stat/Customer-Support-Chat-Intent-Classification/blob/main/Data%20Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In the 'Exploratory Data Analysis' part, we found out that the dataset is imbalanced. We try to deal with the text data imbalance issue by


1. merging smiliar intents
      - There are too many categories relative to the size of the dataset we have. The best approach is to merge categories that have similar intents. 

2.   augmenting text data which belong to minority class
      - There are several methods to augment text data, including: 
          - back translation (e.g. English -> French -> English)
          - __paraphrasing__
      - We use paraphrasing, because most of the time, we noticed that back translation gives you almost the identical (even word by word) result before and after translation. Parapharsing, on the other hand, gives you a similar but not the same result as the original sentences. But, sometimes, the parapharsing technique even changes some important keywords, so that it doesn't seem to belong to the category anymore For example, when we apply technique to the 'contactless_not_working' category, the keyword 'contactless' translates to other words, so they can't be classifed to the cateogry anymore. In such cases, we either merge the class to similar categories, or remove the minority category when there are not enough data. After data augmentation, we examine whether there are exactly the same sentences in each of the category. If the same sentences are found, we only keep one of them. 

3.   removing text data which have similar meaning in each of the majority class. 
      - There are multiple ways you can calculate the text similarity. This [blog post](https://medium.com/@adriensieg/text-similarities-da019229c894) introduces 10+ different ways you can implement.
      - Among the many different options, we use __BERT embeddings + Cosine Similarity__ method. 
        - BERT produces contextualized word embeddings, meaning that each word's embedding is generated based on the context it appears in, rather than simply being a fixed representation of the word. This allows BERT to capture the nuances of meaning that can be missed by simpler word embedding models.

        - To calculate word similarity using BERT embeddings, you can use cosine similarity to compare the embeddings of two words. Cosine similarity measures the angle between two vectors and returns a value between -1 and 1, where 1 means the vectors are identical and -1 means they are completely dissimilar.






In [1]:
"""
아래는 위의 'bert + cosine similarity' 를 chatgpt 에 물어봤을때, 해당 방법을 python 에 implement 하는 법을 주는 sample code



from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F

# Load the BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Encode two words to get their embeddings
word1 = "dog"
word2 = "cat"
encoded_input1 = tokenizer(word1, return_tensors='pt')
encoded_input2 = tokenizer(word2, return_tensors='pt')
with torch.no_grad():
    embeddings1 = model(**encoded_input1).pooler_output
    embeddings2 = model(**encoded_input2).pooler_output

# Calculate the cosine similarity between the embeddings
similarity = F.cosine_similarity(embeddings1, embeddings2)
print(f"The similarity between '{word1}' and '{word2}' is {similarity.item():.2f}")
"""

'\n아래는 위의 \'bert + cosine similarity\' 를 chatgpt 에 물어봤을때, 해당 방법을 python 에 implement 하는 법을 주는 sample code\n\n\n\nfrom transformers import AutoTokenizer, AutoModel\nimport torch.nn.functional as F\n\n# Load the BERT model and tokenizer\nmodel_name = \'bert-base-uncased\'\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModel.from_pretrained(model_name)\n\n# Encode two words to get their embeddings\nword1 = "dog"\nword2 = "cat"\nencoded_input1 = tokenizer(word1, return_tensors=\'pt\')\nencoded_input2 = tokenizer(word2, return_tensors=\'pt\')\nwith torch.no_grad():\n    embeddings1 = model(**encoded_input1).pooler_output\n    embeddings2 = model(**encoded_input2).pooler_output\n\n# Calculate the cosine similarity between the embeddings\nsimilarity = F.cosine_similarity(embeddings1, embeddings2)\nprint(f"The similarity between \'{word1}\' and \'{word2}\' is {similarity.item():.2f}")\n'

# Import Data

In [2]:
import pandas as pd
import numpy as np

url_training = 'https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/train.csv'
url_testing = 'https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/test.csv'
df_training = pd.read_csv(url_training)
df_testing = pd.read_csv(url_testing)

# In order to preserve the original training data, we create a copy of the dataset. 
df_training_copy = df_training.copy()
df_testing_copy = df_testing.copy()

# Data Augmentation

In [3]:
# Paraphasing and translating text data 

! pip install -q sentence-splitter
! pip install -q transformers
! pip install -q SentencePiece
! pip install -q mtranslate

from mtranslate import translate
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from sentence_splitter import SentenceSplitter, split_text_into_sentences

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 KB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for mtranslate (setup.py) ... [?25l[?25hdone


In [4]:
# Hugging Face provides the following function which paraphrases a sentence. 
# https://huggingface.co/tuner007/pegasus_paraphrase

model_name = 'tuner007/pegasus_paraphrase'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)

def get_response(input_text,num_return_sequences):
  batch = tokenizer.prepare_seq2seq_batch([input_text],truncation=True,padding='longest',max_length=60, return_tensors="pt").to(torch_device)
  translated = model.generate(**batch,max_length=60,num_beams=10, num_return_sequences=num_return_sequences, temperature=1.5)
  tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
  return tgt_text

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

In [5]:
# Since the function takes only a sentence, write an additional function which takes a paragraph as an input

def get_response_paragraph(input_text):
  splitter = SentenceSplitter(language='en')
  sentence_list = splitter.split(input_text)
  paraphrase = []
  for i in sentence_list:
    a = get_response(i,1)
    paraphrase.append(a)
    
  paraphrase2 = [' '.join(x) for x in paraphrase]
  paraphrase3 = [' '.join(x for x in paraphrase2) ]
  paraphrased_text = str(paraphrase3).strip('[]').strip("'")

  return paraphrased_text

In [6]:
text = "How do I know if I will get my card, or if it is lost?"
print(get_response(text, 1))

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



['How do I know if I get my card or not?']


### 1. ***contactless_not_working*** (row count: 35) --- DONE

> The keyword of this category is 'contactless'. If a sentence doesn't include this keyword, it could be ambiguous to classify them into this category. So, we take a look at the rows that doesn't include this keyword within this category.



In [7]:
# Keyword lookup

word_to_exclude = 'contactless'
category_to_include = 'contactless_not_working'

mask = ~(df_training['text'].str.contains(word_to_exclude, case=False)) & (df_training['category'] == category_to_include)
result = df_training[mask]
result


Unnamed: 0,text,category
1782,Should i uninstall the app before i try it again?,contactless_not_working
1786,how many days processing new card?,contactless_not_working
1788,Would reinstalling the app solve the problem?,contactless_not_working
1791,The terminal I paid at wouldn't take my card. ...,contactless_not_working
1801,how to get new card?,contactless_not_working
1804,Any charges applicable for new card?,contactless_not_working
1807,How can i get new card?,contactless_not_working




> If we remove the data, we end up having only 27 rows in this category, which will be not enough even with data augmentation. In the dataset, there are two different categories that have similar intents: 'card_not_working' and 'virtual_card_not_working' each of which has 112 and 41 rows. So, we merge the three categories together into 'card_not_working'. 




In [8]:
category_not_working = ['contactless_not_working', 'virtual_card_not_working']
mask = df_training_copy['category'].isin(category_not_working)
df_training_copy.category[mask] = 'card_not_working'



#-------------------------testing---------------------------
mask = df_testing_copy['category'].isin(category_not_working)
df_testing_copy.category[mask] = 'card_not_working'

In [9]:
df_training_copy[df_training_copy.category=='card_not_working'].count()

text        188
category    188
dtype: int64

--------------------------------------------------------------------------------

### 2. ***lost_or_stolen_card*** (row count: 82) --- DONE



> There are two similar intents related to lost and stolen; 'lost_or_stolen_card' and 'lost_or_stolen_phone'. We investigate the data in each of them, and if they are different only based on the two keywords 'card' and 'phone' and have the same context, we merge them together. 



In [10]:
mask = df_training_copy['category']=='lost_or_stolen_card'
df_training_copy[mask]

Unnamed: 0,text,category
1475,Has there been any activity on my card today?,lost_or_stolen_card
1476,I lost my wallet and all my cards were in it.,lost_or_stolen_card
1477,I'm panicking! I lost my card! Help!,lost_or_stolen_card
1478,I need to report a stolen card,lost_or_stolen_card
1479,How do I replace a stolen card?,lost_or_stolen_card
...,...,...
1552,Can you help me retrieve my card?,lost_or_stolen_card
1553,I lost my card. Can you help me?,lost_or_stolen_card
1554,What should I do? My card has been stolen!,lost_or_stolen_card
1555,My card was stolen last night.,lost_or_stolen_card


In [11]:
mask = df_training_copy['category']=='lost_or_stolen_phone'
df_training_copy[mask]

Unnamed: 0,text,category
5251,"My phone was stolen, what should I do first?",lost_or_stolen_phone
5252,"My phone was stolen, what should I do?",lost_or_stolen_phone
5253,"I lost my phone, what should I do?",lost_or_stolen_phone
5254,"Someone has stolen my phone, what should I do?",lost_or_stolen_phone
5255,I think I lost my phone. Is there a way to pr...,lost_or_stolen_phone
...,...,...
5367,Everything was stolen from me. I can't use th...,lost_or_stolen_phone
5368,My things were stolen and I need to know if I ...,lost_or_stolen_phone
5369,"lost my phone, what is account security?",lost_or_stolen_phone
5370,I lost my phone. What do I do to block someon...,lost_or_stolen_phone


> Based on the contexts of the data, we can merge the two intents and call them 'lost_or_stolen' in general, instead of having two separate intents. And we end up having 203 counts for this new intent.



In [12]:
category_lost_or_stolen = ['lost_or_stolen_card', 'lost_or_stolen_phone']
mask = df_training_copy['category'].isin(category_lost_or_stolen)

df_training_copy.category[mask] = 'lost_or_stolen'
df_training_copy[df_training_copy.category=='lost_or_stolen'].count()

#-------------------------testing---------------------------
mask = df_testing_copy['category'].isin(category_lost_or_stolen)
df_testing_copy.category[mask] = 'lost_or_stolen'




--------------------------------------------------------------------------------

### 3. ***card_acceptance*** (row count: 59) -- DONE



> We take a look at this data, and decide if we should merge them into another intent, or augment this intent. 



In [13]:
mask = df_training_copy['category']=='card_acceptance'
df_training_copy[mask]

Unnamed: 0,text,category
3096,Is there anywhere I can't use my card?,card_acceptance
3097,In which stores can I shop with this card?,card_acceptance
3098,How do I know where I can use my card?,card_acceptance
3099,Can I tell what business will take this card?,card_acceptance
3100,Are there businesses that don't accept this card?,card_acceptance
3101,Is the card welcomed by everybody?,card_acceptance
3102,Do you have a list of businesses that accept t...,card_acceptance
3103,What businesses accept this card?,card_acceptance
3104,What retailers accept my card?,card_acceptance
3105,This card is accepted by what businesses?,card_acceptance


> Since there do not exist similar intents in this dataset, we do the data augmentation and back translation using the technique mentioned above. 



In [14]:
# Paraphrase

df_training_augment = df_training_copy[mask].text.apply(lambda x: get_response(x, 1))
card_acceptance_augmented_training = df_training_augment.to_frame().assign(category="card_acceptance")
card_acceptance_augmented_training


#-------------------------testing---------------------------
df_augment_testing = df_testing_copy[mask].text.apply(lambda x: get_response(x, 1))
card_acceptance_augmented_testing = df_augment_testing.to_frame().assign(category="card_acceptance")
card_acceptance_augmented_testing

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

  df_augment_testing = df_testing_copy[mask].text.apply(lambda x: get_response(x, 1))


Unnamed: 0,text,category


In [15]:
# Back translation (to avoid the situation where translation gives you back the same output as input, we use 3-way translation which makes the translation more complicated: it -> ru -> en)

df_translate_1_training = df_training_copy[mask].text.apply(lambda x: translate(x, "it"))
df_translate_2_training = df_translate_1_training.apply(lambda x: translate(x, "ru"))
df_translate_3_training = df_translate_2_training.apply(lambda x: translate(x, "en"))

card_acceptance_translated_training = df_translate_3_training.to_frame().assign(category="card_acceptance")
card_acceptance_translated_training




#-------------------------testing---------------------------
df_translate_1_testing = df_testing_copy[mask].text.apply(lambda x: translate(x, "it"))
df_translate_2_testing = df_translate_1_testing.apply(lambda x: translate(x, "ru"))
df_translate_3_testing = df_translate_2_testing.apply(lambda x: translate(x, "en"))

card_acceptance_translated_testing = df_translate_3_testing.to_frame().assign(category="card_acceptance")
card_acceptance_translated_testing

  df_translate_1_testing = df_testing_copy[mask].text.apply(lambda x: translate(x, "it"))


Unnamed: 0,text,category


--------------------------------------------------------------------------------

### 4. ***card_swallowed*** (row count: 61) & ***atm_support*** (row count: 87) --- DONE


> The 'card_swallowed' could be a similar intent as 'atm_support'. In this case, the keyword is ATM. So we can potentially merge them together. 





In [16]:
# Keyword lookup

word_to_exclude = 'ATM'
category_to_include = 'card_swallowed'

mask = ~(df_training_copy['text'].str.contains(word_to_exclude, case=False)) & (df_training_copy['category'] == category_to_include)
result = df_training_copy[mask]
result

Unnamed: 0,text,category
6251,Your machine took my card. How do I get it back?,card_swallowed
6261,How do I get my card back quickly from a bank ...,card_swallowed
6264,I was getting cash and the card got stuck inside,card_swallowed
6275,I was removing a dollar amount from my account...,card_swallowed
6282,The bank machine didn't give my card back how ...,card_swallowed
6291,how can the money machine keep my card what do...,card_swallowed
6292,How do I get a new card as your machine swallo...,card_swallowed
6295,My card was eaten by the cash machine what do ...,card_swallowed
6298,Will you please help me get my card back?,card_swallowed
6303,What are the steps I should take to recover my...,card_swallowed


In [17]:
mask = df_training_copy['category']=='card_swallowed'
df_training_copy[mask]

Unnamed: 0,text,category
6245,What do I do if the ATM took my card?,card_swallowed
6246,What do I do now my credit card has been swall...,card_swallowed
6247,An ATM machine didn't give me back my card.,card_swallowed
6248,"My card got trapped inside an ATM, what should...",card_swallowed
6249,What do I do if I can't get my card out of the...,card_swallowed
...,...,...
6301,"My card is stuck in an ATM machine, how do I g...",card_swallowed
6302,atm ate my card,card_swallowed
6303,What are the steps I should take to recover my...,card_swallowed
6304,"My card is stuck inside the ATM, what am I sup...",card_swallowed


> You can observe that most of the text data has the keyword 'ATM' in them. Even when they don't contain the keyword, you can infer that their cards were swallowed mostly by ATMs. Hence, we can merge the intent to the 'atm_support' intent to reduce the number of intents. 

In [18]:
df_training_copy.category[mask] = 'atm_support'
df_training_copy[df_training_copy.category=='atm_support'].count()



#-------------------------testing---------------------------
df_testing_copy.category[mask] = 'atm_support'
df_testing_copy[df_testing_copy.category=='atm_support'].count()

text        40
category    40
dtype: int64

--------------------------------------------------------------------------------

### 5. ***compromised_card*** (row count: 86) -- IN PROGRESS

In [19]:
mask = df_training_copy['category'] == 'compromised_card'
df_training_copy[mask]

Unnamed: 0,text,category
4459,I think someone is using my card without my pe...,compromised_card
4460,What do I do if I detect fraudulent use on my ...,compromised_card
4461,I think my account has been hacked there are c...,compromised_card
4462,I don't recognize some transactions. Is someo...,compromised_card
4463,I just got an email in confirming my purchase ...,compromised_card
...,...,...
4540,The card has suffered a security breach.,compromised_card
4541,My card data has been exposed.,compromised_card
4542,There are transactions that I did not make on ...,compromised_card
4543,I think someone may be using my card.,compromised_card




> NEED TO DO DATA AUGMENTATION (PARAPHRASING + BACK TRANSLATION)



--------------------------------------------------------------------------------

### 6. ***receiving_money*** (row count: 95) -- IN PROGRESS

--------------------------------------------------------------------------------

### 7. ***get_disposable_virtual_card*** (row count: 97) -- IN PROGRESS

> There are 4 different intents that are part of 'getting card':

*   get_disposable_virtual_card (row count: 97)
*   getting_virtual_card (row count: 98)
*   get_physical_card (row count: 106)
*   getting_spare_card (row count: 129)

> Check keywords such as 'disposable', 'virtual', 'physical', 'spare', etc.

In [20]:
mask = df_training_copy['category'] == 'get_disposable_virtual_card'
df_training_copy[mask]

Unnamed: 0,text,category
8371,What do you use disposable cards on?,get_disposable_virtual_card
8372,How do I get a disposable virtual card as well?,get_disposable_virtual_card
8373,"I want a disposable virtual card, how do I do ...",get_disposable_virtual_card
8374,What are the disposable cards for?,get_disposable_virtual_card
8375,I need a disposable virtual card. Please tell ...,get_disposable_virtual_card
...,...,...
8463,"I need to deposit my virtual card, how do i do...",get_disposable_virtual_card
8464,What are the benefits of a disposable card?,get_disposable_virtual_card
8465,What is the disposable card for?,get_disposable_virtual_card
8466,I have heard about these disposable virtual ca...,get_disposable_virtual_card


In [21]:
# Keyword lookup for "get_disposable_virtual_card"

word_to_exclude = 'disposable'
category_to_include = 'get_disposable_virtual_card'

mask = ~(df_training_copy['text'].str.contains(word_to_exclude, case=False)) & (df_training_copy['category'] == category_to_include)
result = df_training_copy[mask]
result

Unnamed: 0,text,category
8376,What systems do you have in place for my secur...,get_disposable_virtual_card
8381,can i get a virtual card online,get_disposable_virtual_card
8412,how do the cards work?,get_disposable_virtual_card
8421,What is the procedure for depositing a virtual...,get_disposable_virtual_card
8422,can i get a virtual card,get_disposable_virtual_card
8444,How can I get a virtual card for a one-time tr...,get_disposable_virtual_card
8462,I need a single use card for shopping online,get_disposable_virtual_card
8463,"I need to deposit my virtual card, how do i do...",get_disposable_virtual_card


> As the above table shows which is a list of sentences that do not contain the keyword 'disposable', some data have nothing to do with this category. For instance, it is impossible to infer from a sentence "how do the cards work?" that the sentence is asking about getting disposable virtual card. We drop such vague sentences. 



In [22]:
# Keyword lookup for "getting_virtual_card"

word_to_exclude = 'virtual'
category_to_include = 'getting_virtual_card'

mask = ~(df_training_copy['text'].str.contains(word_to_exclude, case=False)) & (df_training_copy['category'] == category_to_include)
result = df_training_copy[mask]
result

Unnamed: 0,text,category
3092,Is there an alternative to a physical card?,getting_virtual_card


In [23]:
# Keyword lookup for "get_physical_card"

mask = df_training_copy['category'] == 'get_physical_card'
df_training_copy[mask]

Unnamed: 0,text,category
3994,"I'm not sure what to do about the PIN, because...",get_physical_card
3995,Is my PIN sent separably?,get_physical_card
3996,Where can I get my card PIN?,get_physical_card
3997,my pin hasn't arrived in the post! How do I ca...,get_physical_card
3998,Where is the card PIN located?,get_physical_card
...,...,...
4095,how to get card pin?,get_physical_card
4096,Where do I find my PIN for my new card?,get_physical_card
4097,Can you deliver the PIN separately?,get_physical_card
4098,is my pin the same as my passcode,get_physical_card


> Even though the intent is named "get_physical_card", most of the sentences are about PINs. Investigate which sentences doesn't include the word 'PIN'.



In [24]:
word_to_exclude = 'pin'
category_to_include = 'get_physical_card'

mask = ~(df_training_copy['text'].str.contains(word_to_exclude, case=False)) & (df_training_copy['category'] == category_to_include)
result = df_training_copy[mask]
result

Unnamed: 0,text,category
4091,How do I get started when I get my card?,get_physical_card


> Only one sentence doesn't include the word 'PIN'. Also, we can infer from the sentence that it is also about PIN even though it does not explicitly mention the word. Hence, we can safely assume that this category is more suitable with 'pin_inquiry'. 



In [25]:
mask = df_training_copy['category'] == 'get_physical_card'
df_training_copy.category[mask] = 'pin_inquiry'

In [26]:
# Keyword lookup for "getting_spare_card"

mask = df_training_copy['category'] == 'getting_spare_card'
df_training_copy[mask]

Unnamed: 0,text,category
7804,Are extra charges added for sending out additi...,getting_spare_card
7805,I'd like to order an additional card,getting_spare_card
7806,I want some extra physical cards.,getting_spare_card
7807,I would like open a second card for my daughte...,getting_spare_card
7808,Am I gonna be charged for sending out more cards?,getting_spare_card
...,...,...
7928,Can I have more than one card?,getting_spare_card
7929,I need another card for a member of my family....,getting_spare_card
7930,Is there any way I can get more physical cards...,getting_spare_card
7931,Can I give a second card for this account to m...,getting_spare_card


--------------------------------------------------------------------------------

### 8. ***receiving_money*** (row count: 95) -- IN PROGRESS

In [27]:
# Keyword lookup for "getting_spare_card"

mask = df_training_copy['category'] == 'receiving_money'
df_training_copy[mask]

Unnamed: 0,text,category
7333,Can my salary be received here?,receiving_money
7334,How can my boss pay me directly to the card?,receiving_money
7335,Salary in GBP has been received. Does it need ...,receiving_money
7336,I am paid by my employer in GBP; do I need to ...,receiving_money
7337,How can my friend give me money?,receiving_money
...,...,...
7423,Can I use a different currency with my salary?,receiving_money
7424,My salary is in GBP. Do I need to configure th...,receiving_money
7425,How can I configure what currency my salary is...,receiving_money
7426,Do you support direct deposits from my employer?,receiving_money



> Keep original



--------------------------------------------------------------------------------

### 9. ***TOP UPS*** -- maybe we can keep the original for the 'top-up's 



> There are 6 intents that are related to 'top-ups':


1.   pending_top_up (149)
2.   top_up_reverted (146)
3.   top_up_failed (145)
4.   top_up_by_bank_transfer_charge (111)
5.   topping_up_by_card (103)
6.   top_up_limits (97)
7.   verify_top_up (126)
8.   automatic_top_up (127)

In [28]:
# Keyword lookup for "top_up_limits"

mask = df_training_copy['category'] == 'top_up_limits'
df_training_copy[mask]

Unnamed: 0,text,category
2234,What is the max amount of top-ups?,top_up_limits
2235,Are top-ups unlimited?,top_up_limits
2236,Can I increase my top-up maximum?,top_up_limits
2237,Can I top-up any amount?,top_up_limits
2238,What's the limit to how much I can top up?,top_up_limits
...,...,...
2326,How much is the limit for top-ups?,top_up_limits
2327,How much can I top up on my card?,top_up_limits
2328,limits on top ups,top_up_limits
2329,What is the daily limit on my card?,top_up_limits


In [29]:
# Keyword lookup for "topping_up_by_card"

mask = df_training_copy['category'] == 'topping_up_by_card'
df_training_copy[mask]

Unnamed: 0,text,category
4235,My money I had was gone and I could not get gas!,topping_up_by_card
4236,i can not see my top up,topping_up_by_card
4237,I can't see my top up in my wallet!,topping_up_by_card
4238,I want to transfer money using my credit card.,topping_up_by_card
4239,"I tried topping up using my card, but the mone...",topping_up_by_card
...,...,...
4333,How can I use my credit card to transfer money?,topping_up_by_card
4334,How do I use my card to top up?,topping_up_by_card
4335,"I need to transfer some money, can I use my cr...",topping_up_by_card
4336,Why is my 'top up' not showing up in my wallet?,topping_up_by_card


In [30]:
# Keyword lookup for "top_up_by_bank_transfer_charge"

mask = df_training_copy['category'] == 'top_up_by_bank_transfer_charge'
df_training_copy[mask]

Unnamed: 0,text,category
1817,Is there a top up fee for transfer?,top_up_by_bank_transfer_charge
1818,Will there be a charge for topping up by accou...,top_up_by_bank_transfer_charge
1819,What are the charges for receiving a SEPA tran...,top_up_by_bank_transfer_charge
1820,Is there a charge for SEPA transfers?,top_up_by_bank_transfer_charge
1821,Will I be charged a fee for a SEPA transfer?,top_up_by_bank_transfer_charge
...,...,...
1923,Is there any fees associated with receiving mo...,top_up_by_bank_transfer_charge
1924,Will I be charged to top off my account?,top_up_by_bank_transfer_charge
1925,What kind of fee is there to top-up my account?,top_up_by_bank_transfer_charge
1926,Do you process SWIFT transfers?,top_up_by_bank_transfer_charge


In [31]:
# Keyword lookup for "top_up_failed"

mask = df_training_copy['category'] == 'top_up_failed'
df_training_copy[mask]

Unnamed: 0,text,category
8468,I think my top-up has failed.,top_up_failed
8469,Top-up is not working,top_up_failed
8470,My top up is not working,top_up_failed
8471,"My top-up hasn't gone through, what happened?",top_up_failed
8472,My top-up was rejected by an app.,top_up_failed
...,...,...
8608,There has been a red flag on my top up.,top_up_failed
8609,Was there a problem with topping up?,top_up_failed
8610,"Hi, I tried to top up my card today and it did...",top_up_failed
8611,My top-up failed to go through.,top_up_failed


In [32]:
# Keyword lookup for "top_up_reverted"

mask = df_training_copy['category'] == 'top_up_reverted'
df_training_copy[mask]

Unnamed: 0,text,category
3155,My top up did not show up as shown and my mone...,top_up_reverted
3156,Has my top-up been cancelled?,top_up_reverted
3157,I topped up recently and saw the money go thro...,top_up_reverted
3158,For what reason did my top-up get cancelled?,top_up_reverted
3159,I think my top up has been reverted,top_up_reverted
...,...,...
3296,"Hello, I have recharged topup but account is ...",top_up_reverted
3297,Reverted top up,top_up_reverted
3298,"Hae, I already completed my 3D secure authenti...",top_up_reverted
3299,Can you tell me why my top-up was canceled?,top_up_reverted


In [33]:
# Keyword lookup for "pending_top_up"

mask = df_training_copy['category'] == 'pending_top_up'
df_training_copy[mask]

Unnamed: 0,text,category
1928,How long does a top-up take to go through?,pending_top_up
1929,I am under the impression that my top up is st...,pending_top_up
1930,How long will it take for my money to be depos...,pending_top_up
1931,i put money on my card and i dont see it on th...,pending_top_up
1932,I have a top-up that's still pending and wante...,pending_top_up
...,...,...
2072,"Hello, I'm a brand new customer and tried topp...",pending_top_up
2073,I have a new card and I can't add money to it....,pending_top_up
2074,I'm not certain my top-up went through yet.,pending_top_up
2075,I've been waiting for over an hour for a top u...,pending_top_up


In [34]:
# Keyword lookup for "verify_top_up"

mask = df_training_copy['category'] == 'verify_top_up'
df_training_copy[mask]

Unnamed: 0,text,category
7678,Do you know how I can verify that I did a top-...,verify_top_up
7679,where is the code for verifying the top up card?,verify_top_up
7680,The top-up card is verified how?,verify_top_up
7681,Tell me about verifying top-up,verify_top_up
7682,What is the verification code for my top up card?,verify_top_up
...,...,...
7799,I need help finding the verification code on t...,verify_top_up
7800,What does top-up verification do?,verify_top_up
7801,I CAN NOT FIND THE TOP-UP VERIFICATION CODE.,verify_top_up
7802,How do I go about verifying the top-up card?,verify_top_up


In [35]:
# Keyword lookup for "automatic_top_up"

mask = df_training_copy['category'] == 'automatic_top_up'
df_training_copy[mask]

Unnamed: 0,text,category
1118,Can I add money automatically to my account wh...,automatic_top_up
1119,i need help finding the auto top up option.,automatic_top_up
1120,What are the maximum amount you can do for aut...,automatic_top_up
1121,I can't find the auto-top up option.,automatic_top_up
1122,Does the auto top-up have any limits?,automatic_top_up
...,...,...
1240,"My funds are low, can I auto top up?",automatic_top_up
1241,I know there's an auto-top option but I can no...,automatic_top_up
1242,Is it possible to top-up automatically?,automatic_top_up
1243,What is the auto top-up limit?,automatic_top_up


--------------------------------------------------------------------------------

### 10. ***Verifying identities*** -- DONE



> There are 3 different intents that are associated with 'verifying identities'


1.   unable_to_verify_identity (102)
2.   verify_my_identity (104)
3.   why_verify_identity (121)


In [36]:
# Keyword lookup for "unable_to_verify_identity"

mask = df_training_copy['category'] == 'unable_to_verify_identity'
df_training_copy[mask]

Unnamed: 0,text,category
3892,Can you help me with proving my identity?,unable_to_verify_identity
3893,What proof do you need for my identification?,unable_to_verify_identity
3894,Are there any reasons that my identity wouldn'...,unable_to_verify_identity
3895,I am having some difficulty verifying my id.,unable_to_verify_identity
3896,I am not able to verify my id. Why?,unable_to_verify_identity
...,...,...
3989,I'm still waiting on my identity verification.,unable_to_verify_identity
3990,Why can't my ID be verified?,unable_to_verify_identity
3991,Help my prove my identity.,unable_to_verify_identity
3992,i am under 18 and i am trying to verify my id....,unable_to_verify_identity


In [37]:
# Keyword lookup for "verify_my_identity"

mask = df_training_copy['category'] == 'verify_my_identity'
df_training_copy[mask]

Unnamed: 0,text,category
9770,What do you require for identity verification?,verify_my_identity
9771,How can I prove I am me?,verify_my_identity
9772,I need to verify my identity. How do I do that?,verify_my_identity
9773,How do I verify my identity?,verify_my_identity
9774,How do I verify my identity online,verify_my_identity
...,...,...
9869,What is needed for the identity verification?,verify_my_identity
9870,"For the identity check, what kind of documents...",verify_my_identity
9871,What form of identification is accepted?,verify_my_identity
9872,What is necessary for the identity verification?,verify_my_identity


> We can combine "verify_my_identity" and "unable_to_verify_identity", because, they are mostly asking how they can verify their identity. When they are not able to verify it ("unable_to_verify_identity"), we can simply tell them how they can do it. The new category is now named "verify_identity". 



In [38]:
category_verify_identity = ['verify_my_identity', 'unable_to_verify_identity']
mask = df_training_copy['category'].isin(category_verify_identity)

df_training_copy.category[mask] = 'verify_identity'
df_training_copy[df_training_copy.category=='verify_identity'].count()




#-------------------------testing---------------------------
mask = df_testing_copy['category'].isin(category_verify_identity)

df_testing_copy.category[mask] = 'verify_identity'
df_testing_copy[df_testing_copy.category=='verify_identity'].count()

text        80
category    80
dtype: int64

In [39]:
# Keyword lookup for "why_verify_identity"

mask = df_training_copy['category'] == 'why_verify_identity'
df_training_copy[mask]

Unnamed: 0,text,category
3771,Why do you have an identity check?,why_verify_identity
3772,I do not feel comfortable verifying my identity.,why_verify_identity
3773,Why on earth do you need so much personal id i...,why_verify_identity
3774,DO you know the reason for the identity check?,why_verify_identity
3775,I answered so many questions about my identity...,why_verify_identity
...,...,...
3887,why and how soon do I have to have proof of id...,why_verify_identity
3888,Why do you need all these details about my ide...,why_verify_identity
3889,What is identity check?,why_verify_identity
3890,Can i make transactions before identity verifi...,why_verify_identity


--------------------------------------------------------------------------------

### 11. ***passcode_forgotten*** -- IN PROGRESS

In [40]:
# Keyword lookup for "passcode_forgotten"

mask = df_training_copy['category'] == 'passcode_forgotten'
df_training_copy[mask]

Unnamed: 0,text,category
4814,Help me! I don't know what my password is.,passcode_forgotten
4815,I thought I knew my password but I guess I was...,passcode_forgotten
4816,I am unable to access my app due to forgetting...,passcode_forgotten
4817,I can't recall my password.,passcode_forgotten
4818,What should I do if I don't know my password?,passcode_forgotten
...,...,...
4914,"I don't know what my passcode is, can you help?",passcode_forgotten
4915,How can i reset my passcode ?,passcode_forgotten
4916,What can I do if my passcode won't work?,passcode_forgotten
4917,"I can't remember my app code, can you reset it?",passcode_forgotten


In [41]:
# Keyword lookup for "pin_blocked"

mask = df_training_copy['category'] == 'pin_blocked'
df_training_copy[mask]

Unnamed: 0,text,category
1667,I have exceeded the number of PIN attempts,pin_blocked
1668,I mistook my pin and now I am locked. Can you...,pin_blocked
1669,Please help me unblock my pin which I put the ...,pin_blocked
1670,Help me unblock my account. I entered the PIN...,pin_blocked
1671,"I locked myself out of my account, how do I un...",pin_blocked
...,...,...
1777,I need a new PIN number.,pin_blocked
1778,Can you unblock my account? I entered the PIN...,pin_blocked
1779,I tried my PIN too many times,pin_blocked
1780,I punched in the wrong pin too many times and ...,pin_blocked


In [42]:
# Keyword lookup for "change_pin"

mask = df_training_copy['category'] == 'change_pin'
df_training_copy[mask]

Unnamed: 0,text,category
6883,Is it possible for me to change my PIN number?,change_pin
6884,What are the steps to change my PIN to somethi...,change_pin
6885,In what way can I change my PIN and where do I...,change_pin
6886,Can I change my PIN at any cash machines?,change_pin
6887,I need to make my card PIN a different number,change_pin
...,...,...
7000,Please tell me how to change my pin.,change_pin
7001,At what ATM can I change my PIN?,change_pin
7002,If I am not in the country and I need to chang...,change_pin
7003,What do I have to do to change my pin?,change_pin


--------------------------------------------------------------------------------

### 12. ***declined and failed transfer*** -- IN PROGRESS



> declined_transfer (133) & failed_transfer (137)



In [43]:
# Keyword lookup for "declined_transfer"

mask = df_training_copy['category'] == 'declined_transfer'
df_training_copy[mask]

Unnamed: 0,text,category
5541,"Transfer unable to be completed, states 'decli...",declined_transfer
5542,Why was my transfer request decline?,declined_transfer
5543,My transfer was rejected,declined_transfer
5544,I tried to buy something online yesterday and ...,declined_transfer
5545,"I am having trouble doing a transfer. A ""Decli...",declined_transfer
...,...,...
5669,I saw that my transfer was declined.,declined_transfer
5670,Why do you keep declining my transfers? it's a...,declined_transfer
5671,How can I find a way to transfer money without...,declined_transfer
5672,I have a problem with my account. I tried to t...,declined_transfer


In [44]:
# Keyword lookup for "failed_transfer"

mask = df_training_copy['category'] == 'failed_transfer'
df_training_copy[mask]

Unnamed: 0,text,category
7428,What is happening? I have tried to transfer m...,failed_transfer
7429,Is there a reason that my transfer failed?,failed_transfer
7430,Why didn't my transfer complete?,failed_transfer
7431,Why was I unable to finish this transfer?,failed_transfer
7432,I have made 5 attempts to make a very standard...,failed_transfer
...,...,...
7560,why do i have a failed transfer,failed_transfer
7561,I've tried completing a very standard transfer...,failed_transfer
7562,"When I attempted to make a transfer, it failed.",failed_transfer
7563,Why can't my transfer complete?,failed_transfer


# Preprocessing

In [45]:
!pip install -q tensorflow_text

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/5.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/5.8 MB[0m [31m12.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━[0m [32m3.6/5.8 MB[0m [31m52.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m5.8/5.8 MB[0m [31m70.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [46]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from sklearn.preprocessing import LabelBinarizer

In [47]:
# preprocessing:
  # encoding: converting texts into numbers using BERT
    # download two models; one to perform preprocessing and the other for encoding
        # bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
        # bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")



''' To use BERT, text inputs must be converted into numeric token ids and organized into multiple Tensors. 
TensorFlow Hub offers a preprocessing model that corresponds to each of the discussed BERT models. 
These models use TF operations from the TF.text library to carry out the transformation, 
eliminating the need to run independent Python code outside of your TensorFlow model for preprocessing text. ''' 

# Note that the following preprocessing model cannot take pandas dataframe as input
# Need to convert the input data (text) into list
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3") # This is the preprocessing model we will be using on our text data
                                                                                              # Since this text preprocessor is a TensorFlow model, It can be included in your model directly.



In [48]:
# Extract text data as list (example)
inputs = df_training_copy.text[0:5].tolist()

In [49]:
text_preprocessed = bert_preprocess(inputs)

In [50]:
print(f'Keys       : {list(text_preprocessed.keys())}')
print(f'Shape      : {text_preprocessed["input_word_ids"].shape}')     # The input is truncated to 128 tokens. The number of tokens can be customized
print(f'Word Ids   : {text_preprocessed["input_word_ids"][0, :12]}')
print(f'Input Mask : {text_preprocessed["input_mask"][0, :12]}')
print(f'Type Ids   : {text_preprocessed["input_type_ids"][0, :12]}')   # The input_type_ids only have one value (0) because this is a single sentence input. For a multiple sentence input, it would have one number for each input.

Keys       : ['input_word_ids', 'input_mask', 'input_type_ids']
Shape      : (5, 128)
Word Ids   : [ 101 1045 2572 2145 3403 2006 2026 4003 1029  102    0    0]
Input Mask : [1 1 1 1 1 1 1 1 1 1 0 0]
Type Ids   : [0 0 0 0 0 0 0 0 0 0 0 0]


# Model Building

In [51]:
bert_model = hub.KerasLayer("https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1")

In [53]:
"""
bert_results = bert_model(text_preprocessed)

print(f'Loaded BERT: {"https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1"}')
print(f'Pooled Outputs Shape:{bert_results["pooled_output"].shape}') # represent each input sequence as a whole. 
                                                                     # The shape is [batch_size, H]. You can think of this as an embedding for the entire movie review.
                                                                     # For the fine-tuning you are going to use the pooled_output array.
print(f'Pooled Outputs Values:{bert_results["pooled_output"][0, :12]}')
print(f'Sequence Outputs Shape:{bert_results["sequence_output"].shape}') # represents each input token in the context. The shape is [batch_size, seq_length, H]. 
                                                                         # You can think of this as a contextual embedding for every token in the movie review.
print(f'Sequence Outputs Values:{bert_results["sequence_output"][0, :12]}')
"""

Loaded BERT: https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1
Pooled Outputs Shape:(5, 512)
Pooled Outputs Values:[ 0.2830166  -0.10704856 -0.99982864  0.32834923  0.75644654 -0.20272332
 -0.99827766 -0.06505853  0.37738743 -0.06368753 -0.7537793  -0.21071714]
Sequence Outputs Shape:(5, 128, 512)
Sequence Outputs Values:[[-0.8711817  -0.33320275  0.9165911  ... -0.2764066  -0.6449242
   0.0519006 ]
 [-0.52613     0.01936688  0.35743985 ... -0.8161681  -0.23002876
   0.56460106]
 [-0.8626529  -0.3867918  -0.2540034  ... -0.76943177 -0.03504736
   0.12754942]
 ...
 [-0.49495363 -0.4963697   1.2027649  ... -1.1381952  -0.12779602
   0.24443786]
 [-0.27515975 -0.16101809  0.4482491  ... -0.21568504 -0.01699192
  -0.03480853]
 [-0.3881903  -0.16248006  0.3746023  ... -0.14226384  0.22514021
   0.06008574]]


In [93]:
def intent_classification_bert():
  
  # Initializing the BERT layers
  input_text = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text') # define an input tensor --> the input data to the model will be a string of variable length
  text_preprocessed = bert_preprocess(input_text) # This layer converts the input string into a format that can be understood by the BERT model
  output_bert = bert_model(text_preprocessed) # This layer encodes the input text using the BERT model and returns the encoded outputs
                                              # After that, this will be fed into the neural network layers.

  # Initializing the neural network layers
  layer = tf.keras.layers.Dropout(0.1, name="dropout")(output_bert['pooled_output']) # This layer will be used to prevent model overfitting
                                                                                 # We will use 0.1% of the neurons to handle overfitting
  layer = tf.keras.layers.Dense(72, activation=None, name="classificaiton")(layer)  # It only has one neuron. We also initialize the activation function as sigmoid. 
                                                                                 # sigmoid is used when we have output values that between 0 and 1. 
                                                                                 # In our case, when making predictions, 
                                                                                 # the prediction probability will lie between 0 and 1. That’s why it is best suited.
                                                                                 # We also name the layer as output because this is our output layer.
  return tf.keras.Model(input_text, layer)

In [94]:
classifier_model = intent_classification_bert()

loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True) # Since this is a non-binary classification problem and the model outputs probabilities, 
                                                                 # you’ll use losses.CategoricalCrossentropy loss function.
metrics = tf.metrics.CategoricalAccuracy()

In [95]:
epochs=5
optimizer=tf.keras.optimizers.Adam(1e-5)
classifier_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)

In [96]:
from sklearn.model_selection import train_test_split
trainfeatures, validfeatures, trainlabels, validlabels = train_test_split(df_training_copy['text'],df_training_copy['category'], stratify=df_training_copy['category'], test_size=0.2)

testfeatures=df_testing_copy.copy()
testlabels=testfeatures.pop("category")

In [97]:
from sklearn.preprocessing import LabelBinarizer
binarizer=LabelBinarizer()  # One-Hot-Encoding of class-labels

trainlabels=binarizer.fit_transform(trainlabels.values)
validlabels=binarizer.transform(validlabels.values)
testlabels=binarizer.transform(testlabels.values)

In [109]:
trainfeatures = pd.DataFrame(trainfeatures)
validfeatures = pd.DataFrame(validfeatures)


In [110]:
print(trainfeatures.shape)
print(validfeatures.shape)
print(testfeatures.shape)



(8002, 1)
(2001, 1)
(3080, 1)


In [111]:
print(trainlabels.shape)
print(validlabels.shape)
print(testlabels.shape)



(8002, 72)
(2001, 72)
(3080, 72)


In [112]:
history = classifier_model.fit(x=trainfeatures, y=trainlabels,
                               validation_data=(validfeatures,validlabels),
                               batch_size=32,
                               epochs=epochs)

Epoch 1/5


ValueError: ignored

In [None]:
########################################################## Handling class imbalance  ############################################### <-- do this after data preprocessing, because the labels might change

In [86]:
from sklearn.utils import class_weight

In [98]:
# Calculate class weights
class_weights = class_weight.compute_class_weight('balanced', classes = np.unique(df_training_copy['category']), y= df_training_copy['category'])

# Convert class weights to dictionary
class_weights_dict = dict(enumerate(class_weights))

# After 'compute_class_weight', the key values become numerical values, when the original categories are texts. In order to see the order, create dictionary to map numerical values to categories
category_map = dict(enumerate(np.unique(df_training_copy['category'])))

In [99]:
print(class_weights_dict)
print(category_map)

{0: 0.8575960219478738, 1: 0.8737770789657582, 2: 1.2630050505050505, 3: 1.1026234567901234, 4: 0.93871996996997, 5: 1.093941382327209, 6: 0.8124593892137751, 7: 0.767572130141191, 8: 0.8905804843304843, 9: 0.8849079971691437, 10: 1.07698105081826, 11: 2.3547551789077215, 12: 0.908042846768337, 13: 1.2404513888888888, 14: 0.9995003996802558, 15: 0.73899231678487, 16: 0.7429441473559121, 17: 0.8269675925925926, 18: 0.8319194943446441, 19: 0.7849183929692404, 20: 0.8683159722222222, 21: 1.1387750455373407, 22: 1.6154715762273901, 23: 1.07698105081826, 24: 0.908042846768337, 25: 0.8030667951188183, 26: 1.0445906432748537, 27: 0.7633547008547008, 28: 1.1481864095500458, 29: 1.1481864095500458, 30: 1.1481864095500458, 31: 1.2404513888888888, 32: 1.1773775894538607, 33: 0.8369310575635877, 34: 1.0140916463909164, 35: 1.1026234567901234, 36: 1.4322737686139748, 37: 1.07698105081826, 38: 1.4176587301587302, 39: 0.6843869731800766, 40: 1.1577546296296297, 41: 1.3231481481481482, 42: 0.873777078

In [89]:
# Match classes with class weights

class_indices = [list(binarizer.classes_).index(x) for x in np.unique(df_training_copy['category'])]
class_weights = class_weights[class_indices]

In [91]:
class_weights

array([0.85759602, 0.87377708, 1.26300505, 1.10262346, 0.93871997,
       1.09394138, 0.81245939, 0.76757213, 0.89058048, 0.884908  ,
       1.07698105, 2.35475518, 0.90804285, 1.24045139, 0.9995004 ,
       0.73899232, 0.74294415, 0.82696759, 0.83191949, 0.78491839,
       0.86831597, 1.13877505, 1.61547158, 1.07698105, 0.90804285,
       0.8030668 , 1.04459064, 0.7633547 , 1.14818641, 1.14818641,
       1.14818641, 1.24045139, 1.17737759, 0.83693106, 1.01409165,
       1.10262346, 1.43227377, 1.07698105, 1.41765873, 0.68438697,
       1.15775463, 1.32314815, 0.87377708, 0.97154235, 0.93241984,
       0.93871997, 1.20809179, 1.31066562, 1.4624269 , 0.82207429,
       0.86292271, 1.07698105, 1.28639403, 1.25162663, 1.21868908,
       1.21868908, 0.95814176, 1.43227377, 0.95157915, 1.34884035,
       0.79388889, 0.80773579, 1.22947394, 0.81245939, 1.08539497,
       0.67442017, 1.22947394, 1.10262346, 1.02911523, 1.14818641,
       0.77183642, 0.8523347 ])

In [None]:
##########################################################################################################################################