<a href="https://colab.research.google.com/github/reban87/ML-Projects/blob/main/Task2_MCC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Task 2: Multiclass Text Classification | RPA Labs | Rebanta Aryal | 1st April 2022

[```ktrain```](https://amaiya.github.io/ktrain/index.html) is wrapper library for tensorflow and keras and can be used to implement in HuggingFace. 

In [None]:
!pip install ktrain

Collecting ktrain
  Downloading ktrain-0.30.0.tar.gz (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 58.2 MB/s 
[?25hCollecting scikit-learn==0.24.2
  Downloading scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
[K     |████████████████████████████████| 22.3 MB 1.5 MB/s 
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 51.8 MB/s 
Collecting cchardet
  Downloading cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263 kB)
[K     |████████████████████████████████| 263 kB 49.6 MB/s 
Collecting syntok==1.3.3
  Downloading syntok-1.3.3-py3-none-any.whl (22 kB)
Collecting transformers==4.10.3
  Downloading transformers-4.10.3-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 29.9 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 M

To use a particular set of GPU devices, the CUDA_VISIBLE_DEVICES environment variable has been used.

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";

Let us use [20news group](https://www.kaggle.com/datasets/crawford/20-newsgroups) datasets from Sklearn.

In [None]:
import ktrain
from ktrain import text
from sklearn.datasets import fetch_20newsgroups

since 20 NEWS groups has 20 classes. Lets take only few of them, Since training all the dataset will be computationally expensive

In [None]:
categories=['comp.graphics','rec.motorcycles','sci.med','rec.sport.baseball','sci.electronics']

Training and testing datasets have been generated using the method ```fetch_20newsgroups()```


In [None]:
train=fetch_20newsgroups(
    subset='train',
    categories=categories,
    shuffle=True,
    random_state=0
)

In [None]:
test=fetch_20newsgroups(
    subset='test',
    categories=categories,
    shuffle=True,
    random_state=0
)

Since our main concern is to classify in multiple classes, in our case we have 5 different classes.

In [None]:
test.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [None]:
test.target

array([4, 3, 3, ..., 1, 1, 2])

In [None]:
test.target_names

['comp.graphics',
 'rec.motorcycles',
 'rec.sport.baseball',
 'sci.electronics',
 'sci.med']

In [None]:
X_train=train.data
y_train=train.target

X_test=test.data
y_test=test.target

In [None]:
len(X_train)

2964

In [None]:
len(X_test)

1973

Build the model using ```Distilbert-base-uncased``` model as it is small and training is faster as well

In [None]:
model_name='distilbert-base-uncased'
trans=text.Transformer(model_name,maxlen=512,class_names=categories)

In [None]:
train_data=trans.preprocess_train(X_train,y_train)
test_data=trans.preprocess_test(X_test,y_test)

preprocessing train...
language: en
train sequence lengths:
	mean : 233
	95percentile : 559
	99percentile : 1150


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 251
	95percentile : 618
	99percentile : 1425


In [None]:
model=trans.get_classifier()

Downloading:   0%|          | 0.00/363M [00:00<?, ?B/s]

In [None]:
learner=ktrain.get_learner(model,train_data=train_data,val_data=test_data,batch_size=16)

It is always difficult to find out the best learning rate for our model. Exploring different articles regarding the text classification, setting the value of learning rate ```0.0001``` gives fair result also training for more than 2 epochs takes more time.

In [None]:
learner.fit_onecycle(1e-4,1)



begin training using onecycle policy with max lr of 0.0001...


<keras.callbacks.History at 0x7f454844cb50>

Training for a single epoch as well, the model has the accuracy of 85.95% with validation accuracy of 92.90%. Looking at confusion matrix, precision, recall and f1 score, it looks fine.

In [None]:
learner.validate(class_names=categories)

                    precision    recall  f1-score   support

     comp.graphics       0.94      0.89      0.91       389
   rec.motorcycles       0.92      0.94      0.93       398
           sci.med       0.99      0.95      0.97       397
rec.sport.baseball       0.88      0.90      0.89       393
   sci.electronics       0.92      0.96      0.94       396

          accuracy                           0.93      1973
         macro avg       0.93      0.93      0.93      1973
      weighted avg       0.93      0.93      0.93      1973



array([[347,   5,   1,  26,  10],
       [  2, 374,   1,  12,   9],
       [  4,   7, 379,   0,   7],
       [ 13,  21,   0, 352,   7],
       [  5,   1,   0,   9, 381]])

In [None]:
predict_data=ktrain.get_predictor(learner.model,preproc=trans)

In [None]:
x='I have a friend name Rebanta and he has a fever'

In [None]:
predict_data.predict(x)

'sci.electronics'

In [None]:
y='digital multimeter can be used to find the voltage and current'

In [None]:
predict_data.predict(y)

'rec.sport.baseball'

In [None]:
z="I have a 42 years old friend, misdiagnosed as having osteopporosos for two years"

In [None]:
predict_data.predict(z)

'sci.electronics'

In [None]:
m="Bike riding in very high traffic highway is dangerous"

In [None]:
predict_data.predict(m)

'rec.motorcycles'