## Problem - Multi-class classification

### We have title and abstract of various projects, classified into

1)Computer Science	
2)Physics	
3)Mathematics	
4)Statistics	
5)Quantitative Biology	
6)Quantitative Finance

## Using bert for the task

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Installing transformers and upgrading it so that it is compatible with simpletransformers.
### Then, installing simpletransformers.

In [None]:
!pip install transformers
!pip install --upgrade transformers
!pip install simpletransformers

In [None]:
#A package for easing return of multiple values
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi

In [None]:
#GPUtil is a Python module for getting the GPU status from NVIDA GPUs using nvidia-smi.
!pip install gputil

#Cross-platform lib for process and system monitoring in Python.
!pip install psutil

In [None]:
#importing other necessary packages and ClassificationModel for bert
from tqdm import tqdm
import warnings
warnings.simplefilter('ignore')

from simpletransformers.classification import ClassificationModel
from sklearn.preprocessing import LabelEncoder

import torch
from scipy.special import softmax

In [None]:
train = pd.read_csv('../input/train-full/train.csv')
test = pd.read_csv('../input/independence-data-av/test.csv')
sample_sub = pd.read_csv('../input/independence-data-av/sample_submission.csv')

In [None]:
train_copy = pd.read_csv('../input/train-full/train.csv')
train_copy.head()

In [None]:
train["text"] = train["TITLE"] + train["ABSTRACT"]
test["text"] = test["TITLE"] + test["ABSTRACT"]

### Cleaning text using clean-text

In [None]:
!pip install clean-text[gpl]
from cleantext import clean
def text_cleaning(text):
    text=clean(text,
    fix_unicode=True,               # fix various unicode errors
    to_ascii=True,                  # transliterate to closest ASCII representation
    lower=True,                     # lowercase text
    no_line_breaks=True,           # fully strip line breaks as opposed to only normalizing them
    no_urls=True,                  # replace all URLs with a special token
    no_emails=True,                # replace all email addresses with a special token
    no_phone_numbers=True,         # replace all phone numbers with a special token
    no_numbers=True,               # replace all numbers with a special token
    no_digits=True,                # replace all digits with a special token
    no_currency_symbols=True,      # replace all currency symbols with a special token
    no_punct=True,                 # fully remove punctuation
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_number="<PHONE>",
    replace_with_number="<NUMBER>",
    replace_with_digit="0",
    replace_with_currency_symbol="<CUR>",
    lang="en"                       # set to 'de' for German special handling
    )
    return text

In [None]:
for i in range(len(train)):
    train['text'].iloc[i]=text_cleaning(train['text'].iloc[i])
    
for i in range(len(test)):
    test['text'].iloc[i]=text_cleaning(test['text'].iloc[i])    

In [None]:
train['text'].iloc[0]

In [None]:
test['text'].iloc[0]

### Using LabelEncoder to convert string classes into integers

In [None]:
target_classes = ["Computer Science" ,"Physics" , "Mathematics", "Statistics" , "Quantitative Biology" , "Quantitative Finance"]
train['label'] = train[target_classes].values.tolist()

le = LabelEncoder()
train['label'] = le.fit_transform(train['label'].astype(str))
train = train[["text","label"]]

test = test[["text"]]
#initialising test labels
test["label"] = 1

### Running the ClassificationModel and training

In [None]:
model = ClassificationModel('bert', 'bert-base-uncased', use_cuda=True,num_labels=24, args={'train_batch_size':32,
                                                                                             'reprocess_input_data': True,
                                                                                             'overwrite_output_dir': True,
                                                                                             'fp16': False,
                                                                                             'do_lower_case': False,
                                                                                             'num_train_epochs': 2,
                                                                                             'max_seq_length': 256,
                                                                                             'regression': False,
                                                                                             'manual_seed': 2,
                                                                                             "learning_rate":4e-5,
                                                                                             'weight_decay':0.0,
                                                                                             "save_eval_checkpoints": False,
                                                                                             "save_model_every_epoch": False,
                                                                                             "silent": False})

model.train_model(train)

### Get the evaluations from training bert

In [None]:
test_result, test_model_outputs, test_wrong_predictions = model.eval_model(test)

In [None]:
predictions = softmax(test_model_outputs,axis=1)
final_pred = [np.argmax(x) for x in predictions]

In [None]:
final_pred

### Processing and converting integer classes back to string classes

In [None]:
sub1=sample_sub.copy()
sub1['target'] = le.inverse_transform(final_pred)
from ast import literal_eval
sub1.loc[:,'target'] = sub1.loc[:,'target'].apply(lambda x: literal_eval(x))
sub1[target_classes] = pd.DataFrame(sub1.target.tolist(), index= sub1.index)
sub1.drop("target",axis=1,inplace = True)

In [None]:
sub1

In [None]:
sub1.to_csv('sub_new1.csv',index = False)