<a href="https://colab.research.google.com/github/vmavis/colab/blob/main/label_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Importing Data**

The data is imported from Kaggle and can be found [here](https://www.kaggle.com/datasets/vetrirah/janatahack-independence-day-2020-ml-hackathon). To access the dataset mentioned, our Kaggle API token is first uploaded to session storage. The dataset is then downloaded to our session storage. As it is a zip file containing 3 csv files, it should be unzipped.

In [None]:
import numpy as np
import pandas as pd

In [None]:
!pip install kaggle
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!kaggle datasets download -d vetrirah/janatahack-independence-day-2020-ml-hackathon

Downloading janatahack-independence-day-2020-ml-hackathon.zip to /content
  0% 0.00/11.4M [00:00<?, ?B/s] 79% 9.00M/11.4M [00:00<00:00, 92.7MB/s]
100% 11.4M/11.4M [00:00<00:00, 104MB/s] 


In [None]:
!unzip janatahack-independence-day-2020-ml-hackathon.zip

Archive:  janatahack-independence-day-2020-ml-hackathon.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


The required modules to train our models are installed first. The scikit-multilearn module is used to create a Label Powerset model and the simpletransformers module is used to create a RoBERTa model.

In [None]:
!pip install scikit-multilearn
!pip install simpletransformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-multilearn
  Downloading scikit_multilearn-0.2.0-py3-none-any.whl (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.4/89.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-multilearn
Successfully installed scikit-multilearn-0.2.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting simpletransformers
  Downloading simpletransformers-0.63.11-py3-none-any.whl (250 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.7/250.7 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting transformers>=4.6.0 (from simpletransformers)
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m65.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from si

All the necessary libraries and functions are imported first. Further need of other libraries and functions may require us to import them seperately.

In [None]:
import re
import spacy
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from skmultilearn.problem_transform import LabelPowerset
from simpletransformers.classification import MultiLabelClassificationModel
from sklearn.metrics import classification_report, accuracy_score

# **Data Analysis**

The training and testing datasets are read into seperate dataframes and read_csv indicates that Python is attempting to read a csv (Comma Seperated Value) file.

In [None]:
train = pd.read_csv('/content/train.csv')
test = pd.read_csv('/content/test.csv')

A glimpse of the training data is shown using head. If the number of rows is not specified, the first five rows will automatically be printed. We check the number of rows and columns using shape. We check the number of null values and the data type of each variable using info.

In [None]:
train.head()

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0


In [None]:
train.shape

(20972, 9)

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20972 entries, 0 to 20971
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   ID                    20972 non-null  int64 
 1   TITLE                 20972 non-null  object
 2   ABSTRACT              20972 non-null  object
 3   Computer Science      20972 non-null  int64 
 4   Physics               20972 non-null  int64 
 5   Mathematics           20972 non-null  int64 
 6   Statistics            20972 non-null  int64 
 7   Quantitative Biology  20972 non-null  int64 
 8   Quantitative Finance  20972 non-null  int64 
dtypes: int64(7), object(2)
memory usage: 1.4+ MB


To check the number of articles of each label, we set the value type to boolean and find the total count of values where they are not zero. We can see that over 8000 articles are labelled with computer science, while only around 200 articles are labelled with quantitative finance.

In [None]:
train.astype(bool).sum(axis=0)

ID                      20972
TITLE                   20972
ABSTRACT                20972
Computer Science         8594
Physics                  6013
Mathematics              5618
Statistics               5206
Quantitative Biology      587
Quantitative Finance      249
dtype: int64

The previous steps are also applied to the testing data.

In [None]:
test.head()

Unnamed: 0,ID,TITLE,ABSTRACT
0,20973,Closed-form Marginal Likelihood in Gamma-Poiss...,We present novel understandings of the Gamma...
1,20974,Laboratory mid-IR spectra of equilibrated and ...,Meteorites contain minerals from Solar Syste...
2,20975,Case For Static AMSDU Aggregation in WLANs,Frame aggregation is a mechanism by which mu...
3,20976,The $Gaia$-ESO Survey: the inner disk intermed...,Milky Way open clusters are very diverse in ...
4,20977,Witness-Functions versus Interpretation-Functi...,Proving that a cryptographic protocol is cor...


In [None]:
test.shape

(8989, 3)

In [None]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8989 entries, 0 to 8988
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   ID        8989 non-null   int64 
 1   TITLE     8989 non-null   object
 2   ABSTRACT  8989 non-null   object
dtypes: int64(1), object(2)
memory usage: 210.8+ KB


# **Data Preprocessing**

As we have a large amount of training data, a total of 10,000 samples are taken to be used in this code.

In [None]:
train = train.sample(10000)

A function to cleanse our data is defined below. All letters are first converted into lowercase. Certain patterns are then either removed or replaced with a white space. The mentioned patterns are:
- re.sub(r'\d+', '', i ): removing one or more digit characters
- re.sub(r'[^\w]', ' ', i): replacing alpha-numeric characters at the beginning of the string with a white space
- re.sub(r'https', '', i): removing https
- re.sub(r'com', '', i): removing com
- re.sub(r'((?<=^)|(?<= )).((?=$)|(?= ))', '', i): removing single character words
- re.sub(r'\s+', ' ', i): replacing one or more white spaces with a single white space

In [None]:
def cleansing(df):
    df_clean = df.str.lower()
    df_clean = [re.sub(r'\d+', '', i ) for i in df_clean]
    df_clean = [re.sub(r'[^\w]', ' ', i) for i in df_clean]
    df_clean = [re.sub(r'https', '', i) for i in df_clean]
    df_clean = [re.sub(r'com', '', i) for i in df_clean]
    df_clean = [re.sub(r'((?<=^)|(?<= )).((?=$)|(?= ))', '', i) for i in df_clean]
    df_clean = [re.sub(r'\s+', ' ', i) for i in df_clean]
    df_clean = [re.sub(r'\s$', '', i) for i in df_clean]
    return df_clean

The columns that contain texts are combined into a single column and stored in a new column named 'text'. It is then cleansed using the function defined previously. A glimpse of the clean text is shown below.

In [None]:
train['text'] = train['TITLE'] + train['ABSTRACT']
train['clean_text'] = cleansing(train['text'])
train['clean_text'].head()

1419     construction of directed graphs we study the p...
5290     temporally identity aware ssd with attentional...
14665    harpo mev gamma ray beam validation of high an...
462      cost effective seed selection in online social...
7133     asymptotics for small nonlinear price impact p...
Name: clean_text, dtype: object

We combine all label values into a single vector. They are stored in a new column named 'label'.

In [None]:
train['label'] = train.apply(lambda x: list([x['Computer Science'], x['Physics'], x['Mathematics'], x['Statistics'], x['Quantitative Biology'], x['Quantitative Finance']]), axis=1)
train['label'].head()

1419     [1, 0, 0, 0, 0, 0]
5290     [1, 0, 0, 0, 0, 0]
14665    [0, 1, 0, 0, 0, 0]
462      [1, 0, 0, 0, 0, 0]
7133     [0, 0, 0, 0, 0, 1]
Name: label, dtype: object

The previous steps are also applied to the testing data.

In [None]:
test['text'] = test['TITLE'] + test['ABSTRACT']
test['clean_text'] = cleansing(test['text'])
test['clean_text'].head()

0    closed form marginal likelihood in gamma poiss...
1    laboratory mid ir spectra of equilibrated and ...
2    case for static amsdu aggregation in wlans fra...
3    the gaia eso survey the inner disk intermediat...
4    witness functions versus interpretation functi...
Name: clean_text, dtype: object

All English stop words are loaded from the spacy library. Words in the clean text that match those stop words are removed.

In [None]:
nlp = spacy.load("en_core_web_sm")
train['clean_text'] = train['clean_text'].apply(lambda x: ' '.join([word for word in x.split() if nlp.vocab[word].is_stop==False]))
test['clean_text'] = test['clean_text'].apply(lambda x: ' '.join([word for word in x.split() if nlp.vocab[word].is_stop==False]))

The clean text of the training data is set as our x variable and the labels are set as our y variable. A glimpse of each is shown below.

In [None]:
x = train['clean_text']
x.head()

1419     construction directed graphs study problem con...
5290     temporally identity aware ssd attentional lstm...
14665    harpo mev gamma ray beam validation high angul...
462      cost effective seed selection online social ne...
7133     asymptotics small nonlinear price impact pde a...
Name: clean_text, dtype: object

In [None]:
y = train[['Computer Science', 'Physics',	'Mathematics',	'Statistics',	'Quantitative Biology', 'Quantitative Finance']]
y.head()

Unnamed: 0,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
1419,1,0,0,0,0,0
5290,1,0,0,0,0,0
14665,0,1,0,0,0,0
462,1,0,0,0,0,0
7133,0,0,0,0,0,1


The training data is split into two. 70% is allocated for the training set and the rest is allocated for the validation set.

In [None]:
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size = 0.3, random_state = 42)

The clean text of the testing data is set as our x test variable.

In [None]:
x_test = test['clean_text']
x_test.head()

0    closed form marginal likelihood gamma poisson ...
1    laboratory mid ir spectra equilibrated igneous...
2    case static amsdu aggregation wlans frame aggr...
3    gaia eso survey inner disk intermediate age op...
4    witness functions versus interpretation functi...
Name: clean_text, dtype: object

The clean text and label of the training data are set for our RoBERTa model. 70% is allocated for the training set and 30% is allocated for the validation set.

In [None]:
rb_data = train[['clean_text', 'label']]
rb_data.head()

Unnamed: 0,clean_text,label
1419,construction directed graphs study problem con...,"[1, 0, 0, 0, 0, 0]"
5290,temporally identity aware ssd attentional lstm...,"[1, 0, 0, 0, 0, 0]"
14665,harpo mev gamma ray beam validation high angul...,"[0, 1, 0, 0, 0, 0]"
462,cost effective seed selection online social ne...,"[1, 0, 0, 0, 0, 0]"
7133,asymptotics small nonlinear price impact pde a...,"[0, 0, 0, 0, 0, 1]"


In [None]:
rb_train = rb_data.sample(frac=0.7,random_state=42)
rb_val = rb_data.sample(frac=0.3,random_state=42)

# **TFIDF Vectorizer**

TF-IDF (term frequency-inverse document frequency) is used to show how important a word is to a document in a collection or corpus. TF (term frequency) measures how often a term appears in a document. IDF (inverse document frequency) measures how important a term is across all documents in the corpus.

To see the importance of each term of our data, TfidfVectorizer is used. It converts a collection of raw documents to a matrix of TF-IDF features. The clean text is then fitted and transformed into numbers. The fit_transform function learns the vocabulary dictionary and returns term-document matrix. The transform function performs scaling of the data and returns the transformed array. The fit function is not used on the validation and test sets as it can introduce bias to those two sets.

In [None]:
vectorizer = TfidfVectorizer()
train_tfidf = vectorizer.fit_transform(x_train)
valid_tfidf = vectorizer.transform(x_valid)
test_tfidf = vectorizer.transform(x_test)

# **Label Powerset Model**

Label Powerset is a transformation method used to predict multi-label data. It transforms multi-label problem into a multi-class problem by building a model where the classes are each labelset. It also considers possible correlations between class labels and sees the combination of labels as a unique class.

Advantages
- This method usually yields higher accuracy compared to other methods.
- It is highly efficient and takes less time to train.
- As it considers possible correlations between class labels, it is suitable for training data containing relevant label combinations.

Disadvantages
- This method can be susceptible to underfitting if the training data has a large number of labels. This may result in a higher model complexity, which may lead to a lower accuracy.
- It cannot be used to predict the labels of the testing data as it computes the probability of each individual class.

In [None]:
lp = LabelPowerset(LogisticRegression())
lp.fit(train_tfidf, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
predict_valid = lp.predict(valid_tfidf)

To see how well the model performs against the validation set, we print the classification report and accuracy. As it yields a quite high accuracy, we can conclude that this model performs quite well and our data does not have a large number of labels.

In [None]:
print(classification_report(y_valid, predict_valid))

              precision    recall  f1-score   support

           0       0.78      0.87      0.83      1198
           1       0.91      0.85      0.88       875
           2       0.84      0.76      0.80       809
           3       0.82      0.59      0.69       726
           4       1.00      0.03      0.05        80
           5       1.00      0.09      0.17        44

   micro avg       0.83      0.76      0.80      3732
   macro avg       0.89      0.53      0.57      3732
weighted avg       0.84      0.76      0.78      3732
 samples avg       0.85      0.80      0.81      3732



In [None]:
accuracy_score(y_valid, predict_valid)

0.6706666666666666

# **RoBERTa Model**

RoBERTa, which stands for Robustly Optimized BERT Approach, is a BERT modification developed by researchers at Facebook AI. BERT itself stands for Bidirectional Encoder Representations from Transformers, which is a transformer method used to process input sequences and generate representations of words in a sentence using self-attention.

Advantages
- This method removes the NSP (next sentence prediction) objective, which allows for an improvement in downstream task performance.
- It is trained over a longer period of time and with more data, which results in a higher accuracy and better results.

Disadvantages
- This method takes an incredibly long time to run, hence the small number of epochs set in this model.
- Longer time to run equals to larger carbon footprint.

In [None]:
rb = MultiLabelClassificationModel("roberta", "roberta-base", num_labels = 6, args={"reprocess_input_data": True, "overwrite_output_dir": True, "num_train_epochs": 3},)
rb.train_model(rb_train)

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForMultiLabelSequenceClassification: ['lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForMultiLabelSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.out_proj.weight', 'roberta.pooler.dense.bia

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]



  0%|          | 0/7000 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/875 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/875 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/875 [00:00<?, ?it/s]

(2625, 0.19583024549058506)

To see how well the model performs against the validation set, we print the LRAP score and binary cross entropy loss. As it yields a very high LRAP score and a very low binary cross entropy loss, we can conclude that this model performs extremely well.

In [None]:
rb_result, rb_model_outputs, rb_wrong_predictions = rb.eval_model(rb_val)
print(rb_result)



  0%|          | 0/3000 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/375 [00:00<?, ?it/s]

{'LRAP': 0.9682000000000012, 'eval_loss': 0.11522358655681213}


In [None]:
rb_x = x_test.tolist()
rb_predict_test, m = rb.predict(rb_x)
rb_predict_test

  0%|          | 0/8989 [00:00<?, ?it/s]

  0%|          | 0/1124 [00:00<?, ?it/s]

[[0, 0, 1, 1, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0],
 [0, 0, 0, 1, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [1, 0, 0, 1, 0, 0],
 [1, 0, 1, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 1, 0],
 [0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [1, 0, 1, 1, 0, 0],
 [1, 0, 0, 1, 0, 0],
 [1, 0, 1, 0, 0, 0],
 [0, 0, 0, 1, 0, 0],
 [0, 0, 1, 0, 0, 0],
 [0, 0, 1, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [0, 0, 0, 1, 0, 0],
 [0, 1, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0],
 [1, 0, 0, 0,

# **Label Prediction**

The labels of the testing data are predicted using RoBERTa. The result is shown below.

In [None]:
test['predictions'] = rb_predict_test
test[['Computer Science', 'Physics', 'Mathematics', 'Statistics', 'Quantitative Biology', 'Quantitative Finance']] = pd.DataFrame(test['predictions'].tolist(), index = test.index)

In [None]:
sub = test[['ID', 'Computer Science', 'Physics', 'Mathematics', 'Statistics', 'Quantitative Biology', 'Quantitative Finance']]
sub.head()

Unnamed: 0,ID,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,20973,0,0,1,1,0,0
1,20974,0,1,0,0,0,0
2,20975,1,0,0,0,0,0
3,20976,0,1,0,0,0,0
4,20977,1,0,0,0,0,0


# **Conclusion**

If the main focus is to find the probability of which label is assigned to which text, Label Powerset is a good option. It is quick to train and yields fairly good accuracy. If the main focus is to predict the labels of each text, RoBERTa is a good option. It is highly accurate and easy to understand.