### Name:- Sudhanshu Vishwakarma
### Batch:- B-2 Hons
### Subject:- Computational Linguistics Lab Exam

**Problem Statement**:- Download the above dataset and build a text classifier model that can predict the subject areas given paper abstracts and titles using latest NLP techniques taught to you. Show the use of word2vec and BERT especially here.Extra 1 hour is given to write comments in your code and upload it. Proper commenting after each function or wherever seems fit should be done. Upload the pdf of your code here. Plag should not be more than 10%.



# Word2Vec Implementation

importing libraries for word2vec implementation

In [32]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import LSTM, Bidirectional
from tensorflow.keras.layers import Embedding
from gensim.parsing.preprocessing import preprocess_string
from tensorflow.keras.preprocessing.sequence import pad_sequences




importing dataset

In [33]:
# Load the dataset
df = pd.read_csv('arxiv_data.csv')



In [34]:
df.head()

Unnamed: 0,titles,summaries,terms
0,Survey on Semantic Stereo Matching / Semantic ...,Stereo matching is one of the widely used tech...,"['cs.CV', 'cs.LG']"
1,FUTURE-AI: Guiding Principles and Consensus Re...,The recent advancements in artificial intellig...,"['cs.CV', 'cs.AI', 'cs.LG']"
2,Enforcing Mutual Consistency of Hard Regions f...,"In this paper, we proposed a novel mutual cons...","['cs.CV', 'cs.AI']"
3,Parameter Decoupling Strategy for Semi-supervi...,Consistency training has proven to be an advan...,['cs.CV']
4,Background-Foreground Segmentation for Interio...,"To ensure safety in automated driving, the cor...","['cs.CV', 'cs.LG']"


Making a Word2Vec Tokenizer function(available online) for preprocessing

In [35]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import skipgrams
from collections import Counter

class Word2VecTokenizer(Tokenizer):
    def __init__(self, texts, vocab_size=5000, window_size=5):
        super().__init__(num_words=vocab_size)
        self.tokenizer = Tokenizer(num_words=vocab_size)
        self.vocab_size = vocab_size
        self.window_size = window_size
        
        self.fit_on_texts(texts)
        self.word2id = self.word_index
        self.id2word = {v: k for k, v in self.word2id.items()}
        self.vocab = set(self.word2id.keys())
        self.word_counts = Counter(self.word_counts)
        
        self.generate_pairs(texts)
        
    def generate_pairs(self, texts):
        data = []
        for text in texts:
            seq = self.tokenizer.texts_to_sequences([str(text)])[0] # explicitly convert to string
            pairs, _ = skipgrams(sequence=seq, vocabulary_size=self.vocab_size, window_size=self.window_size, negative_samples=0)
            for pair in pairs:
                data.append([pair[0], pair[1]])
        self.data = np.array(data)

Splitting the dataset and tuning to fit into the word2vec model

In [36]:
# Split the data into training and validation sets
train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)

# Preprocessing the text data
tokenizer = Word2VecTokenizer(train_data, 'summaries')
tokenizer.fit_on_texts(train_data['summaries'])

# Convert text data into sequence of integers
X_train_seq = tokenizer.tokenizer.texts_to_sequences(train_data['summaries'])
X_val_seq = tokenizer.tokenizer.texts_to_sequences(val_data['summaries'])

# Pad the sequence with zeros to have a fixed length
max_seq_len = 500
X_train_padded = pad_sequences(X_train_seq, maxlen=max_seq_len, padding='post')
X_val_padded = pad_sequences(X_val_seq, maxlen=max_seq_len, padding='post')

# Convert labels to binary representation
mlb = MultiLabelBinarizer()
y_train = mlb.fit_transform(train_data['terms'])
y_val = mlb.transform(val_data['terms'])
num_labels = len(mlb.classes_)




Defining the model and running it for 5 epochs

In [37]:
# Define the model architecture
model = Sequential()
model.add(Embedding(len(tokenizer.word_index)+1, 128, input_length=max_seq_len))
model.add(Bidirectional(LSTM(128)))
model.add(Dense(num_labels, activation='sigmoid'))

# Compiling the model for fitting the data into the model and running it
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Training the model by fitting the preprocessed data
model.fit(X_train_padded, y_train, batch_size=128, epochs=5, validation_data=(X_val_padded, y_val))

Epoch 1/5
 35/324 [==>...........................] - ETA: 43:36 - loss: 0.2929 - accuracy: 0.0000e+00

In [None]:
# Evaluation of the model made based on classification report that is precision,recall and f1 score
#also calculating the accuracy
y_pred = model.predict(X_val_padded)
y_pred_labels = mlb.inverse_transform(y_pred > 0.5)
y_val_labels = mlb.inverse_transform(y_val)
print(classification_report(y_val_labels, y_pred_labels))

As you can see that the model is running but will take atleast an hour to train so time boundation is there

### BERT IMPLEMENTATION On this dataset

Importing Libraries for BERT implementation

In [None]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
from keras.models import Model
from keras.layers import Input, Dense, Dropout

Converting the terms into singular terms rather than a set of terms

In [None]:
new_data = []
for i, row in df.iterrows():
    for term in row['terms']:
        new_item = {
            'titles': row['titles'],
            'summaries': row['summaries'],
            'terms': [str(term)],
            'label': row['label']
        }
        new_data.append(new_item)

new_df = pd.DataFrame(new_data)

importing BERT and making train_test_split of the data

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
X_train_abstract, X_val_abstract, X_train_title, X_val_title, y_train, y_val = train_test_split(
    abstract_tensors, title_tensors, label_tensors, test_size=0.2, random_state=42)
