<a href="https://colab.research.google.com/github/tobias-hoepfl/Digital-Organizations-SE/blob/main/assignments/assignment_5_Hoepfl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The following assignment consists of a theoretical part (learning portfolio) and a practical part (assignment). The goal is to build a classification model that predicts from which subject area a certain abstract originates. The plan would be that next week we will discuss your learnings from the theory part, that means you are relatively free to fill your Learning Portfolio on this new topic and in two weeks we will discuss your solutions of the Classification Model.

#Theory part (filling your Learning Portfolio, May 10)

In preparation for the practical part, I ask you to familiarize yourself with the following resources in the next week:

1) Please watch the following video:

https://course.fast.ai/Lessons/lesson4.html

You are also welcome to watch the accompanying Kaggle notebook if you like the video.

2) In addition to the video, I recommend you to read the first chapters of the course

https://huggingface.co/learn/nlp-course/chapter1/1


Try to understand principle processes and log them in your learning portfolio! A few suggestions: What is a pre-trained NLP model? How do I load them? What is tokenization? What does fine-tuning mean? What types of NLP Models are there? What possibilities do I have with the Transformers package? etc...

#Practical part (Assignment, May 17)

1) Preprocessing: The data which I provide as zip in Olat must be processed first, that means we need a table which has the following form:

Keywords | Title | Abstract | Research Field

The research field is determined by the name of the file.

2) We need a training dataset and a test dataset. My suggestion would be that for each research field we use the first 5700 lines for the training dataset and the last 300 lines for the test dataset. Please stick to this because then we can compare our models better!

3) Please use a pre-trained model from huggingface to build a classification model that tries to predict the correct research field from the 26. Please calculate the accuracy and the overall accuracy for all research fields. If you solve this task in a group, you can also try different pre-trained models. In addition to the abstracts, you can also see if the model improves if you include keywords and titles.

Some links, which can help you:

https://huggingface.co/docs/transformers/training

https://huggingface.co/docs/transformers/tasks/sequence_classification

One last request: Please always use PyTorch and not TensorFlow!

## Preparation

In [1]:
!pip install --upgrade pip
!pip install transformers==4.28.0
!pip install datasets
!pip install sentencepiece
!pip install git+https://github.com/huggingface/accelerate
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/accelerate
  Cloning https://github.com/huggingface/accelerate to /tmp/pip-req-build-acv7g5u5
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate /tmp/pip-req-build-acv7g5u5
  Resolved https://github.com/huggingface/accelerate to commit 612ecef7b8f8e7a405cfba61261f599c47f59ca1
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (py

In [2]:
from google.colab import drive
drive.mount('/content/drive')
zip_file = "/content/drive/MyDrive/DigitalOrganizations/data.zip"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Import files

In [3]:
import glob
import zipfile
import pandas as pd

#load files and change format to Keywords | Title | Abstract | Research Field

zf = zipfile.ZipFile(zip_file)
zf.namelist()

train_df = pd.DataFrame()
test_df = pd.DataFrame()

for f in zf.namelist():
  try:
    ndf = (pd.read_csv(zf.open(f)))[['Index Keywords', 'Abstract', 'Title']]
  except:
    #'data/MATH_1991-2000.csv' contains one faulty line that we will drop
    #Otherwise: Error tokenizing data. C error: Expected 24 fields in line 1061, saw 39
    ndf = (pd.read_csv(zf.open(f), skiprows= range(1060, 1061)))[['Index Keywords', 'Abstract', 'Title']]
  ndf['Research Field'] = ((f.split('_')[0]).split('/')[1])

  #put the last 300 in train_df
  if '2011-2020' in f:
    train_ndf = ndf.iloc[:-300]
    test_ndf = ndf.iloc[-300:]
    test_df = pd.concat([test_df, test_ndf], ignore_index=True)
    train_df = pd.concat([train_df, train_ndf], ignore_index=True)
  else:
    train_df = pd.concat([train_df, ndf], ignore_index=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ndf['Research Field'] = ((f.split('_')[0]).split('/')[1])


In [4]:
import builtins
labels = builtins.list(set((test_df['Research Field']).tolist()))

In [5]:
dict = {}
for i in range(len(labels)):
  dict[labels[i]] = i

In [6]:
train_df['label'] = train_df['Research Field'].map(dict)
test_df['label'] = test_df['Research Field'].map(dict)

In [7]:
#Take a smaller sample
#When using the whole training set, it was projected to take more than 80 hours (despite using GPU)

train_df = train_df.sample(frac = 0.02, ignore_index=True)
test_df = test_df.sample(frac = 0.02, ignore_index=True)

In [8]:
train_df.head()

Unnamed: 0,Index Keywords,Abstract,Title,Research Field,label
0,aquatic environment; bacterium; carbon cycle; ...,Heterotrophic bacteria perform two major funct...,Bacterial growth efficiency in natural aquatic...,ENVI,9
1,article; comparative study; computer assisted ...,Introduction: Cone-beam computed tomography (C...,Accuracy of cone-beam computed tomography imag...,DENT,2
2,Geographic information systems; Population sta...,This paper presents an integrated study of urb...,Evaluating urban expansion and land use change...,ENVI,9
3,convergent margin; earthquake magnitude; geode...,We document geodetic strain across the Nepal H...,Convergence rate across the Nepal Himalaya and...,EART,4
4,,The aim of this paper is to investigate the as...,Asymmetric exchange rate pass-through: Evidenc...,ECON,22


## Fine-tune model

In [9]:
from datasets import Dataset,DatasetDict

train_ds = Dataset.from_pandas(train_df)
test_ds = Dataset.from_pandas(test_df)

In [10]:
model_name = 'bert-base-uncased'

In [11]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [12]:
train_tok_ds = train_ds.map(lambda x: tokenizer(x['Abstract'], padding='max_length', truncation=True, max_length=512))
test_tok_ds = test_ds.map(lambda x: tokenizer(x['Abstract'], padding='max_length', truncation=True, max_length=512))

Map:   0%|          | 0/2936 [00:00<?, ? examples/s]

Map:   0%|          | 0/156 [00:00<?, ? examples/s]

In [13]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    "test-trainer",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8, 
    per_device_eval_batch_size=8, 
    num_train_epochs=3
    )

In [14]:
import evaluate
accuracy = evaluate.load("accuracy")

import numpy as np

def compute_metrics_func(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [15]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=26)
trainer = Trainer(
    model, 
    args, 
    train_dataset=train_tok_ds, 
    eval_dataset=test_tok_ds,
    compute_metrics=compute_metrics_func
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [16]:
trainer.train()
#One can see that a high accuracy can be achieved



Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.22807,0.666667
2,1.770700,0.883072,0.75641
3,0.732800,0.843106,0.775641


TrainOutput(global_step=1101, training_loss=1.1928750193194841, metrics={'train_runtime': 888.4799, 'train_samples_per_second': 9.914, 'train_steps_per_second': 1.239, 'total_flos': 2317981561307136.0, 'train_loss': 1.1928750193194841, 'epoch': 3.0})

In [17]:
trainer.save_model("/content/drive/My Drive/DigitalOrganizations/model")

## Top 3 accuracy

Addition: Accuracy measures whether the research field with the highest probability value matches the target. With 26 research fields, it would also be interesting to know if the correct target is at least among the three highest probability values.

$\begin{pmatrix} A\\ B \\ C \\D \\E \end{pmatrix} = \begin{pmatrix} 0.1\\ 0.95 \\ 0.5 \\0.2 \\0.3 \end{pmatrix} → \text{Choice}_1 = B, \text{Choice}_3 = B,C,E$

In [18]:
prediction_output = trainer.predict(test_tok_ds)

In [19]:
pred_value = prediction_output.predictions
pred_label = prediction_output.label_ids

In [20]:
counter_right_predictions = 0

for i in range(len(test_tok_ds['label'])):
  lab = test_tok_ds['label'][i]
  top_3 = (np.argsort(pred_value[i])[-3:])

  #look if the actual label is contained in the top 3
  if lab in top_3:
    counter_right_predictions += 1

#top 3 prediction accuracy is even higher
print(counter_right_predictions/len(test_tok_ds['label']))

0.9358974358974359


 **Observations and learnings:**

- After some epochs there is not much improvement noticeable in the training process (maybe also some overfitting takes place).

- Still the result is quite impressive, considering that only a tiny portion of the given data was used.

- When taking into account the top 3, accuracy is even much better

- I had a lot of issues when doing this assignment (most of them I could resolve). The performance issue still remained (despite using GPU) which is why I just took a fraction of the given dataset for the training and testing