# BERT for Sentence Classification

This notebook is part of our Media Framing Analysis on a joint dataset from BBC and Aljazeera news articles from 7 October to 31 December 2023 on Palestine-Israel conflict.

- 1. We first created a dataset df_death.json composed of sentences containing keywords: ['kill', 'murder', 'slaughter', "massacre", "casualties", "fatalit", "die", "dead"], you can see Media_Framing_in_News.ipynb for more details.

- 2. In an attempt to develop a text classification model able to detect whether a given sentence is reporting Palestinian fatalities or Israeli fatalities in our dataset.

- 3. We want our text classification algorithm to assign one of these 3 labels to the sentences:
    - _Palestinians_ for sentences reporting Palestine's war casualties
    - _Israelis_ for sentences reporting Israel's war casualties
    - _Not available_ for sentences in which the party suffering the losses is not detectable, or is not either Israel or Palestine.

- 4. We trained our model on AI-generated and manually augmented dataset to be able to evaluate model performance in the prediction of labels.

## Install and Import packages

In [1]:
#!pip install transformers datasets

In [2]:
#pip install transformers[torch]

In [None]:
import re
import pandas as pd
import numpy as np

## Generation of Labeled Data

1. Using various prompts (sometimes asking for numbered reportings, sometimes specifying the object (Palestinian people),and sometimes specifying the subject (Israeli army) we first asked ChatGPT to generate us several sentences mentioning Palestinian fatalities.

In [None]:
palestinian_sentences = [
    "Following an onslaught, scores of Palestinian lives were tragically lost.",
"A devastating assault claimed the lives of numerous Palestinian civilians.",
"The onslaught resulted in the tragic demise of many Palestinians.",
"The assault led to the untimely demise of a multitude of Palestinian individuals.",
"A brutal attack resulted in the loss of Palestinian lives.",
"Palestinian casualties soared as a result of the violent attack.",
"The offensive resulted in the loss of countless Palestinian lives.",
"An aggressive attack took a toll on the lives of numerous Palestinians.",
"A relentless assault claimed the lives of several Palestinian victims.",
"The attack resulted in the tragic passing of many Palestinian civilians.",
"A forceful assault led to the tragic demise of a multitude of Palestinian souls.",
"Many Palestinian lives were lost in the wake of the brutal attack.",
"Palestinian casualties mounted following the violent assault.",
"The offensive led to the unfortunate loss of countless Palestinian lives.",
"An intense attack resulted in the tragic loss of numerous Palestinian individuals.",
"A merciless assault took the lives of several Palestinian victims.",
"The attack caused the untimely passing of many Palestinian civilians.",
"A severe assault claimed the lives of a multitude of Palestinian souls.",
"The loss of Palestinian lives was profound in the wake of the brutal attack.",
"Numerous Palestinian casualties were reported following the violent assault.",
"Amid escalating tensions, 20 Palestinians were killed in a devastating airstrike.",
"Reports indicate that 15 Palestinians lost their lives in the conflict-ridden region.",
"In a tragic turn of events, the death toll rose to 30 Palestinians in the conflict zone.",
"A war-torn area witnessed the demise of 25 Palestinian civilians amid ongoing clashes.",
"Violent clashes have claimed the lives of 18 Palestinians in a military offensive.",
"Reports from the strife-torn region confirm 22 Palestinian casualties in recent clashes.",
"Amidst escalating violence, the death toll has surged to 28 Palestinians.",
"The conflict has left a devastating toll, with 16 Palestinians reported dead.",
"In a harrowing development, the death toll among Palestinians rose to 21.",
"Tensions escalated, resulting in the tragic death of 23 Palestinian civilians.",
"A recent attack has claimed the lives of 14 Palestinians in the region.",
"The conflict zone witnessed the tragic demise of 19 Palestinian individuals.",
"A devastating offensive has led to the death of 26 Palestinians, including civilians.",
"Recent clashes have resulted in the untimely demise of 17 Palestinians.",
"In a distressing turn of events, the death toll rose to 29 Palestinians.",
"Amid ongoing hostilities, 24 Palestinians lost their lives in the conflict zone.",
"The conflict has led to the tragic death of 31 Palestinian civilians.",
"Reports confirm that 13 Palestinians were killed by Israel in the recent spate of violence.",
"The region remains fraught with conflict, claiming the lives of 27 Palestinians.",
"A recent attack has resulted in the tragic loss of 20 Palestinian lives.",
"The death toll has tragically risen to 32 Palestinians amid escalating tensions.",
"In a heart-wrenching turn of events, 18 Palestinians were reported dead.",
"Tensions in the area have led to the untimely demise of 25 Palestinians.",
"Amidst ongoing conflict, the death toll among Palestinians reached 30.",
"Recent clashes have claimed the lives of 16 Palestinians in the strife-torn region.",
"The region remains embroiled in conflict, resulting in the death of 21 Palestinians.",
"The conflict zone witnessed the tragic passing of 29 Palestinian civilians.",
"In a tragic development, 14 Palestinians lost their lives in the conflict.",
"Amid escalating violence, the death toll surged to 23 Palestinian casualties.",
"Recent hostilities led to the untimely demise of 19 Palestinians in the area.",
"The death toll tragically rose to 26 Palestinians amidst ongoing clashes.",
"Tensions escalated, claiming the lives of 18 Palestinians in the conflict zone.",
"A recent offensive led to the tragic loss of 22 Palestinian lives.",
"The conflict has resulted in the untimely demise of 28 Palestinians, including civilians.",
"Recent clashes reported the death of 15 Palestinians in the strife-torn region.",
"Amidst escalating hostilities, the death toll surged to 31 Palestinians.",
"Reports confirm the tragic loss of 20 Palestinian lives in the conflict.",
"In a distressing turn of events, 17 Palestinians were reported dead.",
"Tensions in the area claimed the lives of 24 Palestinians in recent clashes.",
"The conflict has tragically claimed the lives of 29 Palestinian civilians.",
"The strife-torn region reported the death of 13 Palestinians in recent attacks.",
"Amid escalating tensions, 26 Palestinians lost their lives in the conflict zone.",
"Recent clashes have led to the untimely demise of 21 Palestinians.",
"The death toll tragically rose to 30 Palestinians amidst ongoing hostilities.",
"Tensions escalated, claiming the lives of 19 Palestinians in the strife-torn area.",
"A recent offensive led to the tragic loss of 23 Palestinian lives.",
"The conflict has resulted in the untimely demise of 27 Palestinians, including civilians.",
"Recent clashes reported the death of 16 Palestinians in the conflict zone.",
"Amidst escalating hostilities, the death toll surged to 32 Palestinians.",
"Reports confirm the tragic loss of 18 Palestinian lives in the ongoing conflict.",
    "A devastating airstrike claimed the lives of 20 Palestinians in the war-torn region.",
"Reports indicate that 15 Palestinians were murdered amidst escalating tensions.",
"In a tragic turn of events, the death toll rose to 30 Palestinians slaughtered in the conflict.",
"A conflict-ridden area witnessed the demise of 25 Palestinian civilians amid ongoing clashes.",
"Violent clashes have resulted in the deaths of 18 Palestinians in a military offensive.",
"Reports from the strife-torn region confirm 22 Palestinian casualties in recent clashes.",
"Amidst escalating violence, the death toll has surged to 28 Palestinians killed.",
"The conflict has left a devastating toll, with 16 Palestinians reported murdered.",
"In a harrowing development, the death toll among Palestinians rose to 21.",
"Tensions escalated, resulting in the tragic death of 23 Palestinian civilians.",
"A recent attack has led to the murder of 14 Palestinians in the region.",
"The conflict zone witnessed the tragic demise of 19 Palestinian individuals.",
"A devastating offensive has led to the slaughter of 26 Palestinians, including civilians.",
"Recent clashes have resulted in the killing of 17 Palestinians.",
"In a distressing turn of events, the death toll rose to 29 Palestinians slaughtered.",
"Amid ongoing hostilities, 24 Palestinians lost their lives in the conflict zone.",
"The conflict has led to the tragic deaths of 31 Palestinian civilians.",
"Reports confirm that 13 Palestinians were killed in the recent spate of violence.",
"The region remains fraught with conflict, claiming the lives of 27 Palestinians.",
"A recent attack has resulted in the tragic killing of 20 Palestinian lives.",
"The death toll has tragically risen to 32 Palestinians amidst escalating tensions.",
"In a heart-wrenching turn of events, 18 Palestinians were reported slaughtered.",
"Tensions in the area have led to the untimely deaths of 25 Palestinians.",
"Amidst ongoing conflict, the death toll among Palestinians reached 30.",
"Recent clashes have claimed the lives of 16 Palestinians in the strife-torn region.",
"The region remains embroiled in conflict, resulting in the killing of 21 Palestinians.",
"The conflict zone witnessed the tragic deaths of 29 Palestinian civilians.",
"In a tragic development, 14 Palestinians lost their lives in the conflict.",
"Amid escalating violence, the death toll surged to 23 Palestinian casualties.",
"Recent hostilities led to the untimely deaths of 19 Palestinians in the area.",
"The death toll tragically rose to 26 Palestinians amidst ongoing clashes.",
"Tensions escalated, resulting in the killing of 18 Palestinians in the conflict zone.",
"A recent offensive led to the slaughter of 22 Palestinian lives.",
"The conflict has resulted in the untimely deaths of 28 Palestinians, including civilians.",
"Recent clashes reported the killing of 15 Palestinians in the strife-torn region.",
"Amidst escalating hostilities, the death toll surged to 31 Palestinians.",
"Reports confirm the tragic deaths of 20 Palestinian lives in the conflict.",
"In a distressing turn of events, 17 Palestinians were reported slaughtered.",
"Tensions in the area claimed the lives of 24 Palestinians in recent clashes.",
"The conflict has tragically claimed the lives of 29 Palestinian civilians.",
"The strife-torn region reported the deaths of 13 Palestinians in recent attacks.",
"Amid escalating tensions, 26 Palestinians were killed in the conflict zone.",
"Recent clashes have led to the untimely deaths of 21 Palestinians.",
"The death toll tragically rose to 30 Palestinians amidst ongoing hostilities.",
"Tensions escalated, resulting in the killing of 19 Palestinians in the strife-torn area.",
"A recent offensive led to the slaughter of 23 Palestinian lives.",
"The conflict has resulted in the untimely deaths of 27 Palestinians, including civilians.",
"Recent clashes reported the killing of 16 Palestinians in the conflict zone.",
"Amidst escalating hostilities, the death toll surged to 32 Palestinians.",
"Reports confirm the tragic deaths of 18 Palestinian lives in the ongoing conflict.",
"Israeli airstrikes decimated a neighborhood, leaving 25 civilians dead.",
"Reports confirm an Israeli offensive resulting in the deaths of 30 people.",
"In a devastating turn, Israeli military operations led to the demise of 20 civilians.",
"Israeli attacks in the region resulted in the tragic deaths of 35 individuals.",
"Amidst Israeli airstrikes, 22 civilians lost their lives in the conflict zone.",
"Reports indicate that Israeli assaults claimed the lives of 28 civilians.",
"An Israeli offensive left 33 people dead, sparking international concern.",
"The death toll rose to 26 amidst Israeli attacks on civilian infrastructure.",
"Israeli military operations resulted in the tragic demise of 24 individuals.",
"In a tragic turn, 31 civilians were killed in Israeli airstrikes.",
"Reports confirm that Israeli actions resulted in the deaths of 27 individuals.",
"Israeli assaults led to the tragic deaths of 29 civilians in the region.",
"A recent Israeli offensive resulted in the demise of 23 civilians.",
"In a distressing development, Israeli attacks claimed the lives of 34 civilians.",
"Israeli airstrikes left 32 individuals dead in the strife-torn area.",
"Reports from the conflict zone confirm 21 casualties in Israeli attacks.",
"Amidst Israeli military operations, 25 civilians tragically lost their lives.",
"Israeli assaults resulted in the tragic deaths of 37 civilians.",
"The death toll surged to 40 amidst ongoing Israeli attacks.",
"In a devastating turn, Israeli airstrikes claimed the lives of 38 individuals.",
"Reports confirm that Israeli actions led to the deaths of 36 people.",
"Israeli assaults in the region resulted in the tragic demise of 41 individuals.",
"A recent Israeli offensive claimed the lives of 39 people.",
"In a distressing development, 43 individuals were killed by Israeli attacks.",
"Israeli airstrikes resulted in the tragic deaths of 42 civilians.",
"Reports confirm that Israeli actions led to the demise of 45 individuals.",
"Israeli assaults in the region claimed the lives of 44 civilians.",
"A recent offensive by Israel resulted in the tragic demise of 47 individuals.",
"In a devastating turn, 46 people lost their lives in Israeli attacks.",
"Israeli airstrikes claimed the lives of 49 civilians in the strife-torn area.",
"Reports confirm that Israeli actions led to the deaths of 48 individuals.",
"Israeli assaults in the region resulted in the tragic demise of 50 civilians.",
"A recent offensive by Israel claimed the lives of 53 individuals.",
"In a distressing development, 52 individuals were killed by Israelis.",
"Israeli airstrikes resulted in the tragic deaths of 55 civilians.",
"Reports confirm that Israeli actions led to the demise of 54 individuals.",
"Israeli assaults in the region claimed the lives of 57 civilians.",
"A recent offensive by Israel resulted in the tragic demise of 56 individuals.",
"In a devastating turn, 59 people lost their lives in Israeli attacks.",
"Israeli airstrikes claimed the lives of 58 civilians in the strife-torn area.",
"Reports confirm that Israeli actions led to the deaths of 61 individuals.",
"Israeli assaults in the region resulted in the tragic demise of 60 civilians.",
"A recent offensive by Israel claimed the lives of 63 individuals.",
"In a distressing development, 62 Palestinians were killed by Israeli attacks.",
"Israeli airstrikes resulted in the tragic deaths of 65 civilians.",
"Reports confirm that Israeli actions led to the demise of 64 individuals.",
"Israeli assaults in the region claimed the lives of 67 civilians.",
"A recent offensive by Israel resulted in the tragic demise of 66 individuals.",
"In a devastating turn, 69 individuals lost their lives in Israeli attacks.",
"Israeli airstrikes claimed the lives of Palestinian 68 civilians in the strife-torn area.",
"Israeli forces killed 30 Palestinian civilians trapped in a collapsed building."


            ]



2. As a further step, to augment the data volume, we replaced several words from the generated data with some other words we frequently encountered in our BBC&Aljazeera articles but which had not been used by ChatGPT.

In [None]:
bomb = []
for sentence in palestinian_sentences:
  if 'attack' in sentence:
    sentence = re.sub(r'attack', 'bombardment', sentence)
    sentence = re.sub(r'assault', 'massacre', sentence)
    sentence = re.sub(r'were killed', 'died', sentence)
    bomb.append(sentence)
palestinian_sentences += bomb

3. We then replaced the strings:
- 'Israeli', 'Israel' with 'Hamas' so as to modify the party who attacks as it appears in the news, and
- 'Palestinian' with 'Israeli', to modify the party who suffers the attacks

Notice that performing the replacements in this order is important, if you had started by converting the string Palestinian to Israeli, then the second step will end up re-converting it to Hamas and there won't be any reference to Israeli civilians.

In [None]:
israeli_sentences = []
for sentence in palestinian_sentences:

    if 'Israeli' in sentence:
        sentence = re.sub(r'Israeli', 'Hamas', sentence)
   
    elif 'Israel' in sentence:
        sentence = re.sub(r'Israel', 'Hamas fighters', sentence)
        
    elif 'Palestinian' in sentence:
        sentence = re.sub(r'Palestinian', 'Israeli', sentence) 
 


    israeli_sentences.append(sentence)




4. With a similar logic, we also generated our NA-tagged sentences:

In [None]:
na_sentences = []
for sentence in palestinian_sentences:
    if 'Palestinians' in sentence:
        sentence = re.sub(r'Palestinians', '', sentence)
    elif 'Palestinian' in sentence:
        sentence = re.sub(r'Palestinian', '', sentence)
    elif 'Israeli' in sentence:
        sentence = re.sub(r'Israeli', '', sentence)
    elif 'Israel' in sentence:
        sentence = re.sub(r'Israel', '', sentence)
    na_sentences.append(sentence)



5. Having noticed our dataset contained references to massacre and slaughter of animals, we augmented our NA dataset so that the algorithm will not return any of the parties if the sentence reports the animal casualties.

In [None]:
na_sentences_2 = []
for sentence in palestinian_sentences:
    if 'Palestinians' in sentence:
        sentence = re.sub(r'Palestinians', 'sheeps', sentence)
    elif 'Palestinian civilians' in sentence:
        sentence = re.sub(r'Palestinian civilians', 'animals', sentence)
    elif 'Palestinian people' in sentence:
        sentence = re.sub(r'Palestinian people', 'cows', sentence)
    elif 'Palestinian' in sentence:
        sentence = re.sub(r'Palestinian', 'animal', sentence)
    elif 'individuals' in sentence:
        sentence = re.sub(r'individuals', 'animals', sentence)
    elif 'civilians' in sentence:
        sentence = re.sub(r'civilians', 'cows', sentence)
    elif 'people' in sentence:
        sentence = re.sub(r'people', 'animals', sentence)


    na_sentences_2.append(sentence)

na_sentences += na_sentences_2

6. Similarly, we did not want our model to confuse the sentences reporting the death of Hamas combatants with the death of civilians. Therefore, we wanted to train it in a way that it would return Not Available also for the sentences referring to the death of non-civilians.

In [None]:
na_sentences_3 = []
for sentence in palestinian_sentences:
    if 'Palestinians' in sentence:
        sentence = re.sub(r'Palestinians', 'Hamas fighters', sentence)
    elif 'Palestinian civilians' in sentence:
        sentence = re.sub(r'Palestinian civilians', 'Hamas fighters', sentence)
    elif 'Palestinian people' in sentence:
        sentence = re.sub(r'Palestinian people', 'Hamas members', sentence)
    elif 'individuals' in sentence:
        sentence = re.sub(r'individuals', 'Hamas gunmen', sentence)
    elif 'civilians' in sentence:
        sentence = re.sub(r'civilians', 'Hamas fighters', sentence)
    elif 'people' in sentence:
        sentence = re.sub(r'people', 'Hamas combatants', sentence)


    na_sentences_3.append(sentence)

na_sentences += na_sentences_3

7. Another common thing we noticed, and turned out to be misleading for the algorithm in our attempts was when the sentence was referring to death tolls from another war. The most common references in our news were either on Russia-Ukraine War or on Israel's war with Lebanon. We included some examples of such references too in the NA dataset.

In [None]:
na_sentences_4 = []
for sentence in palestinian_sentences:
    if 'Palestinians' in sentence:
        sentence = re.sub(r'Palestinians', 'Ukranians', sentence)
    elif 'Palestinian' in sentence:
        sentence = re.sub(r'Palestinian', 'Russian', sentence)
    elif 'Israelis' in sentence:
        sentence = re.sub(r'Israelis', 'Russians', sentence)
    elif 'Israeli' in sentence:
        sentence = re.sub(r'Israeli', 'Russian', sentence)



    na_sentences_4.append(sentence)

na_sentences += na_sentences_4

In [None]:
na_sentences_5 = []
for sentence in palestinian_sentences:
    if 'Palestinians' in sentence:
        sentence = re.sub(r'Palestinians', 'Lebanon', sentence)
    elif 'Palestinian' in sentence:
        sentence = re.sub(r'Palestinian', 'Lebanese', sentence)
    elif 'Israelis' in sentence:
        sentence = re.sub(r'Israelis', 'Egyptians', sentence)
    elif 'Israeli' in sentence:
        sentence = re.sub(r'Israeli', 'Egypt', sentence)



    na_sentences_5.append(sentence)

na_sentences += na_sentences_5

8. As a final step for NA labels, we asked ChatGPT to generate some generic sentences, without reference to any party, that reports fatalities in war-like scenarios.

In [None]:
na_sentences_6 = [
    "Amidst the conflict, soldiers fallen in battle are remembered for their bravery.",
"The war-torn region witnessed numerous casualties in the ongoing conflict.",
"Reports indicate increasing fatalities as the conflict escalates.",
"Scores of soldiers lost their lives in the recent skirmish along the border.",
"Civilians caught in the crossfire perish tragically as violence persists.",
"The nation mourns fallen heroes in the aftermath of a fierce battle.",
"Families grieve as casualties mount in the ongoing war zone.",
"Airstrikes claim many lives in the war-ravaged territory.",
"The conflict zone witnesses a surge in fatalities as tensions heighten.",
"Soldiers laid to rest in solemn ceremonies after the latest clash.",
"The war effort takes its toll with a rise in casualties reported.",
"Civilians bear the brunt of the escalating conflict, resulting in tragic deaths.",
"Young soldiers lost their lives defending the country's borders.",
"The humanitarian crisis deepens as civilian deaths continue to rise.",
"Survivors recount horror stories of losing loved ones in the war.",
"Bodies of fallen soldiers repatriated amidst solemn ceremonies.",
"Rescue efforts strained as casualties surge in the conflict zone.",
"The conflict claims many innocent lives in indiscriminate attacks.",
"Officials confirm heavy casualties in the latest offensive.",
"The war's devastating toll continues to mount as clashes persist.",
"Amidst the chaos, families mourn lost loved ones in the ongoing conflict.",
"The nation observes a day of mourning for war victims.",
"The conflict's impact is seen in increasing death tolls reported daily.",
"Survivors struggle to cope with the trauma of losing family members.",
"The war-torn region faces an influx of casualties due to ongoing strife.",
"Civilian deaths draw international concern over the conflict's toll.",
"Tensions escalate with rising fatalities on both sides of the conflict.",
"The conflict claims more lives as hostilities continue unabated.",
"Families brace for tragic news as casualties are confirmed in the conflict.",
"War-related deaths prompt calls for immediate peace talks.",
"Humanitarian aid reaches war-affected regions grappling with casualties.",
"Survivors recount the painful loss of friends and family in the war zone.",
"The conflict zone witnesses numerous casualties in recent clashes.",
"Civilians caught in the conflict pay the ultimate price amidst chaos.",
"The war's impact is seen in the loss of innocent lives across the region.",
"Tragic casualties highlight the toll of ongoing conflict on civilians.",
"The conflict zone is rife with heartbreaking stories of lives lost.",
"Military fatalities continue to mount as battles persist.",
"The war's devastating toll severely impacts civilian populations.",
"Families mourn the loss of those who perished in the conflict.",
"Tensions rise as casualties escalate in the war-torn area.",
"Civilians bear witness to horrific scenes of war-related deaths.",
"The conflict's tragic outcome is seen in mounting death tolls.",
"Amidst the conflict, the loss of innocent lives draws condemnation.",
"Soldiers killed in action are honored for their sacrifice.",
"The war's cost is evident in increasing casualties reported daily.",
"The ongoing conflict claims more victims as violence rages on.",
"Families plead for an end to bloodshed as casualties mount.",
"The conflict's toll on civilian lives raises humanitarian concerns.",
"The somber reality of war-related deaths grips the nation.",
]

na_sentences += na_sentences_6

9. We noticed the frequent reference of Palestinians as "civilians in Gaza" rather than mentioning their nationality. So we made sure our data for Palestinian fatalities would have enough examples of those instances as well. Though "In Israel" is not a confusing use as the cases of Gaza vs Palestinians, we covered enough examples of that version as well.

In [None]:
pal_sentences = []
for sentence in palestinian_sentences:
    if 'Palestinians' in sentence:
        sentence = re.sub(r'Palestinians', 'people in Gaza', sentence)
    elif 'Palestinian civilians' in sentence:
        sentence = re.sub(r'Palestinian civilians', 'civilians', sentence)
        sentence = re.sub(r'.', 'in Gaza.', sentence)
    elif 'Palestinian people' in sentence:
        sentence = re.sub(r'Palestinian people', 'people', sentence)
        sentence = re.sub(r'.', 'in Gaza.', sentence)

    pal_sentences.append(sentence)


In [None]:
isr_sentences = []
for sentence in israeli_sentences:
    if 'Israelis' in sentence:
        sentence = re.sub(r'Israelis', 'people', sentence)
    elif 'Israeli' in sentence:
        sentence = re.sub(r'Israeli', '', sentence)
    sentence = re.sub(r'.', 'in Israel', sentence)

    isr_sentences.append(sentence)


10. Finally, we wanted to include several examples of indirect reporting of death counts for both Israel and Palestine casualties into correspoinding dataset. For that reason, we added the source-citing clauses to the end of each sentence for palestinian_sentences and israeli_sentences dataset.

In [None]:
pal_sentences_2 = []
for sentence in palestinian_sentences:
    sentence = re.sub(r'.', 'according to the health ministry in Gaza.', sentence)

    pal_sentences_2.append(sentence)



We noticed that it was quite common to cite the source of Palestinian fatalities as Hamas, as much as Gaza or Palestine, therefore we made sure these cases got enough representation in our data of Palestinian fatalities, otherwise the replacement we performed at step 3 could risk misleading the algorithm to assign label "Israeli fatality" everytime it sees the word Hamas. Therefore, we have to counterbalance it with enough examples in which word "Hamas" is used in a sentence reporting Palestinian deaths.

In [None]:
pal_sentences_3 = []
for sentence in pal_sentences:
    sentence = re.sub(r'.', 'according to the Hamas.', sentence)


    pal_sentences_3.append(sentence)


In [None]:
pal_sentences_4 =[]
for sentence in palestinian_sentences:
    sentence = 'The Hamas-run health ministry said' + sentence

    pal_sentences_4.append(sentence)

palestinian_sentences += pal_sentences
palestinian_sentences += pal_sentences_2
palestinian_sentences += pal_sentences_3
palestinian_sentences += pal_sentences_4




In [None]:
isr_sentences_2 = []
for sentence in israeli_sentences:
    sentence = re.sub(r'.', 'according to Israel.', sentence)

    isr_sentences_2.append(sentence)

israeli_sentences += isr_sentences
israeli_sentences+= isr_sentences_2


In [None]:
isr_sentences_3 = []
for sentence in na_sentences_6:
    sentence = re.sub(r'civilian', 'Israeli', sentence)
    sentence = re.sub(r'soldiers', 'Israelis', sentence)

    isr_sentences_3.append(sentence)


israeli_sentences+= isr_sentences_3

### Assign Labels to each Data Frame

In [None]:
pal = {}
pal['sentence'] = palestinian_sentences
pal['party_who_dies'] = []
for i in range(len(palestinian_sentences)):
    pal['party_who_dies'].append("Palestinians")
pal_df = pd.DataFrame(pal)

In [None]:
isr = {}
isr['sentence'] = israeli_sentences
isr['party_who_dies'] = []
for i in range(len(israeli_sentences)):
    isr['party_who_dies'].append("Israelis")
isr_df = pd.DataFrame(isr)

In [None]:
na = {}
na['sentence'] = na_sentences
na['party_who_dies'] = []
for i in range(len(na_sentences)):
    na['party_who_dies'].append("Not available")
na_df = pd.DataFrame(na)

### Concatenate the DataFrames and Save in csv format

In [None]:
df = pd.concat([pal_df, isr_df, na_df], ignore_index=True)

In [None]:
len(df)

2635

In [None]:
label_map = {'Israelis': 0, 'Palestinians': 1, 'Not available':2}
df['label'] = df['party_who_dies'].map(label_map)
df.to_csv('data.csv', index=None)

## Building the Model

Fine-tuning the pre-trained BERT model, we will assess the model's performance on our artificial sentence dataset.

### Train-test split
- We will first load our dataset using load_dataset module to get a DatasetDict object (the object format suitable for transformers.
- Then we will split it into train and test sets.

In [None]:
from datasets import load_dataset
raw_dataset = load_dataset('csv', data_files='data.csv')

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'party_who_dies', 'label'],
        num_rows: 2635
    })
})

In [None]:
split = raw_dataset['train'].train_test_split(test_size=0.2, seed=42)
split

DatasetDict({
    train: Dataset({
        features: ['sentence', 'party_who_dies', 'label'],
        num_rows: 2108
    })
    test: Dataset({
        features: ['sentence', 'party_who_dies', 'label'],
        num_rows: 527
    })
})

### Import the pre-trained model
We will be using the Tokenizer and pre_traiend weights from "bert-base-uncased" model.

In [None]:
checkpoint = "bert-base-uncased"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_fn(batch):
    return tokenizer(batch['sentence'], truncation=True)
    #include truncation but not padding
    #padding will be automatically done by trainer

tokenized_datasets = split.map(tokenize_fn, batched=True) #rememer split was the name of splitted dataset


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/2108 [00:00<?, ? examples/s]

Map:   0%|          | 0/527 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoModelForSequenceClassification, \
  Trainer, TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint, num_labels=3) #we have 3 labels : Palestinians, Israelis and Not available

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Define Training Arguments

In [None]:
training_args = TrainingArguments(
  output_dir='training_dir', 
  evaluation_strategy='epoch',
  save_strategy='epoch',
  num_train_epochs=6, 
  per_device_train_batch_size=16, 
  per_device_eval_batch_size=64,
)

In [None]:
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
def compute_metrics(logits_and_labels):
  logits, labels = logits_and_labels
  predictions = np.argmax(logits, axis=-1)
  acc = np.mean(predictions == labels)
  f1 = f1_score(labels, predictions, average='macro')
  return {'accuracy': acc, 'f1': f1}

### Train the Model
We will train our model on 6 epochs on the tokenized dataset, while using the test set for validation and fine-tuning of the weights.

In [None]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.182746,0.941176,0.941084
2,No log,0.100306,0.948767,0.950522
3,No log,0.135512,0.948767,0.950497
4,0.160900,0.127409,0.944972,0.945102
5,0.160900,0.12448,0.937381,0.938215
6,0.160900,0.156739,0.935484,0.936525


TrainOutput(global_step=792, training_loss=0.12638628603232027, metrics={'train_runtime': 1280.7783, 'train_samples_per_second': 9.875, 'train_steps_per_second': 0.618, 'total_flos': 3205243543615344.0, 'train_loss': 0.12638628603232027, 'epoch': 6.0})

### Model Selection

Models from all six epochs performed quite similarly, we decided to continue with the model in the 2nd epoch as it achieved the highest accuracy and F1 score. We print down the training directory to access to the weights corresponding the epoch 2 model:

In [None]:
!ls training_dir

checkpoint-132	checkpoint-396	checkpoint-660	runs
checkpoint-264	checkpoint-528	checkpoint-792


In [None]:
from transformers import pipeline
savedmodel = pipeline('text-classification',
                      model='training_dir/checkpoint-396',
                      device=0)

### Saving the Model for future use (if needed)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
checkpoint_number = 396
trainer.save_model(f'/content/drive/My Drive/Empirical_methods/BERT_model/checkpoint-{checkpoint_number}')

In [None]:
savedmodel.save_pretrained('/content/drive/My Drive/Empirical_methods')

## Predictions on Unlabelled Dataset

Now we are ready to use the model to classify sentences from our real-news dataset stored in df_death.json

In [16]:
import json
with open('df_death.json', 'r') as file:
    data = json.load(file)

df_real = pd.DataFrame(data)

In [17]:
df = {}
df['source'] = []
df['sentence'] = []
for i in range(len(df_real)):
  for sentence in df_real['mention_death'][i]:
    df['sentence'].append(sentence)
    df['source'].append(df_real['source'][i])

data = pd.DataFrame(df)

In [18]:
len(data)

5025

In [19]:
data.head()

Unnamed: 0,source,sentence
0,Aljazeera,"The US helicopters sunk three of the boats, ki..."
1,Aljazeera,He said the latest clash marked serious escala...
2,Aljazeera,The unrest in the Red Sea comes as anger grows...
3,Aljazeera,The war began when Hamas carried out shock cro...
4,Aljazeera,An estimated 70 percent of respondents said th...


In [None]:
map2label = {'LABEL_0':'Israelis', 'LABEL_1':'Palestinians', 'LABEL_2':'Not available'}

In [None]:
predictions = savedmodel(data['sentence'].tolist())
data['party_who_dies_pred'] = [map2label[pred['label']]for pred in predictions]


In [None]:
data.head()

Unnamed: 0,source,sentence,party_who_dies_pred
0,Aljazeera,"The US helicopters sunk three of the boats, ki...",Not available
1,Aljazeera,He said the latest clash marked serious escala...,Not available
2,Aljazeera,The unrest in the Red Sea comes as anger grows...,Palestinians
3,Aljazeera,The war began when Hamas carried out shock cro...,Israelis
4,Aljazeera,An estimated 70 percent of respondents said th...,Israelis


In [None]:
file_path = 'df_death_pred.json'

# Save the DataFrame to a JSON file
data.to_json(file_path)