## Inference on Dataset for topic "Health"

Meant to be run on a Google Colab Instance, not locally. T4, A100 or V100 should all be sufficient. 

#### Install relevant packages

In [1]:
!pip install accelerate -U
!pip install transformers[sentencepiece]
!pip install datasets
!pip install --force-reinstall -v "openpyxl==3.0.10"
!pip install xformers

Collecting accelerate
  Downloading accelerate-0.29.2-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.4/297.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.w

In [2]:
## Load general packages
import pandas as pd
import numpy as np
from google.colab.data_table import DataTable
from sklearn.model_selection import train_test_split
from google.colab import drive
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import torch
import os
import tqdm

### Connect to drive

In [3]:
drive.mount('/content/drive')

Mounted at /content/drive


### Download Model from Drive

In [4]:


model_name_custom = f"deberta-base-health_final_20240321_9596"
mode_custom_path = "/content/drive/MyDrive/unga_health/" + model_name_custom
device = "cuda:0" if torch.cuda.is_available() else "cpu"  # use GPU (cuda) if available, otherwise use CPU

model = AutoModelForSequenceClassification.from_pretrained(mode_custom_path)
tokenizer = AutoTokenizer.from_pretrained(mode_custom_path, use_fast=True, model_max_length=512)

# documentation: https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ZeroShotClassificationPipeline
pipe_classifier = pipeline(
    "text-classification",
    model=model,  # if you have trained a model above, load_best_model_at_end in the training arguments has automatically replaced model with the fine-tuned model
    tokenizer=tokenizer,
    framework="pt",
    device=device,
    batch_size=24
)

In [5]:
with torch.no_grad():
  torch.cuda.empty_cache()

### Download Data

In [6]:
oos_test = pd.read_csv('https://nextcloud.swp-berlin.org/s/REDACTED/download')
oos_test =  oos_test[['text', 'detail_vote_number', "group_id_alt"]]
oos_test


Unnamed: 0,text,detail_vote_number,group_id_alt
0,"The General Assembly,",4033262,1.0
1,Recalling its resolution 78/160 of 19 December...,4033262,2.0
2,Reaffirming its resolutions 53/199 of 15 Decem...,4033262,3.0
3,Recognizing the importance of creating synergi...,4033262,4.0
4,"Stressing the important role of science, techn...",4033262,5.0
...,...,...,...
309468,Requests the Secretary-General to continue to ...,284003,11.0
309469,Requests all States and international organiza...,284003,12.0
309470,Also requests the Secretary-General to report ...,284003,13.0
309471,,284001,


In [7]:
oos_test['text_length'] = oos_test['text'].str.len()
# Get the top 100 texts by length
top_100_texts_by_length = oos_test.nlargest(100, 'text_length')
top_100_texts_by_length

Unnamed: 0,text,detail_vote_number,group_id_alt,text_length
114978,processes and the decisions taken therein. We ...,809145,59.0,36400.0
197514,Resolves that the scale of assessments for the...,590472,20.0,33148.0
139353,Resolves that the scale of assessments for the...,750363,24.0,28913.0
170626,Resolves that the scale of assessments for the...,675488,19.0,28773.0
224985,Resolves that the scale of assessments for the...,512129,8.0,28643.0
...,...,...,...,...
3519,Taking note of the declaration adopted in Flor...,4030825,27.0,2079.0
20255,Urges Member States to adopt a climate- and en...,3998705,34.0,2072.0
3026,Urges Member States to adopt a climate- and en...,4030818,37.0,2071.0
246240,Also encourages all States members of the Agen...,454728,24.0,2065.0


In [8]:
## drop missing texts, this is because we have all resolutions in the dataset,
## even the ones where the PDF is faulty or we dont have a link
oos_test = oos_test.dropna(subset=['text'])

## delete the text from paragraphs that are simply too long for the classifier
## use only the first 6000 characters of the
oos_test["text"] = oos_test.apply(lambda row: row['text'][:6000] if len(row['text']) > 25000 else row['text'], axis=1)

oos_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  oos_test["text"] = oos_test.apply(lambda row: row['text'][:6000] if len(row['text']) > 25000 else row['text'], axis=1)


Unnamed: 0,text,detail_vote_number,group_id_alt,text_length
0,"The General Assembly,",4033262,1.0,21.0
1,Recalling its resolution 78/160 of 19 December...,4033262,2.0,160.0
2,Reaffirming its resolutions 53/199 of 15 Decem...,4033262,3.0,238.0
3,Recognizing the importance of creating synergi...,4033262,4.0,446.0
4,"Stressing the important role of science, techn...",4033262,5.0,399.0
...,...,...,...,...
309466,Notes with satisfaction the national efforts o...,284003,9.0,115.0
309467,"Commends the international community, includin...",284003,10.0,167.0
309468,Requests the Secretary-General to continue to ...,284003,11.0,166.0
309469,Requests all States and international organiza...,284003,12.0,280.0


In [9]:
dir_path = '/content/drive/MyDrive/unga_health/preds_20240416'
if not os.path.exists(dir_path):
    os.makedirs(dir_path)

In [11]:
text_lst = oos_test["text"].tolist()

# in principle it would be better to use a Dataset object from the transformers library
# however, that does not allow for saving intermediate results and since it is a lot of text
# sometimes the google colab crashes or gets disconnected. This way it probably takes
# about half an hour longer but that doesn´t matter
def save_output(output, chunk_idx):
    """Save intermediate results to a CSV file."""
    file_path = f"{dir_path}/predictions_eval_set_chunk_{chunk_idx}.csv"
    df_temp = pd.DataFrame([data[0] for data in output])
    eval_temp = pd.concat([oos_test.iloc[chunk_idx*1000:(chunk_idx+1)*1000].reset_index(drop=True), df_temp], axis=1, ignore_index=True)
    eval_temp.to_csv(file_path, index=False)

def process_texts(texts):
    """Process a list of texts in chunks and save outputs."""
    num_chunks = len(texts) // 1000 + (1 if len(texts) % 1000 != 0 else 0)
    with torch.no_grad():
      torch.cuda.empty_cache()

    with tqdm.tqdm(total=len(texts)) as pbar:
        for chunk_idx in range(num_chunks):
            file_path = f"{dir_path}/predictions_eval_set_chunk_{chunk_idx}.csv"

            # Check if the file is already saved
            if os.path.exists(file_path):
                pbar.update(1000)  # Update the progress bar by 1000 steps
                continue

            start_idx = chunk_idx * 1000
            end_idx = start_idx + 1000
            current_chunk_output = [pipe_classifier(text) for text in texts[start_idx:end_idx]]

            save_output(current_chunk_output, chunk_idx)
            pbar.update(1000)  # Update the progress bar by 1000 steps

process_texts(text_lst)

310000it [00:00, 5861373.03it/s]          


In [None]:
import random

# Set working directory
os.chdir('/content/drive/MyDrive/unga_health/preds_20240321')

# List all csv files in the directory
files = [f for f in os.listdir() if f.endswith('.csv')]

# Read and process all csv files
predictions = pd.concat([pd.read_csv(file) for file in files], ignore_index=True)

predictions.columns = ['text', 'detail_vote_number', 'group_id_alt','index_alt', 'label', 'prob']
predictions = predictions.drop(columns=[predictions.columns[3]])

# Get rows with the lowest 200 probabilities
low_prob = predictions.nsmallest(200, 'prob')



In [None]:
# Filter rows where text contains "health" and label is "not health"
health = predictions[predictions['text'].str.contains('health', case=False) & (predictions['label'] == 'not health')]

# Get rows with the highest 200 probabilities
high_prob = predictions.nlargest(200, 'prob')

# Sample data
coders = ["daniel", "rebecca", "paul"]
sample_list = []

for coder in coders:
    for label in predictions['label'].unique():
        sample_data = predictions[predictions['label'] == label].sample(n=25)
        sample_data['coder'] = coder
        sample_list.append(sample_data)

sample = pd.concat(sample_list, ignore_index=True)

In [None]:
sample


Unnamed: 0,text,detail_vote_number,group_id_alt,label,prob,coder
0,"Further decides that, in accordance with the p...",638162,30.0,not health,0.989855,daniel
1,Taking note of the report of the independent i...,827185,9.0,not health,0.995008,daniel
2,Taking note of the discussions on munitions ma...,1327179,10.0,not health,0.994794,daniel
3,Taking note of the communiqué issued by the ro...,505276,5.0,not health,0.990829,daniel
4,Welcomes the report of the Committee on Confer...,3897086,9.0,not health,0.994927,daniel
...,...,...,...,...,...,...
145,Also calls upon States to provide the necessar...,509174,44.0,health,0.997962,paul
146,Encourages Member States to adopt best practic...,3894270,46.0,health,0.997631,paul
147,Supports an inclusive consultation process for...,724901,54.0,health,0.969559,paul
148,"Recognizing that bullying, including cyberbull...",858080,8.0,health,0.996584,paul


In [None]:
low_prob

Unnamed: 0,text,detail_vote_number,group_id_alt,label,prob
39190,Also recognizes the importance of innovation i...,3941456,76.0,not health,0.500859
52478,Also recognizes the importance of innovation i...,3883588,76.0,not health,0.500859
65172,Also recognizes the importance of innovation i...,3828656,72.0,not health,0.500859
78161,Also recognizes the importance of innovation i...,1642600,70.0,not health,0.500859
155878,Concerned about the challenges that the financ...,697981,15.0,not health,0.501362
...,...,...,...,...,...
225706,(f) To take into account the outcome of the sp...,482484,60.0,not health,0.563332
43841,Requests the Secretary-General to submit to th...,3896667,82.0,health,0.564744
17759,(m) To promote quality education and lifelong ...,3998742,69.0,health,0.566178
117552,Calls upon States to promote and protect the r...,789269,20.0,not health,0.566370


In [None]:
health

Unnamed: 0,text,detail_vote_number,group_id_alt,label,prob
3146,Recalling its resolution 76/300 of 28 July 202...,4030820,14.0,not health,0.881413
4082,Urges Member States to continue to meet their ...,4030840,20.0,not health,0.99314
4745,Recognizing its resolution 76/300 of 28 July 2...,4030850,9.0,not health,0.968829
5826,Recalling further its resolution 76/300 of 28 ...,4030872,20.0,not health,0.512971
7797,Recalling further its resolution 76/300 of 28 ...,4029891,7.0,not health,0.824453
8369,Firmly convinced that the use of space science...,4029479,13.0,not health,0.978253
9859,Also notes with concern the approximately 30 p...,4029432,260.0,not health,0.647533
12634,Recognizing the key role the Pacific Islands F...,4019749,4.0,not health,0.527304
13709,Reiterates its call for the reinforcement of c...,4009707,14.0,not health,0.851309
13988,Recalling its resolution 77/165 of 14 December...,4008496,3.0,not health,0.990269
