<a href="https://colab.research.google.com/github/rubac/open_survey/blob/main/BERT_Coding_of_Willingness_Open_Q.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Transformer for Coding Responses to Open-Ended Questions

[Simple Transformers](https://github.com/ThilinaRajapakse/simpletransformers) - an NLP library based on the [Transformers](https://github.com/huggingface/transformers) library by HuggingFace is used to automatically code responses to an open-ended survey question into one of 8 categories.

# Install Simple Transformers library 

In [1]:
%%capture
# install simpletransformers
!pip install simpletransformers
!pip freeze | grep simpletransformers
# simpletransformers==0.28.2

# Load the dataset

In [2]:
import pandas as pd
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Load manually coded training data (three separate datasets)
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/data/rs_500_1_ID.xlsx')
df_2 = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/data/rs_500_2_ID.xlsx')
df_3 = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/data/rs_500_3_ID.xlsx')

DFS = [df, df_2, df_3]

In [4]:
# Put them all into one DF
for x in DFS:
  x['Kategorie'] = x['Kategorie1'].str[:1]
  print(x['Kategorie'].value_counts())
  x['Antwort'] = x.iloc[:,0]

DF = DFS[0]
DF = DF[['Antwort', 'Kategorie']]

DF_2 = DFS[1]
DF_2 = DF_2[['Antwort', 'Kategorie']]

DF_3 = DFS[2]
DF_3 = DF_3[['Antwort', 'Kategorie']]

DF = DF.append(DF_2, ignore_index = True)
DF = DF.append(DF_3, ignore_index = True)
print(DF.shape)

1    142
4    133
2    101
7     55
6     42
5     36
8     18
3      1
Name: Kategorie, dtype: int64
4    166
1    105
2     75
5     67
7     62
6     42
3     10
8      4
Name: Kategorie, dtype: int64
4    206
2     67
7     67
1     64
6     50
5     39
3     13
8     12
Name: Kategorie, dtype: int64
(1577, 2)


In [5]:
DF['Kategorie'] = DF['Kategorie'].astype(int)
DF['Kategorie'] = DF['Kategorie'] -1

print(DF.shape)
print(DF['Kategorie'].value_counts())
print(DF.head())

(1577, 2)
3    505
0    311
1    243
6    184
4    142
5    134
7     34
2     24
Name: Kategorie, dtype: int64
                                             Antwort  Kategorie
0  Ich habe kein Problem damit, dass meine Daten ...          0
1  Ich habe nichts zu verbergen, ich helfe der Al...          0
2                                     Keine Bedenken          0
3        Meiner Meinung nach spricht nichts dagegen.          0
4  Stimme der Weitergabe meiner Daten grundsätzli...          0


In [6]:
from sklearn.model_selection import train_test_split

# set aside 20% of train and test data for evaluation
train_df, test_df = train_test_split(DF,
    test_size=0.2, shuffle = True, random_state = 8)

# Use the same function above for the validation set
train_df, val_df = train_test_split(train_df, 
    test_size=0.25, random_state= 8) # 0.25 x 0.8 = 0.2

print('train shape: ',train_df.shape)
print('test shape: ',test_df.shape)
print('val shape: ',val_df.shape)

print(test_df.head())
print(val_df.head())

train shape:  (945, 2)
test shape:  (316, 2)
val shape:  (316, 2)
                                                Antwort  Kategorie
117                                         datenschutz          0
193   Bin mir nicht sicher , dass meine Gesundheitsd...          1
1514                                               Nein          6
933   Um das Gesundheitsamt dabei zu unterstützen, d...          3
551                              Bin mir da unschlüssig          5
                                                Antwort  Kategorie
537                                              Anonym          5
1542                           Habe ich bereits gemacht          6
714   Gesundheitsdaten sind eine sehr persönliche An...          0
1039  Wenn ich Empfehlungen brauche, vertraue ich au...          3
293        wegen der Studien, zur Allgemeinheit dienend          3


In [7]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs

# define hyperparameter
model_args = ClassificationArgs()
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.evaluate_during_training = True
model_args.manual_seed = 4
model_args.use_multiprocessing = True
model_args.train_batch_size = 16
model_args.eval_batch_size = 8
model_args.labels_list = [0,1,2,3,4,5,6,7]
model_args.num_train_epochs = 5
model_args.learning_rate = 0.00012131339286506642

# Create a TransformerModel
model = ClassificationModel(
        "bert", "bert-base-german-cased",
        args=model_args,
        num_labels=8,
)

# Train the model
model.train_model(
    train_df,
    eval_df=val_df
)

# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(test_df)


Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoi

  0%|          | 0/945 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/60 [00:00<?, ?it/s]



  0%|          | 0/316 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/60 [00:00<?, ?it/s]



  0%|          | 0/316 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/60 [00:00<?, ?it/s]



  0%|          | 0/316 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/60 [00:00<?, ?it/s]



  0%|          | 0/316 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/60 [00:00<?, ?it/s]



  0%|          | 0/316 [00:00<?, ?it/s]



  0%|          | 0/316 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/40 [00:00<?, ?it/s]

In [8]:
# First look at test set classification performance
from sklearn.metrics import f1_score, accuracy_score


def f1_multiclass(labels, preds):
    return f1_score(labels, preds, average='micro')
    
result, model_outputs, wrong_predictions = model.eval_model(test_df, f1=f1_multiclass, acc=accuracy_score)

result



  0%|          | 0/316 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/40 [00:00<?, ?it/s]

{'mcc': 0.6620155252054196,
 'f1': 0.7278481012658227,
 'acc': 0.7278481012658228,
 'eval_loss': 1.141813975572586}

In [9]:
# Predict for all responses used in training / eval / test
antwort_list = DF['Antwort'].values.tolist()
predictions, raw_outputs = model.predict(antwort_list)

  0%|          | 0/1577 [00:00<?, ?it/s]

  0%|          | 0/198 [00:00<?, ?it/s]

In [10]:
# Save
DF['predicted'] = predictions
DF.head(50)
DF.to_csv("/content/drive/MyDrive/Colab Notebooks/data/coded_results.csv")

In [11]:
# Look at classification performance in test set
from sklearn.metrics import classification_report
test_antwort_list = test_df['Antwort'].values.tolist()
predictions, raw_outputs = model.predict(test_antwort_list)
test_predictions = predictions
print(classification_report(test_df['Kategorie'], test_predictions, target_names=["0","1","2","3","4","5","6","7"]))
test_df['Predicted'] = test_predictions
# Save classified test set for later use
test_df.to_csv("/content/drive/MyDrive/Colab Notebooks/data/testset_coded_results.csv")

  0%|          | 0/316 [00:00<?, ?it/s]

  0%|          | 0/40 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.77      0.76      0.77        71
           1       0.67      0.73      0.70        41
           2       0.25      0.25      0.25         4
           3       0.84      0.86      0.85        98
           4       0.59      0.77      0.67        30
           5       0.55      0.50      0.52        32
           6       0.81      0.62      0.70        34
           7       0.33      0.17      0.22         6

    accuracy                           0.73       316
   macro avg       0.60      0.58      0.58       316
weighted avg       0.73      0.73      0.72       316



In [12]:
# Load all responses
DF_complete = pd.read_stata('/content/drive/MyDrive/Colab Notebooks/data/GIP_W59_A01_open_clean.dta')
DF_complete.head()

# Classify all responses
full_antwort_list = DF_complete['openanswer'].values.tolist()
full_predictions, full_raw_outputs = model.predict(full_antwort_list)
DF_complete['predicted'] = full_predictions
DF_complete.head(50)
DF_complete.to_csv("/content/drive/MyDrive/Colab Notebooks/data/coded_all_responses.csv")

  0%|          | 0/3900 [00:00<?, ?it/s]

  0%|          | 0/488 [00:00<?, ?it/s]