# XLNet for HMTC - Hierarchical Multi-label classification of Privacy Policies.

Code for reproducing results in the paper "A Combined Rule-Based and Machine Learning Approach for Automated GDPR Compliance Checking" 

*Revised on May 10, 2021*

In this notebook I'll show you how to use XLNet with the huggingface PyTorch library and the simpletransformers library to quickly and efficiently fine-tune a model to get near state of the art performance in privacy policy classification.

# Contents

See "Table of contents" in the sidebar to the left.

# Introduction


## History

2018 was a breakthrough year in NLP. Transfer learning, particularly models like Allen AI's ELMO, OpenAI's Open-GPT, and Google's BERT allowed researchers to smash multiple benchmarks with minimal task-specific fine-tuning and provided the rest of the NLP community with pretrained models that could easily (with less data and less compute time) be fine-tuned and implemented to produce state of the art results. Unfortunately, for many starting out in NLP and even for some experienced practicioners, the theory and practical application of these powerful models is still not well understood.


## What is XLNeT?
XLNet changed the way a language modeling problem is approached. It is an auto-regressive language model that outputs the joint probability of a sequence of tokens with recurrence. It calculates the probability of a word, conditioned on all possible permutations of words in a sentence, as opposed to just those to the left or the right of the target word. The model achieves state-of-the-art performance on the GLUE benchmark, trained on a large corpus.


## Advantages of Fine-Tuning



In this notebook, we will use XLNet to train a text classifier. Specifically, we will take the pre-trained XLNet model, add an untrained layer of neurons on the end, and train the new model for our classification task. Why do this rather than train a train a specific deep learning model (a CNN, BiLSTM, etc.) that is well suited for the specific NLP task you need? 

1. **Quicker Development**

    * First, the pre-trained XLNet model weights already encode a lot of information about our language. As a result, it takes much less time to train our fine-tuned model - it is as if we have already trained the bottom layers of our network extensively and only need to gently tune them while using their output as features for our classification task. In fact, the authors recommend only 2-4 epochs of training for fine-tuning XLNet on a specific NLP task (compared to the hundreds of GPU hours needed to train the original XLNet model or a LSTM from scratch!). 

2. **Less Data**

    * In addition and perhaps just as important, because of the pre-trained weights this method allows us to fine-tune our task on a much smaller dataset than would be required in a model that is built from scratch. A major drawback of NLP models built from scratch is that we often need a prohibitively large dataset in order to train our network to reasonable accuracy, meaning a lot of time and energy had to be put into dataset creation. By fine-tuning XLNet, we are now able to get away with training a model to good performance on a much smaller amount of training data.

3. **Better Results**

    * Finally, this simple fine-tuning procedure (typically adding one fully-connected layer on top of XLNet and training for a few epochs) was shown to achieve state of the art results with minimal task-specific adjustments for a wide variety of tasks: classification, language inference, semantic similarity, question answering, etc. Rather than implementing custom and sometimes-obscure architetures shown to work well on a specific task, simply fine-tuning XLNet is shown to be a better (or at least equal) alternative.



## A Shift in NLP

This shift to transfer learning parallels the same shift that took place in computer vision a few years ago. Creating a good deep learning network for computer vision tasks can take millions of parameters and be very expensive to train. Researchers discovered that deep networks learn hierarchical feature representations (simple features like edges at the lowest layers with gradually more complex features at higher layers). Rather than training a new network from scratch each time, the lower layers of a trained network with generalized image features could be copied and transfered for use in another network with a different task. It soon became common practice to download a pre-trained deep network and quickly retrain it for the new task or add additional layers on top - vastly preferable to the expensive process of training a network from scratch. For many, the introduction of deep pre-trained language models in 2018 (ELMO, BERT, ULMFIT, XLNet,Open-GPT, etc.) signals the same shift to transfer learning in NLP that computer vision saw.

Let's get started!

# 1. Setup

## 1.1. Using Colab GPU for Training



Google Colab offers free GPUs and TPUs! Since we'll be training a large neural network it's best to take advantage of this (in this case we'll attach a GPU), otherwise training will take a very long time.

A GPU can be added by going to the menu and selecting:

`Edit 🡒 Notebook Settings 🡒 Hardware accelerator 🡒 (GPU)`

Then run the following cell to confirm that the GPU is detected.

In order for torch to use the GPU, we need to identify and specify the GPU as the device. Later, in our training loop, we will load data onto the device. 

In [1]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla V100-PCIE-32GB


## 1.2. Installing the Hugging Face Library



Next, let's install the [simpletransformers](https://github.com/ThilinaRajapakse/simpletransformers), this library is based on the [Transformers](https://github.com/huggingface/transformers) package from Hugging Face. Simple Transformers lets you quickly train and evaluate Transformer models.The Hugging Face library which will give us a pytorch interface for working with BERT. (This library contains interfaces for other pretrained language models like OpenAI's GPT and GPT-2.) 


At the moment, the Hugging Face library seems to be the most widely accepted and powerful pytorch interface for working with BERT. In addition to supporting a variety of different pre-trained transformer models, the library also includes pre-built modifications of these models suited to your specific task. 

The library also includes task-specific classes for token classification, question answering, next sentence prediciton, etc. Using these pre-built classes simplifies the process of modifying Transformers for your purposes.


In [2]:
!pip install -q simpletransformers==0.48.15
!pip install -q transformers==3.4.0

You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m


# 2. Loading OPP-115 (115 Online Privacy Policy) Dataset


For  our  task  we  make  use  of  the  [Usable  Privacy  Policy  Project](https://usableprivacy.org)’s  Online  Privacy  Policies ([OPP-115](https://usableprivacy.org/static/data/OPP-115_v1_0.zip)) corpus, which contains detailed annotations made by Subject Matter Experts (SMEs) for the data practices described in a set of 115 website privacy policies.

OPP-115 consists of 115 privacy policies, manually annotated on a paragraph level, resulting in 3 792 paragraphs, 10 high-level classes and 22 distinct attributes.


## 2.1. Download & Extract

We'll use the `wget` package to download the dataset to the Colab instance's file system. 

In [3]:
!pip install -q wget

You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m


The dataset is hosted on GitHub in this repo: https://usableprivacy.org/static/data/OPP-115_v1_0.zip

In [4]:
import wget
import os

print('Downloading dataset...')

# The URL for the dataset zip file.
url = 'https://usableprivacy.org/static/data/OPP-115_v1_0.zip'

# Download the file (if we haven't already)
if not os.path.exists('./OPP-115_v1_0.zip'):
    wget.download(url, './OPP-115_v1_0.zip')

Downloading dataset...


Unzip the dataset to the file system. You can browse the file system of the Colab instance in the sidebar on the left.

In [5]:
# Unzip the dataset (if we haven't already)
if not os.path.exists('./OPP-115'):
    !unzip -q OPP-115_v1_0.zip

## 2.2. Build clean dataset from OPP-115

We can see from the documentation file that both `annotations` and `sanitized_policies` folders are the ones we need to play with in order to extract the policies' segments and the corresponding annotations. 

In [6]:
#import necessary libraries

import pandas as pd
import numpy as np
import itertools
import os
import re
import json
from pathlib import Path
import csv
import pickle

We use the following helper functions to build the clean dataset that we will be using throughout the notebook to perform our hmtc task. 

In [7]:
def remove_html(text):
    html = re.compile(r'<.*?>')
    return html.sub(r'', text)

In [8]:
pol_count = 0
policies_dict = {"segments": []}
annotated_segments = [] # list of dicts w annotated segments
DATA_PATH = "./OPP-115"

for filename in os.listdir(DATA_PATH+"/sanitized_policies"):
    policy_segments = []
    pol_count += 1
    annotations_path = DATA_PATH+ "/annotations/"+filename.replace(".html", ".csv")
    segments_path = DATA_PATH+"/sanitized_policies/"+filename
    with open(segments_path, 'r') as segs_file:
        segments = segs_file.read().split("|||")
        segments = list(map(remove_html, segments))
        annotations_df = pd.read_csv(annotations_path, names=["annotation_id", "batch_id","annotator_id", "policy_id", "segment_id", "category", "attributes", "date","url"])
        annotation_index = 0;
        for (segment_index, segment) in enumerate(segments):
            segment_obj = {"filename":filename.replace(".html", ""),"Text": segment, "Category": []}
            while(annotation_index < len(annotations_df) and annotations_df.iloc[annotation_index]["segment_id"] == segment_index):
                annotation = annotations_df.iloc[annotation_index]
                attributes = json.loads(annotation["attributes"])
                if(annotation["category"] == "Other"):
                    # all attribute labels but other become category
                    for key, value in attributes.items():
                        if(key != "Other Type"):
                            print("key is not other type: ", key)
                            continue
                        if ("value" in value and value["value"] != "Other"):
                            if value["value"] not in segment_obj["Category"]:
                                segment_obj["Category"].append(value["value"])
                else:
                    if annotation["category"] not in segment_obj["Category"]:
                        segment_obj["Category"].append(annotation["category"])
                    for key, value in attributes.items():
                        if (key not in segment_obj):
                            segment_obj[key] = []
                        if ("value" in value):
                            if value["value"] == "User Profile":
                                if value["value"] not in segment_obj[key]:
                                    segment_obj[key].append("User profile")
                            elif value["value"] == "Service Operation and Security":
                                if value["value"] not in segment_obj[key]:
                                    segment_obj[key].append("Service operation and security")
                            else:
                                if value["value"] not in segment_obj[key]:
                                    segment_obj[key].append(value["value"])
                        else:
                            print("value not in value for key: ", key)
                annotation_index +=1
            policy_segments.append(segment_obj)
    policies_dict["segments"].append(policy_segments)

In [9]:
policies_df = pd.DataFrame(policies_dict)
policies_df.head()

Unnamed: 0,segments
0,"[{'filename': '701_tangeroutlet.com', 'Text': ..."
1,"[{'filename': '1099_enthusiastnetwork.com', 'T..."
2,"[{'filename': '862_disinfo.com', 'Text': ' Pri..."
3,"[{'filename': '1468_rockstargames.com', 'Text'..."
4,"[{'filename': '331_tgifridays.com', 'Text': 'T..."


## 2.3 Split dataframes at policy level

We split the OPP-115 dataset on a policy-document level into 3 sets: 65 policies are used for training, 35 for validation and 30 policies are kept as a testing set.
We use the same splits for the local classifiers approach and the Text to text approach to have a solid base of comparision between the two approaches.

In [10]:
import pathlib
#pip
#import gdown

print('Downloading policies split ...')

prefix = 'https://drive.google.com/uc?id='
split = "1L2UGnahjxS5IJj4eCXrbqHF1IYPkXIe2"

# Download the file (if we haven't already)
if not os.path.exists('./opp115_splits.json'):
    gdown.download(url=prefix + split,
                    output='./opp115_splits.json',
                    quiet=False)

Downloading policies split ...


In [11]:
with open('./opp115_splits.json') as json_file:
    opp115_splits = json.load(json_file)
    
train_policies = opp115_splits['train']
validation_policies = opp115_splits['validation']
test_policies = opp115_splits['test']

print('{:>5,} training policies'.format(len(train_policies)))
print('{:>5,} validation policies'.format(len(validation_policies)))
print('{:>5,} validation policies'.format(len(test_policies)))

   50 training policies
   15 validation policies
   50 validation policies


In [12]:
def expand_segments_df(segments_series):
    seg_list = segments_series.to_list()
    total_list = []
    for pol_segs in seg_list:
        for seg in pol_segs:
            total_list.append(seg)
    return pd.DataFrame(total_list)

In [13]:
policies = expand_segments_df(policies_df["segments"]).rename(columns={'Does/Does Not': 'Does or Does Not'})

In [14]:
policies.head()

Unnamed: 0,filename,Text,Category,Security Measure,Collection Mode,Choice Scope,Action First-Party,Personal Information Type,Choice Type,Identifiability,...,Action Third Party,Retention Period,Retention Purpose,Access Type,Access Scope,Audience Type,Change Type,User Choice,Notification Type,Do Not Track policy
0,701_tangeroutlet.com,Privacy Policy,[Introductory/Generic],,,,,,,,...,,,,,,,,,,
1,701_tangeroutlet.com,TangerOutlets is committed to keeping your per...,"[Data Security, First Party Collection/Use, Th...",[Generic],"[not-selected, Explicit]","[not-selected, Unspecified]",[Unspecified],[Generic personal information],"[not-selected, Unspecified]",[Identifiable],...,"[Receive/Shared with, Other]",,,,,,,,,
2,701_tangeroutlet.com,If at any time you want your email information...,[User Choice/Control],,,"[First party collection, Unspecified, First pa...",,"[Contact, Unspecified]","[Opt-out link, Opt-out via contacting company]",,...,,,,,,,,,,
3,1099_enthusiastnetwork.com,TEN: The Enthusiast Network Privacy Policy T...,[Introductory/Generic],,,,,,,,...,,,,,,,,,,
4,1099_enthusiastnetwork.com,I. WHAT INFORMATION DO WE COLLECT ABOUT YOU? ...,[First Party Collection/Use],,[not-selected],[Unspecified],"[Unspecified, Other, Collect on website]","[Contact, Unspecified, Demographic]",[Unspecified],[Identifiable],...,,,,,,,,,,


## 2.4. Training, Validation and Test Split


In [15]:
train_df = policies[policies.filename.isin(train_policies)]
eval_df = policies[policies.filename.isin(validation_policies)]
test_df = policies[policies.filename.isin(test_policies)]

In [16]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1789 entries, 129 to 3670
Data columns (total 24 columns):
filename                     1789 non-null object
Text                         1789 non-null object
Category                     1789 non-null object
Security Measure             138 non-null object
Collection Mode              679 non-null object
Choice Scope                 1150 non-null object
Action First-Party           679 non-null object
Personal Information Type    1167 non-null object
Choice Type                  1150 non-null object
Identifiability              1002 non-null object
Does or Does Not             1002 non-null object
User Type                    1204 non-null object
Purpose                      1150 non-null object
Third Party Entity           527 non-null object
Action Third Party           527 non-null object
Retention Period             62 non-null object
Retention Purpose            62 non-null object
Access Type                  89 non-null object
Ac

In [17]:
eval_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 276 entries, 0 to 3231
Data columns (total 24 columns):
filename                     276 non-null object
Text                         276 non-null object
Category                     276 non-null object
Security Measure             28 non-null object
Collection Mode              89 non-null object
Choice Scope                 171 non-null object
Action First-Party           89 non-null object
Personal Information Type    173 non-null object
Choice Type                  171 non-null object
Identifiability              136 non-null object
Does or Does Not             136 non-null object
User Type                    174 non-null object
Purpose                      171 non-null object
Third Party Entity           68 non-null object
Action Third Party           68 non-null object
Retention Period             8 non-null object
Retention Purpose            8 non-null object
Access Type                  7 non-null object
Access Scope           

In [18]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1727 entries, 3 to 3791
Data columns (total 24 columns):
filename                     1727 non-null object
Text                         1727 non-null object
Category                     1727 non-null object
Security Measure             134 non-null object
Collection Mode              621 non-null object
Choice Scope                 1042 non-null object
Action First-Party           621 non-null object
Personal Information Type    1063 non-null object
Choice Type                  1042 non-null object
Identifiability              890 non-null object
Does or Does Not             890 non-null object
User Type                    1106 non-null object
Purpose                      1042 non-null object
Third Party Entity           423 non-null object
Action Third Party           423 non-null object
Retention Period             50 non-null object
Retention Purpose            50 non-null object
Access Type                  93 non-null object
Access

We will save the resulting datasets

In [19]:
pathlib.Path('./data').mkdir(parents=True, exist_ok=True)

train_df.drop(columns=['filename']).to_pickle("./data/train_df.pkl")
eval_df.drop(columns=['filename']).to_pickle("./data/validation_df.pkl")
test_df.drop(columns=['filename']).to_pickle("./data/test_df.pkl")

# 4. Train Our Classification Model

In [20]:
#import libraries

from simpletransformers.classification import MultiLabelClassificationModel
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report, f1_score



##4.1 Training Model

In [24]:
class HTMC_classifier():
    def __init__(self, classes ,epochs=1, model_name="classifier", model_type='xlnet', model_path='xlnet-base-cased', output_path=None):
        self.one_hot = MultiLabelBinarizer(classes=classes)
        self.use_cuda = torch.cuda.is_available()
        self.epochs = epochs
        self.model_type = model_type
        self.model_path = model_path
        self.output_path = output_path
        self.load_model()
        print("self.use_cuda: ", self.use_cuda)

    def one_hot_encoder(self, labels, fit=False):
        if(fit):
          return self.one_hot.fit_transform(labels)
        return self.one_hot.transform(labels)


    def one_hot_to_text(self, one_hot_matrix):
      return self.one_hot.inverse_transform(one_hot_matrix)
    
    
    def fit(self, text_series, labels,eval_text_series, eval_labels):
        labels = self.one_hot_encoder(labels, True)
        eval_labels = self.one_hot_encoder(eval_labels)
        print("fitting model, artifacts will be saved to: ", self.output_path)
        self.model = MultiLabelClassificationModel(self.model_type, self.model_path, num_labels=np.array(labels).shape[1],use_cuda=self.use_cuda,
                                      args={'reprocess_input_data': True,
                                            'output_dir':self.output_path ,
                                            'overwrite_output_dir': True,
                                            'eval_batch_size' : 8,
                                            'evaluate_during_training':True,
                                            'evaluate_during_training_silent':False,
                                            'evaluate_during_training_verbose': True,
                                            'evaluate_during_training_steps':-1,
                                            'save_eval_checkpoints':False,
                                            'save_model_every_epoch':False,
                                            'max_seq_length':256,
                                            'train_batch_size':16,
                                            'num_train_epochs': self.epochs,
                                            'save_steps':0,
                                            'use_early_stopping': True,
                                            'early_stopping_patience': 2,
                                            'early_stopping_delta': 0,
                                            'early_stopping_metric': 'f1_avg_score',
                                            'early_stopping_metric_minimize': False,
                                            'overwrite_output_dir':True,
                                            'no_cache':False,
                                            'no_save':True,
                                            'manual_seed':42         
                                            })

        train_df = pd.DataFrame({"text": text_series, "labels": labels.tolist()})
        eval_df = pd.DataFrame({"text": eval_text_series, "labels": eval_labels.tolist()})
        categories = self.one_hot.classes_
        self.model.train_model(train_df,eval_df=eval_df,output_dir=self.output_path,
                               f1_avg_score=lambda truth, predictions,
                               target_names=categories: f1_score(truth, [np.round(p) for p in predictions],
                                                                 average='micro')
                               )
        self.save_enoder()
        return self.output_path 

    def save_enoder(self):
        # Serialize both the pipeline and binarizer to disk.
        encoder_filename = self.output_path +"/encoder.pkl"
        with open(encoder_filename, 'wb') as f:
            pickle.dump((self.one_hot), f)
        print("model saved to: ", encoder_filename)
      
    def load_model(self):
        # Hydrate the serialized objects.
        encoder_filename = self.model_path +"/encoder.pkl"
        print("loading model from: ", self.model_path)
        try:
          with open(encoder_filename, 'rb') as f:
              self.one_hot = pickle.load(f)
          self.model = MultiLabelClassificationModel('roberta', self.model_path)
        except:
          print("couldn't load models from disk")
        
    def predict(self, text_series):
        text_list = text_series.tolist()
        print("text_list: ", len(text_list))
        predictions, _ = self.model.predict(text_list)
        return self.one_hot_to_text(np.array(predictions))

    def evaluate(self, text_series, golden_labels,attribute_name):
        predictions = self.predict(text_series)
        predictions_oh = self.one_hot_encoder(predictions)
        golden_labels_oh = self.one_hot_encoder(golden_labels)
        categories = self.one_hot.classes_
        print(classification_report(golden_labels_oh, predictions_oh, target_names=categories))

##4.2 Define training loop

In [25]:
def train_model(attribute_name, df=[], epochs=1, lr=4e-5,base_model_type="xlnet", base_model_name="xlnet-base-cased",train_df=[],eval_df=[],test_df=[]):
    model_name = "classifier_"+"_".join(attribute_name.split())
    output_path = DATA_PATH +"/models/"+model_name
    classes = list(set(itertools.chain.from_iterable(pd.concat([train_df[attribute_name],eval_df[attribute_name],test_df[attribute_name]],axis=0))))
    classifier = HTMC_classifier(classes,epochs=epochs, model_type=base_model_type, model_path=base_model_name, output_path=output_path)
    print("training set size: {} -- eval set size: {} -- test set size: {}".format(len(train_df[attribute_name]), len(eval_df[attribute_name]),len(test_df[attribute_name])) )
    classifier.fit(train_df["Text"], train_df[attribute_name],eval_df["Text"], eval_df[attribute_name])
    golden_labels = test_df[attribute_name]
    classifier.evaluate(test_df["Text"], golden_labels,attribute_name)
    return classifier

##4.3 Training categories classifier

In [23]:
train_model("Category",epochs=5,lr=4e-5,base_model_type="xlnet", base_model_name="xlnet-base-cased",train_df=train_df.dropna(subset=["Category"]), eval_df=eval_df.dropna(subset=["Category"]),test_df=test_df.dropna(subset=["Category"]))

loading model from:  xlnet-base-cased
couldn't load models from disk
self.use_cuda:  True
training set size: 1789 -- eval set size: 276 -- test set size: 1727
fitting model, artifacts will be saved to:  ./OPP-115/models/classifier_Category



This config doesn't use attention memories, a core feature of XLNet. Consider setting `mem_len` to a non-zero value, for example `xlnet = XLNetLMHeadModel.from_pretrained('xlnet-base-cased'', mem_len=1024)`, for accurate training performance as well as an order of magnitude faster inference. Starting from version 3.5.0, the default parameter will be 1024, following the implementation in https://arxiv.org/abs/1906.08237

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForMultiLabelSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing XLNetForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly i

  0%|          | 0/1789 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/112 [00:02<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/112 [00:02<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/112 [00:02<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/112 [00:02<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/112 [00:02<?, ?it/s]

model saved to:  ./OPP-115/models/classifier_Category/encoder.pkl
text_list:  1727


  0%|          | 0/1727 [00:00<?, ?it/s]

  0%|          | 0/216 [00:00<?, ?it/s]


Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.


Recall and F-score are ill-defined and being set to 0.0 in samples with no true labels. Use `zero_division` parameter to control this behavior.



                                      precision    recall  f1-score   support

          First Party Collection/Use       0.83      0.85      0.84       621
                       Data Security       0.81      0.65      0.72       134
                       Policy Change       0.79      0.70      0.74        69
International and Specific Audiences       0.86      0.82      0.84       137
                Practice not covered       0.49      0.38      0.43       153
                Introductory/Generic       0.77      0.54      0.64       338
      Third Party Sharing/Collection       0.77      0.87      0.82       423
      User Access, Edit and Deletion       0.81      0.65      0.72        93
                        Do Not Track       1.00      0.46      0.63        13
                      Data Retention       0.68      0.42      0.52        50
                 User Choice/Control       0.65      0.63      0.64       221
         Privacy contact information       0.87      0.66      

<__main__.HTMC_classifier at 0x7fa3a786df50>

##4.4 Training attributes classifiers

### 4.4.1 Personal Information Type Classifier



In [26]:
train_model("Personal Information Type",epochs=5,lr=4e-5,base_model_type="xlnet", base_model_name="xlnet-base-cased",train_df=train_df.dropna(subset=["Personal Information Type"]), eval_df=eval_df.dropna(subset=["Personal Information Type"]),test_df=test_df.dropna(subset=["Personal Information Type"]))

loading model from:  xlnet-base-cased
couldn't load models from disk
self.use_cuda:  True
training set size: 1167 -- eval set size: 173 -- test set size: 1063
fitting model, artifacts will be saved to:  ./OPP-115/models/classifier_Personal_Information_Type



This config doesn't use attention memories, a core feature of XLNet. Consider setting `mem_len` to a non-zero value, for example `xlnet = XLNetLMHeadModel.from_pretrained('xlnet-base-cased'', mem_len=1024)`, for accurate training performance as well as an order of magnitude faster inference. Starting from version 3.5.0, the default parameter will be 1024, following the implementation in https://arxiv.org/abs/1906.08237

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForMultiLabelSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing XLNetForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly i

  0%|          | 0/1167 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/73 [00:01<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/73 [00:01<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/73 [00:01<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/73 [00:01<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/73 [00:01<?, ?it/s]

model saved to:  ./OPP-115/models/classifier_Personal_Information_Type/encoder.pkl
text_list:  1063


  0%|          | 0/1063 [00:00<?, ?it/s]

  0%|          | 0/133 [00:00<?, ?it/s]


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.



                               precision    recall  f1-score   support

          Personal identifier       0.00      0.00      0.00        30
                  Demographic       0.80      0.60      0.69        60
       User online activities       0.73      0.64      0.69       152
                 User profile       0.50      0.04      0.07        50
                     Location       0.74      0.37      0.49        71
Cookies and tracking elements       0.89      0.91      0.90       171
    IP address and device IDs       0.80      0.92      0.86        78
                  Unspecified       0.75      0.61      0.68       499
                        Other       0.37      0.06      0.10       123
                  Survey data       0.00      0.00      0.00        15
                       Health       0.67      0.25      0.36         8
                      Contact       0.82      0.78      0.80       210
                    Financial       0.77      0.69      0.73        68
     

<__main__.HTMC_classifier at 0x7f5084b80310>

### 4.4.2 Purpose Classifier


In [23]:
train_model("Purpose",epochs=5,lr=4e-5,base_model_type="xlnet", base_model_name="xlnet-base-cased",train_df=train_df.dropna(subset=["Purpose"]), eval_df=eval_df.dropna(subset=["Purpose"]),test_df=test_df.dropna(subset=["Purpose"]))

loading model from:  xlnet-base-cased
couldn't load models from disk
self.use_cuda:  True
training set size: 1150 -- eval set size: 171 -- test set size: 1042
fitting model, artifacts will be saved to:  ./OPP-115/models/classifier_Purpose



This config doesn't use attention memories, a core feature of XLNet. Consider setting `mem_len` to a non-zero value, for example `xlnet = XLNetLMHeadModel.from_pretrained('xlnet-base-cased'', mem_len=1024)`, for accurate training performance as well as an order of magnitude faster inference. Starting from version 3.5.0, the default parameter will be 1024, following the implementation in https://arxiv.org/abs/1906.08237

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForMultiLabelSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing XLNetForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly i

  0%|          | 0/1150 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/72 [00:01<?, ?it/s]

  0%|          | 0/171 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/72 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/72 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/72 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/72 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/22 [00:00<?, ?it/s]

model saved to:  ./OPP-115/models/classifier_Purpose/encoder.pkl
text_list:  1042


  0%|          | 0/1042 [00:00<?, ?it/s]

  0%|          | 0/131 [00:00<?, ?it/s]


Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.



                                precision    recall  f1-score   support

         Basic service/feature       0.66      0.62      0.64       265
            Analytics/Research       0.82      0.81      0.82       178
 Personalization/Customization       0.72      0.72      0.72       101
             Legal requirement       0.82      0.58      0.68        71
Service operation and security       0.70      0.55      0.62       129
                   Advertising       0.84      0.86      0.85       177
                     Marketing       0.70      0.73      0.72       208
                         Other       1.00      0.01      0.02        81
                   Unspecified       0.73      0.71      0.72       411
            Merger/Acquisition       1.00      0.80      0.89        30
    Additional service/feature       0.61      0.48      0.54       198

                     micro avg       0.73      0.65      0.69      1849
                     macro avg       0.78      0.63      0.66 

<__main__.HTMC_classifier at 0x7f8d46181490>

### 4.4.3 Action First-Party Classifier


In [29]:
train_model("Action First-Party",epochs=5,lr=4e-5,base_model_type="xlnet", base_model_name="xlnet-base-cased",train_df=train_df.dropna(subset=["Action First-Party"]), eval_df=eval_df.dropna(subset=["Action First-Party"]),test_df=test_df.dropna(subset=["Action First-Party"]))

loading model from:  xlnet-base-cased
couldn't load models from disk
self.use_cuda:  True
training set size: 679 -- eval set size: 89 -- test set size: 621
fitting model, artifacts will be saved to:  ./OPP-115/models/classifier_Action_First-Party


Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForMultiLabelSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing XLNetForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForMultiLabelSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.weight', 'logits_proj.bias']
You should probably TRAIN this model on a down-stream tas

  0%|          | 0/679 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/43 [00:01<?, ?it/s]

  0%|          | 0/89 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/12 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/43 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/12 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/43 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/12 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/43 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/12 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/43 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/12 [00:00<?, ?it/s]

model saved to:  ./OPP-115/models/classifier_Action_First-Party/encoder.pkl
text_list:  621


  0%|          | 0/621 [00:00<?, ?it/s]

  0%|          | 0/78 [00:00<?, ?it/s]


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.



                                                  precision    recall  f1-score   support

                                           Other       0.00      0.00      0.00       109
                           Collect in mobile app       0.71      0.55      0.62        22
Receive from other service/third-party (unnamed)       0.50      0.31      0.38        39
                       Collect on mobile website       0.00      0.00      0.00         8
                              Collect on website       0.63      0.69      0.66       295
                    Track user on other websites       0.00      0.00      0.00        14
  Receive from other parts of company/affiliates       0.00      0.00      0.00         9
                                     Unspecified       0.75      0.75      0.75       365
             Collect from user on other websites       0.00      0.00      0.00         5
  Receive from other service/third-party (named)       0.00      0.00      0.00        33

        

<__main__.HTMC_classifier at 0x7f44b519e710>

###4.4.4 Action Third Party Classifier 

In [30]:
train_model("Action Third Party",epochs=5,lr=4e-5,base_model_type="xlnet", base_model_name="xlnet-base-cased",train_df=train_df.dropna(subset=["Action Third Party"]), eval_df=eval_df.dropna(subset=["Action Third Party"]),test_df=test_df.dropna(subset=["Action Third Party"]))

loading model from:  xlnet-base-cased
couldn't load models from disk
self.use_cuda:  True
training set size: 527 -- eval set size: 68 -- test set size: 423
fitting model, artifacts will be saved to:  ./OPP-115/models/classifier_Action_Third_Party



This config doesn't use attention memories, a core feature of XLNet. Consider setting `mem_len` to a non-zero value, for example `xlnet = XLNetLMHeadModel.from_pretrained('xlnet-base-cased'', mem_len=1024)`, for accurate training performance as well as an order of magnitude faster inference. Starting from version 3.5.0, the default parameter will be 1024, following the implementation in https://arxiv.org/abs/1906.08237

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForMultiLabelSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing XLNetForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly i

  0%|          | 0/527 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/33 [00:01<?, ?it/s]


Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate



  0%|          | 0/68 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/9 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/33 [00:02<?, ?it/s]

Running Evaluation:   0%|          | 0/9 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/33 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/9 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/33 [00:02<?, ?it/s]

Running Evaluation:   0%|          | 0/9 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/33 [00:02<?, ?it/s]

Running Evaluation:   0%|          | 0/9 [00:00<?, ?it/s]

model saved to:  ./OPP-115/models/classifier_Action_Third_Party/encoder.pkl
text_list:  423


  0%|          | 0/423 [00:00<?, ?it/s]

  0%|          | 0/53 [00:00<?, ?it/s]


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.



                                    precision    recall  f1-score   support

                               See       0.90      0.32      0.47        28
Collect on first party website/app       0.52      0.49      0.50        61
                             Other       0.00      0.00      0.00        46
               Receive/Shared with       0.91      0.95      0.93       296
                       Unspecified       0.83      0.10      0.19        48
  Track on first party website/app       0.41      0.70      0.51        37

                         micro avg       0.78      0.68      0.73       516
                         macro avg       0.59      0.43      0.43       516
                      weighted avg       0.74      0.68      0.67       516
                       samples avg       0.76      0.74      0.73       516



<__main__.HTMC_classifier at 0x7f469098b590>

### 4.4.5 Collection Mode Classifier

In [24]:
train_model("Collection Mode",epochs=5,lr=4e-5,base_model_type="xlnet", base_model_name="xlnet-base-cased",train_df=train_df.dropna(subset=["Collection Mode"]), eval_df=eval_df.dropna(subset=["Collection Mode"]),test_df=test_df.dropna(subset=["Collection Mode"]))

loading model from:  xlnet-base-cased
couldn't load models from disk
self.use_cuda:  True
training set size: 679 -- eval set size: 89 -- test set size: 621
fitting model, artifacts will be saved to:  ./OPP-115/models/classifier_Collection_Mode



This config doesn't use attention memories, a core feature of XLNet. Consider setting `mem_len` to a non-zero value, for example `xlnet = XLNetLMHeadModel.from_pretrained('xlnet-base-cased'', mem_len=1024)`, for accurate training performance as well as an order of magnitude faster inference. Starting from version 3.5.0, the default parameter will be 1024, following the implementation in https://arxiv.org/abs/1906.08237

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForMultiLabelSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing XLNetForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly i

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/43 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/12 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/43 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/12 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/43 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/12 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/43 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/12 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/43 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/12 [00:00<?, ?it/s]

model saved to:  ./OPP-115/models/classifier_Collection_Mode/encoder.pkl
text_list:  621


  0%|          | 0/621 [00:00<?, ?it/s]

  0%|          | 0/78 [00:00<?, ?it/s]


Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.



              precision    recall  f1-score   support

    Explicit       0.08      0.11      0.09       178
 Unspecified       0.31      0.21      0.25       277
not-selected       0.44      0.32      0.37       252
    Implicit       0.19      0.23      0.21       169

   micro avg       0.24      0.23      0.23       876
   macro avg       0.26      0.22      0.23       876
weighted avg       0.28      0.23      0.25       876
 samples avg       0.23      0.21      0.20       876



<__main__.HTMC_classifier at 0x7f8b7c148c50>

### 4.4.6 Access Type Classifier

In [25]:
train_model("Access Type",epochs=5,lr=4e-5,base_model_type="xlnet", base_model_name="xlnet-base-cased",train_df=train_df.dropna(subset=["Access Type"]), eval_df=eval_df.dropna(subset=["Access Type"]),test_df=test_df.dropna(subset=["Access Type"]))

loading model from:  xlnet-base-cased
couldn't load models from disk
self.use_cuda:  True
training set size: 89 -- eval set size: 7 -- test set size: 93
fitting model, artifacts will be saved to:  ./OPP-115/models/classifier_Access_Type



This config doesn't use attention memories, a core feature of XLNet. Consider setting `mem_len` to a non-zero value, for example `xlnet = XLNetLMHeadModel.from_pretrained('xlnet-base-cased'', mem_len=1024)`, for accurate training performance as well as an order of magnitude faster inference. Starting from version 3.5.0, the default parameter will be 1024, following the implementation in https://arxiv.org/abs/1906.08237

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForMultiLabelSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing XLNetForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly i

  0%|          | 0/89 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/6 [00:01<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/6 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/6 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/6 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/6 [00:01<?, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

model saved to:  ./OPP-115/models/classifier_Access_Type/encoder.pkl
text_list:  93


  0%|          | 0/93 [00:00<?, ?it/s]

  0%|          | 0/12 [00:00<?, ?it/s]


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.



                          precision    recall  f1-score   support

      Deactivate account       0.00      0.00      0.00         5
                    View       0.00      0.00      0.00        25
        Edit information       0.73      0.94      0.82        65
                   Other       0.00      0.00      0.00        21
             Unspecified       0.00      0.00      0.00         7
                  Export       0.00      0.00      0.00         1
Delete account (partial)       0.00      0.00      0.00        18
   Delete account (full)       0.00      0.00      0.00         5

               micro avg       0.73      0.41      0.53       147
               macro avg       0.09      0.12      0.10       147
            weighted avg       0.32      0.41      0.36       147
             samples avg       0.66      0.47      0.52       147



<__main__.HTMC_classifier at 0x7f8c569e6c10>

## 4.5 Training all classifiers

In [None]:
train_model("Category",epochs=5,lr=4e-5,base_model_type="xlnet", base_model_name="xlnet-base-cased",train_df=train_df.dropna(subset=["Category"]), eval_df=eval_df.dropna(subset=["Category"]),test_df=test_df.dropna(subset=["Category"]))

In [None]:
skip_columns = ["Text", "Category"]
training_status_filename = DATA_PATH+"/output/training_status"
try:
      with open(training_status_filename, 'rb') as f:
          skip_columns = pickle.load(f)
          print("loaded skip_columns: ", skip_columns)
except:
      print("couldn't load trainnig status file, will create a new one")
      skip_columns = ["Text", "Category"]
for attribute_name in train_df.drop(columns=skip_columns):
  print('attribute_name : ', attribute_name)
  print("start training for attribute: "+ attribute_name)
  train_model("Access Type",epochs=5,lr=4e-5,base_model_type="xlnet", base_model_name="xlnet-base-cased",train_df=train_df.dropna(subset=[attribute_name]), eval_df=eval_df.dropna(subset=[attribute_name]),test_df=test_df.dropna(subset=[attribute_name]))  
  print("finished training for attribute: "+ attribute_name)
  skip_columns.append(attribute_name)
  with open(training_status_filename, 'wb') as f:
          pickle.dump(skip_columns, f)
print("Training all Clasifiers finished")

# Revision History

TBA