# CNN for HMTC - Hierarchical Multi-label classification of Privacy Policies.

Code for reproducing results in the paper "A Combined Rule-Based and Machine Learning Approach for Automated GDPR Compliance Checking" 

We build a local multi-label classifier for the higher level to predict categories and one local multi-label classifier per attribute to predict their values. In total, one classifier is trained for the first level, and 20 classifiers are trained for the lower level.

This approach is inspired by the work of [Polisis](https://arxiv.org/abs/1802.02561) where they constructed a multi-label classifier for the first level of the hierarchy to predict the categories of data practices from an input segment. Once the categories of a segment are predicted, the second step consists in predicting the values of attributes children of the predicted categories. For example, if the first-level classifier predicts the categories *Data Retention* and  *Data Security* then only the local classifiers that correspond to the attributes *Retention Period, Retention Purpose, Personal Information Type, Security Measure* are considered to predict their values. 

Authors of Polisis [Polisis](https://arxiv.org/abs/1802.02561) are using the same base classifier for all the multi-label classifiers. In this paper we reproduce their work by using CNN as the same base classifiers. We use the same architecture of CNN  and hyperparameters. The CNN classifier is composed of one convolutional layer with a ReLU activation, followed with a dense layer and a ReLU activation. The last layer is a dense layer with a sigmoid activation. We tokenize segments using PENN Treebank tokenization in NLTK. Tokens are mapped into a k-dimensional space via an embedding Layer. We used FastText to train Word embeddings* on 130,326 privacy policies. 

\* *Important Note: By default the code will use FastText in-domain embeddings. Due to licesing these embeddings can be provided only upon request.However GloVe embeddings can be used to run this code as well.*

## Import necessary libraries

If you do not have fastcore or fastprogress

In [39]:
# Install fascore and fastprogress 
# !pip install fastcore
# !pip install fastprogress

In [1]:
# IN THE MAIN PATH
%reload_ext autoreload
%autoreload 2

#Imports needed from pytorch
import torch
from collections import OrderedDict


#Some built-in imports
import sys
import os
import glob
from pathlib import Path
import torch
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')

#Imports from the repository
from data_utils.labels import labels_dict
from data_utils.data_processing import get_vocab, tokenize_sentence, process_dataset, create_multi_labels, stack_segments
from data_utils.data_extraction import load_data, label_to_vector, load_attr_cat_data, load_category_data, load_obj, save_obj,unpickle_dataset
from data_utils.data_collator import DataCollator
from data_utils.ppd_dataset import PrivacyPoliciesDataset
from model_nn.model_utils import get_emb_weights, extract_from_model_file, get_pretrained_embs, init_weights
from model_nn.model_nn import HierarchicalModel
from model_nn.model_nn import ParentCategoryModel
from model_trainer import Trainer
from evaluation.f1_score import f1_score, f1_score_per_label
from evaluation.show_results import print_results, print_results_best_t

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Let's first define paths and attributes

In [2]:
path = Path('.')

# ATTRS
emb_dim=300

# Whether to use parent information or not
parent_information=True

if parent_information:
    attr_data_path = path/'data/child_parent'
else:
    attr_data_path = path/'data/child_only'
    
raw_data_path = path/'data/category'
fasttext_model = path/'fasttext/fastText-0.1.0/corpus_vectors_default_300d'

## Data Extraction

Extract the vocab and data from the CSV data

In [12]:
train_category_df, train_categories_dict = load_category_data(raw_data_path / "train")
validation_category_df, validation_categories_dict = load_category_data(raw_data_path / "validation")
test_category_df, test_categories_dict = load_category_data(raw_data_path /"test")
    
categories = list(train_categories_dict.keys())
category_df = pd.concat([train_category_df,validation_category_df,test_category_df])
print(f'{category_df.category.nunique()} unique categories found')

categories_dict = train_categories_dict

category_df.drop_duplicates(subset=['segment'])
category_df.head()

12 unique categories found


Unnamed: 0,idx,segment,category,label
0,0,"Published: January 1, 2015",Introductory/Generic,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ..."
1,1,The WP Company LLC (The Washington Post) recog...,Introductory/Generic,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ..."
2,1,The WP Company LLC (The Washington Post) recog...,Practice not covered,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ..."
3,2,This Privacy Policy covers the following: Ho...,Introductory/Generic,"[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ..."
4,3,Information We Collect We may collect pers...,First Party Collection/Use,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [13]:
categories_dict

{'Data Retention': 0,
 'Data Security': 1,
 'Do Not Track': 2,
 'First Party Collection/Use': 3,
 'International and Specific Audiences': 4,
 'Introductory/Generic': 5,
 'Policy Change': 6,
 'Practice not covered': 7,
 'Privacy contact information': 8,
 'Third Party Sharing/Collection': 9,
 'User Access, Edit and Deletion': 10,
 'User Choice/Control': 11}

### Create Multilabel labels column

In [14]:
def multilabel_df(category_df):
    parent =  category_df.groupby('segment')['category'].unique()
    cat_comp_df = pd.DataFrame({'segment':parent.index.values, 'category':parent})
    cat_comp_df.reset_index(inplace=True, drop=True)

    labels_ls=[]
    for r in cat_comp_df.category:
        idx_ls=[]
        for c in r:
            idx_ls.append(categories_dict[c])   

        target = [1] * len(idx_ls)
        labels = [0] * len(categories)
        for x,y in zip(idx_ls,target):
            labels[x] = y
        labels_ls.append(labels)

    cat_comp_df['label'] = labels_ls
    return cat_comp_df

In [15]:
train_category_df_comp = multilabel_df(train_category_df)
validation_category_df_comp = multilabel_df(validation_category_df)
test_category_df_comp  = multilabel_df(test_category_df)

### Extract from Attribute_Category Data

In [16]:
from data_utils.data_extraction import load_attr_cat_data

# Get Attribute Folders
attribute_folders = [name for name in os.listdir(attr_data_path) if os.path.isdir(attr_data_path)]

attr_cat_df, child_labels_dict = load_attr_cat_data(attr_data_path, attribute_folders=attribute_folders,
                                                    categories_dict=categories_dict, include_parent=parent_information,
                                                   model=False)
rev_child_labels_dict = {}
for k in child_labels_dict.keys():
    rev_child_labels_dict[child_labels_dict[k]] = k

child_labels = list(child_labels_dict.keys())

#print(len(attr_cat_df))
if parent_information:
    print(f'{attr_cat_df.parent_label.nunique()} unique categories found')

extracting from Action First-Party ...


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  attr_cat_df = pd.concat([attr_cat_df, data])


extracting from Choice Scope ...
extracting from Purpose ...
extracting from Action Third Party ...
extracting from Does or Does Not ...
extracting from Notification Type ...
extracting from Personal Information Type ...
extracting from Audience Type ...
extracting from Collection Mode ...
extracting from User Type ...
extracting from Third Party Entity ...
extracting from Access Scope ...
extracting from User Choice ...
extracting from Identifiability ...
extracting from Retention Purpose ...
extracting from Change Type ...
extracting from Security Measure ...
extracting from Access Type ...
extracting from Choice Type ...
extracting from Do Not Track policy ...
extracting from Retention Period ...
9 unique categories found


## Vocab Generation
Get Vocab from all files in "raw_data" and merge with vocab from each attribute folder

In [17]:
# RAW VOCAB
category_vocab = get_vocab(raw_data_path)
len(category_vocab), max(category_vocab.values())

Generating vocab ...


(6819, 6818)

 Extract vocab from every attribute and merge with master vocab

In [18]:
master_vocab = category_vocab

# VOCAB EXTRACTION FROM attr_data_path
attribute_folders = [name for name in os.listdir(attr_data_path) if os.path.isdir(attr_data_path)]

for folder in attribute_folders:
    sub_data_path = attr_data_path/folder
    vocab = get_vocab(sub_data_path)
    
    idx = max(master_vocab.values()) + 1
    for k in vocab.keys():
        try:
            master_vocab[k]
        except: 
            master_vocab[k] = idx
            idx += 1
            
len(master_vocab), max(master_vocab.values()), len(set(list(master_vocab.values()))), len(set(list(master_vocab.keys())))

Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...
Generating vocab ...


(6819, 6818, 6819, 6819)

Saving the resulting vocabulary.

In [19]:
save_obj(master_vocab, 'models/master_vocab')

##  Pretrained Embeddings

In [20]:
# MODEL UTILS
fasttext_embeddings = get_pretrained_embs(fasttext_model, emb_dim=emb_dim)
emb_weights = get_emb_weights(embeddings=fasttext_embeddings, vocab=master_vocab, emb_dim=300, oov_random=True)
emb_weights = torch.tensor(emb_weights).float()

Extracting pretrained embeddings ...
Generating embedding matrix ...
Some words were missing in the pretrained embedding. 1366 words were not found.


In [21]:
if parent_information:
    parent =  attr_cat_df.groupby('parent_label')['attribute'].unique().index
    grouped_attr =  attr_cat_df.groupby('parent_label')['attribute'].unique().values

    parent_attr_dict = {}
    for i,p in enumerate(parent):
        parent_attr_dict[p] = grouped_attr[i]
    save_obj(parent_attr_dict, 'models/parent_attr_dict')
    parent_attr_dict
else:
    parent_attr_dict=None
    print('parent_attr_dict not created as "parent_information == False", your attribute data does not contain a parent column, \
you will need to load a saved version or create this dict manually to be able to do inference')

### Train Parent Category Model

In [22]:
# DATA PROCESSING
train_cat_tokens, train_cat_parent_labels_arr = process_dataset(df=train_category_df_comp, 
                                                    vocab=master_vocab, 
                                                    include_parent=False, 
                                                    attr_model=False)

validation_cat_tokens, validation_cat_parent_labels_arr = process_dataset(df=validation_category_df_comp, 
                                                    vocab=master_vocab, 
                                                    include_parent=False, 
                                                    attr_model=False)
test_cat_tokens, test_cat_parent_labels_arr = process_dataset(df=test_category_df_comp, 
                                                    vocab=master_vocab, 
                                                    include_parent=False, 
                                                    attr_model=False)

# DATASET
train_dataset = PrivacyPoliciesDataset(train_cat_tokens, train_cat_parent_labels_arr, train_categories_dict.keys(),
                                 parent_information=None)

validation_dataset = PrivacyPoliciesDataset(validation_cat_tokens, validation_cat_parent_labels_arr, validation_categories_dict.keys(),
                                 parent_information=None)

test_dataset = PrivacyPoliciesDataset(test_cat_tokens, test_cat_parent_labels_arr, test_categories_dict.keys(),
                                 parent_information=None)

# Load model
parent_model = ParentCategoryModel(weights_matrix=emb_weights, num_classes=len(categories),
                                   drop=0.2, Co=200, Hu=[100], Ks=[3])

# Initialise our Trainer
trainer = Trainer()

# Initialise the data collator
data_collator = DataCollator()

# Load model and datasets and start trainins
results = trainer.train(parent_model, epochs_num=300, batch_size=40, lr=0.01, weight_decay=0.2, 
                            train_dataset=train_dataset, validation_dataset=validation_dataset, 
                            data_collator=data_collator, evaluate_steps=100, verbose=False, 
                            has_parent=parent_information)

print_results(parent_model, data_collator, train_dataset, validation_dataset, threshold=0.5)
print('\n')
print_results(parent_model, data_collator, train_dataset, test_dataset, threshold=0.5)
print('\n')
print('***'*33)
print('***'*33)

Processing dataset ...
Num of unique segments: 1738
Processing dataset ...
Num of unique segments: 955
Processing dataset ...
Num of unique segments: 1028


last epoch finished: 300 -- progress: 100.0% -- time: 0.23468099636499884 mins258633 minss
Training completed. Total training time: 6.42 mins
Setting model to eval mode ...
Extracting data ...
Evaluating on train set ...
Evaluating on validation set ...
calculating scores ...

2977.0 Train Labels
1738 Train Segments
1587.0 Validation Labels
955 Validation Segments
---------------------------------------------------------------------------------------------------------

Score per label with 0.5 threshold
---------------------------------------------------------------------------------------------------------
Label                                           F1        Precision Recall    C.Train   C.Validation
---------------------------------------------------------------------------------------------------------
Data Retention                                  0.61      0.86      0.57      87        48        
Data Security                                   0.83      0.94      0.77      1

In [26]:
def multilabel_df(category_df,categories_dict):
    
    parent =  category_df.groupby('segment')['category'].unique()
    cat_comp_df = pd.DataFrame({'segment':parent.index.values, 'category':parent})
    cat_comp_df.reset_index(inplace=True, drop=True)

    labels_ls=[]
    for r in cat_comp_df.category:
        idx_ls=[]
        for c in r:
            value = len(categories_dict)
            if c not in categories_dict.keys():
                categories_dict[c]= value
            else:
                idx_ls.append(categories_dict[c])   

        target = [1] * len(idx_ls)
        labels = [0] * len(categories_dict)
        for x,y in zip(idx_ls,target):
            labels[x] = y
        labels_ls.append(labels)

    cat_comp_df['label'] = labels_ls
    return cat_comp_df,categories_dict

### Attribute models training

In [29]:
def train_attribute_model(attr, i,epochs_num=30,batch_size=32,lr=0.005):
    model_name=f'{attr}_attr_value_model'
    print(F'STARTING {i}: {attr} ATTRIBUTE MODEL TRAINING')
    raw_data_path = path/'data/child_only'

    train_category_df, train_categories_dict = load_category_data(raw_data_path / attr / "train")
    validation_category_df, validation_categories_dict = load_category_data(raw_data_path / attr / "validation")
    test_category_df, test_categories_dict = load_category_data(raw_data_path / attr/"test")

    categories = list(train_categories_dict.keys())
    category_df = pd.concat([train_category_df,validation_category_df,test_category_df])
    print(f'{category_df.category.nunique()} unique categories found')

    train_category_df_comp,train_categories_dict = multilabel_df(train_category_df,train_categories_dict)
    validation_category_df_comp ,validation_categories_dict= multilabel_df(validation_category_df,validation_categories_dict)
    test_category_df_comp,test_categories_dict  = multilabel_df(test_category_df,test_categories_dict)


    # DATA PROCESSING
    train_cat_tokens, train_cat_parent_labels_arr = process_dataset(df=train_category_df_comp, 
                                                        vocab=master_vocab, 
                                                        include_parent=False, 
                                                        attr_model=False)

    validation_cat_tokens, validation_cat_parent_labels_arr = process_dataset(df=validation_category_df_comp, 
                                                        vocab=master_vocab, 
                                                        include_parent=False, 
                                                        attr_model=False)
    test_cat_tokens, test_cat_parent_labels_arr = process_dataset(df=test_category_df_comp, 
                                                        vocab=master_vocab, 
                                                        include_parent=False, 
                                                        attr_model=False)

    # DATASET
    train_dataset = PrivacyPoliciesDataset(train_cat_tokens, train_cat_parent_labels_arr, train_categories_dict.keys(),
                                     parent_information=None)

    validation_dataset = PrivacyPoliciesDataset(validation_cat_tokens, validation_cat_parent_labels_arr, validation_categories_dict.keys(),
                                     parent_information=None)

    test_dataset = PrivacyPoliciesDataset(test_cat_tokens, test_cat_parent_labels_arr, test_categories_dict.keys(),
                                     parent_information=None)

    # Load model
    parent_model = ParentCategoryModel(weights_matrix=emb_weights, num_classes=len(categories),
                                       drop=0.2, Co=200, Hu=[100], Ks=[3])

    # Initialise our Trainer
    trainer = Trainer()

    # Initialise the data collator
    data_collator = DataCollator()

    # Load model and datasets and start trainins
    results = trainer.train(parent_model, epochs_num=300, batch_size=40, lr=0.01, weight_decay=0.2, 
                                train_dataset=train_dataset, validation_dataset=validation_dataset, 
                                data_collator=data_collator, evaluate_steps=100, verbose=False, 
                                has_parent=parent_information)

    print_results(parent_model, data_collator, train_dataset, validation_dataset, threshold=0.5)
    print('\n')
    print_results(parent_model, data_collator, train_dataset, test_dataset, threshold=0.5)
    print('\n')
    print('***'*33)
    print('***'*33)

In [34]:
attribute_folders = ["Personal Information Type","Purpose","Does or Does Not","Action First-Party","Action Third Party","Third Party Entity","Retention Period","Access Scope"]
for idx, attr in enumerate(attribute_folders):
    train_attribute_model(attr,idx,epochs_num=300,batch_size=40,lr=0.01)

STARTING 0: Personal Information Type ATTRIBUTE MODEL TRAINING
15 unique categories found
Processing dataset ...
Num of unique segments: 1023
Processing dataset ...
Num of unique segments: 541
Processing dataset ...
Num of unique segments: 603


last epoch finished: 300 -- progress: 100.0% -- time: 0.10372482209723174 mins6430634 mins
Training completed. Total training time: 2.83 mins
Setting model to eval mode ...
Extracting data ...
Evaluating on train set ...
Evaluating on validation set ...
calculating scores ...

1922.0 Train Labels
1023 Train Segments
1032.0 Validation Labels
541 Validation Segments
---------------------------------------------------------------------------------------------------------

Score per label with 0.5 threshold
---------------------------------------------------------------------------------------------------------
Label                                           F1        Precision Recall    C.Train   C.Validation
---------------------------------------------------------------------------------------------------------
Computer information                            0.87      0.86      0.89      93        49        
Contact                                         0.81      0.84      0.79      2

last epoch finished: 300 -- progress: 100.0% -- time: 0.09269034202127946 mins664307 minss
Training completed. Total training time: 2.53 mins
Setting model to eval mode ...
Extracting data ...
Evaluating on train set ...
Evaluating on validation set ...
calculating scores ...

1949.0 Train Labels
940 Train Segments
972.0 Validation Labels
491 Validation Segments
---------------------------------------------------------------------------------------------------------

Score per label with 0.5 threshold
---------------------------------------------------------------------------------------------------------
Label                                           F1        Precision Recall    C.Train   C.Validation
---------------------------------------------------------------------------------------------------------
Additional service/feature                      0.7       0.74      0.68      292       163       
Advertising                                     0.92      0.94      0.9       217

last epoch finished: 300 -- progress: 100.0% -- time: 0.1012974057816832 mins34994096 mins
Training completed. Total training time: 2.77 mins
Setting model to eval mode ...
Extracting data ...
Evaluating on train set ...
Evaluating on validation set ...
calculating scores ...

1131.0 Train Labels
1039 Train Segments
579.0 Validation Labels
528 Validation Segments
---------------------------------------------------------------------------------------------------------

Score per label with 0.5 threshold
---------------------------------------------------------------------------------------------------------
Label                                           F1        Precision Recall    C.Train   C.Validation
---------------------------------------------------------------------------------------------------------
Does                                            0.7       0.82      0.64      998       513       
Does Not                                        0.76      0.93      0.7       14

last epoch finished: 300 -- progress: 100.0% -- time: 0.04597504951181074 mins7779963 minss
Training completed. Total training time: 1.25 mins
Setting model to eval mode ...
Extracting data ...
Evaluating on train set ...
Evaluating on validation set ...
calculating scores ...

809.0 Train Labels
572 Train Segments
404.0 Validation Labels
285 Validation Segments
---------------------------------------------------------------------------------------------------------

Score per label with 0.5 threshold
---------------------------------------------------------------------------------------------------------
Label                                           F1        Precision Recall    C.Train   C.Validation
---------------------------------------------------------------------------------------------------------
Collect from user on other websites             0.5       0.5       0.5       29        6         
Collect in mobile app                           0.63      0.86      0.59      58 

last epoch finished: 300 -- progress: 100.0% -- time: 0.04116091551201489 mins7151511 minss
Training completed. Total training time: 1.13 mins
Setting model to eval mode ...
Extracting data ...
Evaluating on train set ...
Evaluating on validation set ...
calculating scores ...

689.0 Train Labels
540 Train Segments
342.0 Validation Labels
249 Validation Segments
---------------------------------------------------------------------------------------------------------

Score per label with 0.5 threshold
---------------------------------------------------------------------------------------------------------
Label                                           F1        Precision Recall    C.Train   C.Validation
---------------------------------------------------------------------------------------------------------
Collect on first party website/app              0.73      0.77      0.71      87        55        
Other                                           0.56      0.58      0.55      81 

last epoch finished: 300 -- progress: 100.0% -- time: 0.040960848745167665 mins6426434 mins
Training completed. Total training time: 1.11 mins
Setting model to eval mode ...
Extracting data ...
Evaluating on train set ...
Evaluating on validation set ...
calculating scores ...

803.0 Train Labels
535 Train Segments
374.0 Validation Labels
249 Validation Segments
---------------------------------------------------------------------------------------------------------

Score per label with 0.5 threshold
---------------------------------------------------------------------------------------------------------
Label                                           F1        Precision Recall    C.Train   C.Validation
---------------------------------------------------------------------------------------------------------
Named third party                               0.75      0.75      0.75      257       123       
Other                                           0.49      0.48      0.5       43 

last epoch finished: 300 -- progress: 100.0% -- time: 0.004331473823252835 mins6435917 minss
Training completed. Total training time: 0.12 mins
Setting model to eval mode ...
Extracting data ...
Evaluating on train set ...
Evaluating on validation set ...
calculating scores ...

53.0 Train Labels
41 Train Segments
32.0 Validation Labels
24 Validation Segments
---------------------------------------------------------------------------------------------------------

Score per label with 0.5 threshold
---------------------------------------------------------------------------------------------------------
Label                                           F1        Precision Recall    C.Train   C.Validation
---------------------------------------------------------------------------------------------------------
Indefinitely                                    0.47      0.44      0.5       11        4         
Limited                                         0.47      0.57      0.53      26    

last epoch finished: 300 -- progress: 100.0% -- time: 0.006439677003128826 mins6452777 mins
Training completed. Total training time: 0.18 mins
Setting model to eval mode ...
Extracting data ...
Evaluating on train set ...
Evaluating on validation set ...
calculating scores ...

113.0 Train Labels
77 Train Segments
81.0 Validation Labels
56 Validation Segments
---------------------------------------------------------------------------------------------------------

Score per label with 0.5 threshold
---------------------------------------------------------------------------------------------------------
Label                                           F1        Precision Recall    C.Train   C.Validation
---------------------------------------------------------------------------------------------------------
Other                                           0.53      0.66      0.54      21        12        
Other data about user                           0.47      0.44      0.5       20    