***

# Zero-shot learning trials

- By [Zachary Kilhoffer](https://zkilhoffer.github.io/)
- Updated 2024-06-17

***

## Description
- The code is part of the pipelines described in: 
  - Kilhoffer, Z. et al. (2024 in press). "Cloud Privacy Beyond Legal Compliance: An NLP analysis of certifiable privacy and security standards".
- The paper will be released on 2024-06-28 at the [IEEE Cloud Summit](https://www.ieeecloudsummit.org/2024-program).

***

- This goal of this exploratory script is to try to match privacy and security controls with helpful and insightful labels. 
- This code performs zero-shot learning to match (A) our control texts with (B) pre-defined sets of topics.
- A nice feature of the zero-shot learning shown is it allows for documents to be matched to multiple categories, assigning probabilities to each.
  - That can help you look for conceptual overlaps between topics.

***

### Input files:
 - ...
 - ...

### Output files:
- ....
- ...

***


TO DO:
Implement Yuanye comments: 
- strongest contribution is to make comments on how the documents differ based on the analysis. 
- some documents for engineers moreso, some for managers/data processor/data controller
- really go deep in discussing the differences between documents, and how those differences matter for the implementation of privacy.
- spend time doing this in the discussion especially, and to an extent the results


In [1]:
import os, re, warnings, random
import pandas as pd
import numpy as np
import torch
import openpyxl
from transformers import BertTokenizer, BertModel, pipeline

from bertopic import BERTopic
import nltk
from nltk.corpus import words
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
import matplotlib
import matplotlib.pyplot as plt


# Setup

In [None]:
# display tweaks
pd.set_option("display.max_colwidth", 200)  # how much text is showing within a cell
pd.set_option("display.max_columns", False)
pd.set_option("display.max_rows", False)
warnings.filterwarnings("ignore")

In [None]:
# load data
df = pd.read_excel('../data/df_embeddings_publicdomain.xlsx')

# Load classification schema

In [None]:
# c2p2
with open(r'F:\Dropbox\Documents\Education\PhD\Cisco\Cloud Certifications Evaluation V2\Analysis\cisco_privacy_controls_2023-11-02\data\c2p2_criteria.txt', 'r', encoding='utf8') as f:
    c2p2 = [x.strip().lower() for x in f.readlines()]

In [None]:
# extended fipps
with open(r'F:\Dropbox\Documents\Education\PhD\Cisco\Cloud Certifications Evaluation V2\Analysis\cisco_privacy_controls_2023-11-02\data\extended_FIPPs.txt', 'r', encoding='utf8') as f:
    fipps_extended = [x.strip() for x in f.readlines()]

In [None]:
# nist privacy framework
with open(r'F:\Dropbox\Documents\Education\PhD\Cisco\Cloud Certifications Evaluation V2\Analysis\cisco_privacy_controls_2023-11-02\data\nist_privacy_framework.txt', 'r', encoding='utf8') as f:
    nist_framework = [x.strip() for x in f.readlines()]

# removing control codes like (ID.IM-P), grabbing what's before, after, or both
pattern1 = r' \(([^)]+)\):[^:]+'
nist_framework_name = [re.sub(pattern1, '', x).lower() for x in nist_framework]
nist_framework_name = list(set(nist_framework_name))

pattern2 = r' \(([^)]+)\):[^:]'
nist_framework_full = [re.sub(pattern2, ': ', x).lower() for x in nist_framework]

pattern3 = r'\(([^)]+)\):[^:](.*)'
nist_framework_description = [re.search(pattern3, x).group(2).lower() for x in nist_framework]

# Testing

In [None]:
# returns a random full control
def rando_control():
    return df['full_control_text'][random.randint(0, df.shape[0])]

## Bart large mnli

In [None]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

In [None]:
# comparing different levels of specificity/detail of NIST framework description

candidate_labels = [fipps_extended, c2p2, nist_framework_name]
sequence_to_classify = rando_control()
print(sequence_to_classify)
print('***'*10)

for labels in candidate_labels:
    temp = classifier(sequence_to_classify, labels)
    for x, y in zip(temp['labels'][:5], temp['scores'][:5]):
        print(x if len(x) < 50 else x[:50], f'{y:.3}')
    print('***'*10)


## DeBERTa-v3-base-mnli-fever-anli

In [None]:
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli")

## Deberta-v3-base-tasksource-nli

In [None]:
classifier = pipeline("zero-shot-classification", model="sileod/deberta-v3-base-tasksource-nli")

## mDeBERTa-v3-base-mnli-xnli

In [None]:
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")

# Save results

In [None]:
# create small df to test 
df_smol = df.groupby('document', group_keys=False).apply(lambda x: x.sample(min(len(x), 3)))
df_smol.shape

In [None]:
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")

df_smol['zeroshot_fipps_mDeBERTa'] = df_smol['full_control_text'].apply(lambda x: classifier(x, fipps_extended))
df_smol['zeroshot_c2p2_mDeBERTa'] = df_smol['full_control_text'].apply(lambda x: classifier(x, c2p2))
df_smol['zeroshot_nist_mDeBERTa'] = df_smol['full_control_text'].apply(lambda x: classifier(x, nist_framework_name))

from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

df_smol['zeroshot_fipps_bart'] = df_smol['full_control_text'].apply(lambda x: classifier(x, fipps_extended))
df_smol['zeroshot_c2p2_bart'] = df_smol['full_control_text'].apply(lambda x: classifier(x, c2p2))
df_smol['zeroshot_nist_bart'] = df_smol['full_control_text'].apply(lambda x: classifier(x, nist_framework_name))

from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="sileod/deberta-v3-base-tasksource-nli")

df_smol['zeroshot_fipps_DeBERTa'] = df_smol['full_control_text'].apply(lambda x: classifier(x, fipps_extended))
df_smol['zeroshot_c2p2_DeBERTa'] = df_smol['full_control_text'].apply(lambda x: classifier(x, c2p2))
df_smol['zeroshot_nist_DeBERTa'] = df_smol['full_control_text'].apply(lambda x: classifier(x, nist_framework_name))