### NLP Workshop Part 3

#### Unsupervised LM

- 3.1: Use pre-trained network to get embeddings, and then run clustering algorithm

- 3.2: Finetune a pre-trained network on relevant data

#### Section 3.1

In [None]:
import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

In [None]:
# DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations.

tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-small")
model = AutoModel.from_pretrained("johngiorgi/declutr-small")

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import KMeans

newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)

X_train, y_train, y_train_names = newsgroups_train['data'], newsgroups_train['target'], newsgroups_train['target_names'] 
X_test, y_test, y_test_names = newsgroups_test['data'], newsgroups_test['target'], newsgroups_test['target_names']

In [None]:
data = X_train[:10]

In [None]:
inputs = tokenizer(data, padding=True, truncation=True, return_tensors="pt")

In [None]:
with torch.no_grad():
    sequence_output = model(**inputs)[0]

In [None]:
# Mean pool the token-level embeddings to get sentence-level embeddings
embeddings = torch.sum(
    sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

# embeddings shape is [num_samples, 768]

In [None]:
kmeans = KMeans(n_clusters=2, random_state=0).fit(embeddings)
# Get label assignment (centroid assignment) from KMeans
print("Labels:", kmeans.labels_)
# Get the cluster centers
print("Cluster centers:", kmeans.cluster_centers_)
# Infer label for other data based on its proximity to a certain cluster
print(kmeans.predict(embeddings[:2]))

In [None]:
# COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

# Tokenizer gives HTTP error:
from cocolm.modeling_cocolm import COCOLMModel
from cocolm.configuration_cocolm import COCOLMConfig
from cocolm.tokenization_cocolm import COCOLMTokenizer

In [None]:
# base, large = max seq length is 512
model = "microsoft/cocolm-base"
config = COCOLMConfig.from_pretrained(model)

In [None]:
config

In [None]:
model = COCOLMModel.from_pretrained(model, config=config)

In [None]:
tokenizer = COCOLMTokenizer.from_pretrained(model)

In [None]:
inputs = []

for x in data:
    encoded = tokenizer.encode(x)
    if len(encoded) <= 512:
        inputs.append(encoded)

In [None]:
outputs = []
with torch.no_grad():
    for x in inputs:
        embedding = model(torch.tensor([x]))[0] # 1 x 192 x 768
        embedding = torch.mean(embedding[0], axis=0)
        outputs.append(embedding)
outputs = torch.stack(outputs)

In [None]:
kmeans = KMeans(n_clusters=2, random_state=0).fit(outputs)
print(kmeans.labels_)
print(kmeans.cluster_centers_)
print(kmeans.predict(outputs[:2]))

In [None]:
# MPNet

from transformers import MPNetTokenizer, MPNetModel

tokenizer = MPNetTokenizer.from_pretrained('microsoft/mpnet-base')
model = MPNetModel.from_pretrained('microsoft/mpnet-base')

In [None]:
inputs = tokenizer(data, padding=True, truncation=True, return_tensors="pt")

In [None]:
inputs.keys()

In [None]:
with torch.no_grad():
    outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state # N x 512 x 768

In [None]:
last_hidden_states = torch.mean(last_hidden_states, axis=1)

In [None]:
last_hidden_states.shape

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0).fit(last_hidden_states)
print(kmeans.labels_)
print(kmeans.cluster_centers_)
print(kmeans.predict(last_hidden_states[:2]))

#### Section 3.2: Train sentence transformer:

#### How?

We will need to know which pairs are closer and which ones are far.

Given that we don't have labels, we will have to use some heuristic for this.
For instance, our dataset will contain a mapping between BQs being evaluated and the prompt and the response.

As a result, we can try to bring responses corresponding to the same prompt closer in the embedding space.
Responses from different prompts should lie further apart in the embedding space.

__Some discussion needs to happen re: label while training the model. Prompts belonging to the same BQ can have
higher similarity scores?__

In [None]:
from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers import evaluation
from torch.utils.data import DataLoader

In [None]:
#Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer('distilbert-base-nli-mean-tokens') # more models : https://www.sbert.net/docs/training/overview.html

#Define your train examples. You need more than just two examples...
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
    InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

In [None]:
train_loss = losses.CosineSimilarityLoss(model)

In [None]:
sentences1 = ['This list contains the first column', 'With your sentences', 'You want your model to evaluate on']
sentences2 = ['Sentences contains the other column', 'The evaluator matches sentences1[i] with sentences2[i]', 'Compute the cosine similarity and compares it to scores[i]']
scores = [0.3, 0.6, 0.2]

evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)

In [None]:
#Tune the model
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=10,
    warmup_steps=2,
    evaluator=evaluator,
    evaluation_steps=5,
    output_path="./sentence_transformer")

In [None]:
# load model

model = SentenceTransformer('./sentence_transformer')

In [None]:
# Train BERT on custom MLM:

# https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb

#### Other Verbal Analytics Techniques 

There are many other different verbal analytics. Sharing some useful links below for further exploration: 

- __Semantic Matching__       
    - __Idea 1__: We could explore using transformers that capture sentence-level semantics. In other words, using networks that are pre-trained on tasks such as, question-answering, paraphrasing and summarization. To this point, the following transformer architectures have shown SOTA performance on the above tasks:
        * [paraphrase-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2)
        * [MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)
        * [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
        * Useful resource: HuggingFace's official page on [Semantic Matching](https://huggingface.co/tasks/sentence-similarity)
          
    - __Idea 2__: While we are using sentence transformers to compare 2 sentences semantically, we can also compare 2 sentences syntactically. There exists several traditional ML-based ideas that compare 2 sentences based on their syntax, such as:
        * TF-IDF [What is it?](https://monkeylearn.com/blog/what-is-tf-idf/)  |  [Sklearn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
        * Count Vectorizer [What is it?](https://towardsdatascience.com/basics-of-countvectorizer-e26677900f9c) | [Sklearn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
        * FuzzyWuzzy [String matching Python package](https://pypi.org/project/fuzzywuzzy/)