# Machine Learning Project - Inappropriate Language Classification - DistilBert - Zero Shot

The goal here is to use Distil-BERT with zero shot classification so we don't have to train it.<br>
We are using Hugging Face's library to download and run the model.

## Get data

In [None]:
from experiment_baseplate import load_split_data

X_train, y_train, X_validate, y_validate, X_test, y_test = load_split_data()

### Reduce dataset size

In [None]:
X_test = X_test[:10000]
y_test = y_test[:10000]

## Choose Model

### Distil Bert

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('typeform/distilbert-base-uncased-mnli')

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('typeform/distilbert-base-uncased-mnli')

In [None]:
from transformers import MobileBertTokenizerFast
tokenizer = MobileBertTokenizerFast.from_pretrained('typeform/mobilebert-uncased-mnli', model_max_length=512)

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('typeform/mobilebert-uncased-mnli')

## Build Model

In [None]:
from transformers import ZeroShotClassificationPipeline

classifier = ZeroShotClassificationPipeline(model = model, tokenizer = tokenizer)

### Run predictions - No threading

I have 12 cores
- no threading 45s for 10 sentences
- threading 25s for 10 sentences

In [None]:
predictions = classifier(sequences=list(X_test), #["you are a good person", "you are in the shit", "you are shit"]
           candidate_labels=["appropriate", "inappropriate"])

## CPU Acceleration

In [None]:
import psutil
import ray

num_cpus = psutil.cpu_count(logical=True)
ray.init(num_cpus=num_cpus, ignore_reinit_error=True)

In [None]:
classifier_id = ray.put(classifier)

In [None]:
@ray.remote
def predict(pipeline, text_data, label_names):
    return pipeline(text_data, label_names)

### Run predictions

In [None]:
predictions = ray.get([predict.remote(classifier_id, text, ["appropriate", "inappropriate"]) for text in X_test])

### Stop CPU Acceleration

In [None]:
ray.shutdown()

### Get scores

In [None]:
hold = []
for e in predictions:
    hold.append( [1,0] if (e['labels'][0] == 'appropriate') else [0,1] )
predictions = hold

In [None]:
from experiment_baseplate import score
import numpy as np

print("Distil Bert Model")
print("Test values -> " + score( np.array(predictions) , y_test))