# Machine Learning Project - Inappropriate Language Classification - DistilBert - Zero Shot

The goal here is to use Distil-BERT with zero shot classification so we don't have to train it.<br>
We are using Hugging Face's library to download and run the model.

## Get data

In [12]:
from experiment_baseplate import load_split_data

X_train, y_train, X_validate, y_validate, X_test, y_test = load_split_data()

In [15]:
len(X_test) / 1000 * 7.2 / 60

12.464879999999999

### Reduce dataset size

In [2]:
X_test = X_test[:1000]
y_test = y_test[:1000]

## Choose Model

### Distil Bert

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('typeform/distilbert-base-uncased-mnli')

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('typeform/distilbert-base-uncased-mnli')

In [3]:
from transformers import MobileBertTokenizerFast
tokenizer = MobileBertTokenizerFast.from_pretrained('typeform/mobilebert-uncased-mnli', model_max_length=512)

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('typeform/mobilebert-uncased-mnli')

  from .autonotebook import tqdm as notebook_tqdm


## Build Model

In [4]:
from transformers import ZeroShotClassificationPipeline

classifier = ZeroShotClassificationPipeline(model = model, tokenizer = tokenizer)

### Run predictions - No threading

I have 12 cores
- no threading 45s for 10 sentences
- threading 25s for 10 sentences

In [7]:
predictions = classifier(sequences=list(X_test), #["you are a good person", "you are in the shit", "you are shit"]
           candidate_labels=["appropriate", "inappropriate"])

## CPU Acceleration

In [5]:
import psutil
import ray

num_cpus = psutil.cpu_count(logical=True)
ray.init(num_cpus=num_cpus, ignore_reinit_error=True)

2023-04-04 21:26:46,095	INFO worker.py:1553 -- Started a local Ray instance.


0,1
Python version:,3.8.0
Ray version:,2.3.1


In [6]:
classifier_id = ray.put(classifier)

In [7]:
@ray.remote
def predict(pipeline, text_data, label_names):
    return pipeline(text_data, label_names)

### Run predictions

In [8]:
predictions = ray.get([predict.remote(classifier_id, text, ["appropriate", "inappropriate"]) for text in X_test])

### Stop CPU Acceleration

In [9]:
ray.shutdown()

### Get scores

In [10]:
hold = []
for e in predictions:
    hold.append( [1,0] if (e['labels'][0] == 'appropriate') else [0,1] )
predictions = hold

In [11]:
from experiment_baseplate import score
import numpy as np

print("Distil Bert Model")
print("Test values -> " + score( np.array(predictions) , y_test))

Distil Bert Model
Test values -> accuracy : 0.671 | precision : 0.3044982698961938 | recall : 0.4074074074074074
