<h1 align="center">
  <a href="https://uptrain.ai">
    <img width="300" src="https://user-images.githubusercontent.com/108270398/214240695-4f958b76-c993-4ddd-8de6-8668f4d0da84.png" alt="uptrain">
  </a>
</h1>

<h1 style="text-align: center;">Fine-tuning a Large-Language Model [WIP]</h1>

### Install Required packages
- [PyTorch](https://pytorch.org/get-started/locally/): Deep learning framework.
- Hugging Face Transformers(https://huggingface.co/docs/transformers/installation): To use pretrained state-of-the-art models.
- [Hugging Face Datasets](https://pypi.org/project/datasets/): Use public Hugging Face datasets
- [IPywidgets](https://ipywidgets.readthedocs.io/en/stable/user_install.html): For interactive notebook widgets

In [1]:
# !pip install torch transformers[torch] datasets ipywidgets nltk uptrain

In [10]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import json
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import uptrain

from model_constants import *
from model_train import retrain_model
from helper_funcs import *

Define few cases to test our model performance before and after retraining.

In [11]:
testing_texts = [
    "Nike shoes are very [MASK]."
]

In [12]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
original_model_outputs = [test_model(model, x) for x in testing_texts]

Let's use Nike onlinestore customer reviews from Kaggle and filter data using UpTrain signals to retrain our model upon. Please download the data from the [link](https://www.kaggle.com/datasets/tinkuzp23/nike-onlinestore-customer-reviews?resource=download) and unzip it here.
  

In [13]:
# Create Nike review training dataset
nike_attrs = {
    "version": "0.1.0",
    'source': "nike review dataset",
    'url': 'https://www.kaggle.com/datasets/tinkuzp23/nike-onlinestore-customer-reviews?resource=download',
}
# Download the dataset from the url, zip it and copy the csv file here
nike_reviews_dataset = create_dataset_from_csv("web_scraped.csv", "Content", "nike_reviews_data.json")

In [14]:
def nike_positive_sentiment_func(inputs, outputs, gts=None, extra_args={}):
    is_positives = []
    for input in inputs["text"]:
        txt = input.lower()
        sia = SentimentIntensityAnalyzer()
        score = sia.polarity_scores(txt)

        is_negative = score['pos'] < 0.25
        for neg_adj in ['expensive', 'worn', 'cheap', 'inexpensive', 'dirty', 'bad']:
            if neg_adj in txt:
                is_negative = True

        is_positives.append(bool(1-is_negative))
    return is_positives

cfg = {
    'checks': [{
        'type': uptrain.Anomaly.EDGE_CASE,
        "signal_formulae": uptrain.Signal("Nike Positive Sentiment", nike_positive_sentiment_func)
    }],

    # Define where to save the retraining dataset
    'retraining_folder': "uptrain_smart_data",
    
    # Define when to retrain, define a large number because we are using UpTrain just to create retraining dataset
    'retrain_after': 10000000000
}

framework = uptrain.Framework(cfg)

Deleting the folder:  uptrain_smart_data


In [15]:
with open(nike_reviews_dataset) as f:
    all_data = json.load(f)

for sample in all_data['data']:
    inputs = {'text': [sample['text']]}
    framework.log(inputs = inputs, outputs = None)

50  edge cases identified out of  135  total samples


In [16]:
print("Number of samples filtered for retraining: ", len(pd.read_csv("uptrain_smart_data/1/smart_data.csv")))
retraining_dataset = create_dataset_from_csv("uptrain_smart_data/1/smart_data.csv", "text", "retrain_dataset.json", min_samples=1000)

Number of samples filtered for retraining:  82


In [17]:
retrain_model(model, retraining_dataset)
retrained_model_outputs = [test_model(model, x) for x in testing_texts]

Using custom data configuration default-e1ecfc1614a13790


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Downloading and preparing dataset json/default to /Users/sourabhagrawal/.cache/huggingface/datasets/json/default-e1ecfc1614a13790/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /Users/sourabhagrawal/.cache/huggingface/datasets/json/default-e1ecfc1614a13790/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 23
  Batch size = 64
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


The following columns in the training set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 204
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 12
  Number of trainable parameters = 66985530


>>>Before training, Perplexity: 66.00


Epoch,Training Loss,Validation Loss
1,No log,3.84342
2,No log,3.37683
3,No log,3.453454


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 23
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 23
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 23
  Batch size = 64


Training completed. Do not forget to shar

>>>After training, Perplexity: 27.02


In [18]:
[original_model_outputs, retrained_model_outputs]

[[['popular', 'expensive', 'durable', 'common', 'comfortable']],
 [['popular', 'expensive', 'durable', 'comfortable', 'common']]]