<h1 align="center">
  <a href="https://uptrain.ai">
    <img width="300" src="https://user-images.githubusercontent.com/108270398/214240695-4f958b76-c993-4ddd-8de6-8668f4d0da84.png" alt="uptrain">
  </a>
</h1>

<h1 style="text-align: center;">Fine-tuning a Large-Language Model</h1>

### Install Required packages
- [PyTorch](https://pytorch.org/get-started/locally/): Deep learning framework.
- Hugging Face Transformers(https://huggingface.co/docs/transformers/installation): To use pretrained state-of-the-art models.
- [Hugging Face Datasets](https://pypi.org/project/datasets/): Use public Hugging Face datasets
- [IPywidgets](https://ipywidgets.readthedocs.io/en/stable/user_install.html): For interactive notebook widgets

In [1]:
!pip install torch transformers[torch] datasets ipywidgets

In [2]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
from model_constants import *
from model_train import retrain_model
from helper_funcs import *
import json
import uptrain

In [3]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
testing_text = "Nike shoes are very [MASK]."
original_model_outputs = test_model(model, testing_text)

In [4]:
def nike_text_present_func(inputs, outputs, gts=None, extra_args={}):
    is_present = []
    for input in inputs["text"]:
        this_present = "nike" in input.lower()
        is_present.append(bool(this_present))
    return is_present

uptrain_save_fold_name = "uptrain_smart_data_bert"
nike_text_present = uptrain.Signal("Nike Text Present", nike_text_present_func)

cfg = {
    'checks': [{
        'type': uptrain.Anomaly.EDGE_CASE,
        "signal_formulae": nike_text_present
    }],

    # Define where to save the retraining dataset
    'retraining_folder': uptrain_save_fold_name,
    
    # Define when to retrain, define a large number because we
    # are not retraining yet
    'retrain_after': 10000000000
}

framework = uptrain.Framework(cfg)

Deleting the folder:  uptrain_smart_data_bert


In [5]:
raw_dataset = create_sample_dataset("data.json")
with open(raw_dataset) as f:
    all_data = json.load(f)

for sample in all_data['data']:
    inputs = {'data': {'text': [sample['text']]}}
    framework.log(inputs = inputs, outputs = None)

retraining_dataset = create_dataset_from_csv(uptrain_save_fold_name + "/1/smart_data.csv", "text", "retrain_dataset.json")

50  edge cases identified out of  197  total samples
100  edge cases identified out of  397  total samples
150  edge cases identified out of  597  total samples
200  edge cases identified out of  797  total samples
250  edge cases identified out of  997  total samples
300  edge cases identified out of  1197  total samples
350  edge cases identified out of  1397  total samples
400  edge cases identified out of  1597  total samples
450  edge cases identified out of  1797  total samples
500  edge cases identified out of  1997  total samples
550  edge cases identified out of  2197  total samples
600  edge cases identified out of  2397  total samples
650  edge cases identified out of  2597  total samples
700  edge cases identified out of  2797  total samples
750  edge cases identified out of  2997  total samples
800  edge cases identified out of  3197  total samples
850  edge cases identified out of  3397  total samples
900  edge cases identified out of  3597  total samples
950  edge cases 

In [6]:
retrain_model(model, retraining_dataset)
retrained_model_outputs = test_model(model, testing_text)

Using custom data configuration default-2370e3cf0f5387dd


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Downloading and preparing dataset json/default to /Users/vipul/.cache/huggingface/datasets/json/default-2370e3cf0f5387dd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /Users/vipul/.cache/huggingface/datasets/json/default-2370e3cf0f5387dd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 11
  Batch size = 64
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


The following columns in the training set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 92
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 6
  Number of trainable parameters = 66985530


>>>Before training, Perplexity: 10.48


Epoch,Training Loss,Validation Loss
1,1.6319,0.936632
2,1.2835,1.153624
3,1.1519,0.998965


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 11
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 11
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 11
  Batch size = 64


Training completed. Do not forget to shar

>>>After trainign, Perplexity: 2.63


In [7]:
# print([original_model_outputs, retrained_model_outputs])

# # Create Nike review training dataset
# nike_attrs = {
#     "version": "0.1.0",
#     'source': "nike review dataset",
#     'url': 'https://www.kaggle.com/datasets/tinkuzp23/nike-onlinestore-customer-reviews?resource=download',
# }
# # Download the dataset from the url, zip it and copy the csv file here
# raw_nike_reviews_dataset = create_dataset_from_csv("web_scrapped.csv", "Content", "raw_nike_reviews_data.json")