In [77]:
!nvidia-smi

Tue Jan  2 10:28:11 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   77C    P0    34W /  70W |   8836MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|       

##### BERT is particularly designed for tasks like question answering or chatbot. This is because it can understand contextual information of input text. Understanding context is most important to answer certain question from a given text document or paragraph.

The steps involved in fine tuning a BERT mmode include:
- Prepare training data and map labels
- Load pretrained BERT model and tokenizer
- Define training arguments and trainer
- Fine-tune model on training data
- Evaluate on validation data

# DATA PREPARATION

There are many data annotator tools available to make training data for custom Question Answering model for BERT. Some are:
- Haystack Deepset
- Doccano
- Prodigy Data Labeler
- Label Studio
- Amazon Sagemaker ground truth

They are made for large projects. 
In this limited project meant for illustration purposes, we will handcraft our training data using python.

1. fetch raw data. I have used beautiful soup to scrap product descriptions of 12 products. And now I am pasting them simply as context.. The idea is not to go into web scraping too much.
2. Our product descriptions will be the context. 

In [21]:
train_contexts = [
    "Nokia C12 Android 12 (Go Edition) Smartphone, All-Day Battery, 4GB RAM (2GB RAM + 2GB Virtual RAM) + 64GB Capacity | Light Mint",
    "Nokia G21 Android Smartphone, Dual SIM, 3-Day Battery Life, 6GB RAM + 128GB Storage, 50MP Triple AI Camera | Nordic Blue",
    "realme narzo 50i Prime (Dark Blue 4GB RAM+64GB Storage) Octa-core Processor | 5000 mAh Battery",
    "realme narzo N53 (Feather Gold, 4GB+64GB) 33W Segment Fastest Charging | Slimmest Phone in Segment | 90 Hz Smooth Display",
    "realme narzo N55 (Prime Blue, 4GB+64GB) 33W Segment Fastest Charging | Super High-res 64MP Primary AI Camera",
    "Redmi 9A Sport (Carbon Black, 2GB RAM, 32GB Storage) | 2GHz Octa-core Helio G25 Processor | 5000 mAh Battery",
    "Redmi 11 Prime 5G (Thunder Black, 4GB RAM, 64GB Storage) | Prime Design | MTK Dimensity 700 | 50 MP Dual Cam | 5000mAh | 7 Band 5G",
    "Redmi 12C (Royal Blue, 4GB RAM, 64GB Storage) | High Performance Mediatek Helio G85 | Big 17cm(6.71) HD+ Display with 5000mAh(typ) Battery",
    "Redmi A1 (Light Green, 2GB RAM 32GB ROM) | Segment Best AI Dual Cam | 5000mAh Battery | Leather Texture Design | Android 12",
    "Samsung Galaxy M04 Light Green, 4GB RAM, 64GB Storage | Upto 8GB RAM with RAM Plus | MediaTek Helio P35 Octa-core Processor | 5000 mAh Battery | 13MP Dual Camera",
    "Samsung Galaxy M13 (Midnight Blue, 4GB, 64GB Storage) | 6000mAh Battery | Upto 8GB RAM with RAM Plus",
    "Tecno Camon 20 (Serenity Blue, 8GB RAM,256GB Storage)|16GB Expandable RAM | 64MP RGBW Rear Camera|6.67 FHD+ Big AMOLED with in-Display Fingerprint Sensor",
]

In [22]:
len(train_contexts)

12

#### Define Questions and Answers
1. Now we will define a list of question answers for each context.
2. key point to remember. The answer must be within the context text. And it should be written in the same way. For example, if in product description, it is written 4 GB, then in the define QAs we cannot write 4Gbs.
3. For each context, we will give an id that's actually the index position in the list of contexts, so we can map properly. 
4. Every context should be given as many question-answer pairs as possible.

In [23]:
train_questions_answers = [
    {
        "context_index": 0,
        "question": "What is the operating system of the Nokia C12 smartphone?",
        "answer": "Android 12 (Go Edition)"
    },
    {
        "context_index": 0,
        "question": "How much RAM does the Nokia C12 have?",
        "answer": "4GB"
    },
    {
        "context_index": 0,
        "question": "Does the Nokia C12 have virtual RAM?",
        "answer": "(2GB RAM + 2GB Virtual RAM)"
    },
    {
        "context_index": 0,
        "question": "What is the total capacity of the Nokia C12?",
        "answer": "64GB"
    },
    {
        "context_index": 0,
        "question": "What is the color option available for the Nokia C12?",
        "answer": "Light Mint"
    },
    {
        "context_index": 1,
        "question": "What is the model name of the Nokia smartphone?",
        "answer": "Nokia G21"
    },
    {
        "context_index": 1,
        "question": "What is the operating system of the Nokia G21?",
        "answer": "Android"
    },
    {
        "context_index": 1,
        "question": "How many SIM cards does the Nokia G21 support?",
        "answer": "Dual SIM"
    },
    {
        "context_index": 1,
        "question": "How long is the battery life of the Nokia G21?",
        "answer": "3-Day Battery Life"
    },
    {
        "context_index": 1,
        "question": "How much RAM does the Nokia G21 have?",
        "answer": "6GB"
    },
    {
        "context_index": 1,
        "question": "What is the storage capacity of the Nokia G21?",
        "answer": "128GB"
    },
    {
        "context_index": 1,
        "question": "What is the resolution of the main camera on the Nokia G21?",
        "answer": "50MP"
    },
    {
        "context_index": 1,
        "question": "How many AI cameras does the Nokia G21 have?",
        "answer": "Triple AI Camera"
    },
    {
        "context_index": 1,
        "question": "What is the color option available for the Nokia G21?",
        "answer": "Nordic Blue"
    },
    {
        "context_index": 2,
        "question": "What is the model name of the Realme smartphone?",
        "answer": "Realme narzo 50i Prime"
    },
    {
        "context_index": 2,
        "question": "What is the color option available for the Realme narzo 50i Prime?",
        "answer": "Dark Blue"
    },
    {
        "context_index": 2,
        "question": "How much RAM does the Realme narzo 50i Prime have?",
        "answer": "4GB"
    },
    {
        "context_index": 2,
        "question": "What is the storage capacity of the Realme narzo 50i Prime?",
        "answer": "64GB"
    },
    {
        "context_index": 2,
        "question": "What type of processor does the Realme narzo 50i Prime have?",
        "answer": "Octa-core Processor"
    },
    {
        "context_index": 2,
        "question": "What is the battery capacity of the Realme narzo 50i Prime?",
        "answer": "5000 mAh"
    },
    {
        "context_index": 3,
        "question": "What is the model name of the Realme smartphone?",
        "answer": "Realme narzo N53"
    },
    {
        "context_index": 3,
        "question": "What is the color option available for the Realme narzo N53?",
        "answer": "Feather Gold"
    },
    {
        "context_index": 3,
        "question": "How much RAM does the Realme narzo N53 have?",
        "answer": "4GB"
    },
    {
        "context_index": 3,
        "question": "What is the storage capacity of the Realme narzo N53?",
        "answer": "64GB"
    },
    {
        "context_index": 3,
        "question": "What is the charging speed of the Realme narzo N53?",
        "answer": "33W Segment Fastest Charging"
    },
    {
        "context_index": 3,
        "question": "What is the special feature of the Realme narzo N53 in terms of phone thickness?",
        "answer": "Slimmest Phone in Segment"
    },
    {
        "context_index": 3,
        "question": "What is the refresh rate of the display on the Realme narzo N53?",
        "answer": "90 Hz"
    },
    {
        "context_index": 4,
        "question": "What is the model name of the Realme smartphone?",
        "answer": "Realme narzo N55"
    },
    {
        "context_index": 4,
        "question": "What is the color option available for the Realme narzo N55?",
        "answer": "Prime Blue"
    },
    {
        "context_index": 4,
        "question": "How much RAM does the Realme narzo N55 have?",
        "answer": "4GB"
    },
    {
        "context_index": 4,
        "question": "What is the storage capacity of the Realme narzo N55?",
        "answer": "64GB"
    },
    {
        "context_index": 4,
        "question": "What is the charging speed of the Realme narzo N55?",
        "answer": "33W Segment Fastest Charging"
    },
    {
        "context_index": 4,
        "question": "What is the resolution of the primary AI camera on the Realme narzo N55?",
        "answer": "Super High-res 64MP"
    },
    {
        "context_index": 5,
        "question": "What is the model name of the Redmi smartphone?",
        "answer": "Redmi 9A Sport"
    },
    {
        "context_index": 5,
        "question": "What is the color option available for the Redmi 9A Sport?",
        "answer": "Carbon Black"
    },
    {
        "context_index": 5,
        "question": "How much RAM does the Redmi 9A Sport have?",
        "answer": "2GB"
    },
    {
        "context_index": 5,
        "question": "What is the storage capacity of the Redmi 9A Sport?",
        "answer": "32GB"
    },
    {
        "context_index": 5,
        "question": "What is the processor of the Redmi 9A Sport?",
        "answer": "2GHz Octa-core Helio G25 Processor"
    },
    {
        "context_index": 5,
        "question": "What is the battery capacity of the Redmi 9A Sport?",
        "answer": "5000 mAh"
    },
    {
        "context_index": 6,
        "question": "What is the model name of the Redmi smartphone?",
        "answer": "Redmi 11 Prime 5G"
    },
    {
        "context_index": 6,
        "question": "What is the color option available for the Redmi 11 Prime 5G?",
        "answer": "Thunder Black"
    },
    {
        "context_index": 6,
        "question": "How much RAM does the Redmi 11 Prime 5G have?",
        "answer": "4GB"
    },
    {
        "context_index": 6,
        "question": "What is the storage capacity of the Redmi 11 Prime 5G?",
        "answer": "64GB"
    },
    {
        "context_index": 6,
        "question": "What is the special feature of the Redmi 11 Prime 5G in terms of design?",
        "answer": "Prime Design"
    },
    {
        "context_index": 6,
        "question": "What is the processor of the Redmi 11 Prime 5G?",
        "answer": "MTK Dimensity 700"
    },
    {
        "context_index": 6,
        "question": "What is the resolution of the dual camera on the Redmi 11 Prime 5G?",
        "answer": "50 MP"
    },
    {
        "context_index": 6,
        "question": "What is the battery capacity of the Redmi 11 Prime 5G?",
        "answer": "5000mAh"
    },
    {
        "context_index": 6,
        "question": "How many 5G bands does the Redmi 11 Prime 5G support?",
        "answer": "7 Band 5G"
    },
    {
        "context_index": 7,
        "question": "What is the model name of the Redmi smartphone?",
        "answer": "Redmi 12C"
    },
    {
        "context_index": 7,
        "question": "What is the color option available for the Redmi 12C?",
        "answer": "Royal Blue"
    },
    {
        "context_index": 7,
        "question": "How much RAM does the Redmi 12C have?",
        "answer": "4GB"
    },
    {
        "context_index": 7,
        "question": "What is the storage capacity of the Redmi 12C?",
        "answer": "64GB"
    },
    {
        "context_index": 7,
        "question": "What is the processor of the Redmi 12C?",
        "answer": "High Performance Mediatek Helio G85"
    },
    {
        "context_index": 7,
        "question": "What is the size of the display on the Redmi 12C?",
        "answer": "Big 17cm(6.71) HD+ Display"
    },
    {
        "context_index": 7,
        "question": "What is the battery capacity of the Redmi 12C?",
        "answer": "5000mAh(typ) Battery"
    },
    {
        "context_index": 8,
        "question": "What is the model name of the Redmi smartphone?",
        "answer": "Redmi A1"
    },
    {
        "context_index": 8,
        "question": "What is the color option available for the Redmi A1?",
        "answer": "Light Green"
    },
    {
        "context_index": 8,
        "question": "How much RAM does the Redmi A1 have?",
        "answer": "2GB"
    },
    {
        "context_index": 8,
        "question": "What is the storage capacity of the Redmi A1?",
        "answer": "32GB"
    },
    {
        "context_index": 8,
        "question": "What is the special feature of the camera on the Redmi A1?",
        "answer": "Segment Best AI Dual Cam"
    },
    {
        "context_index": 8,
        "question": "What is the battery capacity of the Redmi A1?",
        "answer": "5000mAh Battery"
    },
    {
        "context_index": 8,
        "question": "What is the design feature of the Redmi A1?",
        "answer": "Leather Texture Design"
    },
    {
        "context_index": 8,
        "question": "What is the operating system of the Redmi A1?",
        "answer": "Android 12"
    },
    {
        "context_index": 9,
        "question": "What is the model name of the Samsung smartphone?",
        "answer": "Samsung Galaxy M04"
    },
    {
        "context_index": 9,
        "question": "What is the color option available for the Samsung Galaxy M04?",
        "answer": "Light Green"
    },
    {
        "context_index": 9,
        "question": "How much RAM does the Samsung Galaxy M04 have?",
        "answer": "4GB"
    },
    {
        "context_index": 9,
        "question": "What is the storage capacity of the Samsung Galaxy M04?",
        "answer": "64GB"
    },
    {
        "context_index": 9,
        "question": "How much RAM can the Samsung Galaxy M04 have with RAM Plus?",
        "answer": "Upto 8GB RAM with RAM Plus"
    },
    {
        "context_index": 9,
        "question": "What is the processor of the Samsung Galaxy M04?",
        "answer": "MediaTek Helio P35 Octa-core Processor"
    },
    {
        "context_index": 9,
        "question": "What is the battery capacity of the Samsung Galaxy M04?",
        "answer": "5000 mAh"
    },
    {
        "context_index": 9,
        "question": "What is the resolution of the dual camera on the Samsung Galaxy M04?",
        "answer": "13MP"
    },
    {
        "context_index": 10,
        "question": "What is the model name of the Samsung smartphone?",
        "answer": "Samsung Galaxy M13"
    },
    {
        "context_index": 10,
        "question": "What is the color option available for the Samsung Galaxy M13?",
        "answer": "Midnight Blue"
    },
    {
        "context_index": 10,
        "question": "How much RAM does the Samsung Galaxy M13 have?",
        "answer": "4GB"
    },
    {
        "context_index": 10,
        "question": "What is the storage capacity of the Samsung Galaxy M13?",
        "answer": "64GB"
    },
    {
        "context_index": 10,
        "question": "What is the battery capacity of the Samsung Galaxy M13?",
        "answer": "6000mAh"
    },
    {
        "context_index": 10,
        "question": "How much RAM can the Samsung Galaxy M13 have with RAM Plus?",
        "answer": "Upto 8GB RAM with RAM Plus"
    },
    {
        "context_index": 11,
        "question": "What is the model name of the Tecno smartphone?",
        "answer": "Tecno Camon 20"
    },
    {
        "context_index": 11,
        "question": "What is the color option available for the Tecno Camon 20?",
        "answer": "Serenity Blue"
    },
    {
        "context_index": 11,
        "question": "How much RAM does the Tecno Camon 20 have?",
        "answer": "8GB"
    },
    {
        "context_index": 11,
        "question": "What is the storage capacity of the Tecno Camon 20?",
        "answer": "256GB"
    },
    {
        "context_index": 11,
        "question": "How much expandable RAM can the Tecno Camon 20 have?",
        "answer": "16GB"
    },
    {
        "context_index": 11,
        "question": "What is the resolution of the rear camera on the Tecno Camon 20?",
        "answer": "64MP RGBW"
    },
    {
        "context_index": 11,
        "question": "What is the size of the display on the Tecno Camon 20?",
        "answer": "6.67 FHD+"
    },
    {
        "context_index": 11,
        "question": "What type of display does the Tecno Camon 20 have?",
        "answer": "Big AMOLED"
    },
    {
        "context_index": 11,
        "question": "What feature is integrated into the display of the Tecno Camon 20?",
        "answer": "In-Display Fingerprint Sensor"
    }
]

#### Data format conversion

- Now, we will transform the training data into the format required by SimpleTransformers, which will be using to train the BERT model for our qa task
- After trasnsformation with loops on each context, their questions and anwers, we will dump the final formated train data to a json file for reuse. 

In [24]:
train_data = []
train_contexts_data = []
 
for i, context in enumerate(train_contexts):
    qas = []
    for qa in train_questions_answers:
        if qa["context_index"] == i:
            answer_start = context.find(qa["answer"])
            if answer_start != -1:
                qas.append({
                    "id": str(len(qas) + 1).zfill(5),
                    "is_impossible": False,
                    "question": qa["question"],
                    "answers": [
                        {
                            "text": qa["answer"],
                            "answer_start": answer_start,
                        }
                    ],
                })
    train_contexts_data.append({
        "context": context,
        "qas": qas,
    })
 
train_data.extend(train_contexts_data)

In [25]:
train_data[0]

{'context': 'Nokia C12 Android 12 (Go Edition) Smartphone, All-Day Battery, 4GB RAM (2GB RAM + 2GB Virtual RAM) + 64GB Capacity | Light Mint',
 'qas': [{'id': '00001',
   'is_impossible': False,
   'question': 'What is the operating system of the Nokia C12 smartphone?',
   'answers': [{'text': 'Android 12 (Go Edition)', 'answer_start': 10}]},
  {'id': '00002',
   'is_impossible': False,
   'question': 'How much RAM does the Nokia C12 have?',
   'answers': [{'text': '4GB', 'answer_start': 63}]},
  {'id': '00003',
   'is_impossible': False,
   'question': 'Does the Nokia C12 have virtual RAM?',
   'answers': [{'text': '(2GB RAM + 2GB Virtual RAM)', 'answer_start': 71}]},
  {'id': '00004',
   'is_impossible': False,
   'question': 'What is the total capacity of the Nokia C12?',
   'answers': [{'text': '64GB', 'answer_start': 101}]},
  {'id': '00005',
   'is_impossible': False,
   'question': 'What is the color option available for the Nokia C12?',
   'answers': [{'text': 'Light Mint', 'an

- The above example gives us the training data in suitable format for index id = 0
- Now we will save the training data into a json file named 'amazon_data_train'

In [26]:
import json
 
with open('amazon_data_train.json', 'w', encoding='utf-8') as f:
    json.dump(train_data, f, ensure_ascii=False, indent=4)

### Setting up testing data

Now, we will use a different set of contexts, with their ground truth question answers.

In [27]:
test_contexts = [
    "Redmi Note 11 (Space Black, 4GB RAM, 64GB Storage)|90Hz FHD+ AMOLED Display | Qualcomm® Snapdragon™ 680-6nm | 33W Charger Included",
    "Redmi Note 10S (Deep Sea Blue, 6GB RAM, 64GB Storage) - Super Amoled Display | 64 MP Quad Camera | 6 Month Free Screen Replacement (Prime only) |33W Charger Included",
    "Lava Blaze 5G (Glass Green, 6GB RAM, UFS 2.2 128GB Storage) | 5G Ready | 50MP AI Triple Camera | Upto 11GB Expandable RAM | Charger Included | Clean Android (No Bloatware)",
    "Oppo A78 5G (Glowing Black, 8GB RAM, 128 Storage) | 5000 mAh Battery with 33W SUPERVOOC Charger| 50MP AI Camera | 90Hz Refresh Rate | with No Cost EMI/Additional Exchange Offers"
]
 
test_questions_answers = [
    {
        "context_index": 0,
        "question": "What is the model name of the Redmi smartphone?",
        "answer": "Redmi Note 11"
    },
    {
        "context_index": 0,
        "question": "What is the color option available for the Redmi Note 11?",
        "answer": "Space Black"
    },
    {
        "context_index": 0,
        "question": "How much RAM does the Redmi Note 11 have?",
        "answer": "4GB"
    },
    {
        "context_index": 0,
        "question": "What is the storage capacity of the Redmi Note 11?",
        "answer": "64GB"
    },
    {
        "context_index": 0,
        "question": "What is the display feature of the Redmi Note 11?",
        "answer": "90Hz FHD+ AMOLED Display"
    },
    {
        "context_index": 0,
        "question": "What is the processor of the Redmi Note 11?",
        "answer": "Qualcomm Snapdragon 680-6nm"
    },
    {
        "context_index": 0,
        "question": "What is included in the package of the Redmi Note 11?",
        "answer": "33W Charger Included"
    },
    {
        "context_index": 1,
        "question": "What is the model name of the Redmi smartphone?",
        "answer": "Redmi Note 10S"
    },
    {
        "context_index": 1,
        "question": "What is the color option available for the Redmi Note 10S?",
        "answer": "Deep Sea Blue"
    },
    {
        "context_index": 1,
        "question": "How much RAM does the Redmi Note 10S have?",
        "answer": "6GB"
    },
    {
        "context_index": 1,
        "question": "What is the storage capacity of the Redmi Note 10S?",
        "answer": "64GB"
    },
    {
        "context_index": 1,
        "question": "What type of display does the Redmi Note 10S have?",
        "answer": "Super Amoled Display"
    },
    {
        "context_index": 1,
        "question": "What is the resolution of the camera on the Redmi Note 10S?",
        "answer": "64 MP Quad Camera"
    },
    {
        "context_index": 1,
        "question": "What is the special offer for screen replacement on the Redmi Note 10S?",
        "answer": "6 Month Free Screen Replacement (Prime only)"
    },
    {
        "context_index": 1,
        "question": "What is included in the package of the Redmi Note 10S?",
        "answer": "33W Charger Included"
    },
    {
        "context_index": 2,
        "question": "What is the model name of the Lava smartphone?",
        "answer": "Lava Blaze 5G"
    },
    {
        "context_index": 2,
        "question": "What is the color option available for the Lava Blaze 5G?",
        "answer": "Glass Green"
    },
    {
        "context_index": 2,
        "question": "How much RAM does the Lava Blaze 5G have?",
        "answer": "6GB"
    },
    {
        "context_index": 2,
        "question": "What is the storage capacity of the Lava Blaze 5G?",
        "answer": "UFS 2.2 128GB"
    },
    {
        "context_index": 2,
        "question": "Is the Lava Blaze 5G compatible with 5G networks?",
        "answer": "5G Ready"
    },
    {
        "context_index": 2,
        "question": "What is the resolution of the camera on the Lava Blaze 5G?",
        "answer": "50MP AI Triple Camera"
    },
    {
        "context_index": 2,
        "question": "How much expandable RAM does the Lava Blaze 5G support?",
        "answer": "Upto 11GB Expandable RAM"
    },
    {
        "context_index": 2,
        "question": "What is included in the package of the Lava Blaze 5G?",
        "answer": "Charger Included"
    },
    {
        "context_index": 2,
        "question": "What operating system does the Lava Blaze 5G use?",
        "answer": "Clean Android (No Bloatware)"
    },
    {
        "context_index": 3,
        "question": "What is the model name of the Oppo smartphone?",
        "answer": "Oppo A78 5G"
    },
    {
        "context_index": 3,
        "question": "What is the color option available for the Oppo A78 5G?",
        "answer": "Glowing Black"
    },
    {
        "context_index": 3,
        "question": "How much RAM does the Oppo A78 5G have?",
        "answer": "8GB"
    },
    {
        "context_index": 3,
        "question": "What is the storage capacity of the Oppo A78 5G?",
        "answer": "128GB"
    },
    {
        "context_index": 3,
        "question": "What is the battery capacity of the Oppo A78 5G?",
        "answer": "5000 mAh"
    },
    {
        "context_index": 3,
        "question": "What is the charging speed of the Oppo A78 5G?",
        "answer": "33W SUPERVOOC Charger"
    },
    {
        "context_index": 3,
        "question": "What is the resolution of the camera on the Oppo A78 5G?",
        "answer": "50MP AI Camera"
    },
    {
        "context_index": 3,
        "question": "What is the refresh rate of the display on the Oppo A78 5G?",
        "answer": "90Hz Refresh Rate"
    },
    {
        "context_index": 3,
        "question": "Are there any additional offers available for the Oppo A78 5G?",
        "answer": "with No Cost EMI/Additional Exchange Offers"
    }
]

In [28]:
test_data = []
test_contexts_data = []
 
for i, context in enumerate(test_contexts):
    qas = []
    for qa in test_questions_answers:
        if qa["context_index"] == i:
            answer_start = context.find(qa["answer"])
            if answer_start != -1:
                qas.append({
                    "id": str(len(qas) + 1).zfill(5),
                    "is_impossible": False,
                    "question": qa["question"],
                    "answers": [
                        {
                            "text": qa["answer"],
                            "answer_start": answer_start,
                        }
                    ],
                })
    test_contexts_data.append({
        "context": context,
        "qas": qas,
    })
 
test_data.extend(test_contexts_data)

In [29]:
with open('amazon_data_test.json', 'w', encoding='utf-8') as f:
    json.dump(test_data, f, ensure_ascii=False, indent=4)

- Now we have two datasets in proper format as json files= amazon_data_train.json and amazon_data_test.json. 
- We will use them to fine tune our BERT model

# TRAINING FOR FINE TUNING

###### To finetune BERT or any other transaformer based popular models, we just need to install one package.

In [31]:
#pip install simpletransformers

Now, we will have our train and test datasets we created earlier.

In [32]:
import json

with open(r"amazon_data_train.json", "r") as read_file:
    train = json.load(read_file)
 
with open(r"amazon_data_test.json", "r") as read_file:
    test = json.load(read_file)

In [33]:
train[0]

{'context': 'Nokia C12 Android 12 (Go Edition) Smartphone, All-Day Battery, 4GB RAM (2GB RAM + 2GB Virtual RAM) + 64GB Capacity | Light Mint',
 'qas': [{'id': '00001',
   'is_impossible': False,
   'question': 'What is the operating system of the Nokia C12 smartphone?',
   'answers': [{'text': 'Android 12 (Go Edition)', 'answer_start': 10}]},
  {'id': '00002',
   'is_impossible': False,
   'question': 'How much RAM does the Nokia C12 have?',
   'answers': [{'text': '4GB', 'answer_start': 63}]},
  {'id': '00003',
   'is_impossible': False,
   'question': 'Does the Nokia C12 have virtual RAM?',
   'answers': [{'text': '(2GB RAM + 2GB Virtual RAM)', 'answer_start': 71}]},
  {'id': '00004',
   'is_impossible': False,
   'question': 'What is the total capacity of the Nokia C12?',
   'answers': [{'text': '64GB', 'answer_start': 101}]},
  {'id': '00005',
   'is_impossible': False,
   'question': 'What is the color option available for the Nokia C12?',
   'answers': [{'text': 'Light Mint', 'an

In [34]:
test[0]

{'context': 'Redmi Note 11 (Space Black, 4GB RAM, 64GB Storage)|90Hz FHD+ AMOLED Display | Qualcomm® Snapdragon™ 680-6nm | 33W Charger Included',
 'qas': [{'id': '00001',
   'is_impossible': False,
   'question': 'What is the model name of the Redmi smartphone?',
   'answers': [{'text': 'Redmi Note 11', 'answer_start': 0}]},
  {'id': '00002',
   'is_impossible': False,
   'question': 'What is the color option available for the Redmi Note 11?',
   'answers': [{'text': 'Space Black', 'answer_start': 15}]},
  {'id': '00003',
   'is_impossible': False,
   'question': 'How much RAM does the Redmi Note 11 have?',
   'answers': [{'text': '4GB', 'answer_start': 28}]},
  {'id': '00004',
   'is_impossible': False,
   'question': 'What is the storage capacity of the Redmi Note 11?',
   'answers': [{'text': '64GB', 'answer_start': 37}]},
  {'id': '00005',
   'is_impossible': False,
   'question': 'What is the display feature of the Redmi Note 11?',
   'answers': [{'text': '90Hz FHD+ AMOLED Display

In [54]:
import logging
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs

We will use 340M parameter bert-large-uncased. 

I tried 10 epochs in the beginning, but had to settle with 25 epochs

In [68]:
#train_args are the parameters the QuestionAnswerringModel will use 
train_args = {
    'overwrite_output_dir': True,
    "evaluate_during_training": True,
    "max_seq_length": 128,
    "num_train_epochs": 25, #25, after experimentations
    "evaluate_during_training_steps": 500,
    "save_model_every_epoch": False,
    "save_eval_checkpoints": False,
    "n_best_size":16, #batch_size is another important argument
    "train_batch_size": 16,
    "eval_batch_size": 16
}

In [70]:
model = QuestionAnsweringModel("bert",
                               "bert-large-cased", 
                               args = train_args,
                               use_cuda=True) # I will use GPU for faster performance

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-large-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

With everything set, will now train the model

In [71]:
model.train_model(train, eval_data=test)

convert squad examples to features: 100%|██████████| 82/82 [00:00<00:00, 416.17it/s]
add example index and unique id: 100%|██████████| 82/82 [00:00<00:00, 407503.47it/s]


Epoch:   0%|          | 0/25 [00:00<?, ?it/s]

Running Epoch 0 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 315.42it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 221504.98it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 1 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 336.76it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 184169.16it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 2 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 357.78it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 223407.95it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 3 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 344.22it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 176902.62it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 4 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 344.65it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 237702.79it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 5 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 344.49it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 200344.26it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 6 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 327.34it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 187353.64it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 7 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features:   0%|          | 0/31 [00:00<?, ?it/s][A
convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 297.31it/s]A

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 190371.05it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 8 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 342.35it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 160522.75it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 9 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 354.73it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 189815.22it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 10 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 349.81it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 156654.73it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 11 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 345.72it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 175707.33it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 12 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 345.36it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 191492.52it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 13 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 335.11it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 171083.45it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 14 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 335.75it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 191492.52it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 15 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 342.21it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 244865.21it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 16 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 326.28it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 219263.78it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 17 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 367.13it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 107724.46it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 18 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 349.85it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 217067.49it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 19 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 344.33it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 167125.22it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 20 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 342.46it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 234699.32it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 21 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 348.11it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 86165.29it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 22 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 350.43it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 197005.19it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 23 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 337.46it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 160721.17it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 24 of 25:   0%|          | 0/6 [00:00<?, ?it/s]


convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 349.16it/s]

add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 194645.84it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

(150,
 {'global_step': [6,
   12,
   18,
   24,
   30,
   36,
   42,
   48,
   54,
   60,
   66,
   72,
   78,
   84,
   90,
   96,
   102,
   108,
   114,
   120,
   126,
   132,
   138,
   144,
   150],
  'correct': [1,
   1,
   3,
   5,
   5,
   4,
   6,
   6,
   5,
   3,
   5,
   6,
   6,
   6,
   6,
   6,
   6,
   6,
   6,
   6,
   6,
   6,
   6,
   6,
   6],
  'similar': [0,
   0,
   2,
   2,
   2,
   2,
   3,
   1,
   3,
   4,
   2,
   3,
   3,
   3,
   3,
   3,
   2,
   3,
   3,
   3,
   3,
   3,
   3,
   3,
   3],
  'incorrect': [8,
   8,
   4,
   2,
   2,
   3,
   0,
   2,
   1,
   2,
   2,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  'train_loss': [4.4453125,
   1.96484375,
   1.43798828125,
   0.38897705078125,
   0.848052978515625,
   0.4465751647949219,
   0.4720191955566406,
   0.20631980895996094,
   0.21841812133789062,
   0.05857372283935547,
   0.008152961730957031,
   0.025991439819335938,
   0.03287458419799805,
   0.192285

Now we will evaluate the model

In [72]:
# Evaluate the model
result, texts = model.eval_model(test)

convert squad examples to features: 100%|██████████| 31/31 [00:00<00:00, 337.11it/s]
add example index and unique id: 100%|██████████| 31/31 [00:00<00:00, 173828.11it/s]


Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

In [73]:
print(result)

{'correct': 6, 'similar': 3, 'incorrect': 0, 'eval_loss': -7.96875}


### MODEL INFERENCE

LET'S TEST OUR  BEST MODEL WITH THE QUESTION: What is the model name of the Samsung smartphone?

In [76]:
# Load model from training checkpoint
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs
 
model = QuestionAnsweringModel("bert", "/kaggle/working/outputs/best_model")
 
 
# Make predictions with the model
to_predict = [
    {
        "context": "Samsung Galaxy M14 5G (Smoky Teal, 6GB, 128GB Storage) | 50MP Triple Cam | 6000 mAh Battery | 5nm Octa-Core Processor | 12GB RAM with RAM Plus | Android 13 | Without Charger",
        "qas": [
            {
                "question": "What is the model name of the Samsung smartphone?",
                "id": "0",
            }
        ],
    }
]
 
answers, probabilities = model.predict(to_predict, n_best_size=2)
print(answers)

convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00, 127.75it/s]
add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 9383.23it/s]


Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

[{'id': '0', 'answer': ['Samsung Galaxy M14 5G', 'Samsung Galaxy M14']}]


So, our model is giving the correct answer. Yay!