Generative LLMs:
- LLAMA
- Mistral
- GPT-3.5 Turbo

Non-generative LLMs:
- BERT
- DistilBERT
- RoBERTa
- ALBERT
- XLNet

In [15]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# From huggingface
models = {
    'BERT': 'google-bert/bert-base-uncased',
    'DistilBERT': 'distilbert/distilbert-base-uncased',
    'AlBERTa': 'albert/albert-base-v2',
    'RoBERTa': 'deepset/roberta-base-squad2',
    'T5': 'google-t5/t5-base',
    'XLNet': 'xlnet/xlnet-base-cased'
}
tokenizer = models['BERT']

In [16]:
# Introduction to pipeline from transfromers library
from transformers import pipeline

sentences = ['Kant was a remarkable figure',
             'Water is made of hydrogen and oxygen',
             'The capital of Norway is Oslo',
             'We hope you don\'t hate it']

model = models['BERT']

# Sentiment analysis
classifier = pipeline('sentiment-analysis', model=model)
'''for sentence in sentences:
    print(sentence, classifier(sentence))'''


# Zero-shot classification
classifier = pipeline('zero-shot-classification')
'''for sentence in sentences:
    print(sentence, classifier(sentence, candidate_labels=['education', 'business', 'science', 'politics']))'''

# Generation
generator = pipeline('text-generation', model=model)
'''generator(
    'I am a god. You are merely', 
    max_length=40, 
    num_return_sequences=5
)'''

# Fill masks
sentence = 'The SMRs face a common issue, as <mask> they will be expensive, but the market is wide enough to trigger manufacturing processes to lower the <mask>'
unmasker = pipeline('fill-mask', top_k=3)
#unmasker(sentence)

# NER
ner = pipeline('ner', grouped_entities=True, model=model)
ner('My name is Tiril Mageli and I am an MSc student at Imperial College London in the department of Computing, which is in London')

# Question Answering
qa = pipeline('question-answering')
'''qa(
    question='whats the weather like today?',
    context='My name is Tiril Mageli and I am an MSc student at Imperial College London in the department of Computing, which is in London'
)'''

# Summariser
text = ["The six small modular reactor (SMR) developers shortlisted in Great British Nuclear’s (GBN’s) competition now have an extra two weeks to submit documentation due to the General Election.",
    "The deadline for submitting project documentation has been pushed back from 24 June to 8 July. GBN said no further details were able to be shared due to restrictions on government communications during the pre-election period.",
    "Energy business publication",
    "reported that a request for the delay had come from one of the four US-based prospective SMR firms.",
    "in October 2023 for government support to deliver a new wave of nuclear reactors are EDF Energy, GE-Hitachi Nuclear Energy International, Holtec Britain, NuScale Power, Rolls-Royce SMR and Westinghouse Electric Company UK.",
    "Of those, GE-Hitachi Nuclear Energy International LLC, Holtec Britain Limited, NuScale Power and Westinghouse Electric Company UK Limited have American parent or partner companies.",
    "The competition winner will receive government backing to deploy a fleet of SMRs in the UK. At the time of the competition announcement, GBN chief executive Gwyn Parry-Jones said parties would be “aiming for a final contract agreement in the summer”.",
    "SMRs are nuclear power stations that have lower capacity than large-scale nuclear plantsusually in the 300MW to 500MW range. They are in theory quicker and cheaper to deploy and their construction will be easily repeatable thanks to their modular design, which will see parts created in a factory.",
    "Even if not successful in GBN’s competition, many of the shortlisted firms have signalled intent to deliver SMRs in the UK.",
    "announced a prototype module testing facility at the University of Sheffield",
    "Holtec has shortlisted four UK sites for its SMR module factory",
    ". Westinghouse has plans to",
    "deploy the first privately funded SMRs in North Teesside by the 2030s",
    "At the time of the GBN competition announcement, energy security secretary Claire Coutinho said: “Small modular reactors will help the UK rapidly expand nuclear power; deliver cheaper, cleaner and more secure energy for British families and businesses; create well-paid, high-skilled jobs; and grow the economy.",
    "“This competition has attracted designs from around the world and puts the UK at the front of the global race to develop this exciting, cutting-edge technology and cement our position as a world leader in nuclear innovation.”",
    "There have been some doubts cast, with",
    "the Environmental Audit Committee claiming that SMRs will not be able to help the UK decarbonise by 2035",
    "Additionally, US think tank Institute for Energy Economics and Financial Analysis (IEEFA) has said that SMRs are “too expensive, too slow, and too risky”."]
full_text = ' '.join(text)

summariser = pipeline('summarization', model=model)
#summariser('My name is Tiril and I am happy')

# Translation
#translator = pipeline('translation', model='Helsinki-NLP/opus-mt-fr-en', max_length=100)
#translator('Je suis une banane')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias'

In [None]:
# Introduction to AutoModel from transformers library
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = 'albert/albert-base-v2'
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
#classifier('I hate mondays')

from transformers import AutoModel, AutoConfig

# Loading models
bert_model = AutoModel.from_pretrained('bert-base-cased')
print(type(bert_model))

gpt_model = AutoModel.from_pretrained('gpt2')
#print(type(gpt_model))

bart_model = AutoModel.from_pretrained('facebook/bart-base')
#print(type(bart_model))

# It is also possible to download simply the configuration (ie. random weights)
bert_config = AutoConfig.from_pretrained('bert-base-cased')
print(type(bert_config))

# Or from a specific checkpoint
from transformers import BertConfig
bert_config = BertConfig.from_pretrained('bert-base-cased')
print(type(bert_config))
print(bert_config) # we can change any of these if we want

# E.g. 10 layers instead of 12
from transformers import BertModel
bert_config = BertConfig.from_pretrained('bert-base-cased', num_hidden_layers=10)
bert_model = BertModel(bert_config)
print(bert_model)

# Saving a model
bert_model.save_pretrained('my-bert-model') #saves in current directory

In [None]:
'''
The AutoModelForSequenceClassification and AutoTokenizer classes work together
to power the pipeline() used above. An AutoClass is a shortcut that automatically 
retrieves the architecture of a pretrained model.T_destination

A tokenizer is responsible for preprocessing text into an array of numbers as inputs 
to a model, inluding rules for splitting at which level of word. Important: make sure
you use the SAME tokenizer as the model was pretrained on to get good results.
'''

# Loading tokenizer
from transformers import AutoTokenizer
import json

model_name = 'nlptown/bert-base-multilingual-uncased-sentiment'
tokenizer = AutoTokenizer.from_pretrained(model_name)
encoding = tokenizer('I hate Mondays')
tokens = tokenizer.convert_ids_to_tokens(encoding['input_ids'])
print(encoding)
print('Tokens:', tokens)

{'input_ids': [101, 151, 39487, 39618, 10107, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}
Tokens: ['[CLS]', 'i', 'hate', 'monday', '##s', '[SEP]']


In [None]:
# Get SQuAD data (NOT NECESSARY)
import requests
import json 

url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json'
response = requests.get(url)
data = response.json()
with open('squad-dev-v2.0.json', 'w') as file:
    json.dump(data, file, indent=4)

In [None]:
# SQuAD test
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from datasets import load_dataset
import difflib
import time

squad_models = {
    'BERT': 'google-bert/bert-large-uncased-whole-word-masking-finetuned-squad',
    'DistilBERT': 'distilbert/distilbert-base-cased-distilled-squad',
    'RoBERTa': 'deepset/roberta-base-squad2',
    #'TinyLLAMA': 'TinyLlama/TinyLlama-1.1B-step-50K-105b',
    #'Mistral': 'mistralai/Mistral-7B-v0.1'
}

squad = load_dataset('squad')

def is_close_match(predicted, gold_answers, threshold=0.99):
    for gold in gold_answers:
        similarity = difflib.SequenceMatcher(None, predicted, gold).ratio()
        if similarity >= threshold:
            return True
    return False

results = {name: {} for name in squad_models}
for name in squad_models:
    print('Currently testing:', name)

    model_name = squad_models[name]
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)
    qa_pipeline = pipeline('question-answering', model=model, tokenizer=tokenizer)
    print(qa_pipeline)

    correct = 0
    total = 0
    start = time.time()
    for example in squad['validation']:
        question = example['question']
        context = example['context']
        result = qa_pipeline(question=question, context=context)
        pred = result['answer'].strip().lower()
        gold = [g.strip().lower() for g in example['answers']['text']]

        if is_close_match(pred, gold):
            correct += 1
        total += 1
    
    end = time.time()
    accuracy = correct/total
    length = end-start

    results[name]['Accuracy'] = accuracy
    results[name]['Time'] = length

with open('squad_results.json', 'w') as f:
    json.dump(results, f, indent=4)


Currently testing: BERT


Some weights of the model checkpoint at google-bert/bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<transformers.pipelines.question_answering.QuestionAnsweringPipeline object at 0xa948c06a0>
Currently testing: DistilBERT
<transformers.pipelines.question_answering.QuestionAnsweringPipeline object at 0xa97f0fa30>
Currently testing: RoBERTa
<transformers.pipelines.question_answering.QuestionAnsweringPipeline object at 0xa948c06a0>


In [29]:
from transformers import AutoTokenizer, pipeline
from datasets import load_dataset
import time


wikiann = load_dataset('wikiann', 'en')
data = wikiann['test'][:]
models = {
    'BERT': 'dslim/bert-base-NER',
    'DistilBERT': 'dslim/distilbert-NER',
    'RoBERTa': 'Jean-Baptiste/roberta-large-ner-english'
}

In [71]:
print(data.keys())
i = 1022
tokens = data['tokens'][i]
spans = data['spans'][i]
print('Tokens:', tokens)
print('Tags:', spans)

dict_keys(['tokens', 'ner_tags', 'langs', 'spans'])
Tokens: ['**', 'Michael', 'Somare', ',', 'Prime', 'Minister', 'of', 'Papua', 'New', 'Guinea', '(', '1982–1985', ')']
Tags: ['PER: Michael Somare', 'ORG: Prime Minister of Papua New Guinea']


In [163]:
# Helper functions
'''def join_tokens(tokens):
    string = ''
    for token in tokens:
        if token.isalnum():
            string = string + ' ' + token
        else:
            string = string + token
    string = string.strip()
    return string'''

def join_tokens(tokens):
    return ' '.join(tokens)


label_map = {
    '0': 0,
    'B-PER': 1,
    'I-PER': 2,
    'B-ORG': 3,
    'I-ORG': 4,
    "B-LOC": 5,
    'I-LOC': 6,

    'PER' : 1,
    'ORG': 3,
    'LOC': 5
    
}

def merge_result(entities):
    merged_entities = []
    current = None

    for entity in entities:
        if current == None:
            current = entity
        else:
            if entity['word'].startswith('##'):
                current['word'] += entity['word'][2:]
                current['end'] = entity['end']
                current['score'] = min(current['score'], entity['score'])
            else:
                merged_entities.append(current)
                current = entity
    
    if current is not None:
        merged_entities.append(current)
    
    return merged_entities

def find_word_indices(word, tokens):
    indices = []
    for i, token in enumerate(tokens):
        if token == word:
            indices.append(i)
    return indices

def get_predicted_tags(results, tokens):
    predicted_tags = [0 for _ in range(len(tokens))]
    for result in results:
        entity = result['entity']
        if entity in label_map.keys():
            word = result['word']
            indices = find_word_indices(word, tokens)
            for index in indices:
                predicted_tags[index] = label_map[entity]
    
    return predicted_tags

def calculate_metrics(tags_pred, tags_gold):
    tp, fp, fn = 0, 0, 0

    for pred, gold in zip(tags_pred, tags_gold):
        if pred == gold and gold != '0':
            tp += 1
        elif gold != '0' and pred != gold:
            fn += 1
        elif gold == '0' and pred != '0':
            fp += 1
    
    return tp, fp, fn

In [166]:
def test_model(model_name, wiki_dataset, length=-1):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    ner_pipeline = pipeline('ner', model=model_name, tokenizer=tokenizer)
    
    print('\nCurrently testing:', model_name)
    start_time = time.time()

    tokens_all = wiki_dataset['tokens']
    tags_all = wiki_dataset['ner_tags']

    iterations = len(tokens_all)
    if length > 0:
        iterations = length

    TP, FP, FN = 0, 0, 0
    for i in range(iterations):
        tokens = tokens_all[i]
        true_tags = tags_all[i]
        text = join_tokens(tokens)
        result = ner_pipeline(text)
        merged_result = merge_result(result)
        predicted_tags = get_predicted_tags(merged_result, tokens)

        tp, fp, fn = calculate_metrics(predicted_tags, true_tags)
        TP += tp
        FP += fp
        FN += fn

    end_time = time.time()
    duration = end_time - start_time

    precision = TP/(TP + FP) if (TP + FP) > 0 else 0
    recall = TP/(TP + FN) if (TP + FN) > 0 else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    print('Time:', duration)
    print(f'Precision: {precision:.4f}')
    print(f'Recall: {recall:.4f}')
    print(f'F1 Score: {f1_score:.4f}')
    

In [None]:
test_model(models['RoBERTa'], data, length=10)