# Fine-tuning

LLM: 
- https://towardsdatascience.com/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32

Sent-Transformer:
- https://huggingface.co/blog/how-to-train-sentence-transformers
- https://huggingface.co/datasets/snli


Installations

In [1]:
%%capture

!pip install openai transformers sentence-transformers

Imports

In [2]:
import re
import sqlite3

import numpy as np
import pandas as pd
from scipy import spatial

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification

from sentence_transformers import SentenceTransformer
import openai

Using OpenAI

In [3]:
# If running on Kaggle, add your OpenAI API key to the secrets
from kaggle_secrets import UserSecretsClient
openai.api_key = UserSecretsClient().get_secret("OPENAI_API_KEY")

# # if running locally, use this instead
# import os
# openai.api_key = os.getenv("OPENAI_API_KEY")

# ChatGPT using openAI API
def openai_generate(prompt: str, llm_model: str = "text-davinci-003"):
    res = openai.Completion.create(
        model=llm_model,
        prompt=prompt,
        temperature=0,
        max_tokens=1024,
    )
    
    choice = res.choices[0]
    if choice.finish_reason != "stop":
        raise Exception(f"finish reason: {choice.finish_reason}")
    return choice.text

# Get the arguments from the prompt
# e.g. Sum up all {statement}s and {fact}s -> ["statement", "fact"]
def get_keys(s: str): 
    res = re.findall(r"\{\S+?\}", s)
    res = [re.sub(r"[\{\}]", '', item) for item in res]
    return res

Named Entity Recognition (NER) using BERT Transformers

In [4]:
ner_options = dict(
    tokenizer = "dslim/bert-base-NER",
    model = "dslim/bert-base-NER",
)

ner_tokenizer = AutoTokenizer.from_pretrained(ner_options["tokenizer"])
ner_model = AutoModelForTokenClassification.from_pretrained(ner_options["model"])
ner = pipeline("ner", model=ner_model, tokenizer=ner_tokenizer)

# get important topics / tags of a sentence
def topics(sentence, ner = ner):
    raw_entities = {}
    for token in ner(sentence):
        if '#' in token["word"] or token["entity"] == "O":
            continue
            
        [b_or_i, entity_type] = token["entity"].split("-")
        if entity_type not in raw_entities:
            raw_entities[entity_type] = [token]
            continue 
            
        if b_or_i == "B":
            raw_entities[entity_type].append(token)
        elif b_or_i == "I":
            raw_entities[entity_type][-1]["end"] = token["end"]
                
    get_token = lambda token: sentence[token['start']:token['end']]
        
    entities = set()
    for entity_type in raw_entities:
        for entity in map(get_token, raw_entities[entity_type]):
            entities.add(entity + " (" + entity_type +  ")")
        
    return entities

Downloading (…)okenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Using Sentence Transformers

In [5]:
# # load pre-trained similarity model

# enc_options = dict(
#     model = "sentence-transformers/stsb-roberta-large",
#     dim = 1024,
# )
enc_options = dict(
    model = "sentence-transformers/all-MiniLM-L6-v2",
    dim = 384,
)

enc_model = SentenceTransformer(enc_options["model"])

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Vector Database functionality

In [6]:
class VecDBQuery:
    def __init__(self, data):
        self.data = data
    
    def result(self):
        return self.data
    
    def generate(
        self, 
        mapper: str = None,
        reducer: str = None,
    ):
        if not len(self.data):
            return self
        
        mapper_args = get_keys(mapper)
        reducer_args = get_keys(reducer)
        
        for i in range(len(self.data)):
            self.data[i].update({"_additional": {"generate": {
                "singleResult": None,
                "groupedResult": None, 
            }}})
        
        if mapper and len(mapper_args):
            for i, item in enumerate(self.data):
                single_prompt = f"{mapper}\n\n" 
                for arg in mapper_args:
                    single_prompt += f"{arg}: {item[arg]}\n"
                single_prompt_res = openai_generate(f"{single_prompt}\n")
                self.data[i]["_additional"]["generate"]["singleResult"] = single_prompt_res
        
        if reducer and len(reducer_args):
            grouped_prompt = f"{reducer}\n\n"
            for i, item in enumerate(self.data):
                for arg in reducer_args:
                    grouped_prompt += f"{arg}[{i}]: {item[arg]}\n"
            grouped_prompt_res = openai_generate(f"{grouped_prompt}\n")
            self.data[0]["_additional"]["generate"]["groupedResult"] = grouped_prompt_res

        return self

class VecDB:
    def __init__(
        self, 
        conn: sqlite3.Cursor,
        class_name: str,
        keys: list[str],
        vectorizer: dict[str, any],
        maxchar: int = 1024,
    ):
        self.conn = conn
        self.class_name = class_name
        self.keys = list(sorted(keys))
        self.indexed_keys = ["row_num", *self.keys]
        self.maxchar = maxchar
        
        # create a table with given attributes
        # all of which are string with specified max. length
        columns = [f'{k} nvarchar({maxchar})' for k in self.keys]
        self.conn.execute(f"CREATE TABLE {class_name} ({' ,'.join(['row_num integer', *columns])})")

        # assign vector options
        assert vectorizer is not None, "vectorizer must be specified"
        
        self.vectorizer_fn = vectorizer["encoder"]
        self.vectorized_key = vectorizer["key"]
        self.vectorizer_dim = vectorizer["dim"]
        
        # create a vector database
        vec_cols = [f'vec{i} float' for i in range(self.vectorizer_dim)]
        self.conn.execute(f"CREATE TABLE vectors ({' ,'.join(vec_cols)})")
    
    def insert_data(
        self,
        data: list[dict[str, any]],
    ):     
        
        # add placeholders for adding values
        # then add each row of data
        insert_query = f"INSERT INTO {self.class_name} ({', '.join(self.indexed_keys)}) VALUES ({', '.join(['?']*len(self.indexed_keys))})"
        curr_i = self.conn.execute(f"SELECT COUNT(row_num) FROM {self.class_name}").fetchone()
        for i, d in enumerate(data):
            row_index = curr_i[0] + i
            row_values = [row_index] + [d.get(k, '') for k in self.keys]
            self.conn.execute(insert_query, row_values)
        
        # vectorize each data point and add to vector database
        new_vectors = self.vectorizer_fn([d[self.vectorized_key] for d in data])
        new_vector_tuples = [f"({', '.join([str(n) for n in vector])})" for vector in new_vectors]
        self.conn.execute(f"INSERT INTO vectors VALUES {', '.join(new_vector_tuples)}")
        
        return self
    
    # WHERE
    # path: if data looks like {"a": {"b": {"c": ...}}}, path is set to ["a", "b", "c"]
    # operator: And Or Equal NotEqual GreaterThan GreaterThanEqual LessThan LessThanEqual Like WithinGeoRange IsNull ContainsAny ContainsAll
    # valueText, valueInt, valueBoolean etc.
    def query_data(
        self,
        keys: list[str] = None, 
        near_text: list[str] = None, 
        where: list[any] = None, 
        limit: int = None,
    ):      
        # get all vectors
        vectors = np.array(self.conn.execute("SELECT * FROM vectors").fetchall())
        
        select_query = f"SELECT {', '.join(['row_num', *sorted(keys)])} FROM {self.class_name}"
        
        # get where clauses
        where_queries = []
        for where_clause in where:
            path = where_clause["path"][0]
            operator = where_clause["operator"]
            value_text = where_clause["valueText"]
            if operator == "ContainsAny":
                patterns = [f"{path} LIKE '%{val}%'" for val in value_text]
                where_queries.append('WHERE ' + ' OR '.join(patterns))
        
        # add where clauses and perform select
        select_query = " ".join([select_query, *where_queries])
        vals = self.conn.execute(select_query).fetchall()

        if not len(vals):
            return VecDBQuery(vals)
        
        # vector update
        vectors = vectors[[val[0] for val in vals]]
        
        min_len = min(limit, len(vals))
        
        # nearest neighbor search if near_text is specified
        if near_text is not None:
            near_vector = self.vectorizer_fn(", ".join(near_text))
            searchtree = spatial.KDTree(vectors)
                
            _, vec_ind = searchtree.query(near_vector, k=min_len)
            
            # vector update
            vectors = vectors[vec_ind]
            vals = [vals[i] for i in vec_ind]
            
            if not len(vals):
                return VecDBQuery(vals)
        
        else:
            vals = [vals[i] for i in range(min_len)]
            vectors = vectors[:min_len]
        
        vals = [{k: v for k, v in zip(self.keys, val[1:])} for val in vals]
        return VecDBQuery(vals)
    
    def drop(self):
        # drop both tables
        self.conn.execute("DROP TABLE vectors")
        self.conn.execute(f"DROP TABLE {self.class_name}")
        return self

Local SQL connection

In [7]:
connection = sqlite3.connect("sqlite://")
conn = connection.cursor()

Collect and preprocess data

In [8]:
class_name = "statements"

tts = pd.read_csv("../input/trump-tweets/trumptweets.csv",usecols=["content"])
sentences = list(tts["content"].values)[:100]

data = [
    dict(
        statement=s, 
        entities=', '.join(topics(s)),
    ) for s in sentences
]

Initialize a vector database and insert data into it

In [9]:
client = VecDB(
    conn,
    class_name, 
    keys = ["statement", "entities"],
    vectorizer = dict(
        encoder = enc_model.encode,
        key = "statement",
        dim = enc_options["dim"],
    ),
)

client.insert_data(data)

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

<__main__.VecDB at 0x7f8c9ada9900>

Insert more data into the vector database

In [10]:
client.insert_data([
    {
        "statement": "Donald Trump didn't build any wall in Mexican borders. He built margins.",
        "entities": "Donald Trump (PER)"
    },
    {
        "statement": "Donald Trump seems to be an inspiring character, but I can assure it's the opposite. He doesn't want you to know that he is betraying the US politics. #AmericanDream",
        "entities": "Donald Trump (PER)"
    }
])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

<__main__.VecDB at 0x7f8c9ada9900>

Now, feel free to query data by using `query_data` and generate additional output using OpenAI

In [11]:
client\
    .query_data(
        keys = ["statement", "entities"],
        near_text = ["wall", "politics"],
        where = [dict(
            path = ["entities"],
            operator = "ContainsAny",
            valueText = ["Donald J. Trump", "Donald Trump"],
        )],
        limit = 10,
    )\
    .generate(
        mapper = "Extract the facts out of {statement}, also take away the human factor. Results have to be returned in a list of sentences.",
        reducer = "You are a natural language inference engine. Given many {statement}s, find the conflicting statements (i, j) and return those pairs in a Python list (otherwise return []).",
    )\
    .result()

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[{'entities': 'Donald J. Trump (PER)',
  'statement': '"My persona will never be that of a wallflower - I’d rather build walls than cling to them" --Donald J. Trump',
  '_additional': {'generate': {'singleResult': '\n\n1. Building walls is preferable to clinging to them.\n2. It is not desirable to be a wallflower.',
    'groupedResult': '\nAnswer: []'}}},
 {'entities': 'Donald Trump (PER)',
  'statement': "Donald Trump didn't build any wall in Mexican borders. He built margins.",
  '_additional': {'generate': {'singleResult': '\n1. A wall was not built in Mexican borders.\n2. Margins were built.',
    'groupedResult': None}}},
 {'entities': 'David Letter (PER), Donald Trump (PER), The Late Show (MISC)',
  'statement': "-- Watch Donald Trump's recent appearance on The Late Show with David Letterman: http://tinyurl.com/klts6b",
  '_additional': {'generate': {'singleResult': '\n\n1. Donald Trump appeared on The Late Show with David Letterman.\n2. A link to the appearance is http://tinyurl

Drop the both tables

In [12]:
client.drop()

<__main__.VecDB at 0x7f8c9ada9900>