## Dinozor NLP 

Our goal is to to demonstrate an old NLP task with old NLP methodologies, to understand what the future methods are trying to do better. For this goal I found a Turkish SMS Spam detection dataset from [Onur Karasoy et al.](https://github.com/onrkrsy/TurkishSMS-Collection)

In [None]:
import pandas as pd
from pathlib import Path

dataset_path = Path("TurkishSMS-Collection/TurkishSMSCollection.csv")
df = pd.read_csv(dataset_path, sep=';')

This is how the data looks like

In [None]:
df

The classes are balanced which is nice

In [None]:
df.Group.value_counts()

I am turning classes into 0 and 1 for convenience

In [None]:
df["Group"] = df["Group"].replace(2, 0)

### Good Old Feature Engineering

Please Look at some samples and try to come up with some features that distinguish between spam and ham in this dataset. Than a bunch of classifiers will take those as inputs to generate scores. It is not about the results but the process of applying ancient ML methodologies for real life NLP problems.

In [None]:
pd.set_option('display.max_colwidth', None)

Take a look at random samples from each classes and try to come up with features that differanciates onw from the other

In [None]:
df[df["Group"] == 1].sample(5)

In [None]:
df[df["Group"] == 0].sample(5)

Below here, engineer some features and append them to the original dataframe like:

```python
def get_my_feature(text):
    # calculate your galaxy brain feature
    text = text.do_stuff()
    return text

df["my_feature"] = df["Messages"].apply(get_my_feature)
```
or in any way you like

In [None]:
import re

def clean_text(text):
    """Remove special characters and standardize the text."""
    # Remove special characters except alphanumeric and spaces
    text = re.sub(r'[^a-zA-Z0-9\sİıÜüÖöŞşÇçĞğ]', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    # Make everthing lowercase
    text = text.replace("I", "ı").replace("İ", "i").lower()
    # Strip leading/trailing whitespace
    return text.strip()

def get_caps_proportion(text: str) -> float:
    """Return the percentage of uppercase characters in the given string."""
    caps_count = sum(1 for char in text if char.isupper())
    return caps_count / len(text)

def count_urls(text: str) -> int:
    """Return the number of URLs found in the given string."""
    url_pattern = r'https?://\S+'
    return len(re.findall(url_pattern, text))

def count_numbers(text: str) -> int:
    """Return the number of numbers found in the given string."""
    number_pattern = r'\d+'
    return len(re.findall(number_pattern, text))

def numeric_char_proportion(text: str) -> float:
    """Return the percentage of numeric characters in the given string."""
    return sum(1 for char in text if char.isdigit()) / len(text)

In [None]:
from tqdm import tqdm
tqdm.pandas()

In [None]:
df["uppercase_proportion"] = df["Message"].progress_apply(get_caps_proportion)
df["url_count"] = df["Message"].progress_apply(count_urls)
df["number_count"] = df["Message"].progress_apply(count_numbers)
df["numeric_char_proportion"] = df["Message"].progress_apply(numeric_char_proportion)

take a look at your newly engineered features

In [None]:
df

### Lazy classifier

Grinding different models to hit a higher score is automatable. What we are doing here is only meaningful during benchmarking, productionizing our solutions would bring up different concerns

In [None]:
def train_test_split(df: pd.DataFrame, target: str, ratio: float=0.3): # i know i didn't need to write this
    X = df.drop(target, axis=1)
    Y = df[[target]]
    split = round(len(df)*ratio)
    X_test = X.iloc[:split]
    X_train = X.iloc[split:]
    y_test = Y.iloc[:split]
    y_train = Y.iloc[split:]
    return X_train, X_test, y_train, y_test

I am using a library called lazy predict which is basically goes around and tries every sklearn classifier on your data, so that we can ignore model selection and hyperparameter tuning and just focus on the data

In [None]:
from lazypredict.Supervised import LazyClassifier
from sklearn.metrics import precision_score

## write here your features along with the Group column
features = df[["uppercase_proportion", "url_count", "number_count", "numeric_char_proportion", "Group"]]

X_train, X_test, y_train, y_test = train_test_split(features, target="Group", ratio=.3)

clf = LazyClassifier(verbose=0, ignore_warnings=False, predictions=True, random_state=42, classifiers="all", custom_metric=precision_score)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)

Let's make up a business rule and say that our spam filtering system must avoid flagging our loved ones' SMSs as spams.

So let's compare classifiers by precision

In [None]:
models.sort_values("precision_score", ascending=False)

___

Here is a simple error analysis view so you can see what sort of examples are predicted wrong and develop features based on your hypotheses

In [None]:
predictions["Group"] = y_test
highest_precision_classifier = models.sort_values("precision_score", ascending=False).index[0]

# the examples that were not spams but the best classifier decided otherwise
indices = predictions[(predictions["Group"] == 0) & (predictions[highest_precision_classifier] == 1)].index
df.iloc[indices]

What do you think could be improved?

### Term document matrix

Since the dawn of time, the goal of NLP research is to somehow represent language units with numbers. Because only then, we can make data science with them

![xkcd](https://imgs.xkcd.com/comics/assigning_numbers.png)

One of the older ways to represent words (or tokens) and documents was to create a term-document matrix. We can assume in such matrix the rows are words and columns are documents, and the cells are a function of those two. The most basic function to use might be the frequency of that word in a document. With that we are representing each document with a vacabulary-size dimentional sparse vector. It is also called a co-occurance matrix. Words that co-occur are represented by vectors that are closer.  
For example the words "volkan" and "konak" might occur together more often than "volkan" and "şemsiye"; Therefore distance("volkan", "konak") < distance("volkan", "şemsiye)

The assumption we are making is: **The meaning of documents are a function of the words they contain**  
Let's build that!

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample texts
texts = df["Message"].values

# Create term-document matrix
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(texts)

# Convert to DataFrame for better visualization
td_matrix = pd.DataFrame(matrix.toarray(), 
                 columns=vectorizer.get_feature_names_out(),
                 index=[f'Doc{i+1}' for i in range(len(texts))])

print("Term-document matrix:")
td_matrix

Very simple. You see the rows are documents and the columns are 'words'. We have word representations based on which documents they occur in (is this language modelling??) and we have document representations based on how many of each word they contain.

let's see the most frequent words

In [None]:
td_matrix.sum().sort_values(ascending=False).head(10)

We can infer our mostly co-occured words via vector similarity

In [None]:
def get_most_cooccurances(word, matrix_df, top_k=10):
    if word not in matrix_df.columns:
        raise Exception(f"{word} does not exist in the vocabulary")
    vec = td_matrix[word].values
    similarities = vec.dot(matrix.toarray())
    top_k_indices = (-similarities).argsort()[:top_k]
    return [matrix_df.iloc[:, i].name for i in top_k_indices]

go ahead and discover what words co occur mostly, you can filter by spamness to infer how cooccurance differs between two classes

In [None]:
get_most_cooccurances("cumalar", td_matrix)

It sort of makes sense, but why is it so ugly?  
The words are extracted naively. There are different vectors for "düşünmek", "düşünüyorum", "düşündüler" etc.
Our assumption was that document meaning is a function of its words. We can also assume these words would contribute to similar meanings in a document, so treating them as seperate creates noise.  
Also, words like "ve", "veya", "şöyle", "böyle" should contribute very little to the meaning. Getting rid of those should also remove some of the noise in the matrix

### Exercise! go ahead and write preprocessing step to mitigate the problems stated above, you might remember terms like stop words, stemming and so on. Then, use your new and beautiful term occurances as features for the classifiers above and see how well it performs compared to your initial feature engineering.

# NLP Tasks

## Token Classification

In [10]:
from tqdm import tqdm
tqdm.pandas()

a tokenizer will be useful

In [11]:
from transformers import AutoTokenizer
model_name = "dbmdz/distilbert-base-turkish-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

  from .autonotebook import tqdm as notebook_tqdm


### NER

Token classification is about classifying the parts (words, subwords...) of a text.

Most known application is Named Entity Recognition:

- [ "My", "name", "is", "Ahmet", "." ]
- [ "O", "O", "O", "PERSON", "O" ]  

Named entity recognition finds the special entities in a text, such as "person", "location", "date".

It is a type of token classification, classes being, for example, "O", "PERSON", "LOC", "DATE".

#### How does the ner data look like?

[turkish-nlp-suite/turkish-wikiNER](https://huggingface.co/datasets/turkish-nlp-suite/turkish-wikiNER)  
[aynısının github linki](https://github.com/turkish-nlp-suite/Turkish-Wiki-NER-Dataset/)


I am reading the same data as pandas dataframe and huggingface Datasets to understand what Datasets has to offer and how do they differ

In [12]:
# Loading dataset via pandas
import pandas as pd

splits = {'train': 'dataset/train.json', 'validation': 'dataset/valid.json', 'test': 'dataset/test.json'}
df = pd.read_json("hf://datasets/turkish-nlp-suite/turkish-wikiNER/" + splits["train"], lines=True)

These are the classes represented in the dataset

In [13]:
label_list = ['O',
'B-CARDINAL',
'I-CARDINAL',
'B-DATE',
'I-DATE',
'B-EVENT',
'I-EVENT',
'B-FAC',
'I-FAC',
'B-GPE',
'I-GPE',
'B-LANGUAGE',
'I-LANGUAGE',
'B-LAW',
'I-LAW',
'B-LOC',
'I-LOC',
'B-MONEY',
'I-MONEY',
'B-NORP',
'I-NORP',
'B-ORDINAL',
'I-ORDINAL',
'B-ORG',
'I-ORG',
'B-PERCENT',
'I-PERCENT',
'B-PERSON',
'I-PERSON',
'B-PRODUCT',
'I-PRODUCT',
'B-QUANTITY',
'I-QUANTITY',
'B-TIME',
'I-TIME',
'B-TITLE',
'I-TITLE',
'B-WORK_OF_ART',
'I-WORK_OF_ART']

Let's take a look at what we are dealing with

In [14]:
df

Unnamed: 0,tokens,tags
0,"[Orda, Spike, ,, First'ün, etkisiyle, Buffy'ye...","[O, B-PERSON, O, B-PERSON, O, B-PERSON, O, O, ..."
1,"["", Macera, edebiyatın, ilk, günlerinden, beri...","[O, O, O, B-ORDINAL, O, O, O, O, O, O, O]"
2,"[Günümüzde, Adana'da, 514, okul, öncesi, eğiti...","[O, B-GPE, B-CARDINAL, O, O, O, O, B-CARDINAL,..."
3,"[11, Temmuz, 1927, tarihinde, Filistin'de, mey...","[B-DATE, I-DATE, I-DATE, I-DATE, B-GPE, B-EVEN..."
4,"[Refrain, Lys, Assia, tarafından, söylenen, ve...","[B-PERSON, I-PERSON, I-PERSON, O, O, O, B-EVEN..."
...,...,...
17962,"[Çoğu, ale, nebatî, lezzetin, kaynağı, olan, m...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
17963,"[Amancio, önce, Real, Madrid'in, alt, takımı, ...","[B-PERSON, O, B-ORG, I-ORG, O, O, O, B-ORG, I-..."
17964,"[Avrupa, Komisyonu, :, Parlamentoya, ve, Konse...","[B-ORG, I-ORG, O, O, O, O, O, O, O, O, O, O, O..."
17965,"[En, eski, cinsinin, adı, Teckel'dir, ., 19., ...","[O, O, O, O, O, O, B-DATE, I-DATE, I-DATE, O, ..."


Here we see the labels are given for each word. But most modern approaches don't use word tokenization. We also will be using a model with subword tokenization. Subword tokenization is very beneficial with morphologically rich languages like Turkish.

In the function below we are aligning the labels with the actual tokens that our model will use.  
Feel free to disect it

Here we set the labels of all special tokens to -100 (the index that is ignored by PyTorch) and the labels of all other tokens to the label of the word they come from. Another strategy is to set the label only on the first token obtained from a given word, and give a label of -100 to the other subtokens from the same word. For more info check the [original notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=DIba90p4rvU_)

In [15]:
# How would the code change if we just assume we only want to label all tokens?

label_all_tokens=True
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"],
                                 truncation=True,
                                 is_split_into_words=True)

    word_ids = tokenized_inputs.word_ids()
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:
         # Special tokens have a word id that is None. We set the label to -100 so they are automatically
        # ignored in the loss function.
        if word_idx is None:
            label_ids.append(-100)
        # We set the label for the first token of each word.
        elif word_idx != previous_word_idx:
            label_ids.append(label_list.index(examples["tags"][word_idx]))
        # For the other tokens in a word, we set the label to either the current label or -100, depending on
        # the label_all_tokens flag.
        else:
            label_ids.append(label_list.index(examples["tags"][word_idx]) if label_all_tokens else -100)
        previous_word_idx = word_idx

    tokenized_inputs["labels"] = label_ids
    #import pdb; pdb.set_trace()
    return tokenized_inputs

In [16]:
tmp_df = df.progress_apply(tokenize_and_align_labels, axis=1)

100%|███████████████████████████████████| 17967/17967 [00:05<00:00, 3491.15it/s]


In [17]:
tokenized_df = pd.DataFrame(tmp_df.tolist()) # burayı başka bi şekilde yap

This is how the tokenized labels look like

In [18]:
# we can also add the decoded input_ids to peep into the tokenization of the actual text
tokenized_df

Unnamed: 0,attention_mask,input_ids,labels
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 2757, 1986, 6639, 15453, 16, 9379, 2033, 1...","[-100, 0, 0, 27, 27, 0, 27, 27, 27, 27, 0, 27,..."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 6, 6496, 5004, 25409, 1009, 2411, 14955, 1...","[-100, 0, 0, 0, 0, 0, 21, 0, 0, 0, 0, 0, 0, 0,..."
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 9236, 6280, 11, 2054, 8236, 1119, 3441, 58...","[-100, 0, 9, 9, 9, 1, 1, 0, 0, 0, 0, 0, 1, 1, ..."
3,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 2974, 4214, 8337, 1106, 3863, 7152, 11, 20...","[-100, 3, 4, 4, 4, 4, 9, 9, 9, 5, 6, 6, 6, 0, ..."
4,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 11123, 2079, 1973, 29764, 1022, 3001, 2740...","[-100, 27, 27, 27, 28, 28, 28, 28, 0, 0, 0, 5,..."
...,...,...,...
17962,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 12077, 2088, 1025, 22454, 2001, 1089, 2620...","[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
17963,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 16079, 2329, 1033, 2478, 13314, 12926, 11,...","[-100, 27, 27, 27, 0, 23, 24, 24, 24, 0, 0, 0,..."
17964,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 3070, 7653, 30, 26431, 2029, 1992, 18799, ...","[-100, 23, 24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0..."
17965,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 2654, 3275, 21815, 6717, 3668, 15806, 6848...","[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 4, ..."


#### Finetuning NER

In [19]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

In [51]:
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(label_list))

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at dbmdz/distilbert-base-turkish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
from transformers import DataCollatorForTokenClassification
# Data collator that will dynamically pad the inputs received, as well as the labels.
data_collator = DataCollatorForTokenClassification(tokenizer)

It is a very convenient abstraction to use datasets library with transformers feel free to check how it differs from pandas df

In [60]:
import datasets
split = round(len(tokenized_df)*0.3)
print(split)

dataset = datasets.DatasetDict(
    {
        "train": datasets.Dataset.from_pandas(tokenized_df[split:]),
        "test": datasets.Dataset.from_pandas(tokenized_df[:split]),
    }
)

5390


In [61]:
args = TrainingArguments(
    "test-ner",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

The last thing to define for our Trainer is how to compute the metrics from the predictions. Here we will load the seqeval metric (which is commonly used to evaluate results on the CONLL dataset) via the Datasets library.

So we will need to do a bit of post-processing on our predictions:
- select the predicted index (with the maximum logit) for each token
- convert it to its string label
- ignore everywhere we set a label of -100

The following function does all this post-processing on the result of `Trainer.evaluate` (which is a namedtuple containing predictions and labels) before applying the metric:

In [62]:
import numpy as np
from seqeval.metrics import f1_score, accuracy_score, precision_score, recall_score

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    return {
        "precision": precision_score(y_true=true_labels, y_pred=true_predictions),
        "recall": recall_score(y_true=true_labels, y_pred=true_predictions),
        "f1": f1_score(y_true=true_labels, y_pred=true_predictions),
        "accuracy": accuracy_score(y_true=true_labels, y_pred=true_predictions)
    }

In [63]:
trainer = Trainer(
    model,
    args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


I hope the next cell does not start your pc fan immediately

In [64]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.032189,0.146341,0.272727,0.190476,0.707692
2,No log,1.023219,0.170732,0.318182,0.222222,0.715385
3,No log,1.009742,0.166667,0.318182,0.21875,0.715385


TrainOutput(global_step=12, training_loss=1.222005049387614, metrics={'train_runtime': 6.7247, 'train_samples_per_second': 22.306, 'train_steps_per_second': 1.784, 'total_flos': 1662345126540.0, 'train_loss': 1.222005049387614, 'epoch': 3.0})

Let's see how we did on the test set

In [65]:
trainer.evaluate()

{'eval_loss': 1.009742021560669,
 'eval_precision': 0.16666666666666666,
 'eval_recall': 0.3181818181818182,
 'eval_f1': 0.21874999999999997,
 'eval_accuracy': 0.7153846153846154,
 'eval_runtime': 0.0469,
 'eval_samples_per_second': 106.601,
 'eval_steps_per_second': 21.32,
 'epoch': 3.0}

In [67]:
def compute_test_results():
    predictions, labels, _ = trainer.predict(dataset["test"])
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    return true_predictions, true_labels


In [68]:
pred, label = compute_test_results()

We can see example-wise accuracy scores

In [95]:
for p, l in zip(pred, label):
    a = [pp==ll for pp,ll in zip(p,l)]
    print(sum(a)/len(a))

1.0
0.5555555555555556
0.8888888888888888
0.5909090909090909
0.6216216216216216


#### NER Inference

Feel free to play with your examples to see what the model is good and bad at

In [98]:
example_sentence = "Inzva'nın Taksim binasını Yağız hiç görmemiş."

In [99]:
inputs = tokenizer(example_sentence, return_tensors="pt", add_special_tokens=True)
inputs["input_ids"] = inputs["input_ids"].to(device=model.device)
inputs["attention_mask"] = inputs["attention_mask"].to(device=model.device)


In [100]:
outputs = model(**inputs)

In [101]:
predicted_classes = outputs['logits'].argmax(axis=2).cpu().numpy()[0]

In [102]:
tokens = tokenizer.convert_ids_to_tokens(ids=inputs["input_ids"].cpu().numpy()[0], skip_special_tokens=False)

In [103]:
for i, p in enumerate(predicted_classes):
    if tokens[i] in [tokenizer.sep_token, tokenizer.cls_token]:
        continue
    print(f"{tokens[i]} ----> {label_list[p]}")

In ----> B-PERSON
##z ----> B-PERSON
##va ----> B-GPE
' ----> B-GPE
nın ----> B-GPE
Taksim ----> B-GPE
binası ----> O
##nı ----> O
Yağı ----> B-GPE
##z ----> O
hiç ----> O
görmemiş ----> O
. ----> O


### Extractive QA

Extractive QA can also be formulated as a token classification problem. Here extractive means that the answers is a span inside the given context. So we can train a model to predict for each token to find which token is the start token and which token is the end token.

This is what the SQuAD data format looks like which is quite a common standard dataset and format for QA literature (a bit outdated imo)

In [1]:
example_qa = {
                "data": [
                    {
                        "title": "Example",
                        "paragraphs": [
                            {
                                "context": "The quick brown fox jumps over the lazy dog.",
                                "qas": [
                                    {
                                        "question": "What does the fox jump over?",
                                        "id": "q1",
                                        "answers": [
                                            {
                                                "text": "the lazy dog",
                                                "answer_start": 32
                                            }
                                        ]
                                    }
                                ]
                            }
                        ]
                    }
                ],
                "version": "2.0"
            }

We will be demonstrating the Extractive QA Task with a translated SQuAD dataset. From our friends at Boun-tabilab
[boun-tabi/squad_tr](https://huggingface.co/datasets/boun-tabi/squad_tr)

In [2]:
import gzip
import json

with gzip.open("SQuAD-TR/data/squad-tr-train-v1.0.0.json.gz", "r") as f:
   qa_data = json.loads(f.read().decode('utf-8'))

This time we are directly jumping into the HF datasets format

In [3]:
from datasets import Dataset
from tqdm import tqdm

def json_to_dataset(data):
    datalist = []
    for title in tqdm(data):
        for paragraph in title["paragraphs"]:
            for qa in paragraph["qas"]:
                if len(qa["answers"]) == 0: # bunları dahil edip de kurgulanabilir aslında
                    continue
                example = {'id': qa['id'], 'title': title["title"], 'context': paragraph['context'], 'question': qa['question'], 'answers': qa['answers'][0]}
                datalist.append(example)
    
    return Dataset.from_list(datalist)

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
squad_tr = json_to_dataset(qa_data["data"][:1]) # I am limiting the number of titles to 10 for faster computations

100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 754.64it/s]


100%|██████████| 10/10 [00:00<00:00, 1167.97it/s]




In [5]:
squad_tr

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 515
})

Split the dataset's `train` split into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [6]:
squad_tr = squad_tr.train_test_split(test_size=0.2)

In [7]:
squad_tr

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 412
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 103
    })
})

In [8]:
squad_tr["train"][0]

{'id': '56bea5f23aeaaa14008c91a2',
 'title': 'Beyonce',
 'context': "Beyoncé'nin ilk olarak Jay Z ile “'03 Bonnie & Clyde” adlı işbirliğinin ardından, yedinci albümü The Blueprint 2: The Gift & The Curse (2002) 'de yer aldığı düşünülmektedir. Beyoncé, şarkıdaki klipte Jay Z'nin kız arkadaşı olarak göründü ve bu da ilişkilerinin spekülasyonlarını daha da artıracak. 4 Nisan 2008'de Beyoncé ve Jay Z tanıtım olmadan evlendi. Nisan 2014 itibariyle çift birlikte 300 milyon rekoru sattı. Çift, son yıllarda daha rahat görünmesine rağmen, özel ilişkileri ile tanınıyor. Beyoncé 2010 veya 2011'de düşük yaptı ve bunu “şimdiye kadar dayandığı en üzücü şey” olarak nitelendirdi. Kaybıyla başa çıkmak için stüdyoya döndü ve müzik yazdı. Nisan 2011'de Beyoncé ve Jay Z, albüm kapağını çekmek için Paris'e gittiler ve beklenmedik bir şekilde Paris'te hamile kaldılar.",
 'question': 'Beyonce nerede hamile kaldı?',
 'answers': {'answer_start': 720, 'text': 'Paris'}}

There are several important fields here:

- `answers`: the starting location of the answer token and the answer text.
- `context`: background information from which the model needs to extract the answer.
- `question`: the question a model should answer.


#### Preprocesing

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dbmdz/distilbert-base-turkish-cased")

There are a few preprocessing steps particular to question answering tasks you should be aware of:

1. Some examples in a dataset may have a very long `context` that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the `context` by setting `truncation="only_second"`.
2. Next, map the start and end positions of the answer to the original `context` by setting
   `return_offset_mapping=True`.
3. With the mapping in hand, now you can find the start and end tokens of the answer. Use the [sequence_ids](https://huggingface.co/docs/tokenizers/main/en/api/encoding#tokenizers.Encoding.sequence_ids) method to
   find which part of the offset corresponds to the `question` and which corresponds to the `context`.

Here is how you can create a function to truncate and map the start and end tokens of the `answer` to the `context`:

I recommend checking the videos [here](https://huggingface.co/docs/transformers/tasks/question_answering) for grasping the data format for extractive QA, I based most of this section of notebook from that tutorial

In [10]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        #start_char = answer["answer_start"][0]
        #end_char = answer["answer_start"][0] + len(answer["text"][0])
        start_char = answer["answer_start"]
        end_char = answer["answer_start"] + len(answer["text"])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [11]:
#tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
tokenized_squad_tr = squad_tr.map(preprocess_function, batched=True, remove_columns=squad_tr["train"].column_names)

Map: 100%|███████████████████████████| 412/412 [00:00<00:00, 2689.44 examples/s]
Map: 100%|███████████████████████████| 103/103 [00:00<00:00, 2198.97 examples/s]


Map:  37%|███▋      | 1000/2696 [00:00<00:00, 2669.54 examples/s]

Map:  74%|███████▍  | 2000/2696 [00:00<00:00, 2700.55 examples/s]

Map: 100%|██████████| 2696/2696 [00:00<00:00, 2711.28 examples/s]

Map: 100%|██████████| 2696/2696 [00:01<00:00, 2675.30 examples/s]




Map:   0%|          | 0/675 [00:00<?, ? examples/s]

Map: 100%|██████████| 675/675 [00:00<00:00, 2493.08 examples/s]

Map: 100%|██████████| 675/675 [00:00<00:00, 2415.62 examples/s]




Now create a batch of examples using [DefaultDataCollator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DefaultDataCollator). Unlike other data collators in 🤗 Transformers, the [DefaultDataCollator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DefaultDataCollator) does not apply any additional preprocessing such as padding.

In [12]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

#### Training

In [13]:
from transformers import AutoModel, AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("dbmdz/distilbert-base-turkish-cased", device_map="cpu")

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at dbmdz/distilbert-base-turkish-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


As a side note, let's see what huggingface mean by "model for question answering" can you spot the difference between when we read the same model as a base model

In [14]:
model

DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
     

In [15]:
base_model = AutoModel.from_pretrained("dbmdz/distilbert-base-turkish-cased")

In [16]:
base_model

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(32000, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): DistilBertSdpaAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): L

At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model).
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, and data collator.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [17]:
training_args = TrainingArguments(
    output_dir="test-squad-tr",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad_tr["train"],
    eval_dataset=tokenized_squad_tr["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)


  trainer = Trainer(


Let's write a function to peep into what our data looks like at this stage

In [18]:
def input_data_viewer(data):
    tokens = data["input_ids"]
    padding_start = tokens.index(tokenizer.pad_token_id)
    tokens = tokens[:padding_start]

    #get the answer within
    start = data["start_positions"]
    end = data["end_positions"]

    for idx, token in enumerate(tokens):
        if idx == start:
            print("<<<", end=" ")
        print(tokenizer.decode(token), end=" ")
        if idx == end:
            print(">>>", end=" ")
        

In [19]:
input_data_viewer(tokenized_squad_tr["train"][2])

[CLS] Bebek hangi hastanede teslim edildi ? [SEP] 7 Ocak 2012 ' de Bey ##on ##c ##é , ağır güvenlik altında New York ' taki <<< Len ##ox Hill Hastanesi >>> ' nde Blue I ##v ##y Car ##ter ' ı doğur ##du . İki gün sonra Ja ##y Z , çocuklarına adan ##mış bir şarkı olan “ G ##lor ##y ” yi Life ##and ##tim ##es . com ' da yayınladı . Şarkı , Bey ##on ##c ##é ' nin Blue I ##v ##y ' e hamile kalmadan önce uğradığı düşük de dahil olmak üzere çiftin hamilelik mücadele ##lerini ayrıntı ##landırdı . Blue I ##v ##y ' nin çığlık ##ları şarkının sonuna dahil edildi ve resmi olarak “ B . I . C . ” olarak kabul edildi . “ G ##lor ##y ” Hot R & B / Hip - Hop Son ##gs listesine girdiğinde iki günlük olarak Bill ##board listesine giren en genç kişi oldu . [SEP] 

In [20]:
tokenized_squad_tr

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 412
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 103
    })
})

In [21]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,5.73289
2,No log,5.756026
3,No log,5.746717


TrainOutput(global_step=78, training_loss=5.684173583984375, metrics={'train_runtime': 285.7818, 'train_samples_per_second': 4.325, 'train_steps_per_second': 0.273, 'total_flos': 121115423729664.0, 'train_loss': 5.684173583984375, 'epoch': 3.0})

#### Inference

In [21]:
question = "SQuAD veriseti ne zaman yayınlandı?"
context = "The Stanford Question Answering Dataset yani SQuAD veriseti 2016 yılında akademik bir kıyaslama veriseti olarak yayınlandı ancak içerdiği basit örnekler eleştirilere sebep oldu"

Tokenize the text and return PyTorch tensors:

In [22]:
inputs = tokenizer(question, context, return_tensors="pt")

Pass your inputs to the model and return the `logits`:

In [23]:
import torch

with torch.no_grad():
    outputs = model(**inputs)

Get the highest probability from the model output for the start and end positions:

In [24]:
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

Decode the predicted tokens to get the answer:

In [25]:
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

'##uAD veriseti ne zaman yayınlandı? [SEP] The Stanford Question Answerin'

## Sequence Classification

The first example of this notebook was about classification. Sentiment Analysis is one of the most popular sequence classification tasks. Do you think we can formulate a question answering problem as a sequence classification task???

### Sentiment analysis

[winvoker/turkish-sentiment-analysis-dataset](https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset)

Checking the sequence classification class of bert models will give us an idea about how this problem that we tried to solve with ancient methods, can be solved with language models

In [26]:
from transformers import BertForSequenceClassification

In [27]:
sc_model = BertForSequenceClassification.from_pretrained("dbmdz/bert-base-turkish-cased")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dbmdz/bert-base-turkish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
sc_model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

## Language Modeling

### Encoder Models

Modern encoder models take a natural language input and return a contextualised representation of the input.
(still) The most popular and influencial encoder model is BERT.

In [34]:
from transformers import AutoModel

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
model = AutoModel.from_pretrained("dbmdz/bert-base-turkish-cased")

Let's see what happens to our text input when passed through an encoder model

In [35]:
%%time
s = "Bir zamanlar BERTten büyük dil model diye bahsedilirdi..."
inputs = tokenizer(s, return_tensors="pt")
outputs = model(**inputs)

CPU times: user 464 ms, sys: 287 ms, total: 752 ms
Wall time: 338 ms


inputs are familiar at this point

In [36]:
inputs

{'input_ids': tensor([[   2, 2281, 7476,   38, 2864, 1070, 2324, 2368, 3004, 3424, 2636, 7808,
         3061, 2016,   18,   18,   18,    3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Let's see what the outputs have to offer

In [38]:
outputs.__dict__.keys()

dict_keys(['last_hidden_state', 'pooler_output', 'hidden_states', 'past_key_values', 'attentions', 'cross_attentions'])

Let's dive into what are those and what use they have

In [47]:
outputs.last_hidden_state.shape

torch.Size([1, 18, 768])

Last hidden state of BERT is shaped like [batch_size, input_token_size, embedding_size] so it generates an embedding vector for each token, which we have utilized for token classification tasks before

In [48]:
outputs.pooler_output.shape

torch.Size([1, 768])

Pooler output is (although implementations may differ between bert variants) the CLS token embedding went through a linear layer and tanh activation. This is mostly used for sentence embeddings.

In [49]:
outputs.pooler_output

tensor([[-3.3975e-02,  9.9932e-01,  8.3639e-02,  1.0548e-04,  9.9713e-01,
          1.0797e-01, -9.9961e-01, -9.9946e-01,  8.3098e-01, -2.0310e-01,
         -1.7842e-01, -2.4855e-01,  6.7193e-01,  7.0742e-02,  1.8804e-01,
         -5.2339e-02, -9.9978e-01, -1.4373e-02,  9.9947e-01,  9.9994e-01,
          2.1633e-01,  2.1890e-03, -9.9988e-01, -2.3523e-01,  2.0793e-01,
         -9.9994e-01,  9.8377e-01, -9.9876e-01,  1.0000e+00, -9.9643e-01,
         -9.9996e-01, -4.0036e-02,  9.8537e-01, -9.7859e-01,  9.9999e-01,
         -9.9997e-01, -9.9908e-01,  3.7955e-02, -1.2385e-01,  9.9600e-01,
          5.3146e-01, -9.4441e-03,  9.9473e-01,  9.9999e-01,  1.0000e+00,
          9.5748e-01,  6.1106e-02,  1.4361e-01, -9.8359e-01, -4.2956e-01,
         -9.8889e-01, -9.2030e-01, -4.9239e-02,  9.9492e-01, -1.1915e-01,
         -4.8750e-02,  1.0000e+00, -1.4786e-03,  5.3366e-03,  1.1981e-01,
         -6.2311e-02, -9.9991e-01,  9.9936e-01,  1.0583e-01,  5.7853e-02,
         -2.5122e-01,  9.9771e-01,  9.

**This is basically a 768 dimentional feature vector. You can use this for the very first problem in this notebook and see how it compares!**

### Encoder - Decoder Models

Encoder - Decoder Models are mostly used for sequence-to-sequence NLP problems. Such as translation, summarization, generative question answering and so on.

In [5]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
model = AutoModelForSeq2SeqLM.from_pretrained("ahmetbagci/bert2bert-turkish-paraphrase-generation")

Config of the encoder: <class 'transformers.models.bert.modeling_bert.BertModel'> is overwritten by shared encoder config: BertConfig {
  "_name_or_path": "dbmdz/bert-base-turkish-cased",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.49.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 32000
}

Config of the decoder: <class 'transformers.models.bert.modeling_bert.BertLMHeadModel'> is overwritten by shared decoder config: BertConfig {
  "_name_or_path": "dbmdz/bert-base-turkish-cased",
  "add_cross_attention": true,
  "attention

In [6]:
model

EncoderDecoderModel(
  (encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, el

In [8]:
text="beni benden alırsan seni sana bırakmam"
input_ids = tokenizer(text, return_tensors="pt").input_ids
output_ids = model.generate(input_ids)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
#sample output
#son model arabalar çevre için daha az zararlı mı?

beni benden alırsa seni bırakmam mümkün mü?


### Decoder Models

Decoder Models are all the fuzz since chatgpt. Let's look into their workings

In [9]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [10]:
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")

We are going to look into instruction tuning.

In [18]:
from datasets import load_dataset

ds = load_dataset("BrewInteractive/alpaca-tr")

We know that decoder only models are autoregressive next-token predictors. Their task is also called "document completion" because the continue writing whatever the input document was.  
But how come models that just make more of the input receive dialog capabilities?

In [13]:
tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
def form_prompts(examples):
    prompts = {}
    if examples["input"]:
        messages = [
            {"role": "user", "content": examples["instruction"]},
            {"role": "context", "content": examples["input"]},
            {"role": "assistant", "content": examples["output"]}
        ]
    else:
        messages = [
            {"role": "user", "content": examples["instruction"]},
            {"role": "assistant", "content": examples["output"]}
        ]
    prompts["prompt"] = tokenizer.apply_chat_template(messages, tokenize=False)
    prompts["input_ids"] = tokenizer.apply_chat_template(messages, tokenize=True, truncation=True)
    return {"input_ids": prompts["input_ids"]}

In [22]:
ds = ds.map(batched_form_prompts, remove_columns=ds["train"].column_names, batched=True)

Map: 100%|███████████████████████| 45331/45331 [00:35<00:00, 1269.92 examples/s]


So yes it is still document completion but the document looks in a very specific format

In [24]:
ds = ds["train"].train_test_split(test_size=0.2)

In [25]:
ds

DatasetDict({
    train: Dataset({
        features: ['prompts', 'input_ids'],
        num_rows: 36264
    })
    test: Dataset({
        features: ['prompts', 'input_ids'],
        num_rows: 9067
    })
})

In [26]:
chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

print(tokenizer.apply_chat_template(chat, tokenize=False))


<|im_start|>user
Hello, how are you?<|im_end|>
<|im_start|>assistant
I'm doing great. How can I help you today?<|im_end|>
<|im_start|>user
I'd like to show off how chat templating works!<|im_end|>



Every dialog with any instruction model is parsed into a single string at the background

**Extras** What is lora how does it work why does it work?

In [27]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [None]:
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

target_modules = ["c_attn"]
config = LoraConfig(
    r=1,
    lora_alpha=16, 
    target_modules=target_modules, 
    lora_dropout=0.1, 
    bias="none", 
    task_type="CAUSAL_LM"
)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = prepare_model_for_kbit_training(model)
lora_model = get_peft_model(model, config)

In [None]:
lora_model.print_trainable_parameters()

In [None]:
from transformers import DataCollatorForLanguageModeling, TrainingArguments, Trainer

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir="gpt2_alpaca_tr",
    eval_strategy="no",
    learning_rate=2e-5,
    weight_decay=0.01,
    use_cpu=True
)

trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=ds["train"], # datanın neye benzemesi gerekiyo bi bak
    eval_dataset=ds["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)


In [None]:
ds["train"][0]

In [None]:
lora_model.device

In [None]:
trainer.train()

### Inferance

In [28]:
prompt = "Somatic hypermutation allows the immune system to"

inputs = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [29]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

'Somatic hypermutation allows the immune system to take its place and treat symptoms caused by mutations, but this treatment fails to take into account the role of a genetic mutation, a phenomenon which often leads to severe adverse effects that can have a detrimental impact on the immune system.\n\n\n\n\nThe current study focused on the effect of a combination of genetic modification and an immunomodulatory system on mice with a normal immune system, with an in vitro test to determine the effects on human immune system function in mice of multiple strains and a mouse'