<a href="https://colab.research.google.com/github/stanleykywu/ml-intro/blob/main/Primitive_ChatGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating your own Chat-GPT(ish)

## Dependencies

Run the installation code block ONLY if you are running this in a Google Colab. Nothing will if you run it locally but hopefully you wouldn't need to since all packages will already have been installed

In [None]:
%pip install torch
%pip install transformers

### Import necessary functions

In [6]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

## Training Dataset/Textual Corpus

Define a textual corpus for your model help answer a specific question. In general, the more text, the better your model will perform. In reality, something like ChatGPT would be trained on upwards of 45TB of textual data.

In [20]:
corpus = r"""
A transformer is a deep learning model that adopts the
mechanism of attention, differentially weighing the
significance of each part of the input data. It is used
primarily in the field of natural language processing
(NLP) and in computer vision (CV).

Like recurrent neural networks (RNNs), transformers are 
designed to handle sequential input data, such as natural 
language, for tasks such as translation and text 
summarization. However, unlike RNNs, transformers do not
necessarily process the data in order. Rather, the 
attention mechanism provides context for any position in 
the input sequence. 
"""

## Importing a Tokenizer and Model for Fine-Tuning

Rather than training a model from scratch, we fine-tune an existing language model, like BERT from Google. In particular, this one we use is good at answering question prompts.

In [21]:
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

## Asking our own ChatGPT a question

In [None]:
question = "How do transformers work?"
print(f"The question:\n{question}")

Here, we package our textual corpus and question and feed it to our model. We then decode what the model outputs and returns it as text.

In [23]:
inputs = tokenizer(question, corpus, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
outputs = model(**inputs)

ans_start_scores = outputs.start_logits
ans_end_scores = outputs.end_logits

ans_start = torch.argmax(ans_start_scores)
ans_end = torch.argmax(ans_end_scores) + 1

answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[ans_start:ans_end]))

In [None]:
print(f"The question:\n-------------------------\n{question}\n")
print(f"The answer:\n-------------------------\n{answer}")