# Transfer Learning with GPT2 Experiments

I'm going to track some initial experiments trying to apply transfer learning to a pretrained GPT2 model to then use the final output for Automatic Short Answer Grading. In a meeting on 4/28 Mingyu and I played with the nanogpt repo and got a good sense of how it seems to be working, so the updates we need to make to attempt this seem pretty clear. The steps I am going to try to complete in this notebook are as follows:

1. Figure out how to load in a pretrained model from a checkpoint file. He trained one for about 3 days on a server he has access to, so that is what we want to start with.
2. Make sure the weights of the existing layers are frozen, so that we don't have to accumulate gradients and make updates on the ~10 million (I think?) weights in the nanoGPT model.
3. Download the edited Mohler dataset and split into training and testing sets.
4. Modify the network to prepare for transfer learning. One new MLP layer needs to be added, We are going to attempt this in two separate ways, and compare what is best. The first step here is to add a new layer to the network, I'll start with a simple MLP the same way Karpathy does it. The current forward function then needs to take one extra step to pass the output from the last layer of the transformer through this extra MLP. Then we will make two separate updates:
    a. In the forward function, split the input into two separate chunks, one for the desired response, and one for the student response. Feed each response separately into the transformer, and save the final hidden states that result for each. Use some sort of pytorch comparison or vector norm operation to compare the similarity of the two vectors, and use this similarity metric to make the actual output.
    b. concatenate the question, the desired answer, and the student answer, with separator tokens in between, into one single input tensor. Feed this input into the forward function, which will be modified again to feed the output of the last transformer layer into our new MLP.
5. We want to compare to the papers we were looking at, and so want to compute the root mean squared error

In [72]:
import pandas as pd
import numpy as np

data = pd.read_csv("mohler_dataset_edited.csv")

print(f"the size of the full dataset is {data.shape}")

# Split the DataFrame into training, validation, and testing sets
train, validate, test = np.split(data.sample(frac=1, random_state=42), [int(.7*len(data)), int(.9*len(data))])

# Print the sizes of the resulting sets
print("Training set size: ", len(train))
print("Validation set size: ", len(validate))
print("Testing set size: ", len(test))

print(train.head())

the size of the full dataset is (2273, 7)
Training set size:  1591
Validation set size:  454
Testing set size:  228
        id                                           question   
2204  12.8        What is the Euler tour traversal of a tree?  \
1320   8.3       How can you implement a stack with an array?   
859    5.3  What is the number of operations for insertion...   
408    2.7                 What is the role of a header-file?   
629    3.7  What are the similarities between iteration an...   

                                         desired_answer   
2204  A walk around the tree, starting with the root...  \
1320  Keep the top of the stack toward the end of th...   
859   N (the length of the array) operations achieve...   
408   To store a class interface, including data mem...   
629   They both involve repetition; they both have t...   

                                         student_answer  score_me   
2204  it starts node on the left of the root and the...       2.5  \


In [39]:
import tiktoken

# So want to transform this data into X, Y pairs
# Each X will be the (question, desired_answer, student_answer)
# Y will be the corresponding score_avg
# Define a list of column names to select
selected_cols = ['question', 'desired_answer', 'student_answer']

# same process function as in prepare.py for the openweb dataset, I'm guessing we want the format to be the same?
# this time just take in the text directly
# and output the encod
enc = tiktoken.get_encoding("gpt2")
def process(text):
    ids = enc.encode_ordinary(text) # encode_ordinary ignores any special tokens
    ids.append(enc.eot_token) # add the end of text token, e.g. 50256 for gpt2 bpe
    # note: I think eot should be prepended not appended... hmm. it's called "eot" though...
    return ids

# process each of the columns we care about in the dataframe
# Apply the process function to each element of the selected columns
x_df = train[selected_cols]
print(x_df.head())

                                               question   
2204        What is the Euler tour traversal of a tree?  \
1320       How can you implement a stack with an array?   
859   What is the number of operations for insertion...   
408                  What is the role of a header-file?   
629   What are the similarities between iteration an...   

                                         desired_answer   
2204  A walk around the tree, starting with the root...  \
1320  Keep the top of the stack toward the end of th...   
859   N (the length of the array) operations achieve...   
408   To store a class interface, including data mem...   
629   They both involve repetition; they both have t...   

                                         student_answer  
2204  it starts node on the left of the root and the...  
1320  Make an array, make the bottom at spot 0, make...  
859   theta(n) the best case senario is that everyth...  
408   Allow compiler to recognize the classes when u...  


In [76]:
# # He has some fancy concatenation thing, writing this all to a file, its a little confusing
# # our dataset is small, I'm not going to worry about it, and will create tensors directly

first_row = x_df.iloc[0]
print(first_row)
encoded_input = []
encoded_dataframe = x_df.applymap(process)
encoded_dataframe.head()

first_row = encoded_dataframe.iloc[0]
print(first_row)

X_tuples = []
for index, row in encoded_dataframe.iterrows():
    # convert to numpy arrays
    # np_arr_question = np.array(row['question'])
    # np_arr_desired_answer = np.array(row['desired_answer'])
    # np_arr_student_answer = np.array(row['student_answer'])
    question_tensor = torch.tensor(row['question'], dtype=torch.int64)
    desired_answer_tensor = torch.tensor(row['desired_answer'], dtype=torch.int64)
    student_answer_tensor = torch.tensor(row['student_answer'], dtype=torch.int64)
    X_tuples.append((question_tensor, desired_answer_tensor, student_answer_tensor))

# sanity check to make sure this worked correctly
# the three elements should be the encoded question, desired_answer, and student answer. Each should end with the end token
print(f"First element of first tuple of x tuples: {X_tuples[0][0]}")
print(f"Second element of first tuple of x tuples: {X_tuples[0][1]}")
print(f"Third element of first tuple of x tuples: {X_tuples[0][2]}")

print(f"In total we have {len(X_tuples)} tuples in the training set")

question                What is the Euler tour traversal of a tree?
desired_answer    A walk around the tree, starting with the root...
student_answer    it starts node on the left of the root and the...
Name: 2204, dtype: object
question          [2061, 318, 262, 412, 18173, 4205, 33038, 282,...
desired_answer    [32, 2513, 1088, 262, 5509, 11, 3599, 351, 262...
student_answer    [270, 4940, 10139, 319, 262, 1364, 286, 262, 6...
Name: 2204, dtype: object
First element of first tuple of x tuples: tensor([ 2061,   318,   262,   412, 18173,  4205, 33038,   282,   286,   257,
         5509,    30, 50256])
Second element of first tuple of x tuples: tensor([   32,  2513,  1088,   262,  5509,    11,  3599,   351,   262,  6808,
           11,   810,  1123, 10139,   318,  1775,  1115,  1661,    25,   422,
          262,  1364,    11,   422,  2174,    11,   422,   262,   826,    13,
        50256])
Third element of first tuple of x tuples: tensor([  270,  4940, 10139,   319,   262,  1364,   286

In [74]:
# now get the average scores, which will be our y values
Y = np.array(train['score_avg'])
Y.shape

(1591,)