PART I: Extracting the data

In [2]:
import json
import pandas as pd
from ast import literal_eval
import glob
import csv
import copy
import os

Load data from fsm_data folder. Currently, the manual analysis do have the game DisYouCatchTheBall, TheDice, and Tangram so we remove them

> Note: the dir variable may differ in the local folder



In [15]:
data = []
dir = "../fsm-data/*.json"
old_files = sorted(glob.glob(dir, recursive=False))
files = []
for file in old_files:
  if(file != "../fsm-data\TheDice_US.json" and file != "../fsm-data\DidYouCatchTheBall_US.json" and file != "../fsm-data\TangramsRace.json"):
    files.append(file)
print(len(files))
for single_file in files:
    with open(single_file, 'r') as f:
        json_file = json.load(f)
        new_string = json.dumps(json_file, indent = 2)
        # print(new_string)
        data.append(json_file)

6


Extract all of the text in the game. The considered text are from 'displayText' attribute in the json file

In [16]:
c = []
for single_file in data:
    temp = single_file['states']
    d = []
    #print('------------------------------')
    for i in temp:
        att = list(i.keys())
        for key in att:
            if (key == 'displayText'):
                s = ''
                text = list(i[key].keys())
                for j in text:
                    s += i[key][j]
                if (not (s == '')):
                    d.append(s)
    c.append(d)

Extracting the data from csv file in manual_data folder, and turn them into pandas DataFrame


> Note: All csv need to be in the same name as the json file to keep the proper order (the right input goes with the right output). Furthermore, the number of files in both input and output must be the same (cannot have more input or have more output than the other)

We also drop some of the unneccessary columns (criterias that seems unrelated to NLP)



In [22]:
dir_csv = '../manual-analysis/*.csv'
csv_files = sorted(glob.glob(dir_csv, recursive=False))
for file in csv_files:
   if(file == "../manual-analysis\Mortal_Gorilla_Sheet1.csv"):
      csv_files.remove(file)
df_list = (pd.read_csv(file) for file in csv_files)
labels = list(df_list)
for y in labels:
   y.drop(columns=["NAME","Day","Targeted Grade Level","Presence of Teams", "Team Dynamics No Teams", "Team Size", "Number of Teams", "Team Dynamics Between Teams", "Team Dynamics Within Teams", "Drawing Components [Rules]", "Drawing Components [Physical Objects]", "Drawing Components [Physical Space]", "Drawing Components [Timing]", "Drawing Components [Physicality]", "FSMD Components [Rules]", "FSMD Components [Physical Objects]", "FSMD Components [Physical Space]", "FSMD Components [Timing]", "FSMD Components [Physicality]", "Presence of Finite State Machine Diagram", "Output State Representation", "Transition State Representation", "Finite State Machine Diagram Consistency with Specified Rules", "State Consistency (Boxes)", "Transition Consistency (Arrows)", "Finite State Machine Diagram Completion", "States/Boxes", "Transitions/Arrows", "Numbered States", "Evidence of Programming Language Knowledge [Arrow(s) that loop to a previous state]"], inplace = True)


Keep the input and output as a single DataFrame (may be deleted later)

In [23]:
li = []
for names in csv_files:
  dft = pd.read_csv(names, index_col=None, header=0)
  dft.drop(columns=["NAME","Day","Targeted Grade Level","Presence of Teams", "Team Dynamics No Teams", "Team Size", "Number of Teams", "Team Dynamics Between Teams", "Team Dynamics Within Teams", "Drawing Components [Rules]", "Drawing Components [Physical Objects]", "Drawing Components [Physical Space]", "Drawing Components [Timing]", "Drawing Components [Physicality]", "FSMD Components [Rules]", "FSMD Components [Physical Objects]", "FSMD Components [Physical Space]", "FSMD Components [Timing]", "FSMD Components [Physicality]", "Presence of Finite State Machine Diagram", "Output State Representation", "Transition State Representation", "Finite State Machine Diagram Consistency with Specified Rules", "State Consistency (Boxes)", "Transition Consistency (Arrows)", "Finite State Machine Diagram Completion", "States/Boxes", "Transitions/Arrows", "Numbered States", "Evidence of Programming Language Knowledge [Arrow(s) that loop to a previous state]"], inplace = True)
  li.append(dft)
frame = pd.concat(li, axis=0, ignore_index=True)
x = []
for col in frame.columns:
  if (col != 'Input'):
    x.append(col)

print(x)

['Game Descriptor', 'Content [Counting and Cardinality ]', 'Content [Operations and Algebraic Thinking ]', 'Content [Number and Operations in Base Ten]', 'Content [Number and Operations with Fractions]', 'Content [Measurement and Data]', 'Content [Geometry]', 'Content [Ratio and Proportions]', 'Content [The Number System]', 'Content [Expressions and Equations]', 'Content [Functions]', 'Content [Statistics and Probability ]', 'Progressive Levels', 'Content Adaptability', 'Game Facilitator', 'End-Goal', 'Technological Incorporation', 'Technological Dependency', 'Player Competition (No Teams)', 'Player Collaboration', 'Team Competition', 'Team Collaboration', 'Facilitator Competition', 'Facilitator Collaboration', 'Physicality', 'Physicality Option', 'Sweat Factor', 'Physical Contact', 'Style of Physical Contact', 'Physical Space Diagram', 'Physical Environment', 'If you selected 0 (Unspecified), select one of the following codes related to the size of the environment for gameplay based o

In [24]:
# pip install transformers

In [25]:
import pandas as pd
import torch
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import DistilBertTokenizer, DistilBertModel

  from .autonotebook import tqdm as notebook_tqdm


Initializing the size for our input to feed to the model

In [26]:
MAX_LEN = 133
TRAIN_BATCH_SIZE = 4
VALID_BATCH_SIZE = 2 
EPOCHS = 3
LEARNING_RATE = 1e-05
DEVICE = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print(DEVICE)

cpu


The following two cells create a single DataFrame that holds the input text and the labels

In [27]:
input = []
for arr in c:
  temp = ''
  for i in arr:
    i+='.'
    temp += i
  input.append(temp)

In [28]:
tempf = []
for u in labels:
  tempt = []
  for k in x:
    t = (u[k].values)[0]
    tempt.append(t)
  tempf.append(tempt)

dataf = {'Input': input, 'Output': tempf}
train_data = pd.DataFrame(dataf)
# print(df)

PART II: Building the model

We define a PyTorch Dataset class called MultiLabelDataset that is used to preprocess text data for multi-label text classification tasks using the DistilBERT model.

We put our DataFrame into the class and it will tokenize the text (in a way needed for BERT), generate the attention mask and put all of them in a Tensor object. 

More info about the tokenization process: BERT needs the input to be breaks down into smaller tokens and padd all inputs to be the same length (also, all sentences must be padded with the '[CLS]' and '[SEP]' tokens at start and end)

In [29]:
class MultiLabelDataset(Dataset):

    def __init__(self, dataframe, tokenizer, max_len, new_data=False):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.Input
        self.new_data = new_data
        
        if not new_data:
            self.targets = self.data.Output
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]

        out = {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
        }
        
        if not self.new_data:
            out['targets'] = torch.tensor(self.targets[index], dtype=torch.float)

        return out

Split out data into training set and validation set

In [30]:
train_size = 0.7

train_df = train_data.sample(frac=train_size, random_state=123)
val_df = train_data.drop(train_df.index).reset_index(drop=True)
train_df = train_df.reset_index(drop=True)


print("Orig Dataset: {}".format(train_data.shape))
print("Training Dataset: {}".format(train_df.shape))
print("Validation Dataset: {}".format(val_df.shape))

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', truncation=True, do_lower_case=True)
training_set = MultiLabelDataset(train_df, tokenizer, MAX_LEN)
val_set = MultiLabelDataset(val_df, tokenizer, MAX_LEN)

Orig Dataset: (6, 2)
Training Dataset: (4, 2)
Validation Dataset: (2, 2)


Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.81MB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading (…)okenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 3.73kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 483/483 [00:00<00:00, 60.6kB/s]


In [31]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 8
                }

val_params = {'batch_size': VALID_BATCH_SIZE,
               'shuffle': False,
               'num_workers': 8
                }

training_loader = DataLoader(training_set, **train_params)

This is a model that is built up on DistilBertModel, which is a lighter, faster version of the normal Bert Model.

When calling forward(), the model breaks down the inputs into hidden states, which refer to the internal representation of a sequence of text, such as a sentence or a document, that is learned by a machine learning model

In the case of the DistilBERT model, the hidden state refers to the internal representation of the input text at each layer of the model. Each layer of the model takes the output of the previous layer and produces a new hidden state that captures increasingly complex and abstract features of the input text. The final hidden state of the last layer, corresponding to the [CLS] token in the input sequence, is typically used as the input to downstream tasks such as text classification, question answering, or text generation.

After we have the hidden states, the model run the classifer, which is a sequence of three layers. The last layer would be a fully connected layer with 768 input neurons and 54 output neurons (54 represents our classes). The output of this classifier will be used to predict the classification label of the input text.

Having multiple layers of transformers allows the model to capture increasingly complex and abstract features of the input text, as each layer can build on the representations learned by the previous layer. However, increasing the number of layers can also make the model more prone to overfitting, as it may start to memorize the training data instead of learning general patterns.

In [32]:
class DistilBERTClass(torch.nn.Module):
    def __init__(self):
        super(DistilBERTClass, self).__init__()
        
        self.bert = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.classifier = torch.nn.Sequential(
            torch.nn.Linear(768, 768),
            torch.nn.ReLU(),
            torch.nn.Dropout(0.1),
            torch.nn.Linear(768, 54)
        )

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        out = hidden_state[:, 0]
        out = self.classifier(out)
        return out

model = DistilBERTClass()
model.to(DEVICE)

Downloading pytorch_model.bin: 100%|██████████| 268M/268M [03:38<00:00, 1.23MB/s] 
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DistilBERTClass(
  (bert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(

The optimizer is an algorithm used during the training of a neural network to adjust the model's weights and biases based on the computed gradients of the loss function. The goal of the optimizer is to minimize the loss function and improve the accuracy of the model's predictions.
The Adam optimizer is a popular optimization algorithm that is commonly used in deep learning. It is an extension of stochastic gradient descent (SGD) and is known for its ability to converge quickly and efficiently, even for large and complex models.



In [33]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

During one epoch, the model receives the entire training dataset, processes it forward and backward through the network, and updates the model parameters. 

The train function is called for each epoch and sets the model to training mode using model.train(). It then iterates over the batches in the training_loader, loads the batch data onto the DEVICE, and passes it through the model to obtain the outputs.

The optimizer's gradients are set to zero with optimizer.zero_grad() to prevent accumulation of gradients from previous batches. The loss is computed using the binary_cross_entropy_with_logits function from torch.nn.functional. This function computes the binary cross-entropy loss between the outputs and the targets.

loss.backward() computes the gradients of the binary cross-entropy loss with respect to each parameter of the model, which is then used by the optimizer to update the model parameters in the next step of the training loop

In [34]:
def train(epoch):
    model.train()
    for _, data in tqdm(enumerate(training_loader, 0)):
        ids = data['ids'].to(DEVICE, dtype=torch.long)
        mask = data['mask'].to(DEVICE, dtype=torch.long)
        token_type_ids = data['token_type_ids'].to(DEVICE, dtype=torch.long)
        targets = data['targets'].to(DEVICE, dtype=torch.float)

        outputs = model(ids, mask, token_type_ids)
        print(outputs)
        optimizer.zero_grad()
        loss = torch.nn.functional.binary_cross_entropy_with_logits(outputs, targets)
        
        if _ % 5000 == 0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        loss.backward()
        optimizer.step()

for epoch in range(EPOCHS):
    train(epoch)    