- Experimented with generative GPT models to calculate the percentages associated with words when attempting to complete idiomatic sentences
- Concluded that it was able to understand majority of the most simple idioms and complete them meaningfully

In [2]:
import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

In [3]:
import numpy as np

In [4]:
# Load pre-trained GPT-2 model and tokenizer
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = TFGPT2LMHeadModel.from_pretrained(model_name)

Downloading tf_model.h5: 100%|██████████| 498M/498M [00:13<00:00, 37.3MB/s] 
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [5]:

# Input text for prediction
input_text = "fine if you are so clever why dont you tell me what to"
input_ids = tokenizer.encode(input_text, return_tensors="tf")

# Generate next word predictions
logits = model(input_ids)[0]
predicted_token_id = tf.argmax(logits[0, -1, :])

# Decode the predicted token
predicted_word = tokenizer.decode(predicted_token_id.numpy())

print("Predicted Next Word:", predicted_word)

Predicted Next Word:  do


In [6]:
input_sentence = "its raining"
input_ids = tokenizer.encode(input_sentence, return_tensors="tf")

# Generate next word predictions
logits = model(input_ids)[0]

# Get the predicted probabilities for the last token
predicted_probs = tf.nn.softmax(logits[:, -1, :], axis=-1)[0]

# Get sorted list of token IDs and their corresponding probabilities
sorted_token_probs = sorted(enumerate(predicted_probs), key=lambda x: x[1], reverse=True)

In [7]:
len(sorted_token_probs)

50257

In [8]:
target_word = " down"
for token_id, prob in sorted_token_probs:
    predicted_word = tokenizer.decode([token_id])
    if predicted_word == target_word:
        print(f"Word: {predicted_word} | Probability: {prob:}")
        break

Word:  down | Probability: 0.4680662751197815


In [10]:
input_sentence = "its raining cats and "
input_word = " dogs,"  # The word or phrase you want to check

# Tokenize the input sentence and word
input_ids = tokenizer.encode(input_sentence, return_tensors="tf")
input_word_ids = tokenizer.encode(input_word, return_tensors="tf")

# Generate continuation of the sentence with the input word
input_ids = tf.concat([input_ids, input_word_ids], axis=-1)
logits = model(input_ids)[0]

# Get the predicted probabilities for the last token
predicted_probs = tf.nn.softmax(logits[:, -1, :], axis=-1)[0]

# Get the token ID of the input word
input_word_token_id = input_word_ids[0, 0]

# Get the probability of the input word
input_word_prob = predicted_probs[input_word_token_id].numpy()

print("Input Sentence:", input_sentence)
print("Input Word:", input_word)
print("Probability of the Input Word:", input_word_prob)


Input Sentence: its raining cats and 
Input Word:  dogs,
Probability of the Input Word: 0.003573629
