<a href="https://colab.research.google.com/github/rakibulhaque9954/Machine_Learning_Translation/blob/main/Machine_translation_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Acknowledgement

**Based on research by members of Google Brain, Google Research, Univerity of Toronto**<br>
Paper Link: https://arxiv.org/pdf/1706.03762.pdf


# Imports

In [2]:
import tensorflow as tf### models
import numpy as np### math computations
import matplotlib.pyplot as plt### plotting bar chart
import sklearn### machine learning library
import cv2## image processing
from sklearn.metrics import confusion_matrix, roc_curve### metrics
import seaborn as sns### visualizations
import datetime
import pathlib
import io
import os
import re
import string
import time
from numpy import random
import tensorflow_datasets as tfds
import tensorflow_probability as tfp
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Layer
from tensorflow.keras.layers import (Dense,Flatten,SimpleRNN,InputLayer,Conv1D,Bidirectional,GRU,LSTM,BatchNormalization,Dropout,Input, Embedding,TextVectorization)
from tensorflow.keras.losses import BinaryCrossentropy,CategoricalCrossentropy, SparseCategoricalCrossentropy
from tensorflow.keras.metrics import Accuracy,TopKCategoricalAccuracy, CategoricalAccuracy, SparseCategoricalAccuracy
from tensorflow.keras.optimizers import Adam
from google.colab import drive
from google.colab import files
from tensorboard.plugins import projector

# Data Preparation

## Dataset Download

In [None]:
!wget https://www.manythings.org/anki/fra-eng.zip

--2023-10-22 07:43:06--  https://www.manythings.org/anki/fra-eng.zip
Resolving www.manythings.org (www.manythings.org)... 173.254.30.110
Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7757635 (7.4M) [application/zip]
Saving to: ‘fra-eng.zip’


2023-10-22 07:43:09 (4.27 MB/s) - ‘fra-eng.zip’ saved [7757635/7757635]



In [None]:
!unzip "/content/fra-eng.zip" -d "/content/dataset/"

Archive:  /content/fra-eng.zip
  inflating: /content/dataset/_about.txt  
  inflating: /content/dataset/fra.txt  


## Data Preprocessing

In [None]:
text_dataset = tf.data.TextLineDataset("/content/dataset/fra.txt")

In [5]:
VOCAB_SIZE = 20000
ENGLISH_SEQUENCE_LENGTH = 64
FRENCH_SEQUENCE_LENGTH = 64
EMBEDDING_DIM = 300
BATCH_SIZE = 64

In [None]:
english_vectorize_layer = TextVectorization(
    standardize='lower_and_strip_punctuation',
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=ENGLISH_SEQUENCE_LENGTH
)

In [None]:
french_vectorize_layer = TextVectorization(
    standardize='lower_and_strip_punctuation',
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=FRENCH_SEQUENCE_LENGTH
)

In [None]:
def selector(input_text):
  split_text = tf.strings.split(input_text,'\t')
  return {'input_1':split_text[0:1],'input_2':'starttoken '+split_text[1:2]},split_text[1:2]+' endtoken'

In [None]:
split_dataset = text_dataset.map(selector)

In [None]:
def separator(input_text):
  split_text = tf.strings.split(input_text,'\t')
  return split_text[0:1],'starttoken '+split_text[1:2]+' endtoken'

In [None]:
init_dataset = text_dataset.map(separator)

In [None]:
for i in split_dataset.take(3):
  print(i)

({'input_1': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Go.'], dtype=object)>, 'input_2': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'starttoken Va !'], dtype=object)>}, <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Va ! endtoken'], dtype=object)>)
({'input_1': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Go.'], dtype=object)>, 'input_2': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'starttoken Marche.'], dtype=object)>}, <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Marche. endtoken'], dtype=object)>)
({'input_1': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Go.'], dtype=object)>, 'input_2': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'starttoken En route !'], dtype=object)>}, <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'En route ! endtoken'], dtype=object)>)


### Vocabulary Creation

In [None]:
english_training_data=init_dataset.map(lambda x,y:x) # input x,y and output x
english_vectorize_layer.adapt(english_training_data) # adapt the vectorize_layer to the training data

french_training_data=init_dataset.map(lambda x,y:y) # input x,y,z and output y
french_vectorize_layer.adapt(french_training_data) # adapt the vectorize_layer to the training data

### Grouping and Vectorizing for training

In [None]:
def vectorizer(inputs,output):
  return {'input_1':english_vectorize_layer(inputs['input_1']),
          'input_2':french_vectorize_layer(inputs['input_2'])},french_vectorize_layer(output)

In [None]:
split_dataset

<_MapDataset element_spec=({'input_1': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'input_2': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, TensorSpec(shape=(None,), dtype=tf.string, name=None))>

In [None]:
dataset=split_dataset.map(vectorizer)

In [None]:
for i in split_dataset.take(3):
  print(i)

({'input_1': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Go.'], dtype=object)>, 'input_2': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'starttoken Va !'], dtype=object)>}, <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Va ! endtoken'], dtype=object)>)
({'input_1': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Go.'], dtype=object)>, 'input_2': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'starttoken Marche.'], dtype=object)>}, <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Marche. endtoken'], dtype=object)>)
({'input_1': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Go.'], dtype=object)>, 'input_2': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'starttoken En route !'], dtype=object)>}, <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'En route ! endtoken'], dtype=object)>)


In [None]:
for i in dataset.take(1):
  print(i)

({'input_1': <tf.Tensor: shape=(1, 64), dtype=int64, numpy=
array([[44,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])>, 'input_2': <tf.Tensor: shape=(1, 64), dtype=int64, numpy=
array([[  2, 103,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0]])>}, <tf.Tensor: shape=(1, 64), dtype=int64, numpy=
array([[103,   3,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,

In [None]:
dataset

<_MapDataset element_spec=({'input_1': TensorSpec(shape=(None, 64), dtype=tf.int64, name=None), 'input_2': TensorSpec(shape=(None, 64), dtype=tf.int64, name=None)}, TensorSpec(shape=(None, 64), dtype=tf.int64, name=None))>

In [None]:
dataset = dataset.shuffle(2048).unbatch().batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)


In [None]:
dataset

<_PrefetchDataset element_spec=({'input_1': TensorSpec(shape=(None, 64), dtype=tf.int64, name=None), 'input_2': TensorSpec(shape=(None, 64), dtype=tf.int64, name=None)}, TensorSpec(shape=(None, 64), dtype=tf.int64, name=None))>

In [None]:
NUM_BATCHES = int(200000/BATCH_SIZE)

### Dataset Split

In [None]:
train_dataset = dataset.take(int(0.9*NUM_BATCHES))
val_dataset = dataset.skip(int(0.9*NUM_BATCHES))

In [None]:
train_dataset

<_TakeDataset element_spec=({'input_1': TensorSpec(shape=(None, 64), dtype=tf.int64, name=None), 'input_2': TensorSpec(shape=(None, 64), dtype=tf.int64, name=None)}, TensorSpec(shape=(None, 64), dtype=tf.int64, name=None))>

# Modeling

<hr>
<h4>Model Architecture</h4>
<hr>
<img src='https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png'>

***Step Wise Explanation:***
- Input Embedding: The process begins with encoding the input language (e.g., English sequence) into numerical vectors. Each word or token is transformed into a high-dimensional vector.
- Multi-Head Self-Attention: This is the heart of a transformer. The model looks at each word in the input sentence and assigns different levels of importance to other words in the sentence. Multiple attention heads allow the model to focus on different aspects of the sentence simultaneously.
- Positional Encoding: Since transformers don't have an inherent sense of word order, positional encoding is added to the word embeddings to help the model understand the word's position in the sentence.
- Encoder-Decoder Architecture: In translation tasks, there are typically two parts: the encoder and the decoder. The encoder takes the input sentence and processes it, while the decoder generates the translated output.
- Decoder Self-Attention: The decoder also uses multi-head self-attention, but it's slightly modified to prevent it from looking ahead in the output sentence, which would result in incorrect translations.
- Attention Output: The outputs from the attention mechanisms are used to calculate attention scores, which determine how much each word in the input sentence contributes to each word in the output sentence.
Position-wise Feedforward Networks: After attention, the model passes the data through feedforward neural networks to further process and refine the information.
Output Layer: The final layer in the decoder produces probabilities for each word in the target language vocabulary, allowing the model to predict the next word in the translation.
- Training and Optimization: Transformers are trained using large parallel corpora of source and target language sentences. They learn to minimize the difference between predicted translations and the actual translations in the training data.
- Repeat for Each Token: This process is repeated for each word in the output sentence, where the previously generated words are used as context for generating the next word.
Beam Search or Greedy Decoding: During inference, the model generates translations one word at a time. Beam search or greedy decoding is often used to select the most likely next word based on the model's predictions.

<h4>Inside Attention Layer</h4>
<img src='https://production-media.paperswithcode.com/methods/35184258-10f5-4cd0-8de3-bd9bc8f88dc3.png'>

Easy to understand Explanation:

Lets break it down and relate it to the components and processes in a transformer model:

- School and Students: Think of the school as the entire context, and the students as the individual tokens in a sequence.
- Vectorization and Tokenization: The process of converting students into tokens and vectorizing them represents the initial preprocessing steps where text data is tokenized into individual words or tokens and then converted into numerical vector representations.
- Vocabulary: The vocabulary of the school represents the set of unique tokens (students) that the model has learned from various schools within the same company. These tokens are used to represent words in the sequences.
- Intra-Attention (Self-Attention):
Each student's interaction with their classmates represents the intra-attention mechanism, where relationships, influences, and context between tokens (students) are captured.
Each student becomes a query (Q), and their classmates become keys (K) and values (V).
Attention scores are calculated to determine how much weight each student should give to their classmates.
Softmax normalization of attention scores can be thought of as grading each student's relationships and influence on others.
Concatenation of information from different teachers (heads) captures diverse insights.
- Inter-Attention (Cross-Attention):
When different classes (decoders) want to compare their students (tokens), it's akin to cross-attention between different parts of the model.
A student from one class becomes a query (Q), and the students from another class become keys (K) and values (V).
The process is similar to intra-attention but operates across different classes.
- Linear Layer: The linear layer represents the post-attention processing step that helps combine and refine information before producing the final output.

This is the essence of how attention mechanisms work in transformers, where tokens (students) attend to each other, calculate their influence, and produce context vectors (mark sheets) for each other. These context vectors are then used in cross-attention to compare tokens from different parts of the model, ultimately leading to the model's final output.

Encoder's Role (Intra-Attention in Encoder):
The encoder processes the input sequence and performs intra-attention.
It produces context vectors (contextual representations) for each word in the input sequence.
These context vectors capture information about how each word relates to others within the input sequence.

Signaling the Decoder:
The decoder is signaled to start generating the output sequence.
Typically, this is done by providing the decoder with an initial input, often a special start token (e.g., <START> or <SOS>).

Generating the First Word:
For the first word in the output sequence, the decoder combines the following:
The start token as the initial query.
The encoder's context vectors, which represent the input sequence.
The decoder's own context vector for the output sequence (initialized explicitly).
These components are used to predict the first word in the output sequence.

Subsequent Word Predictions:
For generating subsequent words in the output sequence, the following process occurs:
The shifted target (previously generated word) becomes the query.
The encoder's context vectors, representing the input sequence, are used for context.
The context vectors for the target word (which includes context from the encoder) are also considered.
The last word's hidden state, obtained from the decoder's self-attention (intra-attention), is incorporated.
These components collectively contribute to the prediction of each subsequent word in the output sequence.

Iterative Token Generation:
The decoder repeats the process of generating tokens one by one, considering context from both the encoder's input sequence and its own generated sequence.
At each step, the decoder calculates a probability distribution over the vocabulary for the next token and selects the token with the highest probability.

Ending the Sequence:
The process continues until the model generates an end token (e.g., <END> or <EOS>) or reaches a predefined maximum sequence length.

## Transformers Architecture

<img src="https://www.mihaileric.com/static/feedforward_layer_and_normalization-dfdcfbd00009f7f99eca73ae29f2dfb7-4ec3a.png">

### Positional Encoding

In [4]:
def positional_encoding(model_size, SEQUENCE_LENGTH): # d_model
  output = []
  for pos in range(SEQUENCE_LENGTH):
    PE = np.zeros((model_size)) # initilizing with zeros
    for i in range(model_size):
      if i % 2 == 0: # even positions, sin formula is used according to paper
        PE[i] = np.sin(pos/(10000**(i/model_size)))
      else: # odd positions, cos formula is used as mentioned in the paper
        PE[i] = np.cos(pos/(10000**((i-1)/model_size)))
    output.append(tf.expand_dims(PE, axis = 0))

  out = tf.concat(output, axis=0)
  out = tf.expand_dims(out, axis=0)
  return tf.cast(out, dtype=tf.float32)

In [14]:
print(positional_encoding(256, 32).shape)

(1, 32, 256)


### Input Embeddings

In [3]:
class Embeddings(Layer):
  def __init__(self, sequence_length, vocab_size, embedding_dim):
    super(Embeddings, self).__init__()
    self.token_embeddings = Embedding(input_dim=vocab_size, output_dim=embedding_dim)
    self.sequence_length = sequence_length
    self.vocab_size = vocab_size
    self.embedding_dim = embedding_dim

  def call(self, inputs):
    embedded_tokens = self.token_embeddings(inputs)
    embedded_positions = positional_encoding(self.embedding_dim, self.sequence_length) # PE adding here
    return embedded_tokens + embedded_positions # final output for inputs

  def compute_mask(self, inputs, mask=None):
    return tf.math.not_equal(inputs, 0) # masking function for checking if there are pad tokens(0)



In [5]:
# testing

test_input = tf.constant([[1, 2, 3, 4, 0, 0, 0]])
embeddings_layer = Embeddings(sequence_length=7, vocab_size=20000, embedding_dim=256)
output_embed = embeddings_layer(test_input)
print(output_embed.shape)
mask = embeddings_layer.compute_mask(test_input)
print(mask)

# output: [Batch, Sequence_length, Embedding_dims]
# for each and every input there is vector with Embedding dimension 256
# for zeros in the input the mask was computed and it was not considered since they are zeros(pad_tokens)

(1, 7, 256)
tf.Tensor([[ True  True  True  True False False False]], shape=(1, 7), dtype=bool)


In [6]:
padding_mask = tf.cast(
    tf.repeat(mask,repeats=tf.shape(mask)[1],axis=0),
    dtype=tf.int32)
print(padding_mask)

tf.Tensor(
[[1 1 1 1 0 0 0]
 [1 1 1 1 0 0 0]
 [1 1 1 1 0 0 0]
 [1 1 1 1 0 0 0]
 [1 1 1 1 0 0 0]
 [1 1 1 1 0 0 0]
 [1 1 1 1 0 0 0]], shape=(7, 7), dtype=int32)


In [7]:
print(tf.linalg.band_part(
        tf.ones([1,8, 8],dtype=tf.int32),-1,0))




tf.Tensor(
[[[1 0 0 0 0 0 0 0]
  [1 1 0 0 0 0 0 0]
  [1 1 1 0 0 0 0 0]
  [1 1 1 1 0 0 0 0]
  [1 1 1 1 1 0 0 0]
  [1 1 1 1 1 1 0 0]
  [1 1 1 1 1 1 1 0]
  [1 1 1 1 1 1 1 1]]], shape=(1, 8, 8), dtype=int32)


### Encoder Layer

In [16]:
class TransfomerEncoder(Layer):
  def __init__(self, embedding_dims, dense_dims, num_heads):
    super(TransfomerEncoder, self).__init__()
    self.embedding_dims = embedding_dims
    self.dense_dims = dense_dims
    self.num_heads = num_heads
    self.attention = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embedding_dims)

    self.dense_proj = tf.keras.Sequential([
        Dense(self.dense_dims, activation="relu"),
        Dense(self.embedding_dims),
    ])
    self.layernorm_1 = tf.keras.layers.LayerNormalization()
    self.layernorm_2 = tf.keras.layers.LayerNormalization()
    self.supports_masking = True

  def call(self, inputs, mask=None):
    print(mask)
    if mask is not None:
      mask = tf.cast(mask[:, tf.newaxis, :], dtype='int32')
      print(mask)
      T = tf.shape(mask)[2]
      padding_mask = tf.repeat(mask, T, axis=1)
      print(padding_mask)

    attention_output = self.attention(query=inputs, value=inputs, key=inputs, attention_mask=padding_mask)

    proj_input = self.layernorm_1(inputs + attention_output)
    proj_output = self.dense_proj(proj_input)
    return self.layernorm_2(proj_input + proj_output)

In [17]:
# test_input = tf.random.uniform((1, 10, 256))
# test_mask = tf.cast(tf.random.uniform((1, 10)) > 0.5, tf.int32)
encoder = TransfomerEncoder(embedding_dims=256, dense_dims=512, num_heads=8)(output_embed)
print(encoder.shape)


tf.Tensor([[ True  True  True  True False False False]], shape=(1, 7), dtype=bool)
tf.Tensor([[[1 1 1 1 0 0 0]]], shape=(1, 1, 7), dtype=int32)
tf.Tensor(
[[[1 1 1 1 0 0 0]
  [1 1 1 1 0 0 0]
  [1 1 1 1 0 0 0]
  [1 1 1 1 0 0 0]
  [1 1 1 1 0 0 0]
  [1 1 1 1 0 0 0]
  [1 1 1 1 0 0 0]]], shape=(1, 7, 7), dtype=int32)
(1, 7, 256)


### Decoder

(128, 64, 20000)
(128, 64, 1)


### Full Model

In [None]:
### ENCODER ###
input = Input(shape=(ENGLISH_SEQUENCE_LENGTH,), dtype='int64', name='input_1')
encoder = Encoder(VOCAB_SIZE, EMBEDDING_DIM, HIDDEN_UNITS)
encoder_output = encoder(input)

### DECODER ###
shifted_target = Input(shape=(FRENCH_SEQUENCE_LENGTH,), dtype='int64', name='input_2')
decoder = Decoder(VOCAB_SIZE, EMBEDDING_DIM, HIDDEN_UNITS, FRENCH_SEQUENCE_LENGTH) # initializing initial state of decoder
decoder_output, attention_weightss = decoder(encoder_output, tf.zeros([1, HIDDEN_UNITS]), shifted_target)

### OUTPUT ###
bahdanau_model = Model(inputs=[input, shifted_target], outputs=decoder_output)
bahdanau_model.summary()



Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 64)]                 0         []                            
                                                                                                  
 encoder_8 (Encoder)         (None, 64, 256)              5645312   ['input_1[0][0]']             
                                                                                                  
 input_2 (InputLayer)        [(None, 64)]                 0         []                            
                                                                                                  
 decoder_15 (Decoder)        ((None, 64, 20000),          1078659   ['encoder_8[0][0]',           
                              (None, 64, 1))              3          'input_2[0][0]']       

### BLEU Metric

In [None]:
class BLEU(tf.keras.metrics.Metric):
    def __init__(self,name='bleu_score'):
        super(BLEU,self).__init__()
        self.bleu_score=0

    def update_state(self,y_true,y_pred,sample_weight=None):
      y_pred=tf.argmax(y_pred,-1)
      self.bleu_score=0
      for i,j in zip(y_pred,y_true):
        tf.autograph.experimental.set_loop_options()

        total_words=tf.math.count_nonzero(i)
        total_matches=0
        for word in i:
          if word==0:
            break
          for q in range(len(j)):
            if j[q]==0:
              break
            if word==j[q]:
              total_matches+=1
              j=tf.boolean_mask(j,[False if y==q else True for y in range(len(j))])
              break

        self.bleu_score+=total_matches/total_words

    def result(self):
        return self.bleu_score/BATCH_SIZE

In [None]:
bahdanau_model.compile(
    optimizer=Adam(1e-4),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),)
    # metrics=[BLEU()],
    # run_eagerly=True)



In [None]:
history = bahdanau_model.fit(
    train_dataset,
    epochs=15,
    validation_data=val_dataset)

Epoch 1/15
Epoch 2/15
 381/2812 [===>..........................] - ETA: 10:09 - loss: 0.3672

# Testing and Evalaution

In [1]:
!pip install transformers tensorflow

Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m78.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m69.4 MB/s[0m eta [36m0:00:00[0m
Col

In [11]:
import tensorflow as tf


In [14]:
import torch


In [2]:
!pip install transformers


Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m105.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m72.7 MB/s[0m eta [36m0:00:00[0m
Co

In [3]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Initialize the GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-xl")
model = GPT2LMHeadModel.from_pretrained("gpt2-xl")

# Encode the user input and create an attention mask
input_text = input('user: ')
input_ids = tokenizer.encode(input_text, return_tensors='pt')
attention_mask = torch.ones(input_ids.shape)

# Generate a response
with torch.no_grad():
    output = model.generate(
        input_ids,
        max_length=50,
        attention_mask=attention_mask,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode and print the generated text
response = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Response:", response)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

user: hello
Generated Response: hello, world!

The first thing you'll notice is that the code is a bit more verbose than the previous examples. This is because we're using a lot of the same code, but we're using it in a different order.


In [7]:
# Encode the user input and create an attention mask
input_text = input('user: ')
input_ids = tokenizer.encode(input_text, return_tensors='pt')
attention_mask = torch.ones(input_ids.shape)

# Generate a response
with torch.no_grad():
    output = model.generate(
        input_ids,
        max_length=50,
        attention_mask=attention_mask,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode and print the generated text
response = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Response:", response)

user: fuck ypu
Generated Response: fuck ypu.

I'm not sure if you've noticed, but the last few months have been a bit of a whirlwind for me. I've been working on a lot of different projects, and I've been spending a lot of time
