<a href="https://colab.research.google.com/github/sadrireza/Neural-Networks/blob/main/Synonym%20Suggestion%3A%20T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Synonym Suggestion: T5

#### Section 1: Install Necessary Libraries


In [None]:
!pip install transformers
!pip install nltk
!pip install sentencepiece



#### Section 2: Import Libraries and Download NLTK Data


In [None]:
import nltk
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
from google.colab import files
from nltk.corpus import wordnet
import textwrap

nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

#### Section 3: Upload Text File


In [None]:
uploaded = files.upload()

Saving Sample1.txt to Sample1.txt


#### Section 4: Read and Preprocess the Uploaded Text


In [None]:
filename = next(iter(uploaded))
with open(filename, 'r') as file:
    text = file.read()

#### Section 5: Mask Words in the Text


In [None]:
import random

def mask_random_words(text, p=0.2):
    nltk.download('punkt')
    words = nltk.word_tokenize(text)
    masked_indices = [i for i in range(len(words)) if random.random() < p]
    masked_words = [words[i] for i in masked_indices]
    for i in masked_indices:
        words[i] = "<extra_id_0>"
    masked_text = ' '.join(words)
    return masked_text, masked_words

masked_text, masked_words = mask_random_words(text)
print("Masked Text:", masked_text)
print("Masked Words:", masked_words)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Masked Text: Human beings are <extra_id_0> of a <extra_id_0> , <extra_id_0> creation <extra_id_0> one <extra_id_0> and <extra_id_0> . If one <extra_id_0> is afflicted with <extra_id_0> , Other <extra_id_0> uneasy will <extra_id_0> <extra_id_0> If you have <extra_id_0> sympathy for human pain , The <extra_id_0> of <extra_id_0> you can not retain .
Masked Words: ['members', 'whole', 'In', 'of', 'essence', 'soul', 'member', 'pain', 'members', 'remain', '.', 'no', 'name', 'human']


#### Section 6: Load T5 Model and Predict Synonyms

In [None]:
tokenizer = T5Tokenizer.from_pretrained('t5-base')
model = T5ForConditionalGeneration.from_pretrained('t5-base')

inputs = tokenizer(masked_text, return_tensors='pt')
output_sequences = model.generate(
    inputs['input_ids'],
    max_length=512,
    num_beams=5,       # Using Beam Search for better prediction
    early_stopping=True
)

predicted_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
predicted_words = predicted_text.split()
print("Predicted Text:", predicted_text)

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Predicted Text: of kind of a kind of a kind you can not retain . If you have sympathy for human pain , Other people will not understand you . . . . . . . of . . .


#### Section 7: Replace Masked Words with Predicted Synonyms and Display Result

In [None]:
output_text = masked_text
for synonym in predicted_words:
    # Ensure we replace the first occurrence of the mask with the current synonym
    output_text = output_text.replace("<extra_id_0>", synonym, 1)

def print_wrapped(text, width):
    for line in textwrap.wrap(text, width=width):
        print(line)

print("Original Text:")
print_wrapped(text, 50)
print("\nModified Text:")
print_wrapped(output_text, 50)

Original Text:
Human beings are members of a whole, In creation
of one essence and soul. If one member is
afflicted with pain, Other members uneasy will
remain. If you have no sympathy for human pain,
The name of human you cannot retain.

Modified Text:
Human beings are of of a kind , of creation a one
kind and of . If one a is afflicted with kind ,
Other you uneasy will can not If you have retain
sympathy for human pain , The . of If you can not
retain .
