<a href="https://colab.research.google.com/github/yuanaf/latihan_git/blob/main/TextGeneration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1.About the Project

The text generation project is designed to introduce students to AI-driven **text creation** using pre-trained models from the **Transformers library**. Utilizing Google Colab, students will **generate** coherent and contextually relevant text based on prompts.

#2.Transformers Library

The Transformers library, developed by Hugging Face, is a **popular open-source library** for natural language processing (NLP). It provides easy access to a wide range of **pre-trained models**, including those based on the Transformer architecture. Transformers are powerful because they use attention mechanisms to process and **generate text**, which allows them to handle long-range dependencies and context effectively. The library supports various NLP tasks, such as text generation, translation, summarization, and more.

In [None]:
!pip install transformers



#3.Tokenization

Tokenization is a crucial step in natural language processing that involves **breaking down text** into smaller units called tokens. These tokens can be words, subwords, or characters. Tokenization helps the model understand and process the text by converting it into a format that the model can work with. For example, the sentence "Hello, world!" might be tokenized into ["Hello", ",", "world", "!"]. In the Transformers library, tokenization is handled by specific tokenizers that correspond to different pre-trained models.

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch


#4.Load Pre-trained Model and Tokenizer

In [None]:
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



In [None]:
def generate_text(prompt, max_length=100):
    # Set pad_token to eos_token
    tokenizer.pad_token = tokenizer.eos_token

    # Encode the prompt with attention_mask
    inputs = tokenizer.encode_plus(prompt, return_tensors='pt', padding='max_length', max_length=50, truncation=True)
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']

    # Generate text with a larger max_length
    outputs = model.generate(input_ids, attention_mask=attention_mask, max_length=max_length, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)

    # Decode the generated text
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return text


#5.Text Generation

In [None]:
prompt = input("Text you want to continue: ")
generated_text = generate_text(prompt)
print(generated_text)


Text you want to continue: once upon a time
once upon a time

The world is a place of great beauty

And the world is a place of great fear

And the world is a place of great fear

And the world is a place of great fear

And the world is a
