<a href="https://colab.research.google.com/github/wangyeye66/projects/blob/main/NLP_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline

# Load the pre-trained model and tokenizer
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Initialize the text-generation pipeline with GPT-2
text_generation_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Input prompt for text generation
prompt = "In a distant future, humans have colonized Mars"

# Generate text based on the input prompt
generated_text = text_generation_pipeline(prompt, max_length=200, num_return_sequences=1,truncation=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [2]:
generated_text

[{'generated_text': 'In a distant future, humans have colonized Mars. Their first attempts to return to this planet came in a ship named the "Sawtelle" and the spacecraft was able to capture one of the Sawtees. The ship turned back two colonists, but the Sawtees arrived, demanding access to their home planet.\n\nIn 30 ABY the Sawtelle was used as the base of operations for the colony effort called Yargen. The colonists and others took refuge at the colony hideout at Charka, and in 30 ABY the Sawtelle was discovered in a field, as revealed by Amaya who had been imprisoned in the facility. In 40 ABY Jiralhanae killed most of the Sawtees, and in 41 ABY Amaya tried again, and Amaya killed the Sawtees in order to save what little hope had.'}]

fine tune gpt2 for text generation

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# Mental Health Conversational Data
# https://www.kaggle.com/datasets/elvis23/mental-health-conversational-data?resource=download
import json

def proprocess_file(file):
    with open(file, 'r') as f:
        data = json.load(f)

    processed_data = []
    for intent in data["intents"]:
        for pattern in intent["patterns"]:
            processed_data.append(f"User: {pattern}\n")
            for response in intent["responses"]:
                processed_data.append(f"Assistant: {response}\n")

    return "".join(processed_data)

def save_file(processed_data, output_file):
    with open(output_file, 'w') as f:
        f.write(processed_data)

intents_file = "/content/drive/MyDrive/colab data/mental health.json"
output_file = "/content/drive/MyDrive/colab data/mental_health_data.txt"

preprocessed_data = proprocess_file(intents_file)
save_file(preprocessed_data, output_file)

In [5]:
%%capture
!pip install datasets
!pip install accelerate -U
!pip install transformers[torch] -U

In [6]:
!pip show transformers accelerate torch

Name: transformers
Version: 4.38.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 
---
Name: accelerate
Version: 0.27.2
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: sylvain@huggingface.co
License: Apache
Location: /usr/local/lib/python3.10/dist-packages
Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch
Required-by: 
---
Name: torch
Version: 2.1.0+cu121
Summary: Tensors and Dynamic neural networks in Python with strong

In [7]:
### training and fine tuning
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

def fine_tune_gpt2(model_name, train_file, output_dir):
    # Load GPT-2 model and tokenizer
    model = GPT2LMHeadModel.from_pretrained(model_name)
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)

    # Adjust the tokenizer's pad token
    tokenizer.pad_token = tokenizer.eos_token



    # Load training dataset
    train_dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=train_file,
        block_size=128)

    # Create data collator for language modeling
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False)

    # Set training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        num_train_epochs=5,
        per_device_train_batch_size=4,
        save_steps=10_000,
        save_total_limit=2,
    )

    # Train the model
    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
    )

    trainer.train()

    # Save the fine-tuned model
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

In [10]:
# Fine-tune
fine_tune_gpt2("gpt2", "/content/drive/MyDrive/colab data/mental_health_data.txt", "./text_generation_mental_health_model ")



Step,Training Loss


In [13]:
# Load the fine-tuned model and tokenizer
model_path = '/content/text_generation_mental_health_model '
model = GPT2LMHeadModel.from_pretrained(model_path)
tokenizer = GPT2Tokenizer.from_pretrained(model_path)

# Initialize the pipeline for text generation
text_generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

# Input prompt
prompt = "How to deal with anxiety?"

# Generate text
generated_texts = text_generator(prompt, max_length=200, num_return_sequences=1,truncation=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [14]:
generated_texts

[{'generated_text': "How to deal with anxiety?\nAssistant: I'd suggest working on your daily mental health and wellbeing.\nUser: What are the symptoms of depression?\nAssistant: Depression usually lasts for a week or more and lasts for 5 to 7 days.\nAssistant: Depression usually begins right before the start of school. In addition to irritability, mood swings and irritability are common. People with depression tend to have hard thoughts and feelings which go away within 5 to 10 days.\nUser: How do individuals cope with stress?\nAssistant: When dealing with stress there are many ways to cope with the stress in your life. You can take steps to alleviate the situation by being open-minded, listening to your feelings and not acting on them. Similarly, taking some time out can help and might help you go about your daily tasks. It is especially helpful to find help if you are not feeling well.\nAssistant: Stress is a serious and difficult life event. It affects the way you understand"}]