# Fine-Tuning GPT-2 Model for Soccer Data Chatbot

This notebook demonstrates how to fine-tune a GPT-2 model using the Hugging Face `transformers` library for a custom dataset related to soccer.


In [1]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments, TextDataset, DataCollatorForLanguageModeling


## 1. Load Pre-trained Model and Tokenizer
First, we will load the pre-trained GPT-2 model and tokenizer from the Hugging Face `transformers` library.


In [2]:
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")


## 2. Prepare the Dataset
Next, we will prepare the dataset for training. This involves loading the dataset file and tokenizing the text.


In [3]:
# Define a function to load and tokenize the dataset
def load_dataset(file_path, tokenizer, block_size=128):
    dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=file_path,
        block_size=block_size,
    )
    return dataset

# Path to your dataset file
file_path = "soccer_data"

# Load dataset
train_dataset = load_dataset(file_path, tokenizer)


Token indices sequence length is longer than the specified maximum sequence length for this model (6465 > 1024). Running this sequence through the model will result in indexing errors


## 3. Data Collator for Language Modeling
We will use a data collator to handle the batching and padding of the input sequences.


In [4]:
# Data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)


## 4. Define Training Arguments
We need to specify the training arguments, such as the output directory, number of training epochs, batch size, and save steps.


In [5]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)


## 5. Initialize Trainer
We will initialize the `Trainer` class with the model, training arguments, data collator, and training dataset.


In [6]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)


## 6. Fine-tune the Model
Now, we will start the training process to fine-tune the GPT-2 model on our custom dataset.


In [7]:
# Fine-tune the model
trainer.train()


Step,Training Loss


TrainOutput(global_step=75, training_loss=1.4893733723958333, metrics={'train_runtime': 328.7756, 'train_samples_per_second': 0.456, 'train_steps_per_second': 0.228, 'total_flos': 9798451200000.0, 'train_loss': 1.4893733723958333, 'epoch': 3.0})

## 7. Save the Fine-tuned Model
Finally, we will save the fine-tuned model and tokenizer to the specified directory.


In [8]:
# Save the model and tokenizer
trainer.save_model("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")


('./fine_tuned_model\\tokenizer_config.json',
 './fine_tuned_model\\special_tokens_map.json',
 './fine_tuned_model\\vocab.json',
 './fine_tuned_model\\merges.txt',
 './fine_tuned_model\\added_tokens.json',
 './fine_tuned_model\\tokenizer.json')

# Using the Fine-Tuned GPT-2 Model

In this section, we'll load the fine-tuned model and tokenizer and then use it to generate responses to input questions.


In [9]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the fine-tuned model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("./fine_tuned_model")
model = AutoModelForCausalLM.from_pretrained("./fine_tuned_model")


## Generate Response from the Model
We will define a function to generate a response from the model given an input question.


In [15]:
# Define a function to generate a response from the model
def generate_response(question, model, tokenizer, max_length=50):
    # Encode the input question
    inputs = tokenizer.encode(question, return_tensors="pt")
    # Generate a response from the model
    outputs = model.generate(inputs, max_length=max_length, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
    # Decode the generated response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Example question
question = "Who won the last World Cup?"

# Generate response
response = generate_response(question, model, tokenizer)



In [24]:
# import json
# # Parse the JSON string
# data = json.loads(response)
# print(data)
print(response)

Who won the last World Cup? Who won the last World Cup?"},
   {"input_text": "Who is the most successful manager in the history of the Premier League?", "response": "The most successful manager in the history of
