# Step 1: Install Libraries

This step installs the necessary libraries for fine-tuning a language model. This includes `unsloth` for potentially faster training, `datasets` for handling the dataset, `transformers` for model loading and training, and `accelerate` for distributed training and mixed precision. The code then verifies that `unsloth` and `datasets` are installed.

In [None]:
# Step 1: Install necessary libraries including Unsloth

!pip install --quiet unsloth
!pip install --quiet datasets transformers accelerate

# Step 2: Check if the installation was successful
try:
    import unsloth
    print("Unsloth installed successfully.")
except ImportError:
    print("Unsloth installation failed.")

# Verify that the datasets library is installed
from datasets import load_dataset
print("Datasets library installed successfully.")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.1/47.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.6/278.6 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m375.8/375.8 kB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.4/152.4 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.1/162.1 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.5/31.5 MB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m865.2/865.2 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

# Step 2: Check System Compatibility

This step verifies that the necessary libraries and hardware are available and compatible for fine-tuning.

## Checks Performed:

*   **CUDA Availability:** Checks if a CUDA-enabled GPU is available.
*   **CUDA Version:** If CUDA is available, it displays the CUDA version.
*   **PyTorch Version:** Displays the installed PyTorch version.
*   **Transformers Version:** Displays the installed Hugging Face Transformers library version.
*   **CUDA Toolkit:** Attempts to verify the presence of the CUDA toolkit using `nvcc`.

> Ensuring compatibility of these components is crucial for successful GPU-accelerated training.

If any of these checks indicate an issue, you may need to install or update the respective libraries or ensure you have a suitable runtime environment with GPU support.

In [None]:
import torch
import transformers
import subprocess

# Check CUDA availability and version
cuda_available = torch.cuda.is_available()
cuda_version = torch.version.cuda if cuda_available else "CUDA not available"

# Check PyTorch version
pytorch_version = torch.__version__

# Check Transformers version
transformers_version = transformers.__version__


# Output compatibility check
print(f"CUDA Available: {cuda_available}")
print(f"CUDA Version: {cuda_version}")
print(f"PyTorch Version: {pytorch_version}")
print(f"Transformers Version: {transformers_version}")
# print(f"Unsloth Installed: {unsloth_installed}")

# Additional step to verify system libraries
print("\nVerifying system libraries...")
try:
    subprocess.check_output(['nvcc', '--version'])
    print("CUDA toolkit is installed.")
except subprocess.CalledProcessError:
    print("CUDA toolkit not found, please install it for GPU support.")

CUDA Available: True
CUDA Version: 12.6
PyTorch Version: 2.7.0+cu126
Transformers Version: 4.52.4

Verifying system libraries...
CUDA toolkit is installed.


# Step 3: Load and preprocess the dataset from the uploaded JSON file

This step loads the dataset from a specified JSON file, processes it into a format suitable for training with the Hugging Face `datasets` library, and splits it into training and validation sets.

## Steps Performed:

*   **Load JSON Data:** Reads the data from the JSON file located at the specified `dataset_path`.
*   **Convert to Hugging Face Dataset:** Transforms the loaded JSON data into a `Dataset` object, mapping the 'complex\_data\_science\_result' and 'human\_storytelling\_narrative' fields.
*   **Display Sample Entries:** Prints the first two entries of the created dataset to verify correct loading and structure.
*   **Split Dataset:** Divides the dataset into training (80%) and validation (20%) sets for model training and evaluation.
*   **Display Dataset Sizes:** Prints the number of examples in both the training and validation sets.

This prepared dataset is now ready for tokenization and use in training a language model.

In [None]:
# Step 3: Load and preprocess the dataset from the uploaded JSON file

from datasets import Dataset
import json

# Path to the uploaded JSON file in Google Colab
dataset_path = "/content/drive/MyDrive/A_Manual_Dataset/DataScience_results_with_humanly.json"

# Load the dataset from the JSON file
with open(dataset_path, 'r') as f:
    data = json.load(f)

# Convert the loaded data into the format that Hugging Face Dataset expects
dataset = Dataset.from_dict({
    'complex_data_science_result': [entry['complex_data_science_result'] for entry in data],
    'human_storytelling_narrative': [entry['human_storytelling_narrative'] for entry in data]
})

# Display first 2 entries to confirm correct loading
print("First 2 entries of the dataset:")
print(dataset[:2])

# Split the dataset into training and validation sets (80% training, 20% validation)
train_dataset = dataset.train_test_split(test_size=0.2)['train']
val_dataset = dataset.train_test_split(test_size=0.2)['test']

print("\nTraining and Validation Dataset Sizes:")
print(f"Training Dataset: {len(train_dataset)}")
print(f"Validation Dataset: {len(val_dataset)}")


First 2 entries of the dataset:

Training and Validation Dataset Sizes:
Training Dataset: 87
Validation Dataset: 22


# Step 4: Load Pre-trained Model and Tokenizer

This step loads a pre-trained language model and its corresponding tokenizer from the Hugging Face Transformers library.

## Model Selection:

*   **Model Name:** We are using the "google/flan-t5-base" model, which is a good general-purpose sequence-to-sequence model suitable for fine-tuning on various text generation tasks. The "base" version is chosen for potentially better performance on limited datasets.

## Loading Process:

1.  **Tokenizer:** The `AutoTokenizer.from_pretrained()` method is used to load the tokenizer associated with the specified model. The tokenizer is essential for converting text data into a format that the model can understand (tokens).
2.  **Model:** The `AutoModelForSeq2SeqLM.from_pretrained()` method is used to load the pre-trained model weights. `AutoModelForSeq2SeqLM` is specifically designed for sequence-to-sequence tasks like text generation.

After this step, the pre-trained model and tokenizer are ready to be used for further processing and fine-tuning.

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load a Flan-T5 base model for potentially better performance on limited data
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

print("Model and tokenizer loaded successfully.")

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Model and tokenizer loaded successfully.


# Step 5: Tokenize the Dataset and Create Labels

This step prepares the loaded and split dataset for model training by tokenizing the text data and creating the necessary input and output formats.

## Steps Performed:

1.  **Define Tokenization Function:** A function `tokenize_function` is defined to apply the tokenizer to both the input (`complex_data_science_result`) and target (`human_storytelling_narrative`) columns of the dataset.
2.  **Tokenize Inputs and Targets:** Inside the function, the tokenizer is used to convert the text in both columns into numerical tokens. `truncation=True` and `padding="max_length"` ensure that all sequences have the same length, which is required for batch processing.
3.  **Create 'labels' Column:** The tokenized target sequences (`targets['input_ids']`) are assigned to a new column named 'labels'. This is the standard format expected by Hugging Face Transformers models for training, where the model learns to predict these target tokens.
4.  **Apply Tokenization:** The `tokenize_function` is applied to both the training and validation datasets using the `.map()` method with `batched=True` for efficiency.
5.  **Set Dataset Format:** The format of the datasets is set to 'torch', specifying the columns that should be included in the PyTorch tensors used for training (`input_ids`, `attention_mask`, and `labels`).

After this step, the datasets are in a format suitable for feeding into the language model for fine-tuning.

In [None]:
# Tokenize the dataset and create 'labels' column

# Tokenize the dataset (training and validation sets)
def tokenize_function(examples):
    # Tokenize both the input and target columns (complex data science result and human narrative)
    inputs = tokenizer(examples['complex_data_science_result'], truncation=True, padding="max_length", max_length=512)
    targets = tokenizer(examples['human_storytelling_narrative'], truncation=True, padding="max_length", max_length=512)

    # Return the tokenized inputs and the target labels
    inputs['labels'] = targets['input_ids']
    return inputs

# Apply tokenization to the training and validation datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)

# Set format for PyTorch
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
val_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

print("Dataset tokenized and labels created successfully.")


Map:   0%|          | 0/87 [00:00<?, ? examples/s]

Map:   0%|          | 0/22 [00:00<?, ? examples/s]

Dataset tokenized and labels created successfully.


# Installing Unsloth from GitHub (Alternative Method)

Although `unsloth` was initially installed in Step 1 using a standard `pip install`, sometimes installing directly from the GitHub repository can be beneficial, especially in environments like Google Colab.

This command specifically installs `unsloth` from the main branch of the official GitHub repository:

In [None]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-xxppt5hh/unsloth_98f8e2fe922c4fc0aeec0043c0a746d1
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-xxppt5hh/unsloth_98f8e2fe922c4fc0aeec0043c0a746d1
  Resolved https://github.com/unslothai/unsloth.git to commit f14d7e1542b14c347815c353fed99b1982285930
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


# Configure LoRA for efficient training using Hugging Face and Peft

This step configures Low-Rank Adaptation (LoRA) for efficient fine-tuning of the language model. LoRA is a technique that allows for training only a small number of additional parameters, significantly reducing memory usage and training time while achieving comparable performance to fine-tuning the entire model.

## Steps Performed:

1.  **Import Libraries:** Imports necessary classes from `transformers` and `peft` for model handling and LoRA configuration.
2.  **LoRA Configuration:** Defines a `LoraConfig` object with specific parameters:
    *   `target_modules`: Specifies which layers of the model to apply LoRA to (common linear layers like 'q', 'k', and 'v' in attention mechanisms).
    *   `r`: The rank of the update matrices, controlling the number of trainable parameters.
    *   `lora_alpha`: A scaling factor for the LoRA updates.
    *   `lora_dropout`: The dropout rate applied to the LoRA layers.
    *   `bias`: Specifies whether to train bias parameters (set to "none" here).
    *   `task_type`: Specifies the task type, which is "SEQ\_2\_SEQ\_LM" for sequence-to-sequence models like Flan-T5.
3.  **Apply LoRA:** Wraps the base model with the `PeftModelForSeq2SeqLM` class and the defined `lora_config`. This creates a PEFT model where only the LoRA parameters will be trained.

After this step, the model is ready for efficient fine-tuning using the LoRA technique.

In [None]:
# Step 3: Configure LoRA for efficient training using Hugging Face and Peft

from transformers import AutoModelForSeq2SeqLM
from peft import LoraConfig, PeftModelForSeq2SeqLM

# The model is already loaded in a previous cell (cell dzH-Eqgy0yYX)

# LoRA configuration
lora_config = LoraConfig(
    target_modules=['q', 'k', 'v'], # These are common target modules for T5/Flan-T5
    r=8,  # Rank for the low-rank decomposition
    lora_alpha=16,  # Scaling factor
    lora_dropout=0.1,  # Dropout to prevent overfitting
    bias="none",
    task_type="SEQ_2_SEQ_LM" # Specify task type for Seq2Seq models like T5/Flan-T5
)

# Apply LoRA to the model using PeftModelForSeq2SeqLM
peft_model = PeftModelForSeq2SeqLM(model, lora_config)

print("LoRA configuration completed successfully using Hugging Face and Peft.")

LoRA configuration completed successfully using Hugging Face and Peft.


# Step 6: Define Training Arguments

This step defines the configuration and hyperparameters for the training process using the `TrainingArguments` class from the Hugging Face Transformers library. These arguments control various aspects of how the model will be trained and evaluated.

## Key Arguments:

*   **`output_dir`**: Specifies the directory where the training results, including saved models and logs, will be stored.
*   **`eval_strategy`**: Determines when evaluation is performed during training (e.g., after every epoch).
*   **`learning_rate`**: Sets the learning rate for the optimizer, controlling the step size during model updates.
*   **`per_device_train_batch_size`**: Defines the batch size for training on each device (e.g., GPU). This is reduced here for potentially larger models or limited memory.
*   **`per_device_eval_batch_size`**: Defines the batch size for evaluation on each device, also reduced.
*   **`num_train_epochs`**: Sets the total number of training epochs. This is increased significantly here, likely due to the small dataset size, to allow the model more time to learn.
*   **`weight_decay`**: Applies weight decay to the optimizer, a regularization technique to prevent overfitting.
*   **`save_strategy`**: Determines when the model checkpoints are saved during training (e.g., after every epoch).
*   **`logging_steps`**: Specifies how often training progress (like loss) is logged.
*   **`load_best_model_at_end`**: If set to `True`, the best performing model based on the specified metric will be loaded at the end of training.
*   **`metric_for_best_model`**: Defines the metric used to determine the "best" model (e.g., the lowest evaluation loss).

These arguments are crucial for customizing and controlling the fine-tuning process to achieve desired results.

In [None]:
# Define the training arguments

from transformers import TrainingArguments

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",              # Directory to store results
    eval_strategy="epoch",         # Evaluate after every epoch
    learning_rate=1e-4,                  # Learning rate
    per_device_train_batch_size=2,       # Batch size for training - Reduced for larger model/memory
    per_device_eval_batch_size=2,        # Batch size for evaluation - Reduced for larger model/memory
    num_train_epochs=50,                 # Number of epochs to train - Increased significantly due to small dataset
    weight_decay=0.01,                   # Weight decay for optimization
    save_strategy="epoch",               # Save the model after every epoch
    logging_steps=10,                    # Log every 10 steps
    load_best_model_at_end=True,         # Load the best model after training
    metric_for_best_model="eval_loss",   # Metric to evaluate the best model
)

print("Training arguments set up successfully.")

Training arguments set up successfully.


# Initialize the Trainer and Start Training

This step initializes the Hugging Face `Trainer` and starts the fine-tuning process for the language model using the configured PEFT model, training arguments, and datasets.

## Steps Performed:

1.  **Disable PyTorch Compilation:** Disables PyTorch compilation, which can sometimes cause issues or be unnecessary in certain environments.
2.  **Initialize Data Collator:** Creates a `DataCollatorForSeq2Seq` which is responsible for batching and preparing the tokenized data for the sequence-to-sequence model during training.
3.  **Initialize Trainer:** Instantiates the `Trainer` class, providing it with:
    *   The PEFT model (`peft_model`) with LoRA adapters.
    *   The defined training arguments (`training_args`).
    *   The training dataset (`train_dataset`).
    *   The validation dataset (`val_dataset`).
    *   The data collator (`data_collator`).
4.  **Start Training:** Calls the `trainer.train()` method to begin the fine-tuning process. This will iterate through the training epochs, perform forward and backward passes, update model weights (only the LoRA adapters), and evaluate the model periodically based on the `training_args`.

After this step, the model will be fine-tuned on the provided dataset, and the training progress will be displayed.

In [None]:
# Initialize the Trainer and Start Training

from transformers import Trainer, DataCollatorForSeq2Seq
import torch

# Disable PyTorch compilation
torch._dynamo.config.disable = True

# Initialize Data Collator
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=peft_model)


# Initialize Trainer
trainer = Trainer(
    model=peft_model,  # Using the model with LoRA
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator, # Use the data collator
)

# Start training
trainer.train()

print("\nTraining complete. The fine-tuned model is saved.")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 87 | Num Epochs = 50 | Total steps = 2,200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 1,327,104/248,904,960 (0.53% trained)


Epoch,Training Loss,Validation Loss
1,37.0178,34.434879
2,26.7726,23.680811
3,17.046,12.94425
4,6.645,4.980194
5,4.6337,4.52558
6,4.3888,4.106547
7,3.7739,2.861538
8,3.2765,2.223405
9,2.5559,1.627771
10,2.3238,1.420379



Training complete. The fine-tuned model is saved.


# Save the fine-tuned model

This step saves the fine-tuned model's LoRA adapters and the corresponding tokenizer to a local directory. Saving the adapters separately from the base model is a key advantage of using PEFT and LoRA, as it results in much smaller files that are easier to share and load.

## Steps Performed:

1.  **Save Model Adapters:** The `peft_model.save_pretrained("./fine_tuned_model")` command saves the trained LoRA adapter weights to the specified directory.
2.  **Save Tokenizer:** The `tokenizer.save_pretrained("./fine_tuned_model")` command saves the tokenizer configuration and vocabulary files to the same directory. This ensures that you use the correct tokenizer when loading and using the fine-tuned model for inference.

Saving the model and tokenizer allows you to reuse the fine-tuned model later without needing to retrain it.

In [None]:
# Save the fine-tuned model

# Save the fine-tuned model (LoRA adapters) and tokenizer
peft_model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

print("Model and tokenizer saved successfully.")

Model and tokenizer saved successfully.


# Load the fine-tuned model

This step loads the fine-tuned model and its corresponding tokenizer from the saved directory, preparing it for inference or further use.

## Steps Performed:

1.  **Load Base Model:** The original base model ("google/flan-t5-base") is loaded first. This is necessary because the saved fine-tuned model only contains the LoRA adapters, not the full model weights.
2.  **Load LoRA Adapters:** The `PeftModelForSeq2SeqLM.from_pretrained()` method is used to load the saved LoRA adapters from the specified `model_path` and apply them to the base model. This creates the complete fine-tuned model.
3.  **Load Tokenizer:** The tokenizer is loaded from the same saved directory using `AutoTokenizer.from_pretrained()`. It's important to use the tokenizer saved with the fine-tuned model to ensure consistency in tokenization.

After this step, the `fine_tuned_model` object contains the base model with the applied LoRA adapters, ready for generating text based on new inputs.

In [None]:
# Load the fine-tuned model and tokenizer for testing

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModelForSeq2SeqLM

# Path to the saved fine-tuned model
model_path = "./fine_tuned_model"

# Load the base model - must match the model the adapters were trained on
model_name = "google/flan-t5-base"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Load the LoRA adapters onto the base model
fine_tuned_model = PeftModelForSeq2SeqLM.from_pretrained(model, model_path)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

print("Fine-tuned model and tokenizer loaded successfully.")

Fine-tuned model and tokenizer loaded successfully.


# Prepare Test Data and Generate Predictions

This step demonstrates how to use the fine-tuned model to generate human-readable narratives from new data science results.

## Steps Performed:

1.  **Prepare Test Data:** Defines a list of example data science results (`test_data`) that the fine-tuned model will process.
2.  **Tokenize Test Data:** The test data is tokenized using the same tokenizer that was used during training. `return_tensors="pt"` prepares the data as PyTorch tensors, and `padding=True` and `truncation=True` ensure consistent input length.
3.  **Move to Device:** Moves the tokenized input tensors to the same device as the model (GPU if available) for efficient processing.
4.  **Generate Predictions:** Uses the `fine_tuned_model.generate()` method to produce text outputs based on the tokenized inputs. Key parameters like `max_length`, `num_beams`, `do_sample`, `top_k`, and `temperature` control the text generation process.
5.  **Decode Predictions:** The generated token IDs are decoded back into human-readable text strings using the tokenizer.
6.  **Display Results:** Prints the original input data science results and the corresponding generated human narratives.

This step shows the fine-tuned model in action, transforming complex data science findings into understandable narratives.

In [None]:
# Prepare test data and generate predictions

# Example test data (replace with your actual test data)
test_data = [
    "The regression model showed a positive correlation (r=0.7) between advertising spend and sales, with a p-value of 0.005.",
    "Customer segmentation using K-means clustering identified three distinct groups with varying purchasing behaviors: high-value, medium-value, and low-value customers."
]

# Tokenize the test data
inputs = tokenizer(test_data, return_tensors="pt", padding=True, truncation=True, max_length=512)

# Move inputs to the same device as the model (GPU if available)
if torch.cuda.is_available():
    inputs = {name: tensor.to("cuda") for name, tensor in inputs.items()}
    fine_tuned_model.to("cuda")


# Generate predictions
fine_tuned_model.eval() # Set the model to evaluation mode
with torch.no_grad(): # Disable gradient calculation for inference
    outputs = fine_tuned_model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=1024, # Increased max length for generated output
        num_beams=1, # Use greedy decoding to try and get output
        # Removed early_stopping to encourage generation
        do_sample=True,  # Enable sampling
        top_k=50,        # Sample from top 50 tokens
        temperature=0.7, # Control randomness (lower is less random)
    )

# Decode the generated tokens back to text
predictions = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

# Display the results
for i, (input_text, generated_text) in enumerate(zip(test_data, predictions)):
    print(f"Input Data Science Result {i+1}: {input_text}")
    print(f"Generated Human Narrative {i+1}: {generated_text}\n")

Input Data Science Result 1: The regression model showed a positive correlation (r=0.7) between advertising spend and sales, with a p-value of 0.005.
Generated Human Narrative 1: ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ® a theory that gathered data from a p-based research-prognosing system that showed a positive correlation of 25% of advertising spend and sales in advertising.

Input Data Science Result 2: Customer segmentation using K-means clustering identified three distinct groups with varying purchasing behaviors: high-value, medium-value, and low-value customers.
Generated Human Narrative 2: at the center of the center of the center of the center of the center of the center of the center of the center of the center of the center of the center of the center of 

# Summary and Future Work

Through this notebook, I have learned about the process of fine-tuning a language model for a specific text generation task. Key concepts explored include:

*   **Hugging Face Transformers:** Loading pre-trained models and tokenizers.
*   **Hugging Face `datasets` library:** Loading, processing, and splitting datasets.
*   **Parameter-Efficient Fine-Tuning (PEFT) with LoRA:** An efficient method for fine-tuning large models by training only a small number of additional parameters.
*   **Training with the `Trainer` class:** Configuring and executing the fine-tuning process.
*   **Model Saving and Loading:** Saving the fine-tuned adapters and tokenizer and loading them for inference.
*   **Text Generation:** Using the fine-tuned model to generate new text based on input data.

While the current results show some output, the quality is limited by the small size of the dataset used for fine-tuning. The model's ability to generate coherent and relevant human narratives is directly impacted by the amount and diversity of the training data it has seen.

In the future, I plan to:

*   Obtain and utilize a larger and more diverse dataset of data science results and corresponding human narratives.
*   Experiment with different fine-tuning approaches and hyperparameters to optimize performance.
*   Potentially explore more advanced text generation techniques to improve the quality and fluency of the generated narratives.

This project serves as a valuable foundation for further work in fine-tuning language models for domain-specific text generation tasks.

# Task
Create a markdown cell summarizing the learning experience, acknowledging the limitations due to the small dataset, outlining future plans for improvement with a larger dataset and better approaches, and providing instructions on how to push the current Colab notebook to the GitHub repository "https://github.com/waleedarif1011/Fine-Tuning-Flan-T5.git".

## Install git

### Subtask:
Install Git in the Colab environment.


**Reasoning**:
Install Git by updating the package list and then installing Git.



In [1]:
!sudo apt-get update
!sudo apt-get install git -y
!git --version

0% [Working]            Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:9 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,751 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [3,036 kB]
Get:13 https://r2u.stat.illinois.edu/ubuntu jam

## Configure git

### Subtask:
Configure your Git username and email.


**Reasoning**:
Configure the Git username and email using the git config command.



In [2]:
!git config --global user.name "Waleed Arif"
!git config --global user.email "waleedarif1011@gmail.com"

## Clone the repository

### Subtask:
Clone the specified GitHub repository to your Colab environment.


**Reasoning**:
Clone the specified GitHub repository using the `git clone` command.



In [3]:
!git clone "https://github.com/waleedarif1011/Fine-Tuning-Flan-T5.git"

Cloning into 'Fine-Tuning-Flan-T5'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (3/3), done.


## Copy notebook

### Subtask:
Copy the current Colab notebook file into the cloned repository directory.


**Reasoning**:
Find the path of the current notebook file and copy it to the cloned repository directory.



In [4]:
import os

# Find the current notebook path (this is a common pattern in Colab)
notebook_path = '/content/' + os.path.basename(__file__)

# Define the path to the cloned repository
repo_path = "/content/Fine-Tuning-Flan-T5"

# Define the destination path for the notebook file within the repository
destination_path = os.path.join(repo_path, os.path.basename(notebook_path))

# Copy the notebook file
!cp "$notebook_path" "$destination_path"

print(f"Notebook copied to: {destination_path}")

NameError: name '__file__' is not defined

**Reasoning**:
The previous command failed because `__file__` is not defined in a Jupyter Notebook environment. Need to find the notebook path differently in Colab and then copy the notebook file to the cloned repository directory.



In [5]:
import os
import glob

# Find the current notebook path by looking for .ipynb files in /content/
# Assumes there is only one .ipynb file in /content/ which is common for Colab notebooks.
notebook_files = glob.glob('/content/fine_tune_flan-t5-base.ipynb')

if len(notebook_files) == 1:
    notebook_path = notebook_files[0]
else:
    print("Could not determine the current notebook path. Please manually specify the path.")
    notebook_path = None # Or raise an error

if notebook_path:
    # Define the path to the cloned repository
    repo_path = "/content/Fine-Tuning-Flan-T5"

    # Define the destination path for the notebook file within the repository
    destination_path = os.path.join(repo_path, os.path.basename(notebook_path))

    # Copy the notebook file
    !cp "$notebook_path" "$destination_path"

    print(f"Notebook copied to: {destination_path}")
else:
    print("Notebook not copied due to path issue.")

Could not determine the current notebook path. Please manually specify the path.
Notebook not copied due to path issue.


**Reasoning**:
The automatic method to find the notebook path failed. Manually specify the most likely path for the notebook in Colab and then copy the notebook file to the cloned repository directory.



In [None]:
import os

# Manually specify the most likely path for the notebook in Colab
# This is a common default name, but might need adjustment if the user renamed the notebook
notebook_path = '/content/Fine-Tuning-Flan-T5.ipynb'

# Define the path to the cloned repository
repo_path = "/content/Fine-Tuning-Flan-T5"

# Define the destination path for the notebook file within the repository
destination_path = os.path.join(repo_path, os.path.basename(notebook_path))

# Copy the notebook file
# Use -f to force overwrite if the file already exists
!cp -f "$notebook_path" "$destination_path"

print(f"Notebook copied to: {destination_path}")