*****Generating inspirational quotes using a pre-trained GPT-2 model and saving them to a file *******

1. **Check for GPU and Load Model:**
   - The code checks if a GPU is available, and if so, it sets the device to "cuda." It then loads the GPT-2 tokenizer and model ("noelmathewisaac/inspirational-quotes-distilgpt2") on the specified device.

2. **Load Preprocessed Dataset:**
   - The path to a preprocessed dataset is defined using `dataset_path`. The preprocessed dataset contains text data, including a column with the first 3 words of each entry. It loads this dataset using pandas.

3. **Generated File Path:**
   - The `generated_file_path` is specified as the path to save the generated quotes. It should be a writable path in Kaggle's working directory.

4. **Generate Quotes for Each Row:**
   - The function `generate_quotes_for_each_row` is defined to generate quotes for each row in the dataset.
   - The function checks if there are existing quotes in the generated file (if it exists) to determine the starting index.
   - It then iterates through the rows, uses the first 3 words of each entry as seed text, and generates quotes using the GPT-2 model.
   - If quotes are generated in batches of 100 (controlled by `save_every`), they are appended to the existing file to prevent overwriting.
   - The generated quotes, along with their source row index, are stored in the `generated_quotes` list.

5. **Append Quotes to the Existing File:**
   - After generating quotes, if there are quotes in the `generated_quotes` list (i.e., after every 100 quotes or at the end), they are appended to the existing file in the working directory using pandas. This is done to save the quotes without overwriting the existing ones.

6. **Generate Quotes for All Rows:**
   - The code calls the `generate_quotes_for_each_row` function to generate quotes for all rows in the preprocessed dataset.

7. **Copy the Generated File:**
   - The generated quotes file is copied from its original location to the working directory using the `shutil.copy` function. This ensures that you can easily access the generated quotes within Kaggle's environment.



In [None]:
!pip install transformers
!pip install --upgrade tensorflow




In [3]:
!pip install --upgrade tensorflow==2.5.0 tensorflow-probability==0.12.2


[31mERROR: Could not find a version that satisfies the requirement tensorflow==2.5.0 (from versions: 2.8.0rc0, 2.8.0rc1, 2.8.0, 2.8.1, 2.8.2, 2.8.3, 2.8.4, 2.9.0rc0, 2.9.0rc1, 2.9.0rc2, 2.9.0, 2.9.1, 2.9.2, 2.9.3, 2.10.0rc0, 2.10.0rc1, 2.10.0rc2, 2.10.0rc3, 2.10.0, 2.10.1, 2.11.0rc0, 2.11.0rc1, 2.11.0rc2, 2.11.0, 2.11.1, 2.12.0rc0, 2.12.0rc1, 2.12.0, 2.12.1, 2.13.0rc0, 2.13.0rc1, 2.13.0rc2, 2.13.0, 2.13.1, 2.14.0rc0, 2.14.0rc1, 2.14.0, 2.15.0rc0)[0m[31m
[0m[31mERROR: No matching distribution found for tensorflow==2.5.0[0m[31m
[0m

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="noelmathewisaac/inspirational-quotes-distilgpt2")

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("noelmathewisaac/inspirational-quotes-distilgpt2")
model = AutoModelForCausalLM.from_pretrained("noelmathewisaac/inspirational-quotes-distilgpt2")

###############IGNORE
##file_path = '/content/drive/My Drive/prepro/generated_quotes.csv'

# Open the file in write mode to clear its content
with open(file_path, 'w') as file:
    file.truncate(0)


In [None]:
import random
import torch
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM
import os

# Check if a GPU is available and use it
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("noelmathewisaac/inspirational-quotes-distilgpt2")
model = AutoModelForCausalLM.from_pretrained("noelmathewisaac/inspirational-quotes-distilgpt2").to(device)

# Define the path to your dataset in Kaggle
dataset_path = '/kaggle/input/sammyyy/preprocessed_quotes_no_quote_author_category (1).csv'

# Load the preprocessed dataset
preprocessed_data = pd.read_csv(dataset_path)

# Define the path to save the generated file in Kaggle's working directory
generated_file_path = '/kaggle/working/generated_quotes (1).csv'

# Function to generate quotes for each row without overwriting existing quotes
def generate_quotes_for_each_row(dataframe, max_rows, generated_file_path, max_length=100, save_every=100):
    generated_quotes = []

    if os.path.exists(generated_file_path):
        # Load existing generated quotes and determine the starting index in the dataset
        existing_quotes_data = pd.read_csv(generated_file_path, delimiter='\t')
        num_existing_quotes = len(existing_quotes_data)
    else:
        num_existing_quotes = 0

    for index in range(num_existing_quotes, max_rows):
        entry = dataframe.iloc[index]["first_3_words"]
        seed_text = entry
        input_ids = tokenizer.encode(seed_text, return_tensors="pt").to(device)
        attention_mask = torch.ones(input_ids.shape, device=device)
        output = model.generate(input_ids, max_length=max_length, no_repeat_ngram_size=20, top_k=50, pad_token_id=model.config.eos_token_id, attention_mask=attention_mask)
        quote = tokenizer.decode(output[0], skip_special_tokens=True)

        generated_quotes.append({"Generated_Quote": quote, "Source_Row": f"Row {index + 1}"})

        if len(generated_quotes) % save_every == 0:
            # Append newly generated quotes to the existing file every 100 quotes
            generated_quotes_data = pd.DataFrame(generated_quotes)
            generated_quotes_data.to_csv(generated_file_path, mode='a', header=False, index=False, sep='\t')
            generated_quotes = []  # Clear the list to avoid saving the same quotes multiple times
            print(f"Yay {index + 1} saved")

    # Append any remaining newly generated quotes
    if generated_quotes:
        generated_quotes_data = pd.DataFrame(generated_quotes)
        generated_quotes_data.to_csv(generated_file_path, mode='a', header=False, index=False, sep='\t')
        print(f"Yay {max_rows} saved")

# Generate up to 1 quote per row for the maximum number of rows
max_rows_to_generate = len(preprocessed_data)
generate_quotes_for_each_row(preprocessed_data, max_rows_to_generate, generated_file_path)
