**Data Preprocessing and Text Generation with GPT-2**


This notebook offers a detailed walkthrough of data preprocessing and text generation using the GPT-2 model. It aims to empower users with the necessary skills to effectively process text data and utilize the potential of GPT-2 for creative text generation that can be later used for image generation prompts for Models like StableDiffusionXL from HuggingFace,Dalle,Midjourney and many more models from HuggingFace. The notebook covers the following key steps:

1. **Data Loading:** The process begins with loading a dataset from a CSV file. The dataset contains text descriptions, and the goal is to refine and fine-tune this data for the GPT-2 model.

2. **Text Cleaning:** A custom function is employed for text cleaning. This function eliminates special characters, numbers, and superfluous whitespaces, converts text to lowercase, tokenizes it, removes stopwords, and applies stemming. This meticulous cleaning process results in refined text data ready for analysis.

3. **Attribute Selection:** In this step, the notebook focuses on structured data attributes. It checks the existence of specific columns in the dataset and fills missing values with zeros. Then, it scales the values using Min-Max scaling to standardize the data for further analysis.

4. **Text Generation with GPT-2:** The notebook introduces the use of the GPT-2 model for text generation. It employs the Transformers library to fine-tune the GPT-2 model on a specific dataset of text prompts. This ensures that the model can generate creative and contextually relevant text based on the input prompts.

5. **Training and Saving the Model:** The notebook outlines the training process, including specifying training arguments and data collation for language modeling. It also provides the steps to fine-tune the GPT-2 model and save the fine-tuned model for future use.


**Acknowledgments:**
We express our gratitude to the open-source libraries and datasets used in this notebook, including Pandas, NLTK, Transformers, and the GPT-2 model Mainly HuggingFace :) So much love.

In [None]:
import pandas as pd

# Load your dataset into a DataFrame
df = pd.read_csv('/kaggle/input/prompt-unfiltered-dataset/dataset_final.csv', delimiter='\t')


In [None]:
!pip install nltk


In [None]:
import nltk
nltk.download('punkt')


In [None]:
import nltk
nltk.download('stopwords')


In [None]:
import re
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# Download the 'punkt' tokenizer data
nltk.download('punkt')

# Create a sample DataFrame
data = {'photo_description': ["This is a sample photo description.", "Another description here!"],
        'ai_description': ["AI analysis of the first photo.", "AI description of the second one."]}
df = pd.DataFrame(data)

# Define a function to clean and preprocess text data
def preprocess_text(text):
    # Remove special characters, numbers, and extra whitespaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = ' '.join(text.split())

    # Convert text to lowercase
    text = text.lower()

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Apply stemming 
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

  
    cleaned_text = ' '.join(tokens)

    return cleaned_text


df['photo_description'] = df['photo_description'].apply(preprocess_text)
df['ai_description'] = df['ai_description'].apply(preprocess_text)

# Now, df contains the preprocessed text in the specified columns
print(df)


In [None]:
print(df.columns)

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load your dataset into a DataFrame
df = pd.read_csv('/kaggle/input/prompt-unfiltered-dataset/dataset_final.csv', delimiter='\t')

# Check the column names in the DataFrame
print(df.columns)

# Define the column names based on the provided structure
structured_columns = ['photo_width', 'photo_height', 'exif_iso', 'exif_aperture_value', 'exif_focal_length', 'exif_exposure_time', 'stats_views', 'stats_downloads', 'ai_primary_landmark_latitude', 'ai_primary_landmark_longitude', 'ai_primary_landmark_confidence', 'ai_service_1_confidence', 'ai_service_2_confidence', 'red', 'green', 'blue']

# Check if 'photo_width' exists in the column names
if 'photo_width' not in df.columns:
    print("Column 'photo_width' not found in the DataFrame.")
else:
    
    for column in structured_columns:
        df[column].fillna(0, inplace=True)

    
    scaler = MinMaxScaler()

    
    df[structured_columns] = scaler.fit_transform(df[structured_columns])

   

In [None]:
# Save the preprocessed DataFrame to a new CSV file
df.to_csv('/kaggle/working/preprocessed_dataset_final.csv', index=False)


In [None]:
!pip install transformers



In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="gpt2")

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

In [None]:
import pandas as pd

# Load your dataset into a DataFrame with the first row as headers
df = pd.read_csv('/kaggle/input/prompt-unfiltered-dataset/dataset_final.csv', delimiter='\t', header=0)

# Now, when you run df.head(), the first row will be treated as column headers
print(df.head())


In [None]:
import pandas as pd

# Define the column names as a list of strings 
column_names = [
    'photo_id', 'photo_image_url', 'photo_width', 'photo_height', 'photo_aspect_ratio',
    'photo_description', 'exif_camera_make', 'exif_camera_model', 'exif_iso',
    'exif_aperture_value', 'exif_focal_length', 'exif_exposure_time', 'stats_views',
    'stats_downloads', 'ai_description', 'ai_primary_landmark_name',
    'ai_primary_landmark_latitude', 'ai_primary_landmark_longitude',
    'ai_primary_landmark_confidence', 'blur_hash', 'keyword_x',
    'ai_service_1_confidence', 'ai_service_2_confidence', 'conversion_country',
    'keyword_y', 'hex', 'red', 'green', 'blue', 'keyword', 'ai_coverage',
    'ai_score', 'collection_title'
]

# Load your dataset into a DataFrame with the specified column names
df = pd.read_csv('/kaggle/input/prompt-unfiltered-dataset/dataset_final.csv', delimiter='\t', names=column_names)


print(df.columns)


In [None]:
df['image_prompt'] = (
    'A captivating photograph (' + df['photo_id'].astype(str) + ') with dimensions ' +
    df['photo_width'].astype(str) + 'x' + df['photo_height'].astype(str) + ', '
)

df['image_prompt'] += (
    'captured with a ' + df['exif_camera_make'].astype(str) + ' ' +
    df['exif_camera_model'].astype(str) + ' camera. '
)

df['image_prompt'] += (
    'Shot at ' + df['ai_primary_landmark_name'].astype(str) + ' in ' +
    df['conversion_country'].astype(str) + ', this photo ' +
    'received ' + df['stats_views'].astype(str) + ' views and ' +
    df['stats_downloads'].astype(str) + ' downloads. '
)

df['image_prompt'] += (
    'The image is associated with the collection "' +
    df['collection_title'].astype(str) + '". '
)

df['image_prompt'] += (
    'AI analysis detected ' + df['ai_description'].astype(str) + ' with a ' +
    'confidence score of ' + df['ai_score'].astype(str) + '. '
)

df['image_prompt'] += (
    'The primary landmark is located at (' +
    df['ai_primary_landmark_latitude'].astype(str) + ', ' +
    df['ai_primary_landmark_longitude'].astype(str) + ') with a ' +
    'landmark confidence of ' + df['ai_primary_landmark_confidence'].astype(str) + '. '
)

df['image_prompt'] += (
    'The dominant colors in the image are R=' + df['red'].astype(str) +
    ', G=' + df['green'].astype(str) + ', B=' + df['blue'].astype(str) + '. '
)

df['image_prompt'] += (
    'The image is labeled with keywords: ' + df['keyword_x'].astype(str) +
    ', ' + df['keyword_y'].astype(str) + '. '
)

df['image_prompt'] += (
    'Additional information: ' + df['ai_coverage'].astype(str) +
    ' coverage, ' + df['ai_service_1_confidence'].astype(str) +
    ' service 1 confidence, ' + df['ai_service_2_confidence'].astype(str) +
    ' service 2 confidence. '
)

df['image_prompt'] += (
    'The photo has a unique blur hash: ' + df['blur_hash'].astype(str) + '. '
)

df['image_prompt'] += (
    'The image showcases vivid colors with R=' + df['red'].astype(str) +
    ', G=' + df['green'].astype(str) + ', B=' + df['blue'].astype(str) + '. '
)

df['image_prompt'] += (
    'It was converted in ' + df['conversion_country'].astype(str) +
    ' and is associated with the keyword ' + df['keyword'].astype(str) + '. '
)

df['image_prompt'] += (
    'The photo is available at ' + df['photo_image_url'].astype(str) + '. '
)

df['image_prompt'] += (
    'It has an aspect ratio of ' + df['photo_aspect_ratio'].astype(str) +
    ' and dimensions ' + df['photo_width'].astype(str) + 'x' + df['photo_height'].astype(str) + '. '
)

df['image_prompt'] += (
    'Taken with ' + df['exif_iso'].astype(str) + ' ISO, an aperture value of ' +
    df['exif_aperture_value'].astype(str) + ', and a focal length of ' + df['exif_focal_length'].astype(str) + 'mm. '
)

df['image_prompt'] += (
    'The exposure time was ' + df['exif_exposure_time'].astype(str) + ' seconds. '
)

df['image_prompt'] += (
    'A piece of art captured with a ' + df['exif_camera_make'].astype(str) + ' ' +
    df['exif_camera_model'].astype(str) + ' camera. '
)

df['image_prompt'] += (
    'This image was taken with an ISO setting of ' + df['exif_iso'].astype(str) + 
    ' and an aperture value of ' + df['exif_aperture_value'].astype(str) + '. '
)

df['image_prompt'] += (
    'The focal length was ' + df['exif_focal_length'].astype(str) + 
    ' mm, and the exposure time was ' + df['exif_exposure_time'].astype(str) + ' seconds. '
)

df['image_prompt'] += (
    'It has been viewed ' + df['stats_views'].astype(str) + ' times and downloaded ' + 
    df['stats_downloads'].astype(str) + ' times. '
)

df['image_prompt'] += (
    'The AI analysis detected ' + df['ai_description'].astype(str) + ' with a confidence score of ' + 
    df['ai_score'].astype(str) + '. '
)

df['image_prompt'] += (
    'The primary landmark, ' + df['ai_primary_landmark_name'].astype(str) + ', is situated at (' + 
    df['ai_primary_landmark_latitude'].astype(str) + ', ' + df['ai_primary_landmark_longitude'].astype(str) + ') '
)

df['image_prompt'] += (
    'with a confidence score of ' + df['ai_primary_landmark_confidence'].astype(str) + '. '
)

df['image_prompt'] += (
    'The dominant colors in the image are R=' + df['red'].astype(str) + ', G=' + df['green'].astype(str) + 
    ', B=' + df['blue'].astype(str) + '. '
)

df['image_prompt'] += (
    'This photo is part of the collection "' + df['collection_title'].astype(str) + '" '
)

df['image_prompt'] += (
    'with keywords: ' + df['keyword_x'].astype(str) + ', ' + df['keyword_y'].astype(str) + '.'
)

df['image_prompt'] += (
    'The image has an aspect ratio of ' + df['photo_aspect_ratio'].astype(str) + 
    ' and dimensions ' + df['photo_width'].astype(str) + 'x' + df['photo_height'].astype(str) + '. '
)

df['image_prompt'] += (
    'Captured using the ' + df['exif_camera_make'].astype(str) + ' ' + 
    df['exif_camera_model'].astype(str) + ' at ' + df['exif_iso'].astype(str) + ' ISO. '
)

df['image_prompt'] += (
    'Focal length ' + df['exif_focal_length'].astype(str) + 'mm, aperture value ' + 
    df['exif_aperture_value'].astype(str) + ', exposure time ' + df['exif_exposure_time'].astype(str) + 's. '
)

df['image_prompt'] += (
    'A scene that looks like it came straight out of a dream. '
)

df['image_prompt'] += (
    'A perfect blend of colors and elements that capture the essence of the moment. '
)

df['image_prompt'] += (
    'An image that tells a story with every pixel. '
)

df['image_prompt'] += (
    'An incredible display of creativity and vision. '
)

df['image_prompt'] = (
    'This remarkable photograph was captured using a ' + df['exif_camera_make'].astype(str) + 
    ' ' + df['exif_camera_model'].astype(str) + ' camera. '
)

df['image_prompt'] += (
    'With an ISO of ' + df['exif_iso'].astype(str) + ' and an aperture value of ' + 
    df['exif_aperture_value'].astype(str) + ', it beautifully frames the subject. '
)

df['image_prompt'] += (
    'The focal length of ' + df['exif_focal_length'].astype(str) + 'mm and an exposure time of ' + 
    df['exif_exposure_time'].astype(str) + ' seconds bring life to this image. '
)

df['image_prompt'] += (
    'This captivating image has been viewed ' + df['stats_views'].astype(str) + ' times and downloaded ' + 
    df['stats_downloads'].astype(str) + ' times, making it truly remarkable. '
)

df['image_prompt'] += (
    'AI analysis detected ' + df['ai_description'].astype(str) + ' with a confidence score of ' + 
    df['ai_score'].astype(str) + '. '
)

df['image_prompt'] += (
    'The primary landmark is ' + df['ai_primary_landmark_name'].astype(str) + ' located at (' + 
    df['ai_primary_landmark_latitude'].astype(str) + ', ' + df['ai_primary_landmark_longitude'].astype(str) + ') '
)

df['image_prompt'] += (
    'with a confidence score of ' + df['ai_primary_landmark_confidence'].astype(str) + '. '
)

df['image_prompt'] += (
    'The prominent colors in the image are R=' + df['red'].astype(str) + ', G=' + df['green'].astype(str) + 
    ', B=' + df['blue'].astype(str) + '. '
)

df['image_prompt'] += (
    'This breathtaking image belongs to the collection "' + df['collection_title'].astype(str) + '" '
)

df['image_prompt'] += (
    'with keywords: ' + df['keyword_x'].astype(str) + ', ' + df['keyword_y'].astype(str) + '. '
)

df['image_prompt'] += (
    'The image has an aspect ratio of ' + df['photo_aspect_ratio'].astype(str) + 
    ' and dimensions ' + df['photo_width'].astype(str) + 'x' + df['photo_height'].astype(str) + '. '
)

df['image_prompt'] += (
    'Taken using the ' + df['exif_camera_make'].astype(str) + ' ' + 
    df['exif_camera_model'].astype(str) + ' at ' + df['exif_iso'].astype(str) + ' ISO. '
)

df['image_prompt'] += (
    'Focal length ' + df['exif_focal_length'].astype(str) + 'mm, aperture value ' + 
    df['exif_aperture_value'].astype(str) + ', exposure time ' + df['exif_exposure_time'].astype(str) + 's. '
)

df['image_prompt'] += (
    'A scene that looks like it came straight out of a dream. '
)

df['image_prompt'] += (
    'A perfect blend of colors and elements that capture the essence of the moment. '
)

df['image_prompt'] += (
    'An image that tells a story with every pixel. '
)

df['image_prompt'] += (
    'An incredible display of creativity and vision. '
)

df['image_prompt'] = (
    'A stunning photograph taken with a ' + df['exif_camera_make'].astype(str) + ' ' + 
    df['exif_camera_model'].astype(str) + ' camera, featuring '
)

df['image_prompt'] += (
    'an ISO of ' + df['exif_iso'].astype(str) + ', an aperture value of f/' + 
    df['exif_aperture_value'].astype(str) + ', and a focal length of ' + df['exif_focal_length'].astype(str) + 'mm. '
)

df['image_prompt'] += (
    'The exposure time was ' + df['exif_exposure_time'].astype(str) + ' seconds. '
)

df['image_prompt'] += (
    'With ' + df['stats_views'].astype(str) + ' views and ' + df['stats_downloads'].astype(str) + ' downloads, this image has garnered much attention. '
)

df['image_prompt'] += (
    'It was taken at (' + df['ai_primary_landmark_latitude'].astype(str) + ', ' + 
    df['ai_primary_landmark_longitude'].astype(str) + ') with ' + df['ai_primary_landmark_confidence'].astype(str) + ' landmark confidence. '
)

df['image_prompt'] += (
    'The image showcases dominant colors - R=' + df['red'].astype(str) + 
    ', G=' + df['green'].astype(str) + ', B=' + df['blue'].astype(str) + '. '
)

df['image_prompt'] += (
    'This image belongs to the collection "' + df['collection_title'].astype(str) + 
    '" and is associated with keywords: ' + df['keyword_x'].astype(str) + ', ' + df['keyword_y'].astype(str) + '. '
)

df['image_prompt'] += (
    'AI analysis detected ' + df['ai_description'].astype(str) + ' with a confidence score of ' + 
    df['ai_score'].astype(str) + '. '
)

df['image_prompt'] += (
    'The photograph was converted in ' + df['conversion_country'].astype(str) + '. '
)
df['image_prompt'] = (
    'This extraordinary photograph (' + df['photo_id'].astype(str) + ') showcases a ' + 
    df['ai_description'].astype(str) + ' captured with a ' + df['exif_camera_make'].astype(str) + ' ' + 
    df['exif_camera_model'].astype(str) + ' camera. '
)

df['image_prompt'] += (
    'The image was taken at ' + df['ai_primary_landmark_name'].astype(str) + ' with ' + 
    df['ai_primary_landmark_confidence'].astype(str) + ' landmark confidence. '
)

df['image_prompt'] += (
    'With dimensions ' + df['photo_width'].astype(str) + 'x' + df['photo_height'].astype(str) + ' and an aspect ratio of ' + 
    df['photo_aspect_ratio'].astype(str) + ', it offers a captivating visual experience. '
)

df['image_prompt'] += (
    'The photograph was captured using an ISO of ' + df['exif_iso'].astype(str) + ' and an aperture value of ' + 
    df['exif_aperture_value'].astype(str) + '. '
)

df['image_prompt'] += (
    'With a focal length of ' + df['exif_focal_length'].astype(str) + 'mm and an exposure time of ' + 
    df['exif_exposure_time'].astype(str) + ' seconds, it captures the essence of the moment. '
)

df['image_prompt'] += (
    'This image has been viewed ' + df['stats_views'].astype(str) + ' times and downloaded ' + 
    df['stats_downloads'].astype(str) + ' times, demonstrating its widespread appeal. '
)

df['image_prompt'] += (
    'The dominant colors in the photo are R=' + df['red'].astype(str) + 
    ', G=' + df['green'].astype(str) + ', B=' + df['blue'].astype(str) + '. '
)

df['image_prompt'] += (
    'Part of the collection "' + df['collection_title'].astype(str) + '", this image is enriched with keywords: ' + 
    df['keyword_x'].astype(str) + ', ' + df['keyword_y'].astype(str) + '. '
)

df['image_prompt'] += (
    'AI analysis confidently identified ' + df['ai_description'].astype(str) + ' with a score of ' + 
    df['ai_score'].astype(str) + '. '
)

df['image_prompt'] += (
    'The image underwent conversion in ' + df['conversion_country'].astype(str) + '. '
)

df['image_prompt'] += (
    'An image that tells a unique story through visual elegance. '
)

df['image_prompt'] += (
    'This photograph combines the magic of ' + df['ai_description'].astype(str) + ' with the artistry of photography. '
)

df['image_prompt'] = (
    'A mesmerizing photograph captured with a ' + df['exif_camera_make'].astype(str) + ' ' + 
    df['exif_camera_model'].astype(str) + ' camera. '
)

df['image_prompt'] += (
    'Featuring an ISO setting of ' + df['exif_iso'].astype(str) + ' and an aperture value of ' + 
    df['exif_aperture_value'].astype(str) + ', it beautifully frames the subject. '
)

df['image_prompt'] += (
    'The focal length of ' + df['exif_focal_length'].astype(str) + 'mm and an exposure time of ' + 
    df['exif_exposure_time'].astype(str) + ' seconds bring life to this image. '
)

df['image_prompt'] += (
    'AI analysis detected ' + df['ai_description'].astype(str) + ' with a confidence score of ' + 
    df['ai_score'].astype(str) + '. '
)

df['image_prompt'] += (
    'The primary landmark is ' + df['ai_primary_landmark_name'].astype(str) + ' located at (' + 
    df['ai_primary_landmark_latitude'].astype(str) + ', ' + df['ai_primary_landmark_longitude'].astype(str) + ') '
)

df['image_prompt'] += (
    'with a confidence score of ' + df['ai_primary_landmark_confidence'].astype(str) + '. '
)

df['image_prompt'] += (
    'The prominent colors in the image are R=' + df['red'].astype(str) + ', G=' + df['green'].astype(str) + 
    ', B=' + df['blue'].astype(str) + '. '
)

df['image_prompt'] += (
    'This breathtaking image belongs to the collection "' + df['collection_title'].astype(str) + '" '
)

df['image_prompt'] += (
    'with keywords: ' + df['keyword_x'].astype(str) + ', ' + df['keyword_y'].astype(str) + '. '
)

df['image_prompt'] += (
    'A visual journey that tells a captivating story. '
)

df['image_prompt'] += (
    'An exquisite blend of creativity, colors, and artistry that captivates the viewer. '
)

df['image_prompt'] += (
    'A photographic masterpiece that draws you into the moment it captures. '
)

df['image_prompt'] += (
    'An image that inspires curiosity and imagination with every pixel. '
)

df['image_prompt'] = (
    'This remarkable photograph was captured using a ' + df['exif_camera_make'].astype(str) + 
    ' ' + df['exif_camera_model'].astype(str) + ' camera. '
)

df['image_prompt'] += (
    'With an ISO of ' + df['exif_iso'].astype(str) + ' and an aperture value of ' + 
    df['exif_aperture_value'].astype(str) + ', it beautifully frames the subject. '
)

df['image_prompt'] += (
    'The focal length of ' + df['exif_focal_length'].astype(str) + 'mm and an exposure time of ' + 
    df['exif_exposure_time'].astype(str) + ' seconds bring life to this image. '
)

df['image_prompt'] += (
    'AI analysis detected ' + df['ai_description'].astype(str) + ' with a confidence score of ' + 
    df['ai_score'].astype(str) + '. '
)

df['image_prompt'] += (
    'The primary landmark is ' + df['ai_primary_landmark_name'].astype(str) + ' located at (' + 
    df['ai_primary_landmark_latitude'].astype(str) + ', ' + df['ai_primary_landmark_longitude'].astype(str) + ') '
)

df['image_prompt'] += (
    'with a confidence score of ' + df['ai_primary_landmark_confidence'].astype(str) + '. '
)

df['image_prompt'] += (
    'The dominant colors in the photo are R=' + df['red'].astype(str) + ', G=' + df['green'].astype(str) + 
    ', B=' + df['blue'].astype(str) + '. '
)

df['image_prompt'] += (
    'This photo is part of the collection "' + df['collection_title'].astype(str) + '" '
)

df['image_prompt'] += (
    'with keywords: ' + df['keyword_x'].astype(str) + ', ' + df['keyword_y'].astype(str) + '. '
)

df['image_prompt'] += (
    'An image that captures the essence of the moment with every detail. '
)

df['image_prompt'] += (
    'A masterpiece that speaks volumes through the interplay of colors and elements. '
)

df['image_prompt'] += (
    'A visual story that transports the viewer to another world. '
)

df['image_prompt'] += (
    'An exceptional composition that sparks curiosity and emotion. '
)

df['image_prompt'] = (
    'This incredible photograph (' + df['photo_id'].astype(str) + ') was taken with ' + 
    df['exif_camera_make'].astype(str) + ' ' + df['exif_camera_model'].astype(str) + ' camera. '
)

df['image_prompt'] += (
    'With an ISO setting of ' + df['exif_iso'].astype(str) + ' and an aperture value of ' + 
    df['exif_aperture_value'].astype(str) + ', the image captures the beauty of the moment. '
)

df['image_prompt'] += (
    'The focal length is ' + df['exif_focal_length'].astype(str) + 'mm, and the exposure time is ' + 
    df['exif_exposure_time'].astype(str) + ' seconds. '
)

df['image_prompt'] += (
    'This captivating image has been viewed ' + df['stats_views'].astype(str) + ' times and downloaded ' + 
    df['stats_downloads'].astype(str) + ' times, making it a truly remarkable photo. '
)

df['image_prompt'] += (
    'The AI analysis identifies the subject as ' + df['ai_description'].astype(str) + ' with a confidence score of ' + 
    df['ai_score'].astype(str) + '. '
)

df['image_prompt'] += (
    'The primary landmark in the image is ' + df['ai_primary_landmark_name'].astype(str) + ', located at (' + 
    df['ai_primary_landmark_latitude'].astype(str) + ', ' + df['ai_primary_landmark_longitude'].astype(str) + '), '
)

df['image_prompt'] += (
    'with a landmark confidence of ' + df['ai_primary_landmark_confidence'].astype(str) + '. '
)

df['image_prompt'] += (
    'The prominent colors in this photo are R=' + df['red'].astype(str) + ', G=' + df['green'].astype(str) + 
    ', B=' + df['blue'].astype(str) + '. '
)

df['image_prompt'] += (
    'This captivating image is part of the collection "' + df['collection_title'].astype(str) + '", '
)

df['image_prompt'] += (
    'enriched with keywords: ' + df['keyword_x'].astype(str) + ', ' + df['keyword_y'].astype(str) + '. '
)

df['image_prompt'] += (
    'A visual masterpiece that speaks of creativity and artistry. '
)

df['image_prompt'] += (
    'An image that captures a moment in time and tells a compelling story. '
)

df['image_prompt'] += (
    'An exquisite blend of colors and composition that draws you in. '
)

df['image_prompt'] += (
    'A visual journey that sparks curiosity and emotion. '
)
df['image_prompt'] = (
    'A captivating photo taken with a ' + df['exif_camera_make'].astype(str) + ' ' +
    df['exif_camera_model'].astype(str) + ' camera. '
)

df['image_prompt'] += (
    'With an ISO setting of ' + df['exif_iso'].astype(str) + ' and an aperture value of ' + 
    df['exif_aperture_value'].astype(str) + ', the image beautifully captures the moment. '
)

df['image_prompt'] += (
    'The focal length is ' + df['exif_focal_length'].astype(str) + 'mm, and the exposure time is ' + 
    df['exif_exposure_time'].astype(str) + ' seconds. '
)

df['image_prompt'] += (
    'This captivating image has been viewed ' + df['stats_views'].astype(str) + ' times and downloaded ' + 
    df['stats_downloads'].astype(str) + ' times, making it truly remarkable. '
)

df['image_prompt'] += (
    'AI analysis identified ' + df['ai_description'].astype(str) + ' with a confidence score of ' + 
    df['ai_score'].astype(str) + '. '
)

df['image_prompt'] += (
    'The primary landmark in the image is ' + df['ai_primary_landmark_name'].astype(str) + ', located at (' + 
    df['ai_primary_landmark_latitude'].astype(str) + ', ' + df['ai_primary_landmark_longitude'].astype(str) + '), '
)

df['image_prompt'] += (
    'with a landmark confidence of ' + df['ai_primary_landmark_confidence'].astype(str) + '. '
)

df['image_prompt'] += (
    'The prominent colors in this photo are R=' + df['red'].astype(str) + ', G=' + df['green'].astype(str) + 
    ', B=' + df['blue'].astype(str) + '. '
)

df['image_prompt'] += (
    'This captivating image is part of the collection "' + df['collection_title'].astype(str) + '", '
)

df['image_prompt'] += (
    'enriched with keywords: ' + df['keyword_x'].astype(str) + ', ' + df['keyword_y'].astype(str) + '. '
)

df['image_prompt'] += (
    'A visual masterpiece that speaks of creativity and artistry. '
)

df['image_prompt'] += (
    'An image that captures a moment in time and tells a compelling story. '
)

df['image_prompt'] += (
    'An exquisite blend of colors and composition that draws you in. '
)

df['image_prompt'] += (
    'A visual journey that sparks curiosity and emotion. '
)

# Unique additions:
df['image_prompt'] += (
    'An image that showcases the world in a different light, highlighting its beauty. '
)

df['image_prompt'] += (
    'A photographic masterpiece that evokes a sense of wonder and curiosity. '
)

df['image_prompt'] += (
    " This image is more than just pixels; it's a story waiting to be explored. " 
)

df['image_prompt'] += (
    'A perfect fusion of technology and art, creating a captivating visual experience. '
)

df['image_prompt'] += (
    'An image that invites the viewer to step into the scene and become a part of the story. '
)

df['image_prompt'] += (
    'This photograph captures the essence of the moment, preserving it for eternity. '
)

df['image_prompt'] += (
    'A visual composition that transports the viewer to another world filled with beauty and wonder. '
)



In [None]:
# Select relevant columns
df = df[['photo_id', 'image_prompt']]


In [None]:

df.to_csv('/kaggle/working/image_prompts.csv', index=False)


In [None]:
!pip install transformers
!pip install accelerate


In [None]:
import pandas as pd
data = pd.read_csv('/kaggle/working/image_prompts.csv')  # Load the CSV file using pandas


In [None]:
data = pd.read_csv('/kaggle/working/image_prompts.csv', encoding='utf-8')



In [None]:
pip install --upgrade transformers torch


In [None]:
import os

# Create a Conda environment
os.system('conda create -n myenv python=3.8')

# Activate the Conda environment
os.system('conda activate myenv')

# Install PyTorch and torchvision
os.system('conda install pytorch=1.8.1 torchvision=0.9.1 -c pytorch')

# Install Python packages
os.system('pip install transformers pandas')


In [None]:
# Set the pad_token for the tokenizer
tokenizer.pad_token = tokenizer.eos_token  # You can use a different token if needed

# Tokenize the "image_prompt" column
tokenized_datasets = data["image_prompt"].apply(lambda x: tokenizer(x, padding="max_length", truncation=True))



In [None]:
pip install accelerate -U

In [None]:
pip install transformers[torch]

In [None]:
pip install accelerate


In [1]:
pip install transformers[torch]


[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments
import pandas as pd

# Load the pre-trained GPT-2 model 
model_name = "gpt2-medium"  
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Add a new pad token
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Load your image prompts dataset 
dataset_path = "/kaggle/working/image_prompts.csv"  # Update with your dataset path
data = pd.read_csv(dataset_path)  # Load the CSV file using pandas

# Tokenize the "image_prompt" column with a smaller max_length
tokenized_datasets = data["image_prompt"].apply(lambda x: tokenizer(x, padding="max_length", truncation=True, max_length=64))

# Define data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",  
    overwrite_output_dir=True,
    num_train_epochs=3,  
    per_device_train_batch_size=2,  
    save_steps=10_000,  
    save_total_limit=2,  
    evaluation_strategy="steps",
    eval_steps=10_000,  
    learning_rate=2e-5,  
)

# Create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets.tolist(),  # Convert to list
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
trainer.save_model()


  from .autonotebook import tqdm as notebook_tqdm
