
**Data Preprocessing and Fusion for Unsplash Image Dataset**

**Introduction:**
This notebook presents a comprehensive data preprocessing and fusion workflow for the Unsplash Image Dataset. The dataset contains a wealth of information, including image attributes, keywords, conversions, colors, and collections. Our goal is to prepare and integrate this data to create a consolidated dataset for further analysis and modeling.

**Data Preprocessing Steps:**

1. **Data Import and Initial Filtering:**
   - We begin by loading the dataset from the Unsplash Image Dataset, which includes images and their attributes.
   - Initial filtering is performed to exclude rows with missing values in the 'ai_description' column.

2. **Attribute Selection:**
   - We select relevant attributes for our analysis, including image attributes, exif data, location information, and more.

3. **Keyword Integration:**
   - We merge the filtered dataset with a keywords dataset based on the 'photo_id' column.
   - This step enhances the dataset with additional information related to keywords and confidence scores from AI services.

4. **Data Type Conversion and Cleaning:**
   - Data types are converted as necessary, and rows with invalid values (NaN) in specific columns are removed.
   - Duplicate rows based on the 'photo_id' are also addressed.

5. **Conversions Integration:**
   - We merge the dataset with a conversions dataset, adding information related to conversion country and additional keywords.

6. **Colors Integration:**
   - The dataset is enriched with data from a colors dataset, including hex values, RGB values, keywords, AI coverage, and AI score.

7. **Collections Integration:**
   - We merge the dataset with collections data, introducing collection titles for each image.

**Conclusion:**
This Kaggle notebook demonstrates a thorough data preprocessing and integration process for the Unsplash Image Dataset. By the end of this workflow, we have created a consolidated dataset that combines image attributes, keywords, conversions, colors, and collections. This dataset is now ready for advanced analysis, modeling, and insights generation.

**Acknowledgments:**
We acknowledge the Unsplash dataset for providing a rich source of image-related information for this project.

In [None]:
import pandas as pd

# Define the file paths for input and output (replace with your actual paths)
input_file_path = '/kaggle/input/unsplash-dataset-lite/photos.tsv000'
output_file_path = '/kaggle/working/filtered_dataset.csv'  # Include the desired file name

# Load your dataset into a pandas DataFrame
df = pd.read_csv(input_file_path, sep='\t')  # Specify the tab separator for TSV

# Filter the DataFrame to keep rows where 'ai_description' is not null
df_filtered = df.dropna(subset=['ai_description'])

# Save the filtered dataset to the specified output CSV file
df_filtered.to_csv(output_file_path, index=False)

In [None]:
import pandas as pd

# Define the file paths for input and output (replace with your actual paths)
input_file_path = '/kaggle/working/filtered_dataset.csv'
output_file_path = '/kaggle/working/new_filtered_dataset.csv'  # Include the desired file name

# Load the dataset from the input file
df = pd.read_csv(input_file_path)

# Define the list of relevant attributes to keep
relevant_attributes = [
    'photo_id',
    'photo_image_url',
    'photo_width',
    'photo_height',
    'photo_aspect_ratio',
    'photo_description',
    'exif_camera_make',
    'exif_camera_model',
    'exif_iso',
    'exif_aperture_value',
    'exif_focal_length',
    'exif_exposure_time',
    'photo_location_name',
    'photo_location_latitude',
    'photo_location_longitude',
    'photo_location_country',
    'photo_location_city',
    'stats_views',
    'stats_downloads',
    'ai_description',
    'ai_primary_landmark_name',
    'ai_primary_landmark_latitude',
    'ai_primary_landmark_longitude',
    'ai_primary_landmark_confidence',
    'blur_hash'
]

# Create a new DataFrame with only the relevant attributes
df_filtered = df[relevant_attributes]

# Save the filtered dataset to the specified output CSV file
df_filtered.to_csv(output_file_path, index=False)

In [None]:
import pandas as pd

# Define the file paths for the filtered dataset and keywords dataset
filtered_dataset_path = '/kaggle/working/new_filtered_dataset.csv'
keywords_dataset_path = '/kaggle/input/unsplash-dataset-lite/keywords.tsv000'

# Load the filtered dataset
df_filtered = pd.read_csv(filtered_dataset_path)

# Load the keywords dataset with tab separation
df_keywords = pd.read_csv(keywords_dataset_path, sep='\t')

# Merge the two datasets based on the 'photo_id' column
merged_df = pd.merge(df_filtered, df_keywords[['photo_id', 'keyword', 'ai_service_1_confidence', 'ai_service_2_confidence']], on='photo_id', how='left')

# Save the merged dataset to a new CSV file
output_file_path = '/kaggle/working/merged_dataset.csv'  # Specify the desired output file path
merged_df.to_csv(output_file_path, index=False)

In [None]:
import pandas as pd

# Define the file path for the merged dataset
merged_dataset_path = '/kaggle/working/merged_dataset.csv'

# Load the merged dataset with specified data types and low_memory=False
df_merged = pd.read_csv(merged_dataset_path, low_memory=False)

# Define a function to handle the conversion of 'exif_iso' column values
def convert_exif_iso(value):
    try:
        return float(value)
    except (ValueError, TypeError):
        return None

# Apply the custom conversion function to the 'exif_iso' column
df_merged['exif_iso'] = df_merged['exif_iso'].apply(convert_exif_iso)

# Remove rows with invalid values (NaN) in the 'exif_iso' and 'exif_aperture_value' columns
df_merged = df_merged.dropna(subset=['exif_iso', 'exif_aperture_value'])

# Remove duplicate rows based on the 'photo_id' column
df_merged_no_duplicates = df_merged.drop_duplicates(subset='photo_id', keep='first')

# Save the dataset with duplicate rows removed to a new CSV file
output_file_path = '/kaggle/working/merged_dataset_no_duplicates_1.csv'  # Specify the desired output file path
df_merged_no_duplicates.to_csv(output_file_path, index=False)

In [None]:
import pandas as pd

# Define the file paths for the datasets
conversions_file_path = '/kaggle/input/unsplash-dataset-lite/conversions.tsv000'
existing_dataset_path = '/kaggle/working/merged_dataset_no_duplicates_1.csv'
output_file_path = '/kaggle/working/merged_dataset_with_conversions.csv'

# Load the conversions dataset
conversions_df = pd.read_csv(conversions_file_path, sep='\t')  # Assuming it's a TSV file

# Load the existing dataset
existing_df = pd.read_csv(existing_dataset_path)

# Merge the datasets using the 'photo_id' column
merged_df = existing_df.merge(conversions_df[['photo_id', 'conversion_country', 'keyword']], on='photo_id', how='left')

# Handle duplicates (if necessary) by dropping them based on the 'photo_id'
merged_df_no_duplicates = merged_df.drop_duplicates(subset='photo_id', keep='first')

# Save the merged dataset to a new CSV file
merged_df_no_duplicates.to_csv(output_file_path, index=False)

In [None]:
import pandas as pd

# Define the file paths for the datasets
colors_file_path = '/kaggle/input/unsplash-dataset-lite/colors.tsv000'
existing_dataset_path = '/kaggle/working/merged_dataset_with_conversions.csv'
output_file_path = '/kaggle/working/merged_dataset_with_colors.csv'

# Load the colors dataset
colors_df = pd.read_csv(colors_file_path, sep='\t')  # Assuming it's a TSV file

# Load the existing dataset
existing_df = pd.read_csv(existing_dataset_path)

# Merge the datasets using the 'photo_id' column
merged_df = existing_df.merge(
    colors_df[['photo_id', 'hex', 'red', 'green', 'blue', 'keyword', 'ai_coverage', 'ai_score']],
    on='photo_id',
    how='left'
)

# Handle duplicates (if necessary) by dropping them based on the 'photo_id'
merged_df_no_duplicates = merged_df.drop_duplicates(subset='photo_id', keep='first')

# Save the merged dataset to a new CSV file
merged_df_no_duplicates.to_csv(output_file_path, index=False)

In [None]:
import pandas as pd

# Define the file paths for the datasets
collections_file_path = '/kaggle/input/unsplash-dataset-lite/collections.tsv000'
existing_dataset_path = '/kaggle/working/merged_dataset_with_colors.csv'
output_file_path = '/kaggle/working/merged_dataset_with_collections.csv'

# Load the collections dataset
collections_df = pd.read_csv(collections_file_path, sep='\t')  # Assuming it's a TSV file

# Load the existing dataset
existing_df = pd.read_csv(existing_dataset_path)

# Merge the datasets using the 'photo_id' column
merged_df = existing_df.merge(
    collections_df[['photo_id', 'collection_title']],
    on='photo_id',
    how='left'
)

# Handle duplicates (if necessary) by dropping them based on the 'photo_id'
merged_df_no_duplicates = merged_df.drop_duplicates(subset='photo_id', keep='first')

# Save the merged dataset to a new CSV file
merged_df_no_duplicates.to_csv(output_file_path, index=False)

In [None]:
import pandas as pd

# Define the file paths for the existing dataset and the output file
existing_dataset_path = '/kaggle/working/merged_dataset_with_collections.csv'
output_file_path = '/kaggle/working/dataset_final.csv'

# Load the existing dataset
existing_df = pd.read_csv(existing_dataset_path)

# List of columns to remove
columns_to_remove = [
    'photo_location_name',
    'photo_location_latitude',
    'photo_location_longitude',
    'photo_location_country',
    'photo_location_city'
]

# Drop the specified columns from the dataset
filtered_df = existing_df.drop(columns=columns_to_remove)

# Handle duplicates (if necessary) by dropping them based on the 'photo_id'
filtered_df_no_duplicates = filtered_df.drop_duplicates(subset='photo_id', keep='first')

# Save the filtered dataset to a new CSV file
filtered_df_no_duplicates.to_csv(output_file_path, index=False)