[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/Eng-Dan/kaggle-happywhale-competition/blob/master/happywhale-dataframes.ipynb)

# Context
This notebook generates two .csv files to use as input data for the happywhale 2022 competition.
* simplified_train.csv
* simplified_test.csv

The files contains the arrays retrieved from the resized train and test images dataset [Happywhale 2022 competition - Images 256 by 256](https://www.kaggle.com/datasets/engdan/happywhale-images-256-by-256).

The image arrays have been set to grayscale color map in order o reduce the size of the file. Thus, if your model should work with colored images, consider to use the original images from the competition.



# Packages and libraries

In [None]:
import os
import cv2
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
# Input data files are available in the read-only "../input/" directory
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Directory path variables

In [None]:
TRAIN_IMAGES_DIR = '../input/happywhale-images-256-by-256/resized_train_images'
TEST_IMAGES_DIR = '../input/happywhale-images-256-by-256/resized_test_images'

# Working dataframes

In [None]:
train_dataset_csv = '../input/happy-whale-and-dolphin/train.csv'
sample_submission_csv = '../input/happy-whale-and-dolphin/sample_submission.csv'

In [None]:
train_df = pd.read_csv(train_dataset_csv)
test_df = pd.read_csv(sample_submission_csv)

In [None]:
train_df.head(5)

In [None]:
test_df.head(5)

# Retrieving the images array

The function bellow will add the `image_array` column to the required dataframes.

In [None]:
def add_image_array_column(dataframe, images_source_dir):
    images_array = []
    num_images = dataframe['image'].size
    print('Process started...')
    for index in range(num_images):
        images_array.append(cv2.imread(os.path.join(images_source_dir, dataframe['image'].iloc[index]),
                                       cv2.IMREAD_GRAYSCALE))
    
        if (index % 2000) == 0:
            print('Images array added:', index, 'of', num_images)

    dataframe['image_array'] = images_array
    print('Process finished.')

First, lets apply the function to the `train_df`.

In [None]:
add_image_array_column(train_df, TRAIN_IMAGES_DIR)

Now, checking if all worked as expected.

In [None]:
train_df.head(5)

In [None]:
image_check = train_df.iloc[47696]
print(image_check.image)
print(image_check.species)
print(image_check.individual_id, '\n')
plt.imshow(image_check.image_array, cmap='gray')

Finally, applying the same process to the `test_df` (from `sample_submission.csv` file).

In [None]:
add_image_array_column(test_df, TEST_IMAGES_DIR)

# CSV files generation

In [None]:
train_df.to_csv('simplified_train.csv',index=False)

In [None]:
test_df.to_csv('simplified_test.csv',index=False)