#### Data Preparation - Test and Train

This notebooks does the following:

- Iterates through all folders in the frames/ directory and collect all image filenames in a list. 

- Defines a helper function that checks whether each row in the DataFrame corresponds to a sampled image (based on video_id and frame) allowing us to align with the sampled list of frames. 

- Splits the filtered dataset into training and testing sets (80/20 split) and save them to CSV files.

In [None]:
import os
import pandas as pd

# Path to the frames folder
frames_folder = "frames"
df = pd.read_csv("data/combined-df.csv")

# List all video folders
video_folders = os.listdir(frames_folder)

# Create a list of all sampled image paths
sampled_images = []
for video in video_folders:
    video_path = os.path.join(frames_folder, video)
    if os.path.isdir(video_path):
        for image in os.listdir(video_path):
            sampled_images.append(os.path.join(video, image))

# # Filter the DataFrame based on video_id and frame
def is_image_in_sample(row):
    video_id = row['video_id']
    frame = row['frame']
    image_name = f"{video_id}_frame{frame}.jpg"
    return any(image_name in img for img in sampled_images)

# # Assuming your DataFrame has columns 'video_id' and 'frame'
filtered_df = df[df.apply(is_image_in_sample, axis=1)].reset_index(drop=True)


In [None]:
filtered_df.to_csv("data/filtered-sampled-df.csv", index=False)

Group the filtered DataFrame by video_id and count the number of frames for each video.

In [18]:
frame_counts = filtered_df.groupby('video_id').size().reset_index(name='frame_count')
pd.set_option('display.max_rows', None)

# Print the frame_counts DataFrame
print(frame_counts)

   video_id  frame_count
0      CT11           20
1      CT12           20
2      CT13           20
3      CT14           20
4      CT15           20
5      CT18           20
6      CT19           20
7       CT2           20
8      CT22           20
9      CT25           20
10     CT26           20
11     CT27           20
12     CT28           20
13     CT29           20
14      CT3           20
15      CT5           20
16      CT6           20
17      CT8           20
18   Cric11           20
19   Cric12           20
20   Cric13           20
21   Cric14           20
22   Cric15           17
23   Cric18           20
24   Cric21           20
25   Cric22           20
26   Cric23           20
27   Cric25           20
28   Cric26           20
29   Cric27           20
30   Cric28           20
31   Cric33           20
32   Cric34           20
33   Cric35           20
34   Cric37           20
35    Cric5           20
36    Cric6           20
37    Cric8           20
38     IO11           20


Split the filtered dataset into training and testing sets (80/20 split)

In [None]:
from sklearn.model_selection import train_test_split

# Split the filtered DataFrame into train and test sets
train_df, test_df = train_test_split(filtered_df, test_size=0.2, random_state=42)
train_df.to_csv("../data_csv/train_data.csv", index=False)
test_df.to_csv("../data_csv/test_data.csv", index=False)