# pNeuma Vision transfer from zenodo to hugging face

In this notebook we attempt to upload the pNeuma Vision dataset to Zenodo in order to increase its accesibility. We converted the dataset into a single parquet file and push it to the hub. This allows streaming of the dataset for further tasks.


## Downloading of zenodo dataset

For this task we will use some linux tools

https://zenodo.org/records/7426506/files/20181029_D4_0900_0930.zip?download=1

In [27]:
%%bash

wget https://zenodo.org/records/7426506/files/20181029_D4_0900_0930.zip?download=1
unzip 20181029_D4_0900_0930.zip?download=1

--2024-09-07 08:53:59--  https://zenodo.org/records/7426506/files/20181029_D4_0900_0930.zip?download=1
Resolving zenodo.org (zenodo.org)... 188.184.103.159, 188.184.98.238, 188.185.79.172, ...
Connecting to zenodo.org (zenodo.org)|188.184.103.159|:443... connected.


HTTP request sent, awaiting response... 200 OK
Length: 4729528134 (4.4G) [application/octet-stream]
Saving to: ‘20181029_D4_0900_0930.zip?download=1’

     0K .......... .......... .......... .......... ..........  0% 2.45M 30m39s
    50K .......... .......... .......... .......... ..........  0% 10.5M 18m53s
   100K .......... .......... .......... .......... ..........  0% 11.2M 14m50s
   150K .......... .......... .......... .......... ..........  0% 14.0M 12m29s
   200K .......... .......... .......... .......... ..........  0% 15.2M 10m58s
   250K .......... .......... .......... .......... ..........  0% 7.38M 10m50s
   300K .......... .......... .......... .......... ..........  0% 8.71M 10m31s
   350K .......... .......... .......... .......... ..........  0% 20.9M 9m39s
   400K .......... .......... .......... .......... ..........  0% 78.2M 8m41s
   450K .......... .......... .......... .......... ..........  0% 7.19M 8m52s
   500K .......... .......... .......... .......... 

Archive:  20181029_D4_0900_0930.zip?download=1
   creating: 20181029_D4_0900_0930/
   creating: 20181029_D4_0900_0930/Frames/
  inflating: 20181029_D4_0900_0930/Frames/00001.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00002.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00003.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00004.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00005.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00006.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00007.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00008.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00009.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00010.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00011.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00012.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00013.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00014.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00015.jpg  
  inflating: 20181029_D4_0900_0930/Frames/00016.jpg  
  inflatin

## Parquet creation

Before uploading the dataset we need to first transform the data into tabular. For this we are going to encode the annotations in JSON and save its value in a column next to the images. This will facilitate the work with each image as the annotations will be in the same row. At this level we don't expect that users will access to each annotation directly. For this other case, please refer to the next notebook. 

Disclaimer: For this proof of concept we will consider only a few frames related to one of the compressed files.

In [1]:
datafolder = "/app/data/20181029_D4_0900_0930"

sample_id = "20181029_D4_0900_0930"
date = "2018-10-29"
drone = "D4"
timestamp_start = "09:00"
timestamp_end =  "09:30"

In [5]:
# Create 

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path
from io import BytesIO
from datasets import Dataset, Features, Image, Value
import pandas as pd
import os
from PIL import Image as PILImage

# Define working directories
workdir = Path(datafolder)  # Update this to your data directory
annotations_dir = workdir / "Annotations"
frames_dir = workdir / "Frames"
#out_path = workdir / "my_data.parquet"

# List all annotation files
annotation_files = sorted(annotations_dir.glob("*.csv"))

# Define your features
features = Features({
    'id': Value(dtype='string'),
    'Date': Value(dtype='string'),
    'Drone': Value(dtype='string'),
    'Timestamp_start': Value(dtype='string'),
    'Timestamp_end': Value(dtype='string'),
    'Frame': Value(dtype='string'),
    'Image': Image(decode=True),
    'Annotation_json': Value(dtype='string')
})

data = [] 

for annotation_file in annotation_files:
    base_name = os.path.splitext(os.path.basename(annotation_file))[0]
    frame_number = base_name 
    image_path = os.path.join(frames_dir, f"{base_name}.jpg")

    
    
    annotation_df = pd.read_csv(annotation_file)
    annotation_json = annotation_df.to_json(orient='columns', date_format='iso', double_precision=2, force_ascii=True, default_handler=str)
        
    feature = Image()
    row_data = {
        'id': sample_id,
        'Date': date,
        'Drone': drone,
        'Timestamp_start': timestamp_start,
        'Timestamp_end': timestamp_end,
        'Frame': frame_number,
        'Image': feature.encode_example(PILImage.open(image_path)),
        'Annotation_json': annotation_json
    }
    
    # Append row data to your data structure
    data.append(row_data)

# Convert the collected data into a DataFrame
df = pd.DataFrame(data)

# Convert the DataFrame to a Dataset
dataset = Dataset.from_pandas(df, features=features)




In [3]:
dataset

Dataset({
    features: ['id', 'Date', 'Drone', 'Timestamp_start', 'Timestamp_end', 'Frame', 'Image', 'Annotation_json'],
    num_rows: 3
})

## Upload of dataset to Hugging Face

Previously we created a dataset in Hugging Face that can be access here: https://huggingface.co/datasets/katospiegel/pneuma-vision-parquet

In [None]:
from huggingface_hub import notebook_login
notebook_login()

# Or run huggingface-cli login in the terminal and enter your token

In [6]:
from datasets import load_dataset

dataset.push_to_hub("katospiegel/pneuma-vision-parquet")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]



README.md:   0%|          | 0.00/588 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/katospiegel/pneuma-vision-parquet/commit/eeba4c3657631e9b262539e05b2700d54c51b43a', commit_message='Upload dataset', commit_description='', oid='eeba4c3657631e9b262539e05b2700d54c51b43a', pr_url=None, pr_revision=None, pr_num=None)

And finally we can previsualize the resulting dataset in HuggingFace.

In [8]:
%%html

<iframe
  src="https://huggingface.co/datasets/katospiegel/pneuma-vision-parquet/embed/viewer/default/train"
  frameborder="0"
  width="100%"
  height="560px"
></iframe>

In [15]:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path
from PIL import Image

def image_to_bytes(image_path):
    with Image.open(image_path) as img:
        return img.tobytes()


# Define working directories
workdir = Path(datafolder)  # Update this to your data directory
annotations_dir = workdir / "Annotations"
frames_dir = workdir / "Frames"
out_path = workdir / "my_data.parquet"

# List all annotation files
annotation_files = sorted(annotations_dir.glob("*.csv"))

data = {
    "images": [],
    "metadata": []
}

# Loop through all annotation files and load the corresponding images
for annotation_file in annotation_files:
    # Extract the base filename without extension (e.g., '00001' from '00001.csv')
    base_name = annotation_file.stem
    
    # Define the corresponding image file path
    image_path = frames_dir / f"{base_name}.jpg"
    
    # Read the CSV metadata
    metadata = pd.read_csv(annotation_file)
    
    # Convert the image to bytes
    image_bytes = image_to_bytes(image_path)
    
    # Append data to the list
    data["images"].append(image_bytes)
    data["metadata"].append(metadata.to_dict(orient="records"))  # Convert metadata to a list of dictionaries

# Convert the collected data into a DataFrame
df = pd.DataFrame(data)

# Convert the DataFrame to Arrow Table
table = pa.Table.from_pandas(df)

# Write the table to a Parquet file
pq.write_table(table, out_path, row_group_size=10, compression='gzip')

print(f"Parquet file saved to {out_path}")

Parquet file saved to /app/data/20181029_D4_0900_0930/my_data.parquet


In [16]:
from datasets import load_dataset

dataset = load_dataset("parquet", data_files="/app/data/20181029_D4_0900_0930/my_data.parquet")
dataset.push_to_hub("katospiegel/pneuma-vision-parquet")

Generating train split: 0 examples [00:00, ? examples/s]

Uploading the dataset shards:   0%|          | 0/3 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/609 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/katospiegel/pneuma-vision-parquet/commit/921574907541e907351f6c28a91dfe7454440205', commit_message='Upload dataset', commit_description='', oid='921574907541e907351f6c28a91dfe7454440205', pr_url=None, pr_revision=None, pr_num=None)

In [1]:
from huggingface_hub import notebook_login
notebook_login()

# Or run huggingface-cli login in the terminal and enter your token

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
from datasets import load_dataset

# dataset = load_dataset("stevhliu/demo")
# # dataset = dataset.map(...)  # do all your processing here
# dataset.push_to_hub("stevhliu/processed_demo")

In [13]:
dataset = load_dataset("katospiegel/pneuma-vision-parquet")

Downloading readme:   0%|          | 0.00/609 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/381M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/383M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/362M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50 [00:00<?, ? examples/s]

## Bounding box

In [24]:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path
from PIL import Image
from io import BytesIO
from datasets import Dataset, Features, Image, Value
import pandas as pd
import os
from PIL import Image as PILImage

def crop_image(image_path, x, y, box_size=80):
    with PILImage.open(image_path) as img:
        half_box_size = box_size // 2
        left = max(x - half_box_size, 0)
        upper = max(y - half_box_size, 0)
        right = left + box_size
        lower = upper + box_size
        cropped_img = img.crop((left, upper, right, lower))
        return cropped_img

# Define your features
features = Features({
    'Time [s]': Value(dtype='float32'),
    'ID': Value(dtype='int32'),
    'Type': Value(dtype='string'),
    'x_img [px]': Value(dtype='int32'),
    'y_img [px]': Value(dtype='int32'),
    'Angle_img [rad]': Value(dtype='float32'),
    'Frame': Value(dtype='string'),
    'image': Image(decode=True),
})

data = []  # Initialize your data structure

for annotation_file in annotation_files:
    base_name = os.path.splitext(os.path.basename(annotation_file))[0]
    frame_number = base_name  # Assuming the frame number is the base name of the file
    image_path = os.path.join(frames_dir, f"{base_name}.jpg")
    
    metadata = pd.read_csv(annotation_file)
    
    for index, row in metadata.iterrows():
        x, y = row["x_img [px]"], row["y_img [px]"]
        cropped_img = crop_image(image_path, x, y)
        
        feature = Image()
        # Prepare row data with all columns and the PIL image
        row_data = {
            "Time [s]": row["Time [s]"],
            "ID": row["ID"],
            "Type": row["Type"],
            "x_img [px]": x,
            "y_img [px]": y,
            "Angle_img [rad]": row["Angle_img [rad]"],
            "Frame": frame_number,
            "image": feature.encode_example(cropped_img)
        }
        
        # Append row data to your data structure
        data.append(row_data)

# Convert the collected data into a DataFrame
df = pd.DataFrame(data)

# Convert the DataFrame to a Dataset
dataset = Dataset.from_pandas(df, features=features)


# # Storing locally
# # Convert the DataFrame to Arrow Table
# table = pa.Table.from_pandas(df)

# # Write the table to a Parquet file
# pq.write_table(table, out_path, row_group_size=10, compression='gzip')

# print(f"Parquet file saved to {out_path}")

In [25]:
dataset.push_to_hub("katospiegel/pneuma-vision-parquet")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/7001 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/71 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/581 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/katospiegel/pneuma-vision-parquet/commit/23d1a0935e893f3b1b8f7ba87ebfba93bbac167f', commit_message='Upload dataset', commit_description='', oid='23d1a0935e893f3b1b8f7ba87ebfba93bbac167f', pr_url=None, pr_revision=None, pr_num=None)

In [18]:
from datasets import load_dataset

dataset = load_dataset("parquet", data_files="/app/data/20181029_D4_0900_0930/mybbox_data.parquet")
dataset.push_to_hub("katospiegel/pneuma-vision-parquet")

Generating train split: 0 examples [00:00, ? examples/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/71 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/609 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/katospiegel/pneuma-vision-parquet/commit/984f89bcfc3d56848fea5adbe2df45984c22b8a4', commit_message='Upload dataset', commit_description='', oid='984f89bcfc3d56848fea5adbe2df45984c22b8a4', pr_url=None, pr_revision=None, pr_num=None)