<a href="https://colab.research.google.com/github/raz0208/City-Person-Dataset-EDA/blob/main/CityPersonDatasetEDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CityPerson Dataset EDA (Exploratory Data Analysis)**
A complete EDA (Exploratory Data Analysis) for CityPerson dataset

## Used Dataset:


*   gtFine_trainvaltest
*   gtFinePanopticParts_trainval

## Step 1: Extract and read the datasets

In [1]:
# Instal Required Libraries
import zipfile
import os
import random

In [2]:
# Dataset zip files path from Google Drive
gtFine = '/content/drive/MyDrive/CityPersonDataset/gtFine_trainvaltest.zip'
gtFinePanopticParts = '/content/drive/MyDrive/CityPersonDataset/gtFinePanopticParts_trainval.zip'

gtFine_ExtPath = '/content/CityPersonDataset/gtFine_trainvaltest'
gtFinePano_ExtPath = '/content/CityPersonDataset/gtFinePanopticParts_trainval'

In [3]:
# Extracting files function
def extract_zip(file_path, extract_path):
    if not os.path.exists(extract_path):
        os.makedirs(extract_path)
    with zipfile.ZipFile(file_path, 'r') as zip_ref:
        zip_ref.extractall(extract_path)

# Extract both datasets zip files
extract_zip(gtFine, gtFine_ExtPath)
extract_zip(gtFinePanopticParts, gtFinePano_ExtPath)

## Dataset Structure:

1.   gtFine_trainvaltest
  *   Contains PNG and JSON files.
  *   Organized by:
       * Train, validation, and test folders.
  *   Files Type:
      * *_color.png: Color-coded images for segmentation.
      * *_instanceIds.png: Encoded image masks where each pedestrian is represented with a unique ID.
      * *_labelIds.png: Encoded image masks for class labels.
      * *_polygons.json: Contains polygonal annotations for semantic segmentation, instance
2. gtFinePanopticParts_trainval:
    * Contains TIF files.
    * Organized similarly to gtFine_trainvaltest.
    * File type:
      * *_gtFinePanopticParts.tif: Panoptic segmentation with part-level annotations (e.g., parts of a pedestrian like arms or legs).

In [4]:
# List the extracted content from both datasets
gtFine_Files = os.listdir(gtFine_ExtPath)
gtFinepano_Files = os.listdir(gtFinePano_ExtPath)

gtFine_Files, gtFinepano_Files

(['license.txt', 'README', 'gtFine'],
 ['license.txt', 'README_panopticParts.md', 'gtFinePanopticParts'])

In [5]:
# Path to core folder
gtFine_CorePath = os.path.join(gtFine_ExtPath, 'gtFine')
gtFinePano_CorePath = os.path.join(gtFinePano_ExtPath, 'gtFinePanopticParts')

# List driectories inside core folders
gtFine_Dirs = os.listdir(gtFine_CorePath) if os.path.exists(gtFine_CorePath) else []
gtFinePano_Dirs = os.listdir(gtFinePano_CorePath) if os.path.exists(gtFinePano_CorePath) else []

gtFine_Dirs, gtFinePano_Dirs

(['test', 'train', 'val'], ['train', 'val'])

In [6]:
##### SIMPLE IMPLEMENTATION

# # Listing the sample files from the 'train' directory if it exists in both datasets
# gtfine_trainSample = os.listdir(os.path.join(gtFine_CorePath, 'train')) if 'train' in gtFine_Dirs else []
# gtFinePano_trainSample = os.listdir(os.path.join(gtFinePano_CorePath, 'train')) if 'train' in gtFinePano_Dirs else []

# gtFine_Dirs, gtfine_trainSample[:], gtFinePano_Dirs, gtFinePano_trainSample[:]

######

# Define the subdirectories
subdirs = gtFine_Dirs #["train", "val", "test"]

# Initialize dictionaries to store samples from each subdirectory
gtFine_CityFolders = {}
gtFinePano_CityFolders = {}

# Process each subdirectory
for subdir in subdirs:
    gtFine_CityFolders[subdir] = os.listdir(os.path.join(gtFine_CorePath, subdir)) if subdir in gtFine_Dirs else []
    gtFinePano_CityFolders[subdir] = os.listdir(os.path.join(gtFinePano_CorePath, subdir)) if subdir in gtFinePano_Dirs else []

# Output the first few files for each subdirectory
gtFine_CityFolders_Preview = {key: value[:] for key, value in gtFine_CityFolders.items()}
gtFinePano_CityFolders_Preview = {key: value[:] for key, value in gtFinePano_CityFolders.items()}

gtFine_CityFolders_Preview, gtFinePano_CityFolders_Preview

({'test': ['bonn', 'leverkusen', 'mainz', 'munich', 'berlin', 'bielefeld'],
  'train': ['stuttgart',
   'hamburg',
   'hanover',
   'bremen',
   'tubingen',
   'zurich',
   'weimar',
   'krefeld',
   'bochum',
   'ulm',
   'erfurt',
   'darmstadt',
   'monchengladbach',
   'jena',
   'aachen',
   'dusseldorf',
   'cologne',
   'strasbourg'],
  'val': ['lindau', 'munster', 'frankfurt']},
 {'test': [],
  'train': ['stuttgart',
   'hamburg',
   'hanover',
   'bremen',
   'tubingen',
   'zurich',
   'weimar',
   'krefeld',
   'bochum',
   'ulm',
   'erfurt',
   'darmstadt',
   'monchengladbach',
   'jena',
   'aachen',
   'dusseldorf',
   'cologne',
   'strasbourg'],
  'val': ['lindau', 'munster', 'frankfurt']})

## Files structure

For example: the file structure in the "bochum" city directory confirms the expected dataset formats and relationships:

1. gtFine Dataset (bochum):
    - Files include:
       - Color-coded images (e.g., bochum_000000_000313_gtFine_color.png).
       - Instance masks (e.g., bochum_000000_000313_gtFine_instanceIds.png).
       - Label masks (e.g., bochum_000000_000313_gtFine_labelIds.png).
       - Polygon annotations (e.g., bochum_000000_000313_gtFine_polygons.json).

2. gtFinePanopticParts Dataset (bochum):
    - Files include:
      - Panoptic segmentation with part-level detail (e.g., bochum_000000_000313_gtFinePanopticParts.tif).

## Observed Relationship:
- The filenames match across datasets, indicating alignment.
  - For example: bochum_000000_000313_gtFinePanopticParts.tif aligns with the corresponding files in gtFine for color, instance, label, and polygons.

In [26]:
# Select 3 cities name randomly from 'train' folder to check files
selected_cities = random.sample(gtFine_CityFolders['train'], 3)

# List files for each selected city
gtFine_city_files = {}
gtFinePano_city_files = {}

for city in selected_cities:
    gtFine_city_path = os.path.join(gtFine_CorePath, "train", city)
    gtFinePano_city_path = os.path.join(gtFinePano_CorePath, "train", city)

    gtFine_city_files[city] = sorted(os.listdir(gtFine_city_path)[0:5] if os.path.exists(gtFine_city_path) else [])
    gtFinePano_city_files[city] = sorted(os.listdir(gtFinePano_city_path)[0:5] if os.path.exists(gtFinePano_city_path) else [])

# Preview sample files for each city
gtFine_city_files, gtFinePano_city_files

({'bochum': ['bochum_000000_007150_gtFine_polygons.json',
   'bochum_000000_011255_gtFine_instanceIds.png',
   'bochum_000000_025833_gtFine_color.png',
   'bochum_000000_032169_gtFine_color.png',
   'bochum_000000_038150_gtFine_polygons.json'],
  'erfurt': ['erfurt_000000_000019_gtFine_labelIds.png',
   'erfurt_000025_000019_gtFine_color.png',
   'erfurt_000073_000019_gtFine_labelIds.png',
   'erfurt_000080_000019_gtFine_polygons.json',
   'erfurt_000107_000019_gtFine_labelIds.png'],
  'krefeld': ['krefeld_000000_000316_gtFine_color.png',
   'krefeld_000000_008239_gtFine_color.png',
   'krefeld_000000_023698_gtFine_color.png',
   'krefeld_000000_030560_gtFine_instanceIds.png',
   'krefeld_000000_030701_gtFine_labelIds.png']},
 {'bochum': ['bochum_000000_001828_gtFinePanopticParts.tif',
   'bochum_000000_008804_gtFinePanopticParts.tif',
   'bochum_000000_021606_gtFinePanopticParts.tif',
   'bochum_000000_023435_gtFinePanopticParts.tif',
   'bochum_000000_030913_gtFinePanopticParts.tif']