# Label Preprocessing


*Last updates: 08/18/2021*

---


This notebook preprocess the annotation file provided by Kaggle.

Tha data and the labels can be downloaded here: [Galaxy Zoo](https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge/data)

## Import libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

from PIL import Image

import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils

from skimage import io

%matplotlib inline

## Load the kaggle (original) annotations

The original annotation is stored in `training_solutions_rev1.csv` file.

We load the csv file by `pd.read_csv()`

In [4]:
# load the original label csv provided by Kaggle
df = pd.read_csv('training_solutions_rev1.csv')
print("df.shape = {}".format(df.shape))
df.head()

df.shape = (61578, 38)


Unnamed: 0,GalaxyID,Class1.1,Class1.2,Class1.3,Class2.1,Class2.2,Class3.1,Class3.2,Class4.1,Class4.2,Class5.1,Class5.2,Class5.3,Class5.4,Class6.1,Class6.2,Class7.1,Class7.2,Class7.3,Class8.1,Class8.2,Class8.3,Class8.4,Class8.5,Class8.6,Class8.7,Class9.1,Class9.2,Class9.3,Class10.1,Class10.2,Class10.3,Class11.1,Class11.2,Class11.3,Class11.4,Class11.5,Class11.6
0,100008,0.383147,0.616853,0.0,0.0,0.616853,0.038452,0.578401,0.418398,0.198455,0.0,0.104752,0.512101,0.0,0.054453,0.945547,0.201463,0.181684,0.0,0.0,0.027226,0.0,0.027226,0.0,0.0,0.0,0.0,0.0,0.0,0.279952,0.138445,0.0,0.0,0.092886,0.0,0.0,0.0,0.325512
1,100023,0.327001,0.663777,0.009222,0.031178,0.632599,0.46737,0.165229,0.591328,0.041271,0.0,0.236781,0.160941,0.234877,0.189149,0.810851,0.0,0.135082,0.191919,0.0,0.0,0.140353,0.0,0.048796,0.0,0.0,0.012414,0.0,0.018764,0.0,0.131378,0.45995,0.0,0.591328,0.0,0.0,0.0,0.0
2,100053,0.765717,0.177352,0.056931,0.0,0.177352,0.0,0.177352,0.0,0.177352,0.0,0.11779,0.059562,0.0,0.0,1.0,0.0,0.741864,0.023853,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,100078,0.693377,0.238564,0.068059,0.0,0.238564,0.109493,0.129071,0.189098,0.049466,0.0,0.0,0.113284,0.12528,0.320398,0.679602,0.408599,0.284778,0.0,0.0,0.0,0.096119,0.096119,0.0,0.128159,0.0,0.0,0.0,0.0,0.094549,0.0,0.094549,0.189098,0.0,0.0,0.0,0.0,0.0
4,100090,0.933839,0.0,0.066161,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029383,0.970617,0.494587,0.439252,0.0,0.0,0.0,0.0,0.0,0.0,0.029383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


There are 38 columns. The first columns is `GalaxyID`, which is a unique ID of each image.

The remaining 37 columns represent the relative probability of the **37 classes** that each image is thought to belong to.

There are **61,578 images** in total.

The box below prints out the names of all 37 classes.

In [9]:
# see the column names
print("Columns: {}".format(df.columns))
print("\n{} classes in total".format(len(df.columns)-1)) # we exclude the GalaxyID column

Columns: Index(['GalaxyID', 'Class1.1', 'Class1.2', 'Class1.3', 'Class2.1', 'Class2.2',
       'Class3.1', 'Class3.2', 'Class4.1', 'Class4.2', 'Class5.1', 'Class5.2',
       'Class5.3', 'Class5.4', 'Class6.1', 'Class6.2', 'Class7.1', 'Class7.2',
       'Class7.3', 'Class8.1', 'Class8.2', 'Class8.3', 'Class8.4', 'Class8.5',
       'Class8.6', 'Class8.7', 'Class9.1', 'Class9.2', 'Class9.3', 'Class10.1',
       'Class10.2', 'Class10.3', 'Class11.1', 'Class11.2', 'Class11.3',
       'Class11.4', 'Class11.5', 'Class11.6'],
      dtype='object')

37 classes in total


**classes renaming**

We rename the name of the classes. We rename them from 0 to 36.

Their (original_class_name, renamed_class_name) is displayed below.

In [10]:
## Rename the column names
# create an empty dict
column_rename_dict = dict()

# class label starts from 0
class_label = 0

# iterate from the second column (first column is GalaxyID, which is not a class)
for original_column_name in df.iloc[:, 1:].columns:
    column_rename_dict[original_column_name] = class_label
    print("{} -> {}".format(original_column_name, class_label))
    class_label += 1

Class1.1 -> 0
Class1.2 -> 1
Class1.3 -> 2
Class2.1 -> 3
Class2.2 -> 4
Class3.1 -> 5
Class3.2 -> 6
Class4.1 -> 7
Class4.2 -> 8
Class5.1 -> 9
Class5.2 -> 10
Class5.3 -> 11
Class5.4 -> 12
Class6.1 -> 13
Class6.2 -> 14
Class7.1 -> 15
Class7.2 -> 16
Class7.3 -> 17
Class8.1 -> 18
Class8.2 -> 19
Class8.3 -> 20
Class8.4 -> 21
Class8.5 -> 22
Class8.6 -> 23
Class8.7 -> 24
Class9.1 -> 25
Class9.2 -> 26
Class9.3 -> 27
Class10.1 -> 28
Class10.2 -> 29
Class10.3 -> 30
Class11.1 -> 31
Class11.2 -> 32
Class11.3 -> 33
Class11.4 -> 34
Class11.5 -> 35
Class11.6 -> 36


In [11]:
## Rename the column names in the DataFrame
df = df.rename(columns=column_rename_dict)
df.head()

Unnamed: 0,GalaxyID,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36
0,100008,0.383147,0.616853,0.0,0.0,0.616853,0.038452,0.578401,0.418398,0.198455,0.0,0.104752,0.512101,0.0,0.054453,0.945547,0.201463,0.181684,0.0,0.0,0.027226,0.0,0.027226,0.0,0.0,0.0,0.0,0.0,0.0,0.279952,0.138445,0.0,0.0,0.092886,0.0,0.0,0.0,0.325512
1,100023,0.327001,0.663777,0.009222,0.031178,0.632599,0.46737,0.165229,0.591328,0.041271,0.0,0.236781,0.160941,0.234877,0.189149,0.810851,0.0,0.135082,0.191919,0.0,0.0,0.140353,0.0,0.048796,0.0,0.0,0.012414,0.0,0.018764,0.0,0.131378,0.45995,0.0,0.591328,0.0,0.0,0.0,0.0
2,100053,0.765717,0.177352,0.056931,0.0,0.177352,0.0,0.177352,0.0,0.177352,0.0,0.11779,0.059562,0.0,0.0,1.0,0.0,0.741864,0.023853,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,100078,0.693377,0.238564,0.068059,0.0,0.238564,0.109493,0.129071,0.189098,0.049466,0.0,0.0,0.113284,0.12528,0.320398,0.679602,0.408599,0.284778,0.0,0.0,0.0,0.096119,0.096119,0.0,0.128159,0.0,0.0,0.0,0.0,0.094549,0.0,0.094549,0.189098,0.0,0.0,0.0,0.0,0.0
4,100090,0.933839,0.0,0.066161,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029383,0.970617,0.494587,0.439252,0.0,0.0,0.0,0.0,0.0,0.0,0.029383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**True Label**

To infer the true label of each galaxy, we simply apply `idxmax` along all classes for each image.

We append an additional column named `label` to indicate the true label of each image.

In [12]:
## Add a new column 'label', which takes the argmax along all 37 classes
df['label'] = df.iloc[:, 1:].idxmax(axis="columns")
df.head()

Unnamed: 0,GalaxyID,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,label
0,100008,0.383147,0.616853,0.0,0.0,0.616853,0.038452,0.578401,0.418398,0.198455,0.0,0.104752,0.512101,0.0,0.054453,0.945547,0.201463,0.181684,0.0,0.0,0.027226,0.0,0.027226,0.0,0.0,0.0,0.0,0.0,0.0,0.279952,0.138445,0.0,0.0,0.092886,0.0,0.0,0.0,0.325512,14
1,100023,0.327001,0.663777,0.009222,0.031178,0.632599,0.46737,0.165229,0.591328,0.041271,0.0,0.236781,0.160941,0.234877,0.189149,0.810851,0.0,0.135082,0.191919,0.0,0.0,0.140353,0.0,0.048796,0.0,0.0,0.012414,0.0,0.018764,0.0,0.131378,0.45995,0.0,0.591328,0.0,0.0,0.0,0.0,14
2,100053,0.765717,0.177352,0.056931,0.0,0.177352,0.0,0.177352,0.0,0.177352,0.0,0.11779,0.059562,0.0,0.0,1.0,0.0,0.741864,0.023853,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14
3,100078,0.693377,0.238564,0.068059,0.0,0.238564,0.109493,0.129071,0.189098,0.049466,0.0,0.0,0.113284,0.12528,0.320398,0.679602,0.408599,0.284778,0.0,0.0,0.0,0.096119,0.096119,0.0,0.128159,0.0,0.0,0.0,0.0,0.094549,0.0,0.094549,0.189098,0.0,0.0,0.0,0.0,0.0,0
4,100090,0.933839,0.0,0.066161,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029383,0.970617,0.494587,0.439252,0.0,0.0,0.0,0.0,0.0,0.0,0.029383,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14


**Removing unwanted information**

We only need 2 columns: `GalaxyID` and `label` for our machine learning project.

In [13]:
## Create a new DataFrame with 'GalaxyID' and 'label' columns
class_label_df = df[['GalaxyID', 'label']].copy()
class_label_df.head()

Unnamed: 0,GalaxyID,label
0,100008,14
1,100023,14
2,100053,14
3,100078,0
4,100090,14


## Save csv file

In [14]:
## Save the class label csv
class_label_df.to_csv('class_labels.csv', index=False)

## Check the file

The box below reads the csv file we just saved (for sanity check).

In [15]:
temp = pd.read_csv('class_labels.csv')
temp.head()

Unnamed: 0,GalaxyID,label
0,100008,14
1,100023,14
2,100053,14
3,100078,0
4,100090,14
