# Data Preprocessing

This notebook splits the complete Kaggle `images_training_rev1` images into training, validation, and testing datasets

[Galaxy Zoo datasets (Kaggle)](https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge/data)

## Import library

In [24]:
import pandas as pd
import numpy as np

## DataFrame pre-processing

In [4]:
## Load the csv file
df = pd.read_csv('class_labels.csv')
print("df.shape = {}".format(df.shape))
df.head()

df.shape = (61578, 2)


Unnamed: 0,GalaxyID,label
0,100008,14
1,100023,14
2,100053,14
3,100078,0
4,100090,14


In [7]:
## Unique labels
df['label'].value_counts()

14    36706
1     15940
0      5449
13     3470
2        13
Name: label, dtype: int64

## Renaming labels

The labels can only be 0, 1, 2, 13, 14

We rename 13 to 3, and rename 14 to 4

In [9]:
## Rename the label
# rename label=13 to label=3
# rename label=14 to label=4
df['label'] = df['label'].replace({13 : 3, 14 : 4})
df.head()

Unnamed: 0,GalaxyID,label
0,100008,4
1,100023,4
2,100053,4
3,100078,0
4,100090,4


In [10]:
## Unique labels
df['label'].value_counts()

4    36706
1    15940
0     5449
3     3470
2       13
Name: label, dtype: int64

## Data splitting

There are 61,578 images in total.

We first split the images into 49,262 training+validation sets and 12,316 test data:

We then split the 49,262 training+validation data into 39,410 training data and 9,852 validation data 

In [13]:
num_images = df.shape[0]
print("total number of images = {}".format(num_images))

total number of images = 61578


In [55]:
# random seed
np.random.seed(3)

# random select 12,316 indeces for the test dataset
test_idx = np.sort(np.random.choice(61578, 12316, replace=False))

# the remaining set
remain_idx = np.setdiff1d(np.arange(61578), test_idx)

# random select 9852 validation data
valid_idx = np.sort(np.random.choice(remain_idx, 9852, replace=False))

# set the remaining indeces for training
train_idx = np.setdiff1d(remain_idx, valid_idx)

# print
print("training indeces: {}".format(train_idx))
print("validation indeces: {}".format(valid_idx))
print("testing indeces: {}".format(test_idx))

print("\ntraining size: {}".format(len(train_idx)))
print("validation size: {}".format(len(valid_idx)))
print("testing size: {}".format(len(test_idx)))

training indeces: [    0     1     2 ... 61574 61575 61576]
validation indeces: [    5     6    10 ... 61564 61569 61577]
testing indeces: [    3     9    14 ... 61553 61561 61572]

training size: 39410
validation size: 9852
testing size: 12316


In [79]:
# split the dataframe and reset the index
train_df = df.iloc[train_idx, :].copy().reset_index(drop=True)
valid_df = df.iloc[valid_idx, :].copy().reset_index(drop=True)
test_df = df.iloc[test_idx, :].copy().reset_index(drop=True)

In [91]:
print("training")
print(train_df.head())
print("\nvalidation")
print(valid_df.head())
print("\ntesting")
print(test_df.head())

training
   GalaxyID  label
0    100008      4
1    100023      4
2    100053      4
3    100090      4
4    100128      0

validation
   GalaxyID  label
0    100122      4
1    100123      3
2    100150      4
3    100157      4
4    100322      4

testing
   GalaxyID  label
0    100078      0
1    100143      1
2    100237      4
3    100259      1
4    100382      4


## Save as csv files

In [83]:
# save training labels
train_df.to_csv('labels_train.csv', index=False)
# save validation labels
valid_df.to_csv('labels_valid.csv', index=False)
# save testing labels
test_df.to_csv('labels_test.csv', index=False)

## Class distribution

In [74]:
# compare class distribution
train_class_counts_df = pd.DataFrame(train_df['label'].copy().value_counts()).rename(columns={"label":"train"})
valid_class_counts_df = pd.DataFrame(valid_df['label'].copy().value_counts()).rename(columns={"label":"valid"})
test_class_counts_df = pd.DataFrame(test_df['label'].copy().value_counts()).rename(columns={"label":"test"})

class_counts_df = pd.concat([train_class_counts_df, valid_class_counts_df, test_class_counts_df],
                           axis=1).sort_index()
class_counts_df.head()

Unnamed: 0,train,valid,test
0,3501,880,1068
1,10136,2597,3207
2,7,3,3
3,2229,562,679
4,23537,5810,7359


**Class distributions among training, validation, and testing data**

We used the chi-square test to test the null hypothesis that the training, validation, and test datasets are drawn from populations with identitcal galaxy class frequency distribution.

Given a chi-square distribution with **8** degrees of freedom, the test statistic we calculated is **3.87294**, with **p-value = 0.8684**.

Therfore, the null hypothesis is not rejected.

## Relocate images (optional) 

The codes below will create 3 folders for storing training, validation, and testing images

In [112]:
import os
import shutil

In [116]:
%cd images_training_rev1

C:\Songmao\ML_datasets\Kaggle_galaxyZoo\images_training_rev1


In [117]:
# new folder names
train_folder = 'images_train'
valid_folder = 'images_valid'
test_folder = 'images_test'

# create folders if not exist
if not os.path.exists(train_folder):
    os.makedirs(train_folder)
    
if not os.path.exists(valid_folder):
    os.makedirs(valid_folder)

if not os.path.exists(test_folder):
    os.makedirs(test_folder)

In [119]:
# list filenames to zip
train_fnames = [str(x)+'.jpg' for x in train_df['GalaxyID'].values]
valid_fnames = [str(x)+'.jpg' for x in valid_df['GalaxyID'].values]
test_fnames = [str(x)+'.jpg' for x in test_df['GalaxyID'].values]

In [120]:
# relocate training images
for src_path in train_fnames:
    shutil.move(src_path, train_folder)

In [121]:
# relocate validation images
for src_path in valid_fnames:
    shutil.move(src_path, valid_folder)

In [122]:
# relocate test images
for src_path in test_fnames:
    shutil.move(src_path, test_folder)