# pre-processing data
data source: https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge/data?select=images_test_rev1.zip

Downloaded data summary:
* 61578 jpg images in the images_training_rev1 folder. 
* 79975 jpg images in the images_test_rev1 folder. 
* training_solutions_rev1.csv contains meta info about the dataset.

What we need:

This project has a different and simpler goal than the original Kaggle competition.
We are going to  use this dataste train a NN to identify spiral arms only. The final goal
is to let the NN tell us whether an image contains spiral arms. We do not care about the rest of the galaxy morphology. 

Hence our data processing plan is:
1. Select galaxies that have spiral arms in training and test sets.  We can use Class4.1 > 0.5 in training_solutions_rev1.csv as a criterion.
2. Put galaxies with Class4.1 >= 0.5 into a new folder called spirals.
3. Put galaxies with 0 < Class4.1 < 0.5 into a new folder called non-spirals.
[optional later step]4. Put galaxies with Class4.1 = 0 into a new folder called rounds. 




### load training data meta information

In [4]:
import pandas as pd
meta_info = pd.read_csv('training_solutions_rev1.csv')
display(meta_info.head())

Unnamed: 0,GalaxyID,Class1.1,Class1.2,Class1.3,Class2.1,Class2.2,Class3.1,Class3.2,Class4.1,Class4.2,...,Class9.3,Class10.1,Class10.2,Class10.3,Class11.1,Class11.2,Class11.3,Class11.4,Class11.5,Class11.6
0,100008,0.383147,0.616853,0.0,0.0,0.616853,0.038452,0.578401,0.418398,0.198455,...,0.0,0.279952,0.138445,0.0,0.0,0.092886,0.0,0.0,0.0,0.325512
1,100023,0.327001,0.663777,0.009222,0.031178,0.632599,0.46737,0.165229,0.591328,0.041271,...,0.018764,0.0,0.131378,0.45995,0.0,0.591328,0.0,0.0,0.0,0.0
2,100053,0.765717,0.177352,0.056931,0.0,0.177352,0.0,0.177352,0.0,0.177352,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,100078,0.693377,0.238564,0.068059,0.0,0.238564,0.109493,0.129071,0.189098,0.049466,...,0.0,0.094549,0.0,0.094549,0.189098,0.0,0.0,0.0,0.0,0.0
4,100090,0.933839,0.0,0.066161,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
# check datatypes of each columnn
print(meta_info.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61578 entries, 0 to 61577
Data columns (total 38 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   GalaxyID   61578 non-null  int64  
 1   Class1.1   61578 non-null  float64
 2   Class1.2   61578 non-null  float64
 3   Class1.3   61578 non-null  float64
 4   Class2.1   61578 non-null  float64
 5   Class2.2   61578 non-null  float64
 6   Class3.1   61578 non-null  float64
 7   Class3.2   61578 non-null  float64
 8   Class4.1   61578 non-null  float64
 9   Class4.2   61578 non-null  float64
 10  Class5.1   61578 non-null  float64
 11  Class5.2   61578 non-null  float64
 12  Class5.3   61578 non-null  float64
 13  Class5.4   61578 non-null  float64
 14  Class6.1   61578 non-null  float64
 15  Class6.2   61578 non-null  float64
 16  Class7.1   61578 non-null  float64
 17  Class7.2   61578 non-null  float64
 18  Class7.3   61578 non-null  float64
 19  Class8.1   61578 non-null  float64
 20  Class8

In [15]:
# select spiral galaxies for columns  ID
meta_use_spiral = meta_info[meta_info['Class4.1'] >= 0.5].iloc[:, 0].values
# meta_use[:, [0]].astype(int)

In [16]:
print(type(meta_use_spiral))
print(meta_use_spiral.shape)
print(meta_use_spiral)

<class 'numpy.ndarray'>
(10397,)
[100023 100134 100380 ... 999795 999875 999964]


In [28]:
import os
import shutil
import numpy as np

path_root = './dataset/images_training_rev1/'
write_dir_name = './dataset/training/spirals/'

for num in meta_use_spiral:
    file = path_root + str(num) + '.jpg'
    if os.path.isfile(file):
        shutil.copy(file, write_dir_name, follow_symlinks=True)

now we have 10397 spirals in the training/spirals folder

In [34]:
cond = (meta_info['Class4.1'] != 0) & (meta_info['Class4.1'] < 0.5)
meta_use_nonspiral = meta_info[cond].iloc[:, 0].values

In [36]:
write_dir_name = './dataset/training/nonspirals/'

# nonspirals are 5 times more. To save time, we just need similar number of non-spirals
for num in meta_use_nonspiral[0: 10400]:
    file = path_root + str(num) + '.jpg'
    if os.path.isfile(file):
        shutil.copy(file, write_dir_name, follow_symlinks=True)

now we have 24909 nonspirals in the training/nonspiral folder


In [39]:
# Test set spirals
path_root = './dataset/images_training_rev1/'
write_dir_name = './dataset/test/spirals/'
for num in meta_use_spiral[0:5000]:
    file = path_root + str(num) + '.jpg'
    if os.path.isfile(file):
        shutil.copy(file, write_dir_name, follow_symlinks=True)

# Test set non spirals
path_root = './dataset/images_training_rev1/'
write_dir_name = './dataset/test/nonspirals/'
for num in meta_use_nonspiral[10400: 15400]:
    file = path_root + str(num) + '.jpg'
    if os.path.isfile(file):
        shutil.copy(file, write_dir_name, follow_symlinks=True)


now we have 10397 spirals in the test/spiral folder
now we have 10397 spirals in the test/nonspiral folder


In [40]:
print('now we have {} spirals in the train/spiral folder'.format(10397))
print('now we have {} spirals in the train/nonspiral folder'.format(10400))
print('now we have {} spirals in the test/spiral folder'.format(5000))
print('now we have {} spirals in the test/nonspiral folder'.format(5000))

now we have 10397 spirals in the train/spiral folder
now we have 10400 spirals in the train/nonspiral folder
now we have 5000 spirals in the test/spiral folder
now we have 5000 spirals in the test/nonspiral folder
