# Classifying into 3 types

In this notebook I classify astronomical images into 3 types, stars, spiral galaxies and ellipitical galaxies. That are the three most common types of astronomical objects.

In [1]:
#standard libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.pyplot import figure
import matplotlib.cm as cm
import sys
import os
import time
import random as random
from astropy.io import fits
#torch functions
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tqdm.notebook import tqdm
#sklearn helper functions
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score,f1_score, log_loss
#xgboost for comparison
from xgboost import XGBClassifier
#logistic regression for comparison 
from sklearn.linear_model import LogisticRegression
import pickle
from functions_ml import *

First I get the galaxy data, in the same way as for the galaxy classification. 
It is produced by the program get_zoo_galaxies.py.  The is applied  43 times and thus gets data from 43 fields. This fields cover the area from 310 (-50) degree to 60 degree in right ascension with a height from -1.26 to +1.26  degree in Declination. 

The data consist of two pieces, firstly the images, which were saved as 4 dimensions (1 dimension x of image, second y of image, third channels, forth image in order or batch on torch language) numpy arrays, because torch needs 4d arrays, even when as here only a single channel exist. The image are rdeep images from http://research.iac.es/proyecto/stripe82/pages/data.php
This channel is the channel combination with the highest signal to noise ratio. Secondly, information on each images is loaded as data frame, that are in particular the classes spiral and elliptical which are boolean and exclusive here. There are citizen zoo classifications from the zoo projects. Only rather certain ones I included here, but there is no 100% certainty. 

In [2]:
#getting the list of images
myPath='/home/tobias/ml-testing/astr-images'
list_images=[f for f in os.listdir(myPath) 
    if f.endswith('_ell_spiral_im.npy') ]
list_images.sort()
print(len(list_images))
#getting the list of tables 
list_tables=[f for f in os.listdir(myPath) 
    if f.endswith('_ell_spiral_table.csv')]
list_tables.sort()
print(len(list_tables))

43
43


Next I combine the images.

In [4]:
cutouts_gal=comb_nump_4d(list_images)
print(cutouts_gal.shape)

(43, 43, 1, 7875)


Next combining the tables. 

In [6]:
list_df_gal=[]
for i in range(len(list_tables)):
    i=pd.read_csv(list_tables[i])
    list_df_gal.append(i)  
print(f"number of tables is {len(list_df_gal)}")    
df_gal=pd.concat(list_df_gal,ignore_index=True)
print(f"shape of combined data frame {df_gal.shape}")
print(f"shape of image file is {cutouts_gal.shape}")

number of tables is 43
shape of combined data frame (7875, 51)
shape of image file is (43, 43, 1, 7875)


The images and the classification data has the same lengths. Now I am looking on classes. 

In [7]:
print(df_gal.spiral.value_counts())

1    5766
0    2109
Name: spiral, dtype: int64


The classes are somewhat inbalanced, the ones which are not spiral are ellipticals. 

Next I get stars, the images have the same size as the galaxies since that is needed for 3 way classification even although star can distinguished with a smaller window also. 

In [13]:
#getting the list of star images
list_images_star=[f for f in os.listdir(myPath) 
    if f.endswith('_stars_im.npy') ]
list_images_star.sort()
print(list_images)
#getting the list of star tables 
list_tables_star=[f for f in os.listdir(myPath) 
    if f.endswith('_stars_table.csv')]
list_tables_star.sort()
print(list_tables_star)

['stripe82_01_ell_spiral_im.npy', 'stripe82_02_ell_spiral_im.npy', 'stripe82_03_ell_spiral_im.npy', 'stripe82_04_ell_spiral_im.npy', 'stripe82_05_ell_spiral_im.npy', 'stripe82_06_ell_spiral_im.npy', 'stripe82_07_ell_spiral_im.npy', 'stripe82_08_ell_spiral_im.npy', 'stripe82_09_ell_spiral_im.npy', 'stripe82_10_ell_spiral_im.npy', 'stripe82_11_ell_spiral_im.npy', 'stripe82_12_ell_spiral_im.npy', 'stripe82_13_ell_spiral_im.npy', 'stripe82_14_ell_spiral_im.npy', 'stripe82_15_ell_spiral_im.npy', 'stripe82_16_ell_spiral_im.npy', 'stripe82_17_ell_spiral_im.npy', 'stripe82_18_ell_spiral_im.npy', 'stripe82_19_ell_spiral_im.npy', 'stripe82_20_ell_spiral_im.npy', 'stripe82_21_ell_spiral_im.npy', 'stripe82_22_ell_spiral_im.npy', 'stripe82_23_ell_spiral_im.npy', 'stripe82_24_ell_spiral_im.npy', 'stripe82_25_ell_spiral_im.npy', 'stripe82_26_ell_spiral_im.npy', 'stripe82_27_ell_spiral_im.npy', 'stripe82_28_ell_spiral_im.npy', 'stripe82_29_ell_spiral_im.npy', 'stripe82_30_ell_spiral_im.npy', 'stripe82

Combining them now.

In [15]:
cutouts_star=comb_nump_4d(list_images_star)
print(cutouts_star.shape)
list_df_star=[]
for i in range(len(list_tables_star)):
    i=pd.read_csv(list_tables_star[i])
    list_df_star.append(i)  
df_star=pd.concat(list_df_star,ignore_index=True)
print(f"shape of combined data frame {df_star.shape}")
print(f"shape of image file is {cutouts_star.shape}")

(43, 43, 1, 31530)
shape of combined data frame (31530, 48)
shape of image file is (43, 43, 1, 31530)


Thast are mpre than 30000 here, clearly more than before. Adding so many of them likely mainly adds computing time. Therefore I now only use 20% of all. That might be increased at some point. 

In [18]:
#images and df split
im_star_used, im_star_other,df_star_used, df_star_other = train_test_split(cutouts_star.T,df_star,train_size=0.20, shuffle=True, random_state=1)
print("shape of used star image data is")
print(im_star_used.shape)

shape of used star image data is
(6306, 1, 43, 43)
