
# **Galaxy Zoo 2 decision tree logistic prediction notebook**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Deyht/AI_astro_ED_AAIF/blob/main/codes/CNN/classification/gz2_logistic_pred/gz2_logistic_pred.ipynb)

---


### **CIANNA BETA DEV installation**

/!\ WARNING /!\
This beta version is not suited for general application and has been modified for the specific case covered in this notebook. Some function might have a different behavior than the expected one.
Do not use outside this notebook !

#### Query GPU allocation and properties

If nvidia-smi fail, it might indicate that you launched the colab session whithout GPU reservation.  
To change the type of reservation go to "Runtime"->"Change runtime type" and select "GPU" as your hardware accelerator.

In [None]:
%%shell

nvidia-smi

cd /content/

git clone https://github.com/NVIDIA/cuda-samples/

cd /content/cuda-samples/Samples/1_Utilities/deviceQuery/

cmake CMakeLists.txt

make SMS="50 60 70 80"

./deviceQuery | grep Capability | cut -c50- > ~/cuda_infos.txt
./deviceQuery | grep "CUDA Driver Version / Runtime Version" | cut -c57- >> ~/cuda_infos.txt

cd ~/

If you are granted a GPU that supports high FP16 compute scaling (e.g the Tesla T4), it is advised to change the mixed_precision parameter in the last cell to "FP16C_FP32A".  
See the detail description on mixed precision support with CIANNA on the [Systeme Requirements](https://github.com/Deyht/CIANNA/wiki/1\)-System-Requirements) wiki page.

#### Clone CIANNA git repository

In [None]:
%%shell

cd /content/

git clone https://github.com/Deyht/CIANNA

cd CIANNA

#### Compiling CIANNA for the allocated GPU generation

There is no guaranteed forward or backward compatibility between Nvidia GPU generation, and some capabilities are generation specific. For these reasons, CIANNA must be provided the platform GPU generation at compile time.
The following cell will automatically update all the necessary files based on the detected GPU, and compile CIANNA.

In [None]:
%%shell

cd /content/CIANNA

mult="10"
cat ~/cuda_infos.txt
comp_cap="$(sed '1!d' ~/cuda_infos.txt)"
cuda_vers="$(sed '2!d' ~/cuda_infos.txt)"

lim="11.1"
old_arg=$(awk '{if ($1 < $2) print "-D CUDA_OLD";}' <<<"${cuda_vers} ${lim}")

sm_val=$(awk '{print $1*$2}' <<<"${mult} ${comp_cap}")

gen_val=$(awk '{if ($1 >= 80) print "-D GEN_AMPERE"; else if($1 >= 70) print "-D GEN_VOLTA";}' <<<"${sm_val}")

sed -i "s/.*arch=sm.*/\\t\tcuda_arg=\"\$cuda_arg -D CUDA -D comp_CUDA -lcublas -lcudart -arch=sm_$sm_val $old_arg $gen_val\"/g" compile.cp
sed -i "s/\/cuda-[0-9][0-9].[0-9]/\/cuda-$cuda_vers/g" compile.cp
sed -i "s/\/cuda-[0-9][0-9].[0-9]/\/cuda-$cuda_vers/g" src/python_module_setup.py

./compile.cp CUDA PY_INTERF

mv src/build/lib.linux-x86_64-* src/build/lib.linux-x86_64

#### Testing CIANNA installation

**IMPORTANT NOTE**   
CIANNA is mainly used in a script fashion and was not designed to run in notebooks. Every cell code that directly invokes CIANNA functions must be run as a script to avoid possible errors.  
To do so, the cell must have the following structure.

```
%%shell

cd /content/CIANNA

python3 - <<EOF

[... your python code ...]

EOF
```

This syntax allows one to easily edit python code in the notebook while running the cell as a script. Note that all the notebook variables can not be accessed by the cell in this context.


### **Galaxy Zoo 2 classification**

In the original Galaxy Zoo project, volunteers classified images of Sloan Digital Sky Survey galaxies as belonging to one of six categories - elliptical, clockwise spiral, anticlockwise spiral, edge-on , star/don't know, or merger. GZ2 extends the original Galaxy Zoo classifications for a subsample of the brightest and largest galaxies in the Legacy release, measuring more detailed morphological features. This includes galactic bars, spiral arm and pitch angle, bulges, edge-on galaxies, relative ellipticities, and many others.

There are 243,434 images in total, all resized to a 424x424 resolution. Images are composed so the main object is centered and a part of the environment is visible. This implies that the FoV of each image is different.
For simplicity, we will use cropped and resized images that are more zoomed in toward the object and resized to a either a 64x64 or a 128x128 image resolution.

Details on the classification process can be found in [Hart et al. 2016](https://academic.oup.com/mnras/article/461/4/3663/2608720?login=true)


#### Downloading and visualizing the data

In [None]:
%%shell

cd /content/

#Manually upload the directory to github if not yet opened
git clone https://github.com/Deyht/AI_astro_ED_AAIF

In [None]:
%%shell

cd /content/AI_astro_ED_AAIF/codes/CNN/classification/gz2_logistic_pred/

python3 - <<EOF

#Will download the dataset at the fist call
from aux_fct import *

create_train_batch(visual=1)

EOF

In [None]:
%cd /content/AI_astro_ED_AAIF/codes/CNN/classification/gz2_logistic_pred/

from PIL import Image
import matplotlib.pyplot as plt

im = Image.open("training_set_example.jpg")
plt.figure(figsize=(5,4), dpi=250)
plt.imshow(im)
plt.gca().axis('off')
plt.show()


#### Train the classifier

In [None]:
%%shell

cd /content/AI_astro_ED_AAIF/codes/CNN/classification/gz2_logistic_pred/

python3 - <<EOF


import time
import locale
import matplotlib.pyplot as plt
from scipy import signal
from threading import Thread

from aux_fct import *
import numpy as np

#Comment to access system wide install
import sys, glob
sys.path.insert(0,glob.glob('/content/CIANNA/src/build/lib.*/')[-1])
import CIANNA as cnn


def i_ar(int_list):
	return np.array(int_list, dtype="int")

def f_ar(float_list):
	return np.array(float_list, dtype="float32")

def data_augm():

	data_augm, targets_augm = create_train_batch()
	cnn.delete_dataset("TRAIN_buf", silent=1)
	cnn.create_dataset("TRAIN_buf", nb_im_train, data_augm, targets_augm, silent=1)
	return


data_train, target_train = create_train_batch()
data_test, target_test = create_test_batch()

cnn.init(in_dim=i_ar([image_size,image_size]), in_nb_ch=im_depth+1, out_dim=nb_class, \
		bias=0.1, b_size=8, comp_meth="C_CUDA", dynamic_load=1, mixed_precision="FP16C_FP32A") #Change to C_BLAS or C_NAI

cnn.create_dataset("TRAIN", size=nb_im_train, input=data_train, target=target_train)
#cnn.create_dataset("VALID", size=nb_im_test , input=data_test , target=target_test)
cnn.create_dataset("TEST", size=nb_im_test , input=data_test , target=target_test)


load_epoch = 0
if (len(sys.argv) > 1):
	load_epoch = int(sys.argv[1])
if(load_epoch > 0):
	cnn.load("net_save/net0_s%04d.dat"%load_epoch,load_epoch,bin=1)
else:

	cnn.conv(f_size=i_ar([5,5]), nb_filters=16  , padding=i_ar([2,2]), activation="RELU")
	cnn.pool(p_size=i_ar([2,2]), p_type="MAX")
	cnn.norm(group_size=1, activation="LIN")

	cnn.conv(f_size=i_ar([3,3]), nb_filters=32  , padding=i_ar([1,1]), activation="RELU")
	cnn.pool(p_size=i_ar([2,2]), p_type="MAX")
	cnn.norm(group_size=1, activation="LIN")

	cnn.conv(f_size=i_ar([3,3]), nb_filters=64	, padding=i_ar([1,1]), activation="RELU")
	cnn.pool(p_size=i_ar([2,2]), p_type="MAX")
	cnn.norm(group_size=1, activation="LIN")

	cnn.conv(f_size=i_ar([3,3]), nb_filters=128 , padding=i_ar([1,1]), activation="RELU")
	cnn.conv(f_size=i_ar([1,1]), nb_filters=64  , padding=i_ar([0,0]), activation="RELU")
	cnn.conv(f_size=i_ar([3,3]), nb_filters=128 , padding=i_ar([1,1]), activation="RELU")
	cnn.pool(p_size=i_ar([2,2]), p_type="MAX")
	cnn.norm(group_size=1, activation="LIN")

	cnn.conv(f_size=i_ar([3,3]), nb_filters=192 , padding=i_ar([1,1]), activation="RELU")
	cnn.conv(f_size=i_ar([1,1]), nb_filters=96  , padding=i_ar([0,0]), activation="RELU")
	cnn.conv(f_size=i_ar([3,3]), nb_filters=192 , padding=i_ar([1,1]), activation="RELU")
	cnn.pool(p_size=i_ar([2,2]), p_type="MAX")
	cnn.norm(group_size=1, activation="LIN")

	cnn.dense(nb_neurons=512, drop_rate=0.4, activation="RELU")
	cnn.dense(nb_neurons=nb_class, strict_size=1, activation="LOGI")



for c_iter in range(load_epoch,2000):
	t = Thread(target=data_augm)
	t.start()

	cnn.train(nb_iter=1, learning_rate=0.005, end_learning_rate=0.0005, lr_decay=0.001, momentum=0.0, weight_decay=0.00002,
		confmat=0, control_interv=0, save_every=0, save_bin=1, TC_scale_factor=8.0, shuffle_every=0)

	if((c_iter+1)%10 == 0):
		cnn.forward(saving=2)
		pred_raw = np.fromfile("fwd_res/net0_%04d.dat"%(c_iter+1), dtype="float32")
		pred = np.reshape(pred_raw, (nb_test,-1))
		pred = pred[:,:nb_class]

		if(0):
			for k in range(0, nb_class):
				c_precision = 0
				c_recall = 0
				nb_current_class = np.shape(np.where((target_test[:,k] > 0.99))[0])[0]
				nb_true = np.shape(np.where((target_test[:,k] >= 0.99) & (pred[:,k] >= 0.5))[0])[0]
				nb_false = np.shape(np.where((target_test[:,k] < 0.99) & (pred[:,k] > 0.5))[0])[0]
				if(nb_current_class > 0):
					c_precision = nb_true/nb_current_class
				if(nb_true+nb_false > 0):
					c_recall = nb_true/(nb_true+nb_false)
				print ("C%02d - R = %0.3f - P = %0.3f - N_targ = %5d - N_True = %5d - N_Flase = %5d"%(k, c_recall, c_precision, nb_current_class, nb_true, nb_false))
		else:
			for master_class in master_classes:
				subset_index = np.where(gz2_catalog_header[5:].astype("<U3") == master_class)[0]
				targ_subset = target_test[:,subset_index]
				real_class = np.argmax(targ_subset, axis=1)
				nb_sub_class = np.shape(targ_subset)[1]

				pred_subset = pred[:,subset_index]
				pred_subset = -np.log(1.0/np.clip(pred_subset,0.01,0.99) - 1.0)
				pred_subset = np.exp(pred_subset)
				max_val_array = np.sum(pred_subset,axis=1)
				max_val_array = np.tile(max_val_array,(np.shape(pred_subset)[1],1)).T
				pred_subset /= max_val_array
				pred_class = np.argmax(pred_subset, axis=1)

				class_conf_mat = np.zeros((nb_sub_class+1, nb_sub_class+1), dtype="float")
				for i in range(0, nb_sub_class):
					for j in range(0, nb_sub_class):
						index = np.where((real_class == i) & (pred_class == j))[0]
						class_conf_mat[i,j] += np.shape(index)[0]

				sum_good = 0
				for i in range(0, nb_sub_class):
					class_conf_mat[i,nb_sub_class] = class_conf_mat[i,i]/max(1,np.sum(class_conf_mat[i,:]))
					class_conf_mat[nb_sub_class,i] = class_conf_mat[i,i]/max(1,np.sum(class_conf_mat[:,i]))
					sum_good += class_conf_mat[i,i]
				class_conf_mat[nb_sub_class,nb_sub_class] = sum_good/np.sum(class_conf_mat[:-1,:-1])

				print("%s confmat"%master_class)
				for i in range(0, nb_sub_class):
					for j in range(0, nb_sub_class):
						print("%5d "%(int(class_conf_mat[i,j])), end="")
					print("%0.3f "%(class_conf_mat[i,nb_sub_class]))
				for j in range(0, nb_sub_class+1):
					print("%0.3f "%(class_conf_mat[nb_sub_class,j]), end="")
				print("")


	t.join()
	cnn.swap_data_buffers("TRAIN")


EOF