# Sloan digital sky survey

## Space bodies classification

***Francesco Pudda, 21/10/2020***

## Introduction

The Sloan Digital Sky Survey or SDSS is a multi-spectral imaging map using at Apache Point Observatory in New Mexico. It is the most detailed three-dimensional map of the universe ever made, with multi-color images of one third of the sky, and spectra for more than three million astronomical objects. Data are available at its website and can be accessed via <i>SQL</i> query [1][sdss] [2][sdss_wiki].

Data used in this project are results from a query which joins two tables: <i>PhotoObj</i>, which contains photometric data, and <i>SpecObj</i>, which contains spectral data. Data can be retrieved using SkyServer SQL Search with the command provided in the project description. This query does a table JOIN between the imaging (PhotoObj) and spectra (SpecObj) tables and includes the necessary columns in the SELECT to upload the results to the SAS (Science Archive Server) for FITS file retrieval [3][kaggle]. 

[sdss]: www.sdss.org
[sdss_wiki]: en.wikipedia.org/wiki/Sloan_Digital_Sky_Survey
[kaggle]: www.kaggle.com/muhakabartay/sloan-digital-sky-survey-dr16

## Data loading

Lets' first set up libraries and load data.

In [None]:
import pandas as pd
import seaborn as sb
import numpy as np
import xgboost
import matplotlib.pyplot as plt
from sklearn import *

In [None]:
df = pd.read_csv("../input/sloan-digital-sky-survey-dr16/Skyserver_12_30_2019 4_49_58 PM.csv")

In [None]:
df.head()

We can see that there are many physical features that will probably be used for classification, but also some features concerning the process of acquisition which might be discarded.

Let's first do some renaming, as a matter of personal preference.

In [None]:
df = df.rename(columns={'ra': 'r_ascension', 'dec': 'declination', 'u': 'u_band',
                        'g': 'g_band', 'r': 'r_band', 'i': 'i_band', 'z': 'z_band',
                        'camcol': 'camera_col', 'class': 'label'})

df = df.replace(to_replace="QSO", value="QUASAR")

print(df.label.value_counts())

It can be noticed that there are two different id columns due to the <i>join</i> operation to create this table.

In [None]:
print("Distinct objid: %d" %len(df.objid.unique()))
print("Distinct specobjid: %d" %len(df.specobjid.unique()))

df = df.drop(columns=["objid", "specobjid"])

Some IDs in the first table are not distinct whereas IDs in the second are all unique. Reason for this is not given, but I am assuming that every row is a different and unique sample and so I can discard those columns without any problem.

Convert then label column to categorical and heck for any missing values.

In [None]:
df['label'] = pd.Categorical(df['label'],
                             categories = ['STAR', 'GALAXY', 'QUASAR'])

In [None]:
summary = pd.DataFrame()
summary['Name'] = df.columns
summary['Type'] = df.dtypes.values
summary['NA'] = df.isna().sum().values
print(summary.to_string(index=False))

## Exploratory data analysis

Now it is time to visual inspect the data. I am going to plot the column distribution by grouping by each class label.

In [None]:
melted = df.melt(id_vars=['label'],
                 value_vars=['r_ascension','declination','u_band','g_band','r_band','i_band','z_band','redshift'])
g = sb.FacetGrid(melted, col='variable', col_wrap=4, hue='label',
                 margin_titles=True, sharex=False, sharey=False)
g = g.map(sb.kdeplot, 'value', shade=True)
g = g.add_legend()

In [None]:
g = sb.FacetGrid(df, col='label', hue='label', margin_titles=True, sharex=False, sharey=False)
g = g.map(sb.kdeplot, 'redshift', shade=True)

In [None]:
melted = df.melt(id_vars=['label'],
                 value_vars=['run','rerun','camera_col','field','plate','mjd','fiberid'])
g = sb.FacetGrid(melted, col='variable', col_wrap=4, hue='label',
                 margin_titles=True, sharex=False, sharey=False)
g = g.map(sb.kdeplot, 'value', shade=True)
g = g.add_legend()

As far as physical features are concerned, we can note a pretty much similar distribution across the classes for <i>r_ascension</i> and <i>declination</i>, on the other hand we can note distinct distributions in each other other feature with particular regard to the <i>quasar</i> class showing very uniques patterns. <i>redshift</i> is a special case that needed the three classes to be displayed separately because of the spiked distribution of <i>star</i> class. All in all, I would guess that <i>r_ascension</i> and <i>declination</i> will not be very significant in classification, <i>band</i> features will be more important especially to discriminate <i>quasar</i> and <i>not quasar</i>, and lastly <i>redshift</i> will be the most significative one. This is also logical since those two features are not related to physical properties but rather to their location in the sky vault.

Talking about acquisition features, they all pretty much shows the same pattern accross all of the classes except <i>mjd</i> which is the acquisition date that doesn't have any logical meaning in classification. In addition, <i>rerun</i> raised some warnings saying that data must have a variance, meaning that it shows the same value throughout the samples. All in all, even if some of these features may have some correlation I'm not going to keep them because these are not physical properties and any classification ability is probably just due to statistical fluctuations. If I kept them I might even get better result but they might be biased for this dataset and may not be good for unseen data.

In [None]:
df = df.drop(columns=["run", "rerun", "camera_col", "field", "plate", "mjd", "fiberid"])

In [None]:
columns = ['r_ascension','declination','u_band','g_band','r_band','i_band','z_band','redshift']

## Feature engineering

Now it is necessary to select the best features for classification. I might arbitrarly choose the ones I consider the most likely best, but I prefer to use statistical tools to help me decide. I will start by plotting the correlation matrix to get a general idea of features correlation and then move on to univariate filter selection methods and a recursive feature elimination algorithm.

In [None]:
corr = df.corr()
cormap = sb.heatmap(corr, mask=np.triu(np.ones_like(corr, dtype=np.bool)), cmap=sb.diverging_palette(220, 10, as_cmap=True))

In [None]:
x = df.loc[:, df.columns != 'label']
y = df.loc[:,'label'].to_numpy()

In [None]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.1, random_state=1, stratify=y)

In [None]:
univariate_selection = feature_selection.SelectKBest(k='all').fit(x_train, y_train)
univariate_scores = dict(zip(columns, univariate_selection.scores_))
sorted(univariate_scores.items(), key=lambda t: t[1])

As expected, the two results are coherent with my previous predictions, namely, <i>r_ascension</i> and <i>declination</i> not very useful for classification purposes due to the same inter-class distribution.

In [None]:
columns = ['u_band', 'g_band', 'r_band', 'i_band', 'z_band', 'redshift']
x_train = x_train.loc[:, columns]
x_test = x_test.loc[:, columns]

As final step columns will be normalised in standard scale since many classification algorithms prefer to have columns with the same scale.

In [None]:
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

## Model selection

I ran many cross-validation steps on my machine with different algorithms that I won't be showing here for perfomance reasons. I found out that all algorithms perfomed really good (accuracy greater than 97%), but the one that outclass them all was XGBoost. I also tried voting ensembles but the overall model performed worse than the single XGBoost. Reason for this is probably that most of the entries could be perfectly classified by each of the models but the remaining were misclassified by the majority of the others estimators making the voting ensemble perform worse than XGBoost.
I finally ran a hard tuning of hyperparameters in order to get the most out of XGBoost.

## Final model evalutation

In [None]:
model = xgboost.XGBClassifier(booster='gbtree', max_depth=10,
                                    learning_rate=0.6, reg_lambda=2,
                                    n_estimators=400).fit(x_train, y_train)
predictions = model.predict(x_test)

In [None]:
def plot_confusion_matrix(cm, labels):
    display_labels = labels
    display = metrics.ConfusionMatrixDisplay(confusion_matrix=cm,
                                             display_labels=display_labels)
    return display.plot(include_values=True,
                        cmap='viridis', ax=None, xticks_rotation='horizontal',
                        values_format=None)

In [None]:
print("f1-score:  %.3f" %metrics.f1_score(y_test, predictions, average='weighted'))
print("Balanced accuracy:  %.3f" %metrics.balanced_accuracy_score(y_test, predictions))

cm = plot_confusion_matrix(metrics.confusion_matrix(y_test, predictions), labels=["GALAXY","QUASAR","STAR"])

In [None]:
ax = xgboost.plot_importance(model)
order = [int(i.get_text()[1:]) for i in ax.get_yticklabels()]
ax.set_yticklabels(np.array(columns)[order])

plt.show()