In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#import sci-kit learn libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Goal

The goal of this notebook is to try to classify the images as galaxies, stars, or quasars using logistic regression. This is a work in progress and I would like to keep implementing new algorithms to try to classify these images.

# Exploratory Data Analysis

In [None]:
sdss = pd.read_csv('/kaggle/input/sloan-digital-sky-survey-dr16/Skyserver_12_30_2019 4_49_58 PM.csv')
sdss

Now that we have it loaded up, let's look at all the different features.

In [None]:
sdss.info()

In [None]:
print('Number of NaN values for each feature:\n',sdss.isnull().sum())

In [None]:
print('Number of uniques values for each feature:\n',sdss.nunique())

It also appears that the object id is not unique for each class, which raises the question: Are there duplicate objects? According to the Glossary of SDSS Terminology (https://www.sdss.org/dr12/help/glossary/#O), it turns out that the objects are enumerated within the given image field which may result in duplicate numbers in a different field. Looking at the specobjid, we can see that there are 100,000 unique spectras which means there should be 100,000 unique objects.

In [None]:
sdss.head()

Some of these columns are here mostly for keeping track of the images such as objid and specobjid. Objid and Specobjid are made of the other features camcol, field, mjd, plate, fiberid, run, and rerun. It is made using a bitmask over a 64 bit ID. These features do not really describe the star, galaxy, or quasar which means we can drop all of these features. 

I am also going to train my logistic regression classifier without the RA and DEC since I don't want my model to be more inclined to classify an object based on where it is in the sky. RA and DEC are just coordinates of the star/galaxy/quasar which means they are not actual features of these objects.

I aim to predict the object based on the photometric/spectral qualities of the objects which means I will only use u, g, r, i, z, and redshift for my predictions.

In [None]:
sdss_features = sdss.drop(columns=['objid', 'ra','dec', 'run', 'rerun', 'camcol', 'field','specobjid', 'plate', 'mjd', 'fiberid'])
sdss_features

Now, let's take a deeper look into these features. 

In [None]:
sdss_features.describe()

We can see that the numbers make sense considering the magnitude goes up, which means the object is dimmer, when the redshift goes up, which means the object is farther. The mean redshift is 0.17 which can be used to calculate the distance to the objects. The calculation is involved so I just used a calculator (http://www.astro.ucla.edu/~wright/CosmoCalc.html) to get an approximate distance. The average distance to all of the objects turns out to be 668 Mpc or 2.06e22 km. The furthest object is 2.21e23 km away and the closest is 5.30e20 km. These distances just give us an idea of the range of distances which these objects span. We will now look at the values for each specific class.

# Data Visualization

In [None]:
#Filter each class
stars = sdss_features[sdss_features['class'] == 'STAR']
quasars = sdss_features[sdss_features['class'] == 'QSO']
galaxies = sdss_features[sdss_features['class'] == 'GALAXY']

In [None]:
color_palette = 'GnBu_d'
sns.set()
fig = plt.gcf()
fig.set_size_inches(13,9)
sns.countplot(sdss_features['class'], palette=color_palette)
plt.show()

We can see that we have 4-5x more galaxies and stars compared to quasars. This means that our model may be a little less accurate in being able to predict quasars due to the low number of samples. .

In [None]:
sns.set(style='darkgrid')
fig, axs = plt.subplots(nrows=3)
fig = plt.gcf()
fig.set_size_inches(13,9)
plt.subplots_adjust(hspace=0.8)
sns.boxplot(stars['redshift'], palette=color_palette, ax=axs[0]).set_title('Stars')
sns.boxplot(galaxies['redshift'], palette=color_palette, ax=axs[1]).set_title('Galaxies')
sns.boxplot(quasars['redshift'], palette=color_palette, ax=axs[2]).set_title('Quasars')
plt.show()

These plots show the distribution of redshifts for each class and give us insight about the distance to each class. We can see that stars are the closest, galaxies are further, and the furthest are the quasars. We can see from this that this may be an important feature when classifying each object since the distribution of redshifts is different for each class.

In [None]:
sns.set(style='darkgrid')
sns.pairplot(sdss_features, hue='class')
plt.show()

The pairplot shows us that many of the features are linearly correlated for the classes. This is to be expected since the magnitude is a measure of how bright the object is basically, so as the magnitude increases for one wavelength then it should also increase for the other wavelengths. The distribution of each magnitude is also similar for each class which means it might be harder to group them based on just the wavelengths.

In [None]:
sdss_features_corr = sdss_features.corr()
fig = plt.gcf()
fig.set_size_inches(13,9)
sns.heatmap(sdss_features_corr, annot=True)
plt.show()

It looks like the u, g, r, i, and z features are all highly correlated which points to multicollinearity. One of the assumptions of logistic regression is that there is no multicollinearity, however, each of these features provide different information on different wavelengths of light for the image. Instead of trying to drop some of the wavelengths which may have important information about each of the examples, I will just use L2 regularization or ridge regression to minimize some of the coefficients which should reduce the effect of multicollinearity since it will penalize the coefficients of some features.

# Feature Engineering

So now that we've had a closer look into our data it's time to use logistic regression to classify. First, we will need to split the data into a training set and a test set.

In [None]:
sdss_data = sdss_features[['u','g','r','i','z','redshift']]

#Need to factorize the classes or convert to numerical labels to use in model, returns label array and unique value array, only need the first array
sdss_target = pd.factorize(sdss_features['class'])[0]


#Split data 70/30 and set randomstate to 0 to get the same split every time it is split
x_train, x_test, y_train, y_test = train_test_split(sdss_data, sdss_target, test_size=0.30, random_state=0)

Next we will want to scale our parameters since the redshift is much smaller than all of the other features. I will use the RobustScaler since a lot of the redshift values are high for quasars compared to the redshift values for the stars or galaxies. This may be interpreted as outliers.

In [None]:
robust_scaler = RobustScaler()

#fit_transform will first perform fit and calculates the parameters, then applies transform 
x_train = robust_scaler.fit_transform(x_train)

#just need to transform since fit was already called
x_test = robust_scaler.transform(x_test)

# Apply Logistic Regression

In [None]:
logRegression = LogisticRegression(max_iter=350)

logRegression.fit(x_train, y_train)
predictions = logRegression.predict(x_test)

accuracy = logRegression.score(x_test, y_test)

print('Classification Test Score:', accuracy ,'\n')
print('Classification Performance:\n', classification_report(y_test, predictions),'\n')
print('Train Score:', logRegression.score(x_train,y_train))

cm = confusion_matrix(y_test, predictions)

fig = plt.gcf()
fig.set_size_inches(13,9)
sns.heatmap(cm, annot=True).set_title('Accuracy Score: {}'.format(accuracy))
plt.xlabel('Actual Class')
plt.ylabel('Predicted Class')

plt.show()

In this matrix 0 corresponds to stars, 1 to galaxies, and 2 to quasars.

The confusion matrix shows that a lot of the galaxies were misclassifed as stars and a lot quasars were misclassifed as galaxies. The misclassification of the quasars as galaxies may be from the distribution of redshifts for each of them overlapping. The misclassifcation of the galaxies may be from the overlapping points of u, g, r, i, and z features.

# Conclusion

In conclusion, our model was able to achieve an accuracy of 93% on the test data which is not bad. Unfortunately, some of the classes were misclassified due to overlaps in the distributions of redshifts and the multicollinearity of the wavelength features. I also found that using the MinMaxScaler resulted in an accuracy of 83% and when I switched to RobustScaler then the accuracy increased to 93%. I believe this is because of the difference in redshifts between the quasars and galaxies/stars. The max-min redshift of the data may have caused it to scale down too much while the RobustScaler was able to account for this by using the interquartile range instead.