# About the data:

### From the Overview section,

"The data consists of 10,000 observations of space taken by the SDSS. Every observation is described by 17 feature columns and 1 class column which identifies it to be either a star, galaxy or quasar."30% used in testing and 70% in training.

# Overview of Data
### Labels

So what exactly are stars, galaxies, and quasars? Had you asked me prior to starting this project, I would’ve not been able to answer (shame on me). Fortunately, Faraz’s notebook succinctly summarises what they are:


*     A GALAXY is a gravitationally bound system of stars, stellar remnants, interstellar gas, dust, and dark matter. Galaxies are categorised according to their visual morphology as elliptical, spiral, or irregular. Many galaxies are thought to have supermassive black holes at their active centers.
*     A STAR is a type of astronomical object consisting of a luminous spheroid of plasma held together by its own gravity. The nearest star to Earth is the Sun.
*     A QUASAR, also known as quasi-stellar object, is an extremely luminous active galactic nucleus (AGN). The power radiated by quasars is enormous. The most powerful quasars have luminosities exceeding 1041 watts, thousands of times greater than an ordinary large galaxy such as the Milky Way.

# Features

### A summary of the more important features are:

*     ra, dec — right ascension and declination respectively
*     u, g, r, i, z — filter bands (a.k.a. photometric system or astronomical magnitudes)
*     run, rerun, camcol, field — descriptors of fields (i.e. 2048 x 1489 pixels) within image
*     redshift — increase in wavelength due to motion of astronomical object
*     plate — plate number
*     mjd — modified Julian date of observation
*     fiberid — optic fiber ID

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.



#### first of all we need to import all the packages we need. Numpy and Pandas for data manipulation and all the modules from sklearn


In [None]:
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from tensorflow import keras
%matplotlib inline

# Exploratory Analysis

#### Loading Data

In [None]:
data = pd.read_csv("/kaggle/input/sloan-digital-sky-survey/Skyserver_SQL2_27_2018 6_51_39 PM.csv")

In [None]:
data.head()

In [None]:
data.shape

#### The object id columns, they are of no use in the analysis so we will delete them from dataset

In [None]:


# drop the object id columns, they are of no use in the analysis
data.drop(['objid','specobjid'], axis=1, inplace=True)



Data after droping columns

In [None]:
data.head(20)

In [None]:
data.shape

In [None]:
data.describe()

check the Null values

In [None]:
data.info()

No missing data 

### The Target from data is Data classification to Star,Galaxy or Quasar,so the class column has 3 Categories and in this case we need to convert them into numeric data.

In [None]:
le = LabelEncoder().fit(data['class'])
data['class'] = le.transform(data['class'])

# The result

In [None]:
data.head(20)

after lable encoding,the galaxy has been replaced by number 0, the Quasar is number 1 and star is number 1

In [None]:
data.info()

Perform train and test split

In [None]:


X = data.drop('class', axis=1)
y = data['class']



# Data Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(copy=True , with_mean= True , with_std = True)
X= scaler.fit_transform(X)

In [None]:
#Show data
X[:20]

In [None]:


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=128)



# Density Distribution Plots

In [None]:
sns.countplot(x=data['class'])

Definitely:
0 = Galaxy,
1 = Qusar,
2 = Star.

### Some information about filter:
"U" stands for ultraviolet.
"G" stands for green. 
"R" stands for red. 
"I" stands for infrared. 

In [None]:


sns.pairplot(data[['u','g','r','i','class']])



# Machine Learning models(Classification models)

In [None]:
# Decision Tree Classifier
dtClassifer = DecisionTreeClassifier(max_leaf_nodes=15,max_depth=3)
#------------------------------------------------------------------
#Linear Classifiers:
# 1- Logistic Regression
LRClassifer = LogisticRegression()
# # 2-Naive Bayes Classifier
# NBClassifer = MultinomialNB()
#-------------------------------------------------------------------
#Nearest Neighbor Classifier
NeNeClassifier = KNeighborsClassifier(n_neighbors=3)
#-------------------------------------------------------------------
#Support Vector Machines Classifer
SVCModel = SVC()




In [None]:
dtClassifer.fit(X_train, y_train)
LRClassifer.fit(X_train, y_train)
#NBClassifer.fit(X_train, y_train)
NeNeClassifier.fit(X_train, y_train)
SVCModel.fit(X_train, y_train)

In [None]:
y_preds = dtClassifer.predict(X_test)
y_predsLR = LRClassifer.predict(X_test)
#y_predsNB = NBClassifer.predict(X_test)
y_predsNeNe = NeNeClassifier.predict(X_test)
y_predsSVC = SVCModel.predict(X_test)

In [None]:
print(y_preds[:10],'\n',y_test[:10])
print("*******************************************************")
print(y_predsLR[:10],'\n',y_test[:10])
print("*******************************************************")
#print(y_predsNB[:10],'\n',y_test[:10])
#print("*******************************************************")
print(y_predsNeNe[:10],'\n',y_test[:10])
print("*******************************************************")
print(y_predsSVC[:10],'\n',y_test[:10])


### Measure accuracy of the classifier


In [None]:
print('accuracy_score by Decision Tree Classifier:',accuracy_score(y_true=y_test, y_pred=y_preds))
print('accuracy_score by LR Classifier:',accuracy_score(y_true=y_test, y_pred=y_predsLR))
#print('accuracy_score by Naive Bayes Classifier:',accuracy_score(y_true=y_test, y_pred=y_predsNB))
print('accuracy_score by Nearest Neighbor Classifier:',accuracy_score(y_true=y_test, y_pred=y_predsNeNe))
print('accuracy_score by SVM Classifier:',accuracy_score(y_true=y_test, y_pred=y_predsSVC))

Decision Tree Classifier has the highest score
You can apply neural networks to this data, but it is likely that you will get good results because the volume of training data is not much.