## **Creating and training models with and without PCA  using [RandomForest, KNearest Neighbor, SVC] Classifiers on Stellar Classification Dataset**

#### Dataset Source: https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17

#### Description:

The data consists of 100,000 observations of space taken by the SDSS (Sloan Digital Sky Survey). Every observation is described by 17 feature columns and 1 class column which identifies it to be either a star, galaxy or quasar.

#### Attributes:
<pre>
obj_ID          = Object Identifier, the unique value that identifies the 
                  object in the image catalog used by the CAS<br>
alpha           = Right Ascension angle (at J2000 epoch)<br>
delta           = Declination angle (at J2000 epoch)<br>
u               = Ultraviolet filter in the photometric system<br>
g               = Green filter in the photometric system<br>
r               = Red filter in the photometric system<br>
i               = Near Infrared filter in the photometric system<br>
z               = Infrared filter in the photometric system<br>
run_ID          = Run Number used to identify the specific scan<br>
rereun_ID       = Rerun Number to specify how the image was processed<br>
cam_col         = Camera column to identify the scanline within the run<br>
field_ID        = Field number to identify each field<br>
spec_obj_ID     = Unique ID used for optical spectroscopic objects 
                  (this means that 2 different observations with the same 
                  spec_obj_ID must share the output class)<br>
class           = object class (galaxy, star or quasar object)<br>
redshift        = redshift value based on the increase in wavelength<br>
plate           = plate ID, identifies each plate in SDSS<br>
MJD             = Modified Julian Date, used to indicate when a given piece 
                  of SDSS data was taken<br>
fiber_ID        = fiber ID that identifies the fiber that pointed the light 
                  at the focal plane in each observation<br>


In [2]:
import numpy as np
import pandas as pd

In [3]:
attrib = ["obj_ID","alpha","delta","u","g","r","i","z","run_ID","rerun_ID","cam_col","field_ID","spec_obj_ID","class","redshift","plate","MJD","fiber_ID"]
dataset = pd.read_csv("star_classification.csv", names=attrib)

In [4]:
dataset.head()

Unnamed: 0,obj_ID,alpha,delta,u,g,r,i,z,run_ID,rerun_ID,cam_col,field_ID,spec_obj_ID,class,redshift,plate,MJD,fiber_ID
0,1.237661e+18,135.689107,32.494632,23.87882,22.2753,20.39501,19.16573,18.79371,3606,301,2,79,6.543777e+18,GALAXY,0.634794,5812,56354,171
1,1.237665e+18,144.826101,31.274185,24.77759,22.83188,22.58444,21.16812,21.61427,4518,301,5,119,1.176014e+19,GALAXY,0.779136,10445,58158,427
2,1.237661e+18,142.18879,35.582444,25.26307,22.66389,20.60976,19.34857,18.94827,3606,301,2,120,5.1522e+18,GALAXY,0.644195,4576,55592,299
3,1.237663e+18,338.741038,-0.402828,22.13682,23.77656,21.61162,20.50454,19.2501,4192,301,3,214,1.030107e+19,GALAXY,0.932346,9149,58039,775
4,1.23768e+18,345.282593,21.183866,19.43718,17.58028,16.49747,15.97711,15.54461,8102,301,3,137,6.891865e+18,GALAXY,0.116123,6121,56187,842


In [None]:
# Splitting into X and Y

X = dataset.drop('class', 1)
y = dataset['class']

In [7]:
X.head()

Unnamed: 0,obj_ID,alpha,delta,u,g,r,i,z,run_ID,rerun_ID,cam_col,field_ID,spec_obj_ID,redshift,plate,MJD,fiber_ID
0,1.237661e+18,135.689107,32.494632,23.87882,22.2753,20.39501,19.16573,18.79371,3606,301,2,79,6.543777e+18,0.634794,5812,56354,171
1,1.237665e+18,144.826101,31.274185,24.77759,22.83188,22.58444,21.16812,21.61427,4518,301,5,119,1.176014e+19,0.779136,10445,58158,427
2,1.237661e+18,142.18879,35.582444,25.26307,22.66389,20.60976,19.34857,18.94827,3606,301,2,120,5.1522e+18,0.644195,4576,55592,299
3,1.237663e+18,338.741038,-0.402828,22.13682,23.77656,21.61162,20.50454,19.2501,4192,301,3,214,1.030107e+19,0.932346,9149,58039,775
4,1.23768e+18,345.282593,21.183866,19.43718,17.58028,16.49747,15.97711,15.54461,8102,301,3,137,6.891865e+18,0.116123,6121,56187,842


In [8]:
y.head()

0    GALAXY
1    GALAXY
2    GALAXY
3    GALAXY
4    GALAXY
Name: class, dtype: object

In [9]:
# Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

# Training data: 80000 ; Test data: 20000

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [10]:
X_train.shape

(80000, 17)

In [13]:
X_test.shape

(20000, 17)

**Data Preprocessing**

In [14]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [15]:
X_train.shape

(80000, 17)

In [16]:
X_test.shape

(20000, 17)

**Model using - Random Forest classifier & Raw data**

In [17]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

Performance:

    Time: ~39sec

    [[11679   138    34]
    [  263  3571     1]
    [    8     0  4306]]

    Accuracy: 0.9778

**Model using - K-
Nearest Neighbor classifier & Raw data**

In [141]:
from sklearn.neighbors import KNeighborsClassifier

reg_knn = KNeighborsClassifier()
reg_knn.fit(X_train, y_train)
y_pred = reg_knn.predict(X_test)

Performance:

    Time: ~13sec

    [[11424   122   305]
    [  670  3127    38]
    [  899     9  3406]]

    Accuracy: 0.89785

**Model using - SVC classifier & Raw data**

In [147]:
from sklearn.svm import SVC

reg_svc = SVC()
reg_svc.fit(X_train, y_train)
y_pred = reg_svc.predict(X_test)

Performance:

    Time: ~2min

    [[11469   135   247]
    [  393  3432    10]
    [   92     0  4222]]

    Accuracy: 0.95615

In [18]:
y_pred

array(['GALAXY', 'QSO', 'GALAXY', ..., 'GALAXY', 'GALAXY', 'STAR'],
      dtype=object)

**Performance Evaluation using confusion matrix and accuracy score**

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

cm = confusion_matrix(y_test, y_pred)
print(cm)
print(f'\nAccuracy: {str((accuracy_score(y_test, y_pred))*100)}%')

**Apply PCA and reduce Dimensionality**

In [189]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

In [175]:
X_train.shape 

(80000, 1)

In [155]:
X_test.shape

(20000, 1)

**Model using - Random Forest classifier & Dimension reduced data**

In [198]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

Performance:

    Time: ~8sec

    [[11543    11   297]
    [ 1636  1918   281]
    [ 2839     6  1469]]

    Accuracy 0.7465

**Model using - KNearest Neighbor classifier & Dimension reduced data**

In [196]:
from sklearn.neighbors import KNeighborsClassifier

reg_knn = KNeighborsClassifier()
reg_knn.fit(X_train, y_train)
y_pred = reg_knn.predict(X_test)

Performance:

    Time: ~11sec

    [[10536   945   370]
    [ 2036  1537   262]
    [ 1725   558  2031]]

    Accuracy 0.7052

**Model using - SVC classifier & Dimension reduced data**

In [194]:
from sklearn.svm import SVC

reg_svc = SVC()
reg_svc.fit(X_train, y_train)
y_pred = reg_svc.predict(X_test)

Performance:

    Time: ~11min 30sec

    [[11603   248     0]
    [ 3394   441     0]
    [ 4179   135     0]]

    Accuracy 0.6022

**Performance Evaluation on new Data**

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

cm = confusion_matrix(y_test, y_pred)
print(cm)
print(f'\nAccuracy: {str((accuracy_score(y_test, y_pred))*100)}%')