## Overview

*Star Type Classification / NASA*

For this dataset, we'll be classifying the type of star given our dataset.
In particular, if the star is appropriately labeled as a Red Drawf, Hyper Giant, etc.

As there are multiple possible outputs, we should be keen to note that this is a multiple classification problem.

In [None]:
# Install necessary libraries
!pip install pydotplus
!pip install lazypredict
!pip install pandas -U

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt
import altair as alt

from sklearn.model_selection import train_test_split
from lazypredict.Supervised import LazyClassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from six import StringIO  
import pydotplus
from sklearn.metrics import accuracy_score

%matplotlib inline

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
stars = pd.read_csv('/kaggle/input/star-type-classification/Stars.csv')

In [None]:
stars.sample(n = 10, random_state = 42)

In [None]:
print(stars.info(), '\n', stars.isna().sum())

In [None]:
stars.describe()

In [None]:
stars.Type.value_counts()

Just at face-value, we can see that our target labels are normally distributed.

## Data Cleaning

In [None]:
# We'll get all the unique values and do a sort to see like values easier
color_list = []

for n in stars['Color'].unique():
    color_list.append(n)

color_list.sort()

color_list

We can see some values that are similar but have some character dissimilarities, such as 'Blue White'

Let's see how the data is distributed and go from there.

In [None]:
alt.Chart(stars).mark_bar().encode(
    x = 'count()',
    y = alt.Y('Color:N', sort = '-x')
    )

We can infer that the color observations were probably made by direct observations.

Data of the Metalicity of the stars, would allow for use of something like a B-V Color Index. 

We'll instead relabel the color labels to something a bit more appropriate for out test.

In [None]:
color_map = {'Orange-Red' : 'Orange-Red', 
             'Pale yellow orange' : 'Yellow-Orange',
             'Blue-white' : 'Blue-White', 
             'Blue White' : 'Blue-White',
             'Blue white' : 'Blue-White',
             'Blue-White' : 'Blue-White', 
             'yellow-white' : 'Yellow-White',
             'Yellowish White' : 'Yellow-White',
             'White-Yellow' : 'Yellow-White',
             'yellowish' : 'Yellow',
             'Yellowish' : 'Yellow',   
             'White' : 'White',
             'white' : 'White',
             'Whitish' : 'White',
             'Orange' : 'Orange', 
             'Red' : 'Red', 
             'Blue' : 'Blue'
            }   

stars.Color = stars.Color.map(color_map).astype('category')
stars.Spectral_Class = stars.Spectral_Class.astype('category')

In [None]:
alt.Chart(stars).mark_bar().encode(
    x = 'count()',
    y = alt.Y('Color', sort = '-x')
    )

We'll also label encode the spectral class and color columns as they are ordinal data

In [None]:
le = LabelEncoder()

In [None]:
# Map original labels for future reference
le.fit(stars['Spectral_Class'])
le_name_mapping_spectral_class = dict(zip(le.classes_, le.transform(le.classes_)))
le.fit(stars['Color'])
le_name_mapping_color = dict(zip(le.classes_, le.transform(le.classes_)))
print('Spectral Classes Mapping: ', le_name_mapping_spectral_class, 
      '\n\nColor Mapping: ', le_name_mapping_color)

In [None]:
# Apply transformations
stars['Color'] = le.fit_transform(stars['Color'])
stars['Spectral_Class'] = le.fit_transform(stars['Spectral_Class'])

In [None]:
sc_chart = alt.Chart(stars).mark_bar().encode(
    x = 'count()',
    y = alt.Y('Spectral_Class', sort = '-x')
).properties(
    height = 100,
    width = 100
)

r_chart = alt.Chart(stars).mark_bar().encode(
    x = 'count()',
    y = alt.Y('R:Q', bin = True)
).properties(
    height = 100,
    width = 100
)

l_chart = alt.Chart(stars).mark_bar().encode(
    x = 'count()',
    y = alt.Y('L:Q', bin = True)
).properties(
    height = 100,
    width = 100
)

temperature_chart = alt.Chart(stars).mark_bar().encode(
    x = 'count()',
    y = alt.Y('Temperature:Q', bin = True)
).properties(
    height = 100,
    width = 100
)

sc_chart | r_chart | l_chart | temperature_chart

The features for radii and luminosity seem to have a heavily skewed distribution.

We'll want to provide a log transformation to both of these.
Our main reason is because for star categorization, radii and luminosity seem to be based on
relative changes rather than absolute, in regards to classification.

Before applying the transformation, let's see what the data looks like beforehand.

In [None]:
alt.Chart(stars).mark_point().encode(
    alt.X(alt.repeat('column'), type = 'quantitative'),
    alt.Y(alt.repeat('row'), type = 'quantitative'),
    color = 'Type:N'
).properties(
    width = 200,
    height = 200
).repeat(
    row = ['L', 'R'],
    column = ['Temperature', 'A_M', 'Spectral_Class', 'Color']
)

In [None]:
stars['L'] = np.log(stars.L).astype(float)
stars['R'] = np.log(stars.R).astype(float)

In [None]:
sc_chart = alt.Chart(stars).mark_bar().encode(
    x = 'count()',
    y = alt.Y('Spectral_Class', sort = '-x')
).properties(
    height = 100,
    width = 100
)

r_chart = alt.Chart(stars).mark_bar().encode(
    x = 'count()',
    y = alt.Y('R:Q', bin = True)
).properties(
    height = 100,
    width = 100
)

l_chart = alt.Chart(stars).mark_bar().encode(
    x = 'count()',
    y = alt.Y('L:Q', bin = True)
).properties(
    height = 100,
    width = 100
)

temperature_chart = alt.Chart(stars).mark_bar().encode(
    x = 'count()',
    y = alt.Y('Temperature:Q', bin = True)
).properties(
    height = 100,
    width = 100
)

sc_chart | r_chart | l_chart | temperature_chart

In [None]:
alt.Chart(stars).mark_point().encode(
    alt.X(alt.repeat('column'), type = 'quantitative'),
    alt.Y(alt.repeat('row'), type = 'quantitative'),
    color = 'Type:N'
).properties(
    width = 200,
    height = 200
).repeat(
    row = ['L', 'R'],
    column = ['Temperature', 'A_M', 'Spectral_Class', 'Color']
)

## Analysis Part I

We'll take an overall peak of the data and take a deeper dive if we need to.

In [None]:
alt.Chart(stars).mark_point().encode(
    alt.X(alt.repeat('column'), type = 'quantitative'),
    alt.Y(alt.repeat('row'), type = 'quantitative'),
    color = 'Type:N'
).properties(
    width = 200,
    height = 200
).repeat(
    row = ['Temperature', 'L', 'R', 'A_M', 'Spectral_Class', 'Color'],
    column = ['Temperature', 'L', 'R', 'A_M', 'Spectral_Class', 'Color']
)

Here are a few observations from our plots

    1. Radius, luminosity and absolute magnitude seem to have a role in the type of star
    2. Radii and absolute magnitude seems to seperate the types better
    3. It seems that radii plays an integral role in determining the type of star
    4. Main sequence stars seem to have the widest spread

To further go off of points 2 and 3, it seems to make sense, especially if we refer to the Hertzsprung-Russell Diagram


In particular, the type of star may have some relationship to the equation:

    L = Area x Flux = 4πR^2σSBT^4
   
Let's do a little bit more diving before creating a model

In [None]:
heatmap = alt.Chart(stars).mark_rect().encode(
    alt.X('A_M:Q', bin = True),
    alt.Y('Temperature:Q', bin = True),
    alt.Color('count()', scale = alt.Scale(scheme = 'greenblue'))
)

points = alt.Chart(stars).mark_circle(
    color = 'black',
    size = 5,
).encode(
    x = 'A_M:Q',
    y = 'Temperature:Q',
)

heatmap + points

In [None]:
chart = alt.Chart(stars).mark_circle().encode(
    x = 'L:Q',
    y = 'R:Q',
).properties(
    height = 300,
    width = 300
)

chart + chart.transform_regression('L', 'R', method = 'poly').mark_line()

## Modeling

In [None]:
X = stars.drop(['Type'], axis = 1)
y = stars['Type']

features = X.columns
target = 'Type'

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 42)

Instead of testing each and every classifier, we'll use LazyPredict.
If a model seems promising, we might do a deeper dive into other related models that are not covered
via LazyPredict.

In [None]:
from lazypredict.Supervised import LazyClassifier

In [None]:
models = LazyClassifier(verbose = 0, ignore_warnings = True, custom_metric = None, predictions = True)
model, predictions = models.fit(X_train, X_test, y_train, y_test)

In [None]:
model.head(10)

Our top 10 models have a near 100% accuracy rating, but this could be due to the fact that the dataset is small, or some other factors. 

For now, we'll go with Decision Tree Classification for the following reasons:

    1. Short runtime
    2. Trees are less susceptible to label encoded data

In [None]:
dt = DecisionTreeClassifier(random_state = 42)

In [None]:
dt.fit(X_train, y_train)

In [None]:
importance = dt.feature_importances_

In [None]:
importance = pd.DataFrame(importance).T
importance.columns = features
importance = importance.T.reset_index()
importance.columns = ['Features', 'Importance']

In [None]:
alt.Chart(importance).mark_bar().encode(
    x = 'Features',
    y = 'Importance',
    ).properties(width = 200, height = 200)

When we look back at the graphs above, radii and absolute magnitude have seemed to play a role in classification.

Surprisingly, luminosity did not have more weight in classification.

In [None]:
text_representation = tree.export_text(dt)
print(text_representation)

In [None]:
with open("decistion_tree.log", "w") as fout:
    fout.write(text_representation)

In [None]:
plt.figure(figsize = (30, 30))
tree.plot_tree(dt, feature_names = features,  
                     class_names = ['0', '1', '2', '3', '4', '5'],  filled = True)
plt.savefig('decision_tree_visualization.png')

In [None]:
yhat_test = dt.predict(X_test)
acc = accuracy_score(y_test, yhat_test)
print('Accuracy Score: ', acc)

## Conclusion

For now, it seems we can draw a conclusion that absolute magnitude and radii are integral to classifying the type of star. With such a small dataset, it seems like the accuracy score is appropriate enough, especially at 100%. 

While we can certainly do feature selection and "improve" upon the decision tree, it does not seem necessary at this time.

#### Other Thoughts
I think it would have been nice to see the colors in a B-V Color Index. It would be interesting to see how metalicity would play a more integral role in the classification of stars.