# Stars type classifier

This classifier uses stars type data from https://www.kaggle.com/brsdincer/star-type-classification

Data description:

Temperature -- K
L -- L/Lo - relative luminocity (in the model renamed to L/Lo -lumin)

R -- R/Ro - relative radius (in the model renamed to R/Ro -rad)

AM -- Mv - magnitude (in the model renamed to Mv -magn)

Color -- General Color of Spectrum

Spectral_Class -- O,B,A,F,G,K,M / SMASS - https://en.wikipedia.org/wiki/Asteroid_spectral_types

Type -- Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , Super Giants, Hyper Giants


TARGET:
Type

from 0 to 5

Red Dwarf - 0

Brown Dwarf - 1

White Dwarf - 2

Main Sequence - 3

Super Giants - 4

Hyper Giants - 5


MATH:

Lo = 3.828 x 10^26 Watts (Avg Luminosity of Sun)

Ro = 6.9551 x 10^8 m (Avg Radius of Sun)

## Loading libraries and dataset

In [None]:
# Lets load libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import cross_validate, cross_val_score, train_test_split

import pickle

In [None]:
# loading dataset
sdata = pd.read_csv('../input/star-type-classification/Stars.csv')

## Exploratory data analysis

### Common data

In [None]:
# rename columns (because I like full names :)
sdata.rename(columns = {'L':'L/Lo -lumin','R':'R/Ro -rad','A_M':'Mv -magn'}, inplace = True)
sdata.info()

In [None]:
sdata.head(5)

In [None]:
# visualize missing data
sns.heatmap(sdata.isnull(),yticklabels=False,cbar=False,cmap='viridis')

### Numeric data

In [None]:
# pairplots for numeric data
sns.pairplot(sdata, kind = 'reg')

#### seems Type correlates with absolute magnitude

In [None]:
# lets calculate pairs correlation and build a heatmap (code taken from Kaggle-user ChrisX, https://www.kaggle.com/docxian/star-type-classification)

features_num = ['Temperature', 'L/Lo -lumin', 'R/Ro -rad', 'Mv -magn']

# calc correlation matrices
corr_pearson = sdata[features_num].corr(method='pearson')         # Pearson's corr - shows the linear relationship 
corr_spearman = sdata[features_num].corr(method='spearman')       # Spearman's corr - shows monotonic relationship

# and plot side by side
plt.figure(figsize=(15,5))
ax1 = plt.subplot(1,2,1)
sns.heatmap(corr_pearson, annot=True, cmap='RdYlGn', vmin=-1, vmax=+1)
plt.title('Pearson Correlation')

ax2 = plt.subplot(1,2,2, sharex=ax1)
sns.heatmap(corr_spearman, annot=True, cmap='RdYlGn', vmin=-1, vmax=+1)
plt.title('Spearman Correlation')
plt.show()

### Categorical data

In [None]:
g = sns.catplot(data=sdata, x='Color', y="Type")
g.fig.set_figwidth(7)
g.fig.set_figheight(5)
g.set_xticklabels(rotation = 90)

#### Colors are a bit messy. We will replace them with numerical columns later

In [None]:
# Spectral class categorical plot
g = sns.catplot(data=sdata, x='Spectral_Class', y="Type")
g.fig.set_figwidth(7)
g.fig.set_figheight(5)
g.set_xticklabels(rotation = 0)

### Categorical values to numerical

In [None]:
sdata.info()

In [None]:
# these are basic colors categories, we will make separated columns for them
# the transformation done below is a controversial decision, because I reduced total numbers of colors 
# and redistributed difficult colors into these simle basic categories. This redistribution is not fully
# based on physical approach, where we have no special spectral definition of some taken colors (like "Pale").

basic_colors = {'RED','ORANGE','YELLOW','GREEN','BLUE','WHITE','PALE'}
zero_list = [0]*len(sdata)

for col in basic_colors:
    sdata[col] =  zero_list


# I am always lazy, so copy-pasting is my love    
sdata.loc[sdata.Color == 'Red','RED'] = 1
sdata.loc[sdata.Color == 'White',['WHITE']] = 1
sdata.loc[sdata.Color == 'Blue White',['BLUE','WHITE']] = 1
sdata.loc[sdata.Color == 'Yellowish White',['YELLOW','WHITE']] = 1
sdata.loc[sdata.Color == 'Blue white',['BLUE','WHITE']] = 1
sdata.loc[sdata.Color == 'Pale yellow orange',['PALE','YELLOW','ORANGE']] = 1
sdata.loc[sdata.Color == 'Blue',['BLUE']] = 1
sdata.loc[sdata.Color == 'Blue-white',['BLUE','WHITE']] = 1
sdata.loc[sdata.Color == 'Whitish',['WHITE']] = 1
sdata.loc[sdata.Color == 'yellow-white',['YELLOW','WHITE']] = 1
sdata.loc[sdata.Color == 'Orange',['ORANGE']] = 1
sdata.loc[sdata.Color == 'White-Yellow',['WHITE','YELLOW']] = 1
sdata.loc[sdata.Color == 'white',['WHITE']] = 1
sdata.loc[sdata.Color == 'yellowish',['YELLOW']] = 1
sdata.loc[sdata.Color == 'Yellowish',['YELLOW']] = 1
sdata.loc[sdata.Color == 'Orange-Red',['ORANGE','RED']] = 1
sdata.loc[sdata.Color == 'Blue-White',['WHITE','BLUE']] = 1          
          
sdata

In [None]:
# replace Spectral_Class cat to numerical
s_class = pd.get_dummies(sdata['Spectral_Class'], drop_first = True)
sdata = pd.concat([sdata,s_class], axis = 1)
sdata

In [None]:
# make a copy of our dataset not to reload the main if we do something wrong :)))
tdata = sdata.drop(['Color','Spectral_Class'], axis = 1).copy()

# Model fitting

In [None]:
# train data and target separation
x_data = tdata.drop('Type', axis = 1)
y_data = tdata['Type']

In [None]:
# model creation
# we will take the Gradient boosting classifier
 
gbc = GradientBoostingClassifier(loss = 'deviance', max_depth=3, n_estimators=400, learning_rate = 0.085,
                                 min_samples_leaf = 1, max_features = 'log2')  #GradientBoosting model


In [None]:
# our dataset is not big, so as I understand, there is no reason to divide it in more than 5 folds in a k-fold validation
# we can play with this number to see how the accuracy changes

model = gbc
folds_n = 5
cv_results = cross_val_score(model, x_data, y_data, cv = folds_n, scoring="accuracy",n_jobs=-1)
print('min accuracy= {v}'.format(v = np.min(cv_results)))
print('avg accuracy= {v}'.format(v = np.mean(cv_results)))
print('max accuracy= {v}'.format(v = np.max(cv_results)))


#### Looks too optimistic (at different runs). But even if we change the number of folds to 3 (biger test, smaller train) the accuracy stays the same. I was thinking, that "fit" could use DataFrame index as a feature and because the initial data is sorted, it could be a data leakage, but seems not (see https://stackoverflow.com/questions/58635398/does-sklearn-use-pandas-index-as-a-feature)

#### So lets train the model on the full dataset and save is for future generations :)

In [None]:
model.fit(x_data,y_data)

# save the model to disk  (see https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/)
filename = 'star_classifier.sav'
pickle.dump(model, open(filename, 'wb'))

#### Now lets see feature importance in the trained model (https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html)

In [None]:
feature_importance = model.feature_importances_
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
fig = plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, np.array(x_data.columns)[sorted_idx])
plt.title('Feature Importance (MDI)')

# Thank you for attention :) Please judge me, but not strictly, I only study :)