In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Welcome to my first kernel! 
# I hope the workflow is detailed enough for anyone to understand and I hope you can learn along with my discoveries on this dataset. Any feedback is greatly appreciated!

# Starting with some metrics about the dataset

In [1]:
data=pd.read_csv("/kaggle/input/star-dataset/6 class csv.csv")
data.head()

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import math as m

In [1]:
data.describe()
#Strange results for Luminosity and Radius: The quartiles have a sudden change in the order of magnitude

In [1]:
data.info()

In [1]:
sns.pairplot(data)

# Two observations can be derived from the pairplot:
* There is a possible relation between Temperature and Absolute Magnitude.
* Both regularized parameters (Luminosity and Radius) have a lot of observations concentrated around 0

# First Step: Searching for correlation between Temperature and Absolute Magnitude

In [1]:
#checking out the relation of Log(T) with all variables
temp_log=data["Temperature (K)"]
temp_log=temp_log.apply(m.log10)
data2=data
data2["Temperature (K)"]=temp_log
sns.pairplot(data2)

In both cases there is a relationship that can be seen but given the distributions no conclusion can be drawn. Given the wide range of possible mathematical relations between these variables, when building and assessing model performance an attempt to check this correlation will be made by adding and removing temperature from the estimators.

# Second Step: Regularized Variables
In physics, it is very common for regularized variables to have logarithmic realtions with other parameters. A very know example is the equation for sound intensity:
I=10log(I/I_0) (I given in decibels)

It may seem a long shot to make this relationship in this dataset based on this, but two arguments hold this hypothesis:
* Looking at the wide range of values both variables assume: As mentioned earlier, the high concentration of values around 0 seemed strange, and a exponential rework could improve the looks of those distributions;
* If you look into the dataset description, you can see the in the last figure: Luminosity has a logarithmic scale. This alone supports the hypothesis, but the first point is what is usaully observable when dealing with data from a field of study that you don't have much affinity.

We will now check if these assumptions can improve subsequent model fitting

In [1]:
#Rescaling Luminosity and Radius
rad_log=data["Radius(R/Ro)"]
lum_log=data["Luminosity(L/Lo)"]
rad_log=rad_log.apply(m.log10)
lum_log=lum_log.apply(m.log10)
data4=data
data4["Radius(R/Ro)"]=rad_log
data4["Luminosity(L/Lo)"]=lum_log
sns.pairplot(data4)


# Changing the scale made the data look a lot better afterall, especially when looking the new relation of both parameters with the Star Type variable.
With this in mind, the model fitting will be done using the data4 dataframe

# As this is my very first attempt of model selection and fitting in Kaggle, I will be using different models and cross validation to select the best performing model

In [1]:
data4["log(L/Lo)"]=data4["Luminosity(L/Lo)"]
data4["log(R/Ro)"]=data4["Radius(R/Ro)"]
data4.drop(["Luminosity(L/Lo)","Radius(R/Ro)"], axis=1, inplace=True)
data4.describe()
#Luminosity and Radius look better know

Up to this point all numeric parameters have been treated, but the categorical features (Star Color and Spectral Class) have not. Both of them have a cardinal relation: Light color is related to light frequency and Spectral Class has an order according to the Hertzsprung-Russel Diagram given in the dataset description.

* I'm not sure the relation between a scale difference in the Spectral Class can be accurately described by a unitary difference (i.e. making O=0, B=1...), but given the lack of supplemental info it's the approach I'm going to take
* The light color can be associated with its frequency, but since the color descriptions are not about isolated lights but the percieved light color, this approach becomes unfeasible

Keeping this in mind, I've decided to make 2 models for class: One with dummy variables (model 1) and another with their respective ordinal relations (model 2). The light color feature will be treated only as dummy

In [1]:
#from now on the datax dataset is refering to the data that has model x for spectral class
data1=data4.drop(["Star color","Spectral Class"], axis=1)
data2=data4.drop("Star color", axis=1)

In [1]:
#Checking all categories in both features
data4["Spectral Class"].unique()

In [1]:
data4["Star color"].unique()

In [1]:
#For data2 I'll use the following dictionary
sp_class={"O":0,"B":1,"A":2,"F":3,"G":4,"K":5,"M":6}
data2["Spectral Class"]=data2["Spectral Class"].map(sp_class)

In [1]:
#making dummies for light color and for spectral class
dumm_light=pd.get_dummies(data4["Star color"], drop_first=True)
dumm_class=pd.get_dummies(data4["Spectral Class"], drop_first=True)

In [1]:
data1=data1.join(dumm_light)
data1=data1.join(dumm_class)
data2=data2.join(dumm_light)

In [1]:
#checking datasets
data1.head()

In [1]:
data2.head()

# We can now proceed to model comparison

In [1]:
#sklearn imports
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

# Obs: On the many runs through this notebook I have had a few splits that gave 1.00 score to many of the metrics in the models, and so I chose to reroll the train_test_split cell in order to remove this possible overfit from the result analysis. Only rf2 kept scoring 1.00 in all scenarios

In [1]:
#first organize the test train splits
X1=data1.drop("Star type", axis=1)
y1=data1["Star type"]
X2=data2.drop("Star type", axis=1)
y2=data2["Star type"]
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.3)

# Model 1: Logistic Regression

In [1]:
lr1=LogisticRegression(max_iter=10000)
lr1.fit(X1_train,y1_train)
result_lr1=lr1.predict(X1_test)
print(confusion_matrix(y1_test,result_lr1))
print("\n")
print(classification_report(y1_test,result_lr1))

In [1]:
lr2=LogisticRegression(max_iter=10000)
lr2.fit(X2_train,y2_train)
result_lr2=lr2.predict(X2_test)
print(confusion_matrix(y2_test,result_lr2))
print("\n")
print(classification_report(y2_test,result_lr2))

# Model 2: Linear Discriminant Analysis

In [1]:
lda1=LinearDiscriminantAnalysis()
lda1.fit_transform(X1_train,y1_train)
result_lda1=lda1.predict(X1_test)
print(confusion_matrix(y1_test,result_lda1))
print("\n")
print(classification_report(y1_test,result_lda1))

In [1]:
lda2=LinearDiscriminantAnalysis()
lda2.fit_transform(X2_train,y2_train)
result_lda2=lda2.predict(X2_test)
print(confusion_matrix(y2_test,result_lda2))
print("\n")
print(classification_report(y2_test,result_lda2))

# Model 3: K Nearest Neighbors

In [1]:
#Decision of K value
import numpy as np
error_rate1=[]
error_rate2=[]
for i in range(1,50):
    knn1=KNeighborsClassifier(n_neighbors=i)
    knn2=KNeighborsClassifier(n_neighbors=i)
    knn1.fit(X1_train,y1_train)
    knn2.fit(X2_train,y2_train)
    res_knn1_i=knn1.predict(X1_test)
    res_knn2_i=knn2.predict(X2_test)
    error_rate1.append(np.mean(res_knn1_i!=y1_test))
    error_rate2.append(np.mean(res_knn2_i!=y2_test))

In [1]:
n_values=range(1,50)
plt.fig_size=(12,20)
plt.plot(n_values,error_rate1)
plt.ylabel("Error rate")
plt.xlabel("K value")
neighbors1=error_rate1.index(min(error_rate1))+1

In [1]:
n_values=range(1,50)
plt.plot(n_values,error_rate2)
plt.ylabel("Error rate")
plt.xlabel("K value")
neighbors2=error_rate2.index(min(error_rate2))+1

In [1]:
knn1=KNeighborsClassifier(n_neighbors=neighbors1)
knn1.fit(X1_train,y1_train)
result_knn1=knn1.predict(X1_test)
print(confusion_matrix(y1_test,result_knn1))
print("\n")
print(classification_report(y1_test,result_knn1))

In [1]:
knn2=KNeighborsClassifier(n_neighbors=neighbors2)
knn2.fit(X2_train,y2_train)
result_knn2=knn2.predict(X2_test)
print(confusion_matrix(y2_test,result_knn2))
print("\n")
print(classification_report(y2_test,result_knn2))

# Model 4: Random Forest

In [1]:
rf1=RandomForestClassifier(n_estimators=300)
rf1.fit(X1_train,y1_train)
result_rf1=rf1.predict(X1_test)
print(confusion_matrix(y1_test,result_rf1))
print("\n")
print(classification_report(y1_test,result_rf1))

In [1]:
rf2=RandomForestClassifier(n_estimators=300)
rf2.fit(X2_train,y2_train)
result_rf2=rf2.predict(X2_test)
print(confusion_matrix(y2_test,result_rf2))
print("\n")
print(classification_report(y2_test,result_rf2))

# Model selection allowed us to see many important facts about the data:

* The results given show that the previous assumption of correlation between Temperature and Absolute Magnitude is probably not true, given that overall all models performed very well on the dataset
* Usually not very well suited for more complex datasets, the logistic regression model has performed very well on this data, probably resulting from the logarithmic rescaling of Luminosity and Radius. (I chose not to explore this possibility, if anyone would like to try I would appreciate the feedback, although I strongly believe that the values will exceed Python's mathematical limit due to the size of the parameters)
* KNN model has the worst performance from all models studied, which is expected as a unsupervised learning model is applied onto a supervised dataset
* The cardinal relation on Spectral Class has improved model performance independent of the model chosen, showing that the relation indicated on the Hertzsprung-Russel Diagram had a very positive effect on the data. A relation could have been built for light but the subjectiveness and need to select one value of frequency for a spectrum has made this task difficult and prone to error
* The linear models had some trouble in Star Types 3 and 4 probably because of overlapping values in some parameters as Temperature and Luminosity. If these models were used, extra care should be taken when dealing with these types
* The random forest 2 classifier outperformed all models studied, but when thinking of easier scientific interpretation of the model, the logistic regression 2 model could be implemented without significant loss of performance compared to the huge interpretational gain

# Thanks for reading my kernel!