# Introduction

Today, our goal is to apply machine learning methods in Python to classify stars into types using multiple features.

# Understanding the Data

In [None]:
#importing dataset and libraries
import numpy as np 
import pandas as pd 

df = pd.read_csv("/kaggle/input/star-dataset/6 class csv.csv")

Lets take a look at our dataset:

In [None]:
df.shape

In [None]:
df.head()

The features we have here are a few properties of a star:

* Absolute Temperature (in K) 
Temperature of the Star in Kelvin
* Relative Luminosity (L/Lo)
Relative Luminosity is the ratio of brightness (Luminosity) of the star with the brightness of the Sun.
* Relative Radius (R/Ro)
Relative Radius is again the ratio of the radius of a star compared to the radius of the Sun.
* Absolute Magnitude (Mv)
Absolute magnitude (M) is a measure of the luminosity of a celestial object, on an inverse logarithmic astronomical magnitude scale.
* Star Color (white,Red,Blue,Yellow,yellow-orange etc)
Self explainatory. It is the colour of Star as it appears in the sky.
* Spectral Class (O,B,A,F,G,K,,M)
It is basically a class in which a star falls based on most of the above features.
![HRmetrics.jpg](attachment:HRmetrics.jpg)
* Star Type (Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , SuperGiants, HyperGiants)
Type of star. We'll be trying to predict this value using the features

In [None]:
#checking for missing values
df.isnull().sum()

In [None]:
#different star types and Spectral Classes
df['Star type'].value_counts() , df['Spectral Class'].value_counts()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#checking correlation between variables for PCA
sns.heatmap(data = df.corr(), annot = True)

# PCA and Normalizing 

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()

df['Star_color'] = labelencoder.fit_transform(df['Star color'])
df['Spectral_Class'] = labelencoder.fit_transform(df['Spectral Class'])

In [None]:
features = df.drop(['Star type','Star color','Spectral Class'], axis = 1)
labels = df['Star type']

In [None]:
#scaling our training model
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_train_features = scaler.fit_transform(features)

In [None]:
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(scaled_train_features)
exp_variance = pca.explained_variance_ratio_

In [None]:
fig, ax = plt.subplots()
ax.bar(range(pca.n_components_), exp_variance)
ax.set_xlabel('Principal Component number')

In [None]:
cum_exp_variance = np.cumsum(exp_variance)

fig, ax = plt.subplots()
ax.plot(cum_exp_variance)
ax.axhline(y=0.85, linestyle=':')

We can assume n_components equal to 2 as about 85% of the variance can be explained, hence we perform PCA with number of components equal to 2.

In [None]:
n_component = 2

pca = PCA(n_component, random_state=10)
pca.fit(scaled_train_features)
pca_projection = pca.transform(scaled_train_features)

# Model Building

**Spliting Dataset**

In [None]:
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(pca_projection, labels, random_state=10)

**Decision Tree Classifier**

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=10)
dt.fit(train_features, train_labels)
pred_labels_tree = dt.predict(test_features)

**Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(random_state=10)
logreg.fit(train_features, train_labels)
pred_labels_logit = logreg.predict(test_features)

**Random Forest Classifier**

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=10)
clf.fit(train_features, train_labels)

# Validation

We'll be using KFold Cross validation to validate and compare all 3 of our models

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

kf = KFold(n_splits=10)

tree = DecisionTreeClassifier()
logreg = LogisticRegression()
clf = RandomForestClassifier()

tree_score = cross_val_score(tree, pca_projection, labels, cv=kf)
logit_score = cross_val_score(logreg, pca_projection, labels, cv=kf)
rt_score = cross_val_score(clf,pca_projection, labels, cv=kf)

# Mean of all the score arrays
print("Decision Tree:", np.mean(tree_score),"Logistic Regression:", np.mean(logit_score),"Random Forest:",np.mean(rt_score))

# Conclusion

I would like to mention a amazing reference that I read and learnt from:

https://dataphrase.github.io/songclfy/

Thanks for reading! Any Suggestions and improvements are welcomed.