In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
stars = pd.read_csv("stars_data.csv")
stars

Unnamed: 0,Temperature (K),Luminosity(L/Lo),Radius(R/Ro),Absolute magnitude(Mv),Star type,Star color,Spectral Class
0,3068,0.002400,0.1700,16.12,0,Red,M
1,3042,0.000500,0.1542,16.60,0,Red,M
2,2600,0.000300,0.1020,18.70,0,Red,M
3,2800,0.000200,0.1600,16.65,0,Red,M
4,1939,0.000138,0.1030,20.06,0,Red,M
...,...,...,...,...,...,...,...
235,38940,374830.000000,1356.0000,-9.93,5,Blue,O
236,30839,834042.000000,1194.0000,-10.63,5,Blue,O
237,8829,537493.000000,1423.0000,-10.73,5,White,A
238,9235,404940.000000,1112.0000,-11.23,5,White,A


In [3]:
stars["Star type"].unique()

array([0, 1, 2, 3, 4, 5], dtype=int64)

In [5]:
stars["Spectral Class"].unique()

array(['M', 'B', 'A', 'F', 'O', 'K', 'G'], dtype=object)

In [6]:
stars["Star color"].unique()

array(['Red', 'Blue White', 'White', 'Yellowish White', 'Blue white',
       'Pale yellow orange', 'Blue', 'Blue-white', 'Whitish',
       'yellow-white', 'Orange', 'White-Yellow', 'white', 'Blue ',
       'yellowish', 'Yellowish', 'Orange-Red', 'Blue white ',
       'Blue-White'], dtype=object)

Since we have a categorical column with text data, we should encode it for training models. A simple way to do this is to get dummy variables based on the categorical features. Below, we take the features "Spectral Class" and "Star color" and transform them into multiple variables based on the values of the features. For example, Spectral Class contains classifications of the stars by their spectrum and luminosity and are classified as below:

Source: https://lweb.cfa.harvard.edu/~pberlind/atlas/htmls/note.html

In [7]:
pd.read_html("https://lweb.cfa.harvard.edu/~pberlind/atlas/htmls/note.html")[0].rename(columns={"Spectral Type": "Spectral Class"})

Unnamed: 0,Spectral Class,Surface Temperature,Distinguishing Features
0,O,"> 25,000K",H; HeI; HeII
1,B,"10,000-25,000K",H; HeI; HeII absent
2,A,"7,500-10,000K",H; CaII; HeI and HeII absent
3,F,"6,000-7,500K","H; metals (CaII, Fe, etc)"
4,G,"5,000-6,000K",H; metals; some molecular species
5,K,"3,500-5,000K",metals; some molecular species
6,M,"< 3,500K",metals; molecular species (TiO!)
7,C,"< 3,500K",metals; molecular species (C2!)


Instead of having one feature called "Spectral Class", we separate the values of the feature into 8 different binary features that indicate whether an observation is the respective feature or not (1 or 0). In this case for example, "Spectral Class" is separated into "Spectral Class_A", "Spectral Class_B", "Spectral Class_F", etc. If an observation's "Spectral Class" was F, then "Spectral Class_F" would be 1, and the rest would be 0. The same transformation is applied to "Star color" as well.

In [11]:
stars_dummy_df = pd.get_dummies(stars, columns = ['Spectral Class', 'Star color'] )
stars_dummy_df

Unnamed: 0,Temperature (K),Luminosity(L/Lo),Radius(R/Ro),Absolute magnitude(Mv),Star type,Spectral Class_A,Spectral Class_B,Spectral Class_F,Spectral Class_G,Spectral Class_K,...,Star color_Pale yellow orange,Star color_Red,Star color_White,Star color_White-Yellow,Star color_Whitish,Star color_Yellowish,Star color_Yellowish White,Star color_white,Star color_yellow-white,Star color_yellowish
0,3068,0.002400,0.1700,16.12,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,3042,0.000500,0.1542,16.60,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,2600,0.000300,0.1020,18.70,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,2800,0.000200,0.1600,16.65,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,1939,0.000138,0.1030,20.06,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,38940,374830.000000,1356.0000,-9.93,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
236,30839,834042.000000,1194.0000,-10.63,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
237,8829,537493.000000,1423.0000,-10.73,5,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
238,9235,404940.000000,1112.0000,-11.23,5,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [12]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans

X = stars_dummy_df.drop(columns=["Star type"])
y = stars_dummy_df[["Star type"]]

Unnamed: 0,Star type
0,0
1,0
2,0
3,0
4,0
...,...
235,5
236,5
237,5
238,5
