# Star Classification: Detailed EDA - Machine Learning
Welcome to this kernel community, in this kernel we're going to explore the star classification dataset using visualization tools and pandas. And then we're going to try several types of machine learning algorithms and find the best.



# Table of Content
1. Preparing Environment
1. Data Overview
    * Checking Missing Data
    * Class Distribution
    * Temperature - Type Relation
    * R - Type Relation
    * L - Type Relation
    * A_M - Type Relation
1. Preparing Data
    * One Hot Encoding
    * Scaling Between 0 and 1
    * Train Test Splitting
1. Machine Learning
1. Conclusion

# 1. Preparing Environment
In this section we're going to import libraries and import the data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC,LinearSVC
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB,BernoulliNB

In [None]:
import warnings  as wrn
wrn.filterwarnings("ignore")

In [None]:
data = pd.read_csv('../input/star-type-classification/Stars.csv')
data.head()

# 2. Data Overview
In this section we're going to take a look at the data in order to understand it properly. But before starting I'll define some functions to ease our jobs.

In [None]:
import random
def getRandomColor():
    R,G,B = random.randint(0,255),random.randint(0,255),random.randint(0,255)
    return (R,G,B)

In [None]:
def getRandomPalette():
    palettes = ['Accent', 'Accent_r', 'Blues', 'Blues_r', 'BrBG', 'BrBG_r', 'BuGn', 'BuGn_r', 'BuPu', 'BuPu_r', 'CMRmap', 'CMRmap_r', 'Dark2', 'Dark2_r', 'GnBu', 'GnBu_r', 'Greens', 
                'Greens_r', 'Greys', 'Greys_r', 'OrRd', 'OrRd_r', 'Oranges', 'Oranges_r', 'PRGn', 'PRGn_r', 'Paired', 'Paired_r', 'Pastel1', 'Pastel1_r', 'Pastel2', 'Pastel2_r', 'PiYG',
                'PiYG_r', 'PuBu', 'PuBuGn', 'PuBuGn_r', 'PuBu_r', 'PuOr', 'PuOr_r', 'PuRd', 'PuRd_r', 'Purples', 'Purples_r', 'RdBu', 'RdBu_r', 'RdGy', 'RdGy_r', 'RdPu', 'RdPu_r', 'RdYlBu',
                'RdYlBu_r', 'RdYlGn', 'RdYlGn_r', 'Reds', 'Reds_r', 'Set1', 'Set1_r', 'Set2', 'Set2_r', 'Set3', 'Set3_r', 'Spectral', 'Spectral_r', 'Wistia', 'Wistia_r', 'YlGn', 'YlGnBu', 'YlGnBu_r', 
                'YlGn_r', 'YlOrBr', 'YlOrBr_r', 'YlOrRd', 'YlOrRd_r', 'afmhot', 'afmhot_r', 'autumn', 'autumn_r', 'binary', 'binary_r', 'bone', 'bone_r', 'brg', 'brg_r', 'bwr', 'bwr_r', 'cividis', 'cividis_r',
                'cool', 'cool_r', 'coolwarm', 'coolwarm_r', 'copper', 'copper_r', 'crest', 'crest_r', 'cubehelix', 'cubehelix_r', 'flag', 'flag_r', 'flare', 'flare_r', 'gist_earth', 'gist_earth_r', 'gist_gray',
                'gist_gray_r', 'gist_heat', 'gist_heat_r', 'gist_ncar', 'gist_ncar_r', 'gist_rainbow', 'gist_rainbow_r', 'gist_stern', 'gist_stern_r', 'gist_yarg', 'gist_yarg_r', 'gnuplot', 'gnuplot2', 'gnuplot2_r',
                'gnuplot_r', 'gray', 'gray_r', 'hot', 'hot_r', 'hsv', 'hsv_r', 'icefire', 'icefire_r', 'inferno', 'inferno_r', 'jet', 'jet_r', 'magma', 'magma_r', 'mako', 'mako_r', 'nipy_spectral', 'nipy_spectral_r', 
                'ocean', 'ocean_r', 'pink', 'pink_r', 'plasma', 'plasma_r', 'prism', 'prism_r', 'rainbow', 'rainbow_r', 'rocket', 'rocket_r', 'seismic', 'seismic_r', 'spring', 'spring_r', 'summer', 'summer_r', 'tab10', 
                'tab10_r', 'tab20', 'tab20_r', 'tab20b', 'tab20b_r', 'tab20c', 'tab20c_r', 'terrain', 'terrain_r', 'turbo', 'turbo_r', 'twilight', 'twilight_r', 'twilight_shifted', 'twilight_shifted_r', 'viridis', 'viridis_r', 
                'vlag', 'vlag_r', 'winter', 'winter_r']
    return random.choice(palettes)

### Are there any problem in the data?

In [None]:
data.head()

In [None]:
data.info()

* We don't have a missing data
* All data types seem true, so we can move on and check the class distribution.

### Class Distribution

In [None]:
plt.subplots(figsize=(6,4))
plt.title("Class Distribution")
sns.countplot(data["Type"],palette="twilight")
plt.show()

* We have 6 classes and 40 sample for each class. 
* Data is insanely balanced. 


### Temperature - Type (Target) Relation

In [None]:
temp_by_class = data.groupby("Type")["Temperature"].mean()
temp_by_class

In [None]:
plt.subplots(figsize=(6,4))
plt.title("Temperature - Type Relation")
sns.barplot(temp_by_class.index,temp_by_class.values,palette=getRandomPalette())
plt.show()

* As we can see there are differences between the classes so this feature might be good for our classification problem.

### L - Type (Target) Relation


In [None]:
l_by_class = data.groupby("Type")["L"].mean()
l_by_class

In [None]:
plt.subplots(figsize=(6,4))
plt.title("L - Type Relation")
sns.barplot(l_by_class.index,l_by_class.values,palette=getRandomPalette())
plt.show()

* Class 0,1,2 have really small values, 3 has a small value but it's bigger than 0,1,2.
* Class 4,5 have really big values.

So this feature might be good as well to create a model for this mission.

### R - Type (Target) Relation


In [None]:
r_by_class = data.groupby("Type")["R"].mean()
r_by_class

In [None]:
plt.subplots(figsize=(6,4))
plt.title("R - Type Relation")
sns.barplot(r_by_class.index,r_by_class.values,palette=getRandomPalette())
plt.show()

* Class 0,1,2,3 have small values.
* Class 4 has a small value as well but it's bigger than others.
* Class 5 has a really big value so this feature might be descriptive.

### A_M - Type (Target) Relation

In [None]:
am_by_class = data.groupby("Type")["A_M"].mean()
am_by_class

In [None]:
plt.subplots(figsize=(6,4))
plt.title("A_M - Type Relation")
sns.barplot(am_by_class.index,am_by_class.values,palette=getRandomPalette())
plt.show()

* Although there is no big differences this feature might be descriptive as well, so we won't drop it too.

# Preparing Data
In this section we're going to prepare our data and make it ready to use in a machine learning model. Let's take a look at the dataframe again to understand what we need to do.

In [None]:
data.head()

* As you see we have two categorical features: Color and Spectral_Class. 
* So we should one hot encode those features.

In [None]:
data = pd.get_dummies(data,["Spectral_Class","Type"])
data.head()           

* And now as we see, our features need to be scaled because there are really big scale differences between them.

* We'll use this formula to scale our data between 1 and 0

**(value - min(data)) /( max(data) - min(data))**

In [None]:
data_scaled = (data - np.min(data)) / (np.max(data) - np.min(data))
data_scaled.head()

* Now let's split our target and descriptive data.

In [None]:
X = np.asarray(data_scaled.drop("Type",axis=1))
Y = np.asarray(data_scaled.Type,dtype=int)
print(X.shape)
print(Y.shape)

* And now we can split our dataset into test and train set.

In [None]:
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.25,random_state=42)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)


# Machine Learning
In this section we're going to fit a machine learning model using our little dataset.

In [None]:
def compareModels(classifiers,data,test_data):
    result_dict = {}
    for clf in classifiers:
        clf.fit(data[0],data[1])
        result_dict[str(type(clf))] = clf.score(test_data[0],test_data[1])
    return result_dict

In [None]:
results = compareModels([SVC(),
                         LinearSVC(),
                         GaussianNB(),
                         BernoulliNB(),
                         AdaBoostClassifier(DecisionTreeClassifier()),
                         RandomForestClassifier()
                        ],
                        (x_train,y_train),
                        (x_test,y_test))

In [None]:
results

* We have a small dataset so results might be misleading, but even though we created nice models.

# Conclusion
Thanks for your attention, if you have a question please ask me in the comment section and mention me. I'll return to you as soon as possible.
