# Introduction

The purpose of this investigation to create models to produce accurate decisions based on a data set.
The data set that I have based my investigation on is star classification. By using the data from the star-type-classification data set, I hope to create accurate predictions on classyfing specific star types. I think it will be cool to be able to predict star types only from it's features and who knows, maybe it will be useful in the future for real life situations.
## Hypothesis:
I believe the best model will be a classifier. The output that I intend my predictions to be are categorical rather than numerical so I'll be implementing some code to do this, and the classifier can create categorical predictions unlike a regressor. I hope I'll be able to make some pretty accurate predictions.

# Setup:
The below code contains necessary steps for setting up our machine learning environment. Key features are described in the comments

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # dat processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.tree import DecisionTreeClassifier ,plot_tree# Our model
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Gathering and exploring data
The type of data we're dealing with here is numerical and some categorical. The data includes things that describe the stars key features. 
- Temperature = Kelvin
- L = Luminosity/Lo (Lo = Avg luminosity of sun)
- R = Radius/Ro (Ro = Avg radius of sun)
- A_M = Absolute magnitude
- Color = General Color of Spectrum
- Spectral_Class = O,B,A,F,G,K,M / SMASS
- Type = Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , Super Giants, Hyper Giants

The aim and target is to predict the type of star from the given data. By using the temperature, luminosity, radius, etc, we'll be able to find whether it's a red dward, brown dwarf, white dward, main sequence, super giants and hyper giants.

In our data set, there is a wide variety of coulours which will make it hard to code it. So to solve this we'll use a specific code from *devchauhan1* where all the mix of colours such as yellow white, whitish, yellowish white are all turned into one colour, where in this case just white. This is the link to that code.
https://www.kaggle.com/devchauhan1/star-type-classification-nasa/data

In [None]:
train_file_path = '../input/star-type-classification/Stars.csv'

# Create a new Pandas DataFrame with our training data
star_train_data = pd.read_csv(train_file_path)

x=["Blue-white","Blue White","yellow-white","Blue white","Yellowish White","Blue-White","White-Yellow","Whitish","white"]
for i in x:
    star_train_data.loc[star_train_data["Color"]==i,"Color"]= "White"
for i in ["yellowish","Yellowish"]:
    star_train_data.loc[star_train_data["Color"]==i,"Color"]="Yellow"
for i in ["Orange-Red","Pale yellow orange"]:
    star_train_data.loc[star_train_data["Color"]==i,"Color"]="Orange"


#star_train_data.columns
star_train_data.describe(include='all')
#star_train_data.head()


# Prepare the data

In this data set, we fortunately have all columns filled out so there is no need to drop any values. Every feature in the data set has an important value to find out which type of star we're classyfing so there is no need to drop any of them and they all have an equal value of importance.

In [None]:

selected_columns = ['Temperature', 'L', 'R', 'A_M', 'Color', 'Spectral_Class', 'Type']


prepared_data = star_train_data[selected_columns]


prepared_data.describe(include='all')

The only thing that we're dropping from our features (X) is the acutaly value that we're trying to predict, which in our case is 'Type' (y) which is the star type we're trying to predict.

In [None]:
y = prepared_data.Type


X = prepared_data.drop('Type', axis=1)

y = y.replace({0: 'Red Dwarf', 1: 'Brown Dwarf', 2: 'White Dwarf', 3: 'Main Sequence', 4: 'Super Giant', 5: 'Hyper Giant'})


X.head()

With the get dummies feature included in the pandas import, we are able to essentially turn catagorical data into numerical which is important for us as some of the features in the data set include colour and spectral class which is categorical. By using the get dummies feature we are able to turn it into numerical data because we can't use categorical data in our features when fitting a regressor.

In [None]:
X = pd.get_dummies(X)

X.head()

#  Choosing and training models

Now that we have prepared our data, we have to split the training set from our validation/testing data.

In [None]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

To make it a little easier to visualize, a decsision tree has been created from our training data set.

In [None]:
star_model = DecisionTreeClassifier(max_depth = 100)
star_model.fit(train_X, train_y)


plt.figure(figsize = (50,40))
plot_tree(star_model,
          feature_names=train_X.columns,    
          class_names=['Red Dwarf', 'Brown Dwarf', 'White Dwarf', 'Main Sequence', 'Super Giants', 'Hyper Giants'],
          filled=True)
plt.show()

Here is just an example of what our testing data is actually producing from its own data set.

In [None]:
pred = star_model.predict(train_X)

print("The predictions are:")


X['Star_Type'] = y

X.head()

## Models
Here is where we fit the models. The models that I've chosen is a decision tree classifier and a random forest classifier. By using two models we can compare the accuracy and find out which model does a better job of predicting the star types on the validation set.
We use classifiers instead of regressors because regressors can predict only numerical data whereas we have categorical data in our predictions.
The decision tree classifier model creates one decision tree where the random forest model creates multiples decision trees.

In [None]:
star_predictor = DecisionTreeClassifier(max_depth = 5)

star_predictor.fit(train_X, train_y)

classi_val_predictions =  star_predictor.predict(val_X)
accuracy_score(val_y, classi_val_predictions)

In [None]:
star_forest_predictor = RandomForestClassifier(random_state=1, max_depth = 3)

star_forest_predictor.fit(train_X, train_y)

forr_val_preds = star_forest_predictor.predict(val_X)
accuracy_score(val_y, forr_val_preds)    

# Evaluating and comparing predictions

From our results we can conclude that random forest classsifier has more accurate predictions than the decision tree classifier at a 100% accuracy rate! The random forest may create more accurate decisions but the decision tree classifier is also very successful with a 98% accuracy.
The random forest had more accurate predictions because it creates many decision trees to get a more accurate prediction whereas the decision tree classifier only uses one decision tree.

# Hyper Parameter tuning
To get the most accurate predictions from our models we have to tweak the hyper parameters a bit. To get the most accurate prediction for the Decision tree classifier, the max depth needs to be 5 or more. To get the most accurate prediction for the random forest which is 100% accuracy, we need a depth of 3 or more.

# Conclusion
The purpose of this investigation was too predict star types from the given data which included radius, colour, luminosity, etc. After working through the steps of machine learning we were finally able to create predictions based on the given data set. After testing the accuracy of the predictions we found some interesting results for the models. The decision tree classifier had an amazing accuracy score with 98% however the real mystery was produced from the random forest classifier with 100% accuracy.
To get a 100% accuracy with the validation data is a real anomaly. I thought it would be a mistake in the code, but as I was changing the hyperperameters I noticed that the accuracy could go below 1.00, so maybe it really was a 100% accuracy? Maybe different star types are very distinct from each other and when you get the data for a star the predictions that are made are almost certain for what type of star it is. This may be the case for the 100% accuracy, but who knows? To solve this mystery further research is likely required. 