**The purpose of this notebook is to find a way to predict a Star type thanks to some features**

![](https://i.la-croix.com/729x486/smart/2019/08/02/1201038901/Cette-pratique-essorenviron-5-selon-lassociation-francaise-dastronomie-AFA-consiste-choisir-destination-vacances-fonction-proprete_0.jpg)

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/star-dataset/6 class csv.csv')
df.head()

In [None]:
df.columns = df.columns.str.replace(' ', '_') 
df.head()

In [None]:
df.info()

In [None]:
df.describe()

**Ok, we saw that there is no missing data**

In [None]:
df['Star_type'].value_counts()

 **The data seems to be well-balanced. Now, let's see the impact of the features on the type of star.**

In [None]:
sns.heatmap(data = df.corr(), annot = True)

In [None]:
sns.boxplot(x="Star_type", y="Temperature_(K)",
             palette=["darkorange", "red"],
            data=df)


In [None]:
sns.boxplot(x="Star_type", y="Luminosity(L/Lo)",
             palette=["lightyellow", "yellow"],
            data=df)

In [None]:
sns.boxplot(x="Star_type", y="Radius(R/Ro)",
             palette=["paleturquoise", "lightseagreen"],
            data=df)

In [None]:
sns.boxplot(x="Star_type", y="Absolute_magnitude(Mv)",
             palette=["aqua", "silver"],
            data=df)

**All the features are important. We can begin to make some machine learning**

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()

for col in df.columns.values:
    if df[col].dtypes=='object':
        df[col]=le.fit_transform(df[col])

In [None]:
from sklearn.model_selection import train_test_split

X=df.drop('Star_type',axis=1)
y=df['Star_type']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=18)

In [None]:
X

In [None]:
y

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

**LOGISTIC REGRESSION**

In [None]:
from sklearn.linear_model import LogisticRegression

lr= LogisticRegression()

lr.fit(rescaledX_train,y_train)

In [None]:
from sklearn.metrics import confusion_matrix

y_pred = lr.predict(rescaledX_test)
print("Accuracy of logistic regression classifier: ", lr.score(rescaledX_test,y_test))
confusion_matrix(y_test,y_pred)

**RANDOM FOREST CLASSIFIER**

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier()

In [None]:
rfc.fit(rescaledX_train,y_train)

In [None]:
y_pred = rfc.predict(rescaledX_test)

print("Accuracy of random forest classifier: ", rfc.score(rescaledX_test,y_test))

confusion_matrix(y_test,y_pred)

In [None]:
from sklearn.model_selection import cross_val_score

model = RandomForestClassifier()
cross_val_score(rfc, X_train, y_train, cv=5, scoring='accuracy')

**The random forest classifier model is satisfying, so we can pick it** 