# sklearn module

[scikit-learn.org](https://scikit-learn.org/stable/index.html)

*scikit-learn*

[Model Evaluation](https://www.geeksforgeeks.org/machine-learning-model-evaluation/?ref=ml_lbp)

Kernel selected is Virtual envirnoment (upper right corner of the notebook)

Getting `scikit-learn` installed

In [None]:
# %pip install seaborn
%pip install scikit-learn

Importing the dataset
Splitting into data (=X) and target (=Y)

In [None]:
import pandas as pd

filepath = f'../datasets/dataset_train.csv'
df = pd.read_csv(filepath)
y = df['Hogwarts House']

excluded_features = ["Arithmancy",
                     "Defense Against the Dark Arts",
                     "Care of Magical Creatures"]
X = df[df.columns[6:]].drop(excluded_features, axis=1)
X.head()

Splitting our dataset into a train dataset and test dataset.
Randomly selecting 20% of the rows to be trained
while the remaining 80% will be used for test (prediction).

20% of 1600 rows = 400 rows

In [None]:
from sklearn import model_selection
# from sklearn.datasets import load_iris 
# from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                       random_state=20, 
                                       test_size=0.20)

## Decision Tree Classifier model

Training a Decision Tree Classifier model on the training dataset 

In [None]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier 


tree = DecisionTreeClassifier() 
tree.fit(X_train, y_train) 
y_pred = tree.predict(X_test) 

TP, TN : True postive, neagtive
FP, FN : false positive, negative

Accuracy = correct predictions / total number of predictions
Accuracy = (TP+TN)/(TP+TN+FP+FN)

Drawback : if imbalanced class labels

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

print("Accuracy:", accuracy_score(y_test, 
                                  y_pred)) 

Precision : true positives / true positives and false positives.
Precision = TP/(TP+FP)

In [None]:
print("Precision:", precision_score(y_test, 
                                    y_pred, 
                                    average="weighted")) 

Recall = TP/(TP+FN)

In [None]:
print('Recall:', recall_score(y_test, 
                              y_pred, 
                              average="weighted")) 

F1 score is harmonic mean of precision and recall
F1 score = (2×Precision×Recall)/(Precision+Recall)

In [None]:
print('F1 score:', f1_score(y_test, y_pred, 
                            average="weighted")) 

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matx = confusion_matrix(y_test, y_pred) 
confusion_matx


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(data=y_test)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

heat = sns.heatmap(data=confusion_matx, annot=True, fmt=".1f", cmap="plasma")
houses = { 0: 'Gryffindor', 1: 'Hufflepuff', 2: 'Ravenclaw', 3: 'Slytherin'}

## Logistic regression


[sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)



In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score 

logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
y_pred = logistic_model.predict(X_test)

In [None]:
logistic_model .predict_proba(X_train)

In [None]:
accuracy = accuracy_score(y_test,y_pred)*100

confusion_mat = confusion_matrix(y_test,y_pred)