# My first `scikit-learn` notebook

In [None]:
import pandas as pd
import numpy as np
from random import choices
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

Load a dataset

In [None]:
forecast = pd.read_csv('data/Forecast.csv')
forecast.head()

Setup the `numpy` arrays to use to train classifiers

In [None]:
y = forecast.pop('Go-Out').values  # target feature
X = forecast.values                # training data
type(X),type(y)

Train a *k*-NN classifier

In [None]:
kNN = KNeighborsClassifier(n_neighbors=3) 
kNN.fit(X,y)


Set up sample test data and use for prediction

In [None]:
X_test = np.array([[8,70,11],
                   [8,69,15]])
kNN.predict(X_test)

All `sklearn` classifiers implement the `Estimator` API.

In [None]:
tree = DecisionTreeClassifier()
tree.fit(X,y)
tree.predict(X_test)

In [None]:
lr = LogisticRegression()
lr.fit(X,y)
lr.predict(X_test)

Swapping between classifiers (Estimators) makes model selection easy.  
Note that each predictor gives different results for the test data examples...

In [None]:
cfrs = [kNN,tree,lr]
for cfr in cfrs:
    cfr.fit(X,y)
    print(cfr.predict(X_test))

## Preprocessing
All preprocessing modules implement the `Transformer`  API.

In [None]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X)   # standardise to zero mean and unit variance
X_scaled = scaler.transform(X)
X_test_scaled = scaler.transform(X_test)
X_test_scaled

In [None]:
mm_scaler = preprocessing.MinMaxScaler()        # standardise to range [0,1]
mm_scaler.fit(X)
X_scaled = mm_scaler.transform(X)
X_test_scaled = mm_scaler.transform(X_test)
X_test_scaled

# Try It Yourself

Using the `penguin_size` dataset, experiment with some of the different models available in *sci-kit learn*. Some examples of what you can try are

* [Decision Trees](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
* [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html)
* [KNN Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
* [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

You can try each of the algorithms with and without scalers, and explore the parameters outlined in the SKLearn documentation for each to see what impact it has on the results.


In [None]:
from sklearn.metrics import accuracy_score

def encode_features(df):
    """
    Some models (such as the decision tree, for example) don't work with categorical data. This function
    goes through each column in the dataframe and uses a label encoder to convert categorical data to numerical.
    For example, `Gentoo`, `Emperor`, `Chinstrap` as penguin species would get replaced with 1, 2, 3
    
    We'll talk more about label encoding and other things to watch out for as the module progresses.
    """
    le = preprocessing.LabelEncoder()
    for i in range(len(df.columns)):
        df.iloc[:,i] = le.fit_transform(df.iloc[:,i])
    return df

penguins_train = pd.read_csv('data/penguins_train.csv')
penguins_test = pd.read_csv('data/penguins_test.csv')


# Preprocessing goes here. Make sure that any preprocessing done to the training data is also done to the test data

penguins_train = encode_features(penguins_train)
penguins_test = encode_features(penguins_test)


y_train = penguins_train.pop('species')
X_train = penguins_train.values

y_test = penguins_test.pop('species')
X_test = penguins_test.values

y_pred = [] # the predict(X_test) method on your classifier will return a list of predictions for y_test

# create a classifier
# make sure you `fit` the classifier on the training data before you try to predict


# A handy way to measure the accuracy of your classifier which compares actual targets against predictions
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy is {accuracy}")