##### <img src="../SDSS-Logo.png" style="display:inline; width:500px" />


## Learning Objectives
1. Learn about the scikit-learn Python library



### scikit-learn is a library for Python that provides a large number of functions for machine learning.
### scikit-learn is built on top of Numpy and takes advantage of Numpy's fast math capabilities.

### scikit-learn provides solutions in a number of machine learning areas


|Area|
|----------|
|Classification|
|Regression|
|Clustering|
|Dimensioality Reduction|
|Model selection|
|Pre-processing|



### scikit-learn example
### Use a random forest model on the Wisconsin breast cancer dataset

* This lesson draws heavily from [Sebastian Raschka's book](https://sebastianraschka.com/blog/2022/ml-pytorch-book.html)
* and [James, Witten, Hastie and Tibshirani's book](https://www.statlearning.com/)
* The [API](https://scikit-learn.org/stable/modules/classes.html#api-ref) for sklearn is also very useful to see the list of available functions.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Load the Wisconsin breast cancer dataset.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
bc_data = load_breast_cancer()
type(bc_data)

## Look into the `bunch` data to see what is there.
* sklearn info about the [bunch model](https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html).

In [None]:
dir(bc_data)

In [None]:
bc_data.feature_names

In [None]:
list(bc_data.target_names)

### Divide the dataset into a train and test split with 20% test data.
### Make sure to maintain positive/negative ratio in train and test sets.

In [None]:
X = bc_data.data
y = bc_data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,\
                                                    random_state=1, stratify=y)

### Use the sklearn `MinMaxScaler` to scale the features to be between 0 and 1.

In [None]:
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
mm.fit(X_train)
X_train_std_mm = mm.transform(X_train)
# Scale thet test data independently of the training data scaling
X_test_std_mm = MinMaxScaler().fit(X_test).transform(X_test)

### Create a Random Forest Classifier using [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforestclassifier#sklearn.ensemble.RandomForestClassifier) to classify the dataset

In [None]:
from sklearn.ensemble import RandomForestClassifier

feat_labels = bc_data.feature_names

forest = RandomForestClassifier(n_estimators=100,
                                criterion="entropy",
                                random_state=1)

forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)


### Compute the confusion matrix for this data.
* [`confusion_matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

In [None]:
from sklearn.metrics import confusion_matrix
confmat = confusion_matrix(y_true=y_test, y_pred=y_test_pred)
print(confmat)