<a href="https://colab.research.google.com/github/redframelbx/datascience/blob/main/machine%20learning/ML4DS_SvML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><img src="https://lh3.googleusercontent.com/drive-viewer/AITFw-w3wIpyrbycg-wmuThEMA0kKfsNLaRX59iSQvjQawHZtoXIO3DfiR3GDf8YpHNjvBRnbQhMmgmIlzbbQB8QuZWRfvgkVw=s1600" width=800spx ></center>


# Supervised Machine Learning (Classification)
<p>Classification is a process of finding a model that able to describe and distinguish data classes and concepts. The process typically involve training and testing with labelled data (class of the data is known).</p>

<center><img src="https://lh5.googleusercontent.com/aGi5M4wtDWkQFGvw3ukvWmo5TEvd4kXOjjAE3hILhlYpiGKo0bgvTM96GfaMWr57hZo=w2400" width=600px></center>
<br>

In this notebook, we will use `scikit-learn` Python library to
- implement different machine learning methods for classification,
- generate performance measurement reporting,
- built-in datasets,
- cross validations, train-test data split

Make sure you have installed the library, or else install the library using the following command: <br>

`pip install scikit-learn`

<hr>

## <span style='color:blue'>Case: Classification of Iris flower species</span>
<img src='https://storage.googleapis.com/kaggle-datasets-images/17860/23404/efadfebe925588a27d94d61be1d376d3/dataset-cover.jpg?t=2018-03-22-16-10-55'>

The Iris dataset can be obtained within the `scikit-learn` library

### <span style='color:blue'>To import the dataset:</span>
1. Load the `datasets` module from the `scikit-learn` library using the command: `from sklearn import datasets`
2. Load the iris data set using the `load_iris()` function and assign to a variable called `iris`


In [2]:
# your code here to load the iris dataset according to the instruction above
from sklearn import datasets
iris = datasets.load_iris()


### <span style='color:blue'>To explore the dataset:</span>
1. You can use:
<br>`dir(iris)` shows the attributes of the iris datasets.<br> `iris.data.shape` shows the shape of the data.<br>
`iris.target_names` shows the classes that we want to classify.<br>
`iris.feature_names` shows the name of features that we are training.<br>
`iris.data` to get access the actual data values<br>
`iris.target` to get access of the target labels for each data sample

In [3]:
# your code here
dir(iris)



['DESCR',
 'data',
 'data_module',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

In [4]:
iris.data.shape

(150, 4)

In [5]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [6]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [7]:
iris.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [8]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

### <span style='color:blue'>To prepare the data</span>
1. create a variable `data` to store the iris data
2. create a variable `target` to store the iris target labels

In [None]:
# your code here



In [9]:
data = iris.data
target = iris.target

### <span style='color:blue'>To do data splitting</span>
1. Load the `model_selection` module from the `scikit-learn` library
2. Then split the data into train and test set using the `train_test_split()` function. This function will return 4 variables, which are train data, test data, train labels, test labels respectively. The input parameters for the function is as follows: `train_test_split( data, labels. test_size=0.2, random_state=1 )` where `test_size` refers to the percentage of the testing set (0.2 --> 20% test 80% train)

In [15]:
# your code here
from sklearn import model_selection


Now with the dataset is ready, we will go through few of the supervised machine learning methods and create the classification models to classify iris species.

<hr>

## <span style='color:darkred'>K-Nearest Neighbours (KNN)</span>

Steps:
1. Import library to use K-Nearest Neighbours using command: `from sklearn.neighbors import KNeighborsClassifier`


In [None]:
## Your code here



2. Initialize the model. Need to specify the number of neighbors to 3. It's always recommended to use odd number that is larger than 1. Then assign to a variable. Example: `knn_model = KNeighborsClassifier(n_neighbors=3)`

In [None]:
## Your code here


3. Train model using training set. This can be done using the `fit()` function and pass in the train data and train labels.

In [None]:
## Your code here


4. Test the trained model with testing set using the `predict()` and pass in the test data. Store the prediction results into a variable.

In [None]:
## Your code here


5. Evaluate the predictions using `metrics` module in `scikit-learn` library. First load the library using `from sklearn import metrics`
6. Then can use<br> `confusion_matrix(prediction, test_labels)` to generate the confusion matrix based on the prediction and real test labels. <br> `accuracy_score(prediction, test_labels)` to get the accuracy of the predictions.<br> ** there are many different metrics available, can refer to https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
## Your code here


<hr>

## <span style='color:darkred'>Decision Trees</span>

Steps:
1. Import library to use Decision Tree using command: `from sklearn.tree import DecisionTreeClassifier`
2. Initialize the model. Example: `dt_model = DecisionTreeClassifier(criterion='entropy', random_state=123)`
3. Train model using training set. This can be done using the `fit()` function and pass in the train data and train labels.
4. Test the trained model with testing set using the `predict()` and pass in the test data. Store the prediction results into a variable.
5. Evaluate the predictions using `metrics` module in `scikit-learn` library. First load the library using `from sklearn import metrics`
6. Then can use<br> `confusion_matrix(prediction, test_labels)` to generate the confusion matrix based on the prediction and real test labels. <br> `accuracy_score(prediction, test_labels)` to get the accuracy of the predictions.<br> ** there are many different metrics available, can refer to https://scikit-learn.org/stable/modules/model_evaluation.html

| Parameters | Default | Description |
| -------- | -------- | -------- |
| `criterion` | 'entropy' | Evaluate feature importance. 'entropy' algorithm is based on Information theory which is a method to quantify information in a message. <br>It is used to quantify the information of the data to make decision and split the node. |
| `min_samples_leaf` | 1 | Minimum number of sample(s) to qualify as leaf node |
| `min_samples_split` | 2 | Minimum number of sample(s) to qualify for internal node split |
| `splitter` | 'best' | Method used by the model to make decision when splitting. 'best' method will tell the model to consider feature with highest importance |
| `random_state` | 0 | Seed to generate random number by the model. Will effect any randomness from the model |


In [None]:
## Your code here



#### Visualizing the decision tree
Decision tree is one of the simplest model that can be visualized using the `plot_tree` module in `scikit-learn` library and `matplotlib` library

In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

def view_tree(classifier):
    fig, axes = plt.subplots(nrows=1,ncols=1,figsize=(4,4), dpi=150) #change dpi to resize image
    tree_view = plot_tree(classifier, feature_names=iris.feature_names,
              class_names=iris.target_names, ax=axes, filled=True)

In [None]:
## Try to use the view_tree function above with your decision tree model



<hr>

## <span style='color:darkred'>Random Forest</span>
Steps:
1. Import library to use Random Forest using command: `from sklearn.ensemble import RandomForestClassifier`
2. Initialize the model. Example: `rf_model = RandomForestClassifier(n_estimators=100)`
3. Train model using training set. This can be done using the `fit()` function and pass in the train data and train labels.
4. Test the trained model with testing set using the `predict()` and pass in the test data. Store the prediction results into a variable.
5. Evaluate the predictions using `metrics` module in `scikit-learn` library. First load the library using `from sklearn import metrics`
6. Then can use<br> `confusion_matrix(prediction, test_labels)` to generate the confusion matrix based on the prediction and real test labels. <br> `accuracy_score(prediction, test_labels)` to get the accuracy of the predictions.<br> ** there are many different metrics available, can refer to https://scikit-learn.org/stable/modules/model_evaluation.html


| Parameters | Default | Description |
| -------- | -------- | -------- |
| `bootstrap` | `True` | Evaluate feature importance. 'entropy' algorithm is based on Information theory which is a method to quantify information in a message. <br>It is used to quantify the information of the data to make decision and split the node. |
| `max_features` | 'auto' | Minimum number of sample(s) to qualify as leaf node |
| `min_samples_leaf` | 1 | Minimum number of sample(s) to qualify for internal node split |
| `min_samples_split` | 2 | Method used by the model to make decision when splitting. 'best' method will tell the model to consider feature with highest importance |
| `n_estimators` | 10 | Seed to generate random number by the model. Will effect any randomness from the model |
| `verbose` | 0 | Seed to generate random number by the model. Will effect any randomness from the model |


In [None]:
## Your code here



<hr>

## <span style='color:darkred'>Support Vector Machine</span>
Steps:
1. Import library to use SVM using command: `from sklearn.svm import SVC`
2. Initialize the model. Example: `svm_model = SVC(kernel='linear', gamma='auto')`
3. Train model using training set. This can be done using the `fit()` function and pass in the train data and train labels.
4. Test the trained model with testing set using the `predict()` and pass in the test data. Store the prediction results into a variable.
5. Evaluate the predictions using `metrics` module in `scikit-learn` library. First load the library using `from sklearn import metrics`
6. Then can use<br> `confusion_matrix(prediction, test_labels)` to generate the confusion matrix based on the prediction and real test labels. <br> `accuracy_score(prediction, test_labels)` to get the accuracy of the predictions.<br> ** there are many different metrics available, can refer to https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
## Your code here



SVM can use 'kernel trick' for high dimensional and non-linear data, there are several types of kernel can be used:
- linear kernel with `kernel='linear'`
- radial basis function (RBF) kernel with `kernel='rbf'`
- Sigmoid kernel with `kernel='sigmoid'`
- Polynomial kernel with `kernel='poly'`

In [None]:
## Try train model with different kernel



###
<center><span style="color:#510104">© 2023 UTM Big Data Centre. All Rights Reserved</span></center>