# 6 Supervised Learning with categorical output

- Typical Classification problem with 2, 3, 4 (or more) outputs.
- Most of the time the output consists of binary (male/female, spam/nospam,yes/no)
- Sometime, there are more than binary output: dog/cat/mouse, red/green/yellow.

In this category, we are going to use 2 existing dataset from [sklearn](https://scikit-learn.org/stable/datasets.html):
- [Breast Cancer Wisconsine](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset) data for Binary output
                                     - [Iris plant](https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-plants-dataset) data for multiple (3) output.


## 6.1 Logistic Regression for binary output

- Logistic regression is another technique borrowed by machine learning from the field of statistics. It is the go-to method for binary classification problems (problems with two class values).
- Typical binary classification: True/False, Yes/No, Pass/Fail, Spam/No Spam, Male/Female
                                                                        - Unlike linear regression, the prediction for the output is transformed using a non-linear function called the logistic function.
- The standard logistic function has formulation:

    ![image](https://user-images.githubusercontent.com/43855029/114233181-f7dcbb80-994a-11eb-9c89-58d7802d6b49.png)

![image](https://user-images.githubusercontent.com/43855029/114233189-fb704280-994a-11eb-9019-8355f5337b37.png)

In this example, we load a sample dataset called [Breast Cancer Wisconsine](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset).


### Load Breast Cancer Wisconsine data

In [None]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = data.data
y = data.target
print("There are", X.shape[1], " Predictors: ", data.feature_names)
print("The output has 2 values: ", data.target_names)
print("Total size of data is ", X.shape[0], " rows")

We can see that there are 30 input data representing the shape and size of 569 tumours.
Base on that, the tumour can be considered _malignant_ or _benign_ (0 or 1 as in number)

### Partitioning Data to train/test:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.6, random_state=123)

### Train model using Logistic Regression
For simplicity, we will use all of the predictors for the regression:

In [None]:
from sklearn.linear_model import LogisticRegression
model_LogReg = LogisticRegression(solver='newton-cg').fit(X_train, y_train)

### Evaluate model output:

In [None]:
y_pred = model_LogReg.predict(X_test)

from sklearn import metrics
print("The accuracy score is %1.3f" % metrics.accuracy_score(y_test,y_pred))

We retrieve the **accuracy = 0.965** using all predictors

### Compute AUC-ROC and plot curve

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
import numpy as np

lr_probs = model_LogReg.predict_proba(X_test)
# generate a no skill prediction (majority class)
ns_probs = np.zeros(len(y_test))

In [None]:
# calculate scores
ns_auc = roc_auc_score(y_test, ns_probs)
lr_auc = roc_auc_score(y_test, lr_probs[:,1])
# summarize scores
print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Logistic: ROC AUC=%.3f' % (lr_auc))

In [None]:
# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs[:,1])
# plot the roc curve for the model
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
# show the plot
plt.show()

![image](https://user-images.githubusercontent.com/43855029/153662934-d4c5929f-72cf-43b8-8b1f-085d315022e7.png)




An alternative way to plot AUC-ROC curve, using additional toolbox ["scikit-plot"](https://scikit-plot.readthedocs.io/en/stable/)
Use this command:
```
 pip install scikit-plot
```

The shorter code for using this library:

In [None]:
import scikitplot as skplt
skplt.metrics.plot_roc(y_test, lr_probs)
plt.show()

![image](https://user-images.githubusercontent.com/43855029/153663219-f27aad2b-b76d-4abf-a093-0a433e79bd28.png)


## 6.2 Classification problem with more than 3 outputs

Here we use [Iris plant](https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-plants-dataset) data for multiple (3) output.

### Import data

In [None]:
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target

print("There are", X.shape[1], " Predictors: ", data.feature_names)
print("The output has 3 values: ", data.target_names)
print("Total size of data is ", X.shape[0], " rows")

In [None]:
- We can see that there are 4 input data representing the petal/sepal width and length of 3 different kind of iris flowers.
- Base on that, the iris plants can be classified as 'setosa' 'versicolor' 'virginica'.

In [None]:
### Partitioning Data to train/test:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.6, random_state=123)

Train model using Linear Discriminant Analysis (LDA):
For simplicity, we will use all of the predictors for the regression:

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
model_LDA = LinearDiscriminantAnalysis().fit(X_train,y_train)

### Evaluate model output:

In [None]:
print("The accuracy score is %1.3f" % model_LDA.score(X_test,y_test))


### LDA can be used for both binary and more categorical output

Exercise: create an LDA model to predict the breast cancer Wisconsine data

## 6.3 Other Algorithms

There are many other algorithms that work well for both classification and regression data such as Decision Tree, RandomForest, Bagging/Boosting.
Very similar to chapter 5, the following model should be loaded:


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

Exercise: create a Random Forest model to predict the iris flower data using the same method:

# 7 Principal Component Analysis
- Handy with large data
- Where many variables correlate with one another, they will all contribute strongly to the same principal component
- Each principal component sums up a certain percentage of the total variation in the dataset
- More Principal Components, more summarization of the original data sets

## 7.1 PCA formulation
- For example, we have 3 data sets: `X, Y, Z`
- We need to compute the covariance matrix **M** for the 3 data set:

![image](https://user-images.githubusercontent.com/43855029/114459677-d67c0980-9bae-11eb-85b2-758a98f0cd29.png)

in which, the covariance value between 2 data sets can be computed as:

![image](https://user-images.githubusercontent.com/43855029/114459740-ea277000-9bae-11eb-9259-8ef1b233c0fa.png)

- For the Covariance matrix **M**, we will find **m** eigenvectors and **m** eigenvalues

```
- Given mxm matrix, we can find m eigenvectors and m eigenvalues
- Eigenvectors can only be found for square matrix.
    - Not every square matrix has eigenvectors
- A square matrix A and its transpose have the same eigenvalues but different eigenvectors
- The eigenvalues of a diagonal or triangular matrix are its diagonal elements.
- Eigenvectors of a matrix A with distinct eigenvalues are linearly independent.
```

**Eigenvector with the largest eigenvalue forms the first principal component of the data set
… and so on …***


## 7.2 Implementation

Here we gonna use the breast cancer Wisconsine data set:

In [None]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.6,random_state=123)

X_train_scaled = StandardScaler().fit_transform(X_train)
X_test_scaled = StandardScaler().fit_transform(X_test)

### 7.2.1 Compute PCA using sklearn:

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
PCs = pca.fit_transform(X_train_scaled)
PCs.shape

We can see that the shape of PC's are [341,30], which has the same 30 inputs/principal components as in the original data

### 7.2.2 Explained Variance

The explained variance tells you how much information (variance) can be attributed to each of the principal components.

In [None]:
pca.explained_variance_ratio_
print("The first 4 components represent %1.3f" % pca.explained_variance_ratio_[0:4].sum(), " total variance")

Since using only 4 PCs, it is able to represent 30 PCs in the entire data, therefore, we use this 4 PCs to construct the ML model using K-Nearest Neighbors:

### 7.2.3 Application of PCA model in Machine Learning:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score as acc_score
pca = PCA(n_components=4) #We choose number of principal components to be 4

X_train_pca = pd.DataFrame(pca.fit_transform(X_train_scaled))
X_test_pca = pd.DataFrame(pca.transform(X_test_scaled))
X_train_pca.columns = ['PC1','PC2','PC3','PC4']
X_test_pca.columns  = ['PC1','PC2','PC3','PC4']

In [None]:
# Use random forest to train model
model_RF = KNeighborsClassifier().fit(X_train_pca, y_train)
y_pred_RF = model_RF.predict(X_test_pca)
print("The accuracy score is %1.3f" % acc_score(y_test,y_pred_RF))

Plotting the testing result with indicator of Wrong prediction

In [None]:
import matplotlib.pyplot as plt

ax = plt.gca()

targets = np.unique(y_pred_KNN)
colors = ['r', 'g']

for target, color in zip(targets,colors):
    indp = y_pred_KNN == target
ax.scatter(X_test_pca.loc[indp, 'PC1'], X_test_pca.loc[indp, 'PC2'],c = color)

# Ploting the Wrong Prediction
ind = y_pred_KNN!=np.array(y_test)
ax.scatter(X_test_pca.loc[ind, 'PC1'],X_test_pca.loc[ind, 'PC2'],c = 'black')

#axis control
ax.legend(['malignant','benign','Wrong Prediction'])
ax.set_title("Testing set from KNN using PCA 4 components")
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')

plt.show()

![image](https://user-images.githubusercontent.com/43855029/153672409-2bcefb86-5bf2-497f-b1ca-00af35b776d1.png)

As seen, there are 4 points that were wrongly identified

# 8 Neural Network

## 8.1 The Neural Network of a brain

- Neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates.
- Neuron is a basic unit in a nervous system and is the most important component of the brain.
- In each Neuron, there is a cell body (node), dendrite (input signal) and axon (output signal to other neuron).
- If a Neuron received enough signal, it is then activated to decide whether or not it should transmitt the signal to other neuron or not.

![image](https://user-images.githubusercontent.com/43855029/114472746-da188c00-9bc0-11eb-913c-9dcd14f872ac.png)


## 8.2 Neural Network in Machine Learning:

![image](https://user-images.githubusercontent.com/43855029/114472756-dd137c80-9bc0-11eb-863d-7c4d054efa89.png)

## 8.3 Formulation of Neural Network:


![image](https://user-images.githubusercontent.com/43855029/114472776-e997d500-9bc0-11eb-9f70-450389c912df.png)

Here:
- x1,x2....xn are input variables.
- w1,w2....wn are weights of respective inputs.
- b is the bias, which is summed with the weighted inputs to form the net inputs.

In which:
- Bias and weights are both adjustable parameters of the neuron.
- Parameters are adjusted using some learning rules.
- The output of a neuron can range from -inf to +inf. As the neuron doesn’t know the boundary, so we need a mapping mechanism between the input and output of the neuron. This mechanism of mapping inputs to output is known as Activation Function.

**Activation functions:**

![image](https://user-images.githubusercontent.com/43855029/114575672-6752f380-9c48-11eb-8d53-c78d052cdf17.png)

## 8.4 Multi-Layer Perceptron (MLP)

**Multi-layer Perceptron (MLP)** is a supervised learning algorithm.
Given a set of features `X = x1, x2, ... xm`, and target `y`, MLP can learn a non-linear function approximator for either classification or regression.

Between the input and the output layer, there can be one or more non-linear layers, called hidden layers. Figure below shows a one hidden layer MLP with scalar output.

![image](https://user-images.githubusercontent.com/43855029/114472972-51e6b680-9bc1-11eb-9e78-90ec739844ee.png)

![image](https://user-images.githubusercontent.com/43855029/114575549-48546180-9c48-11eb-8c9c-c5eac3180df1.png)

**The advantages of Multi-layer Perceptron:**
- Capability to learn non-linear models.
- Capability to learn models in real-time (on-line learning) using partial_fit.

**The disadvantages of Multi-layer Perceptron:**
- MLP with hidden layers have a non-convex loss function where there exists more than one local minimum. Therefore different random weight initializations can lead to different validation accuracy.
- MLP requires tuning a number of hyperparameters such as the number of hidden neurons, layers, and iterations.
- MLP is sensitive to feature scaling.

In [None]:
## 8.5 Type of Neural Network Multi-Layer Perceptron in sklearn
Similar to previous Machine Learning model, there are 2 main types of MLP in sklearn, depending on the model output:
- MLPClassifier: for Classification problem
    - MLPRegressor: for Regression problem

## 8.6 Implementation with Classification problem

Here we use **Breast Cancer Wisconsine** data for Classification problem

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = data.data
y = data.target

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

The Class **MLPClassifier** implements a multi-layer perceptron (MLP) algorithm that trains using Backpropagation.
There are lots of parameters in MLPClassifier:
- **hidden_layer_sizes** which is the number of hidden layers and neurons for each layer. Default=`(100,)`
for example `hidden_layer_sizes=(100,)` means there is 1 hidden layers used, with 100 neurons.
for example `hidden_layer_sizes=(50,20)` means there are 2 hidden layers used, the first layer has 50 neuron and the second has 20 neurons.
- **solver** `lbfgs, sgd, adam`. Default=`adam`
- **activation** `identity, logistic, tanh, relu`. Default='relu`

More information can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)


In [None]:
from sklearn.neural_network import MLPClassifier
model_NN = MLPClassifier(hidden_layer_sizes = (50,20),solver='lbfgs',activation='relu',random_state=123).fit(X_train_scaled, y_train)
model_NN.score(X_test_scaled,y_test)

## 8.7 Implementation with Regression problem
- Class **MLPRegressor** implements a multi-layer perceptron (MLP) that trains using backpropagation with no activation function in the output layer, which can also be seen as using the identity function as activation function.
- Therefore, it uses the square error as the loss function, and the output is a set of continuous values.

Here we use **california housing** data from Regression espisode:

In [None]:
import pandas as pd
import numpy as np

from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()

# Predictors/Input:
X = pd.DataFrame(data.data,columns=data.feature_names)

# Predictand/output:
y = pd.DataFrame(data.target,columns=data.target_names)

Fit **MLPRegressor** model

In [None]:
from sklearn.neural_network import MLPRegressor
model_NN = MLPRegressor(hidden_layer_sizes = (10,5),solver='lbfgs',activation='tanh',max_iter=1000).fit(X_train,y_train)
model_NN.score(X_test,y_test)

## 8.8 Tips on using MLP
- Multi-layer Perceptron is sensitive to feature scaling, so it is highly recommended to scale your data.
- Empirically, we observed that **L-BFGS** converges faster and with better solutions on small datasets. For relatively large datasets, however, **Adam** is very robust. It usually converges quickly and gives pretty good performance. **SGD** with momentum or nesterov’s momentum, on the other hand, can perform better than those two algorithms if learning rate is correctly tuned.
- Since backpropagation has a high time complexity, it is advisable to start with smaller number of hidden neurons and few hidden layers for training.
- The loss function for Classifier is **Cross-Entropy** while for Regression is **Square-Error**

## 8.9. Notes
- There are many other NN algorithms which will be introduced in the Deep Learning class

# 9 Unsupervised Learning

- No labels are given to the learning algorithm leaving it on its own to find structure in its input.
- Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).
- Used when no feature output data
- Often used for clustering data

![image](https://user-images.githubusercontent.com/43855029/114584282-82c1fc80-9c50-11eb-9342-41e5592e7b67.png)

![image](https://user-images.githubusercontent.com/43855029/114584314-89507400-9c50-11eb-9c54-5a589075fd48.png)

**Typical method:**

```
K-means clustering
Hierarchical clustering
Ward clustering
Partition Around Median (PAM)
```


In [None]:
## 9.1 K-means clustering

### 9.1.1 Explanation of K-means clustering method:
- Given a set of data, we choose K=2 clusters to be splited:

![image](https://user-images.githubusercontent.com/43855029/114584415-a5ecac00-9c50-11eb-8919-807f83ddf23a.png)

- First select 2 random centroids (denoted as red and blue X)

![image](https://user-images.githubusercontent.com/43855029/114584573-d16f9680-9c50-11eb-9dc4-8d918919f565.png)

- Compute the distance between 2 centroid red X and blue X with all the points (for instance using Euclidean distance) and compare with each other. 2 groups are created with shorter distance to 2 centroids

![image](https://user-images.githubusercontent.com/43855029/114584860-0bd93380-9c51-11eb-9afc-3bb9510e9c34.png)

- Now recompute the **new** centroids of the 2 groups (using mean value of all points in the same groups):

![image](https://user-images.githubusercontent.com/43855029/114585002-34f9c400-9c51-11eb-83e0-b5769abf6cd3.png)

- Compute the distance between 2 **new** centroids and all the points. We have 2 new groups:

![image](https://user-images.githubusercontent.com/43855029/114585030-3b883b80-9c51-11eb-8f69-29f6e406e215.png)

- Repeat the last 2 steps until **no more new centroids** created. The model reach equilibrium:

![image](https://user-images.githubusercontent.com/43855029/114585223-6b374380-9c51-11eb-8663-27474956ec61.png)


### 9.1.2 Example with K=3
![image](https://user-images.githubusercontent.com/43855029/114585361-8e61f300-9c51-11eb-965e-dc4d57e9c0eb.png)

![image](https://user-images.githubusercontent.com/43855029/114585502-b81b1a00-9c51-11eb-8015-973216b450ce.png)


In [None]:
### 9.1.3. Implementation
Here we use the iris data set with only predictors

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd
iris = load_iris()
X = iris.data

Apply Kmeans and plotting

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

model_KMeans = KMeans(n_clusters=3)
model_KMeans.fit(X)

plt.scatter(X[:,2],X[:,3],c=model_KMeans.labels_)
plt.xlabel(iris.feature_names[2])
plt.ylabel(iris.feature_names[3])
plt.title('KMeans clustering with 3 clusters')
plt.show()

![image](https://user-images.githubusercontent.com/43855029/115735833-c99ea900-a358-11eb-87d8-774efc7fa459.png)

### 9.1.4 How to find optimal K values:

#### 9.1.4.1 Elbow approach
- Similar to KNN method for supervised learning, for K-means approach, we are able to use Elbow approach to find the optimal K values.
- The Elbow approach uses the Within-Cluster Sum of Square (WSS) to measure the compactness of the clusters:
![image](https://user-images.githubusercontent.com/43855029/114587068-4d6ade00-9c53-11eb-932d-0de0c9edef83.png)

The optimal K-values can be found from the Elbow using **method="wss"**:

In [None]:
wss = []
for k in range(1,10):
    model = KMeans(n_clusters=k).fit(X)
    wss.append(model.inertia_)

plt.scatter(range(1,10),wss)
plt.plot(range(1,10),wss)
plt.xlabel("Number of Clusters k")
plt.ylabel("Within Sum of Square")
plt.title("Optimal number of clusters based on WSS Method")
plt.show()

![image](https://user-images.githubusercontent.com/43855029/115737965-9b21cd80-a35a-11eb-9bcd-0d63e685ec0f.png)

#### 9.1.4.2 Gap-Statistics approach
- Developed by Prof. Tibshirani et al in Stanford
- Applied to any clustering method (K-means, Hierarchical)
- Maximize the Gap function:

![image](https://user-images.githubusercontent.com/43855029/114586376-95d5cc00-9c52-11eb-9b71-ed330cfc50bc.png)

E*n: expectation under a sample size of n from the reference distribution
![image](https://user-images.githubusercontent.com/43855029/114586396-9b331680-9c52-11eb-9b83-955aa256e623.png)

![image](https://user-images.githubusercontent.com/43855029/114586456-af771380-9c52-11eb-9fdb-99cc8df854fb.png)

**Installation:**

This version of Gap Statistics is not official. Until the moment of writing this documentation, no official Gap Statistics has been released in Python.
We use the version from [milesgranger's github](https://github.com/milesgranger/gap_statistic)
                         ```python
                         pip install git+git://github.com/milesgranger/gap_statistic.git
pip install gapstat-rs
```
Implement Gap-Statistics:

In [None]:
from gap_statistic import OptimalK

optimalK = OptimalK(n_jobs=1) # No parallel
n_clusters = optimalK(X[:,1:4], cluster_array=np.arange(1, 15))
print('Optimal clusters: ', n_clusters)

Plot Gap-Statistics:

In [None]:
import matplotlib.pyplot as plt
plt.plot(optimalK.gap_df.n_clusters, optimalK.gap_df.gap_value, linewidth=3)
plt.scatter(optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].n_clusters,
            optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r')
plt.grid(True)
plt.xlabel('Cluster Count')
plt.ylabel('Gap Value')
plt.title('Gap Values by Cluster Count')
plt.show()

![image](https://user-images.githubusercontent.com/43855029/115745658-a298a500-a361-11eb-8071-6af68f7eb428.png)

## 9.2 Comparison between different clustering methods in sklearn:
- This is an example from [sklearn](https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html)
- The source code for image below can be found [here](https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-download-auto-examples-cluster-plot-cluster-comparison-py)

![image](https://user-images.githubusercontent.com/43855029/115748324-0f14a380-a364-11eb-8a06-6d073b4d99c4.png)