# Machine Learning lab

Welcome to the world of machine learning. In this notebook let's explore what is machine learning and apply it to solve a classification problem.

## GOAL of this lab

You will learn:
1. What is machine learning
2. Different types of machine learning problems
3. Understanding data and output used for machine learning using an sample data
4. Different machine learning algorithms
3. Learn how to use a machine learning algorithm to solve a classification problem


In addition, you will use 
1. python 3: Python is an interpreted, high-level, general-purpose programming language
2. scipy: SciPy is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.
3. numpy: NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
4. matplotlib: Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms
5. pandas:  Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series
6. sklearn: Scikit-learn (formerly scikits.learn) is a free software machine learning library for the Python programming language.[3] It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy

These python libraries are installed in your virtual environment using pip and requirements.txt

### What is machine learning ?
Machine learning (ML) describes algorithms that look at data and correct output for that data and learn from it.
A ML program will try to learn the relationship between input data and the output.

A machine learning program:
1. Extracts the features of the dataset (e.g., shape or color) for a given problem (e.g., separating apples and oranges)
2. Uses the observed features and correct output (also called 'labels') to learn the results (e.g., it is an apple or it is an orange).
3. Saves the observed results as a machine learning model for later use.
4. Uses the saved results in the machine learning model to predict the output for a new data instance (is this an apple or an orange?)

### Click and run the below code cell to learn more about features

In [2]:
REPO_PATH = ''
import sys
import os

from IPython.display import display
import ipywidgets as widgets
import widget_utils

widget_utils.machine_learning()

HTML(value='<b> Does Machine learning </b> <i>always</i> <b>require output accompanying the data to learn ? </…

Button(description='Yes', style=ButtonStyle())

Button(description='No', style=ButtonStyle())

Box(children=(HTML(value=''),))

### What is a feature in machine learning?

A feature is an individual measurable property or characteristic of a phenomenon being observed. For example, an apple is red, green, or pink.  A banana is yellow.  An apple is round.  A banana is oblong and tapered at the ends. 
[Reference](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=2ahUKEwjY9KX3pKLeAhUN7p8KHTbwAMUQFjABegQIChAB&url=https%3A%2F%2Fwww.springer.com%2Fus%2Fbook%2F9780387310732&usg=AOvVaw19ZfKiiH_u4x4lnbqIQ-Zp)

Let's consider a task of separating green apples vs oranges from a box which contains both. There are a few properties like shape, weight, color which we can use to separate them. These properties are the features for the machine learning task. Let's see what makes a good feature below.

### Click and run the below code cell to learn more about features

In [4]:
widget_utils.feature_selection()

HTML(value='<b> Click on a below button to select a good feature to distinguish between green apples vs orange…

Button(description='shape', style=ButtonStyle())

Button(description='weight', style=ButtonStyle())

Button(description='color', style=ButtonStyle())

VBox(children=(HTML(value=''), Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x052\x00\x00\x00\x10\…

There are 2 types of supervised machine learning:
1. Classification
2. Regression

The problem presented above (separating green apples vs oranges) is a classification task.

An example of a regression task is determining the <b>price</b> of an apple using all the appropriate features that contribute to the price.  A machine learning program will use features like weight, quality, source, season, price of alternatives, availability, etc. to determine an approximate price.

This lab is going to demonstrate a classification task using Iris flower dataset. The data set consists of 50 samples from each of three species of Iris flower (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features we will develop a machine learning model to distinguish the species from each other.

### Import libraries

Before we start let's import the libraries necessary for this lab.

In [5]:
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

All the libraries should load without any error. If there is an error Module Not Found then that library is missing in your virtual environment and need to install that.

### Load Dataset

In [6]:
# Load dataset
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'species']
dataset = pandas.read_csv('data/iris_dataset.csv', skiprows=1, names=names)

The dataset should load without any errors.

Now let's try to look at data.

In this step we are going to take a look at the data a few different ways:

1. Dimensions of the dataset.
2. Peek at the data itself.
3. Statistical summary of all attributes.
4. Breakdown of the data by the class variable.

#### Dimensions of Dataset
We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

In [7]:
# shape
print(dataset.shape)

(150, 5)


Above you will see that the dataset has 150 instances and 5 attributes. Next let's see first few rows of data to and distinguish between features and the output class.

In [8]:
# head
print(dataset.head(20))

    sepal-length  sepal-width  petal-length  petal-width species
0            5.1          3.5           1.4          0.2  setosa
1            4.9          3.0           1.4          0.2  setosa
2            4.7          3.2           1.3          0.2  setosa
3            4.6          3.1           1.5          0.2  setosa
4            5.0          3.6           1.4          0.2  setosa
5            5.4          3.9           1.7          0.4  setosa
6            4.6          3.4           1.4          0.3  setosa
7            5.0          3.4           1.5          0.2  setosa
8            4.4          2.9           1.4          0.2  setosa
9            4.9          3.1           1.5          0.1  setosa
10           5.4          3.7           1.5          0.2  setosa
11           4.8          3.4           1.6          0.2  setosa
12           4.8          3.0           1.4          0.1  setosa
13           4.3          3.0           1.1          0.1  setosa
14           5.8         

#### Statistical Summary

Now we can take a look at a summary of each attribute.
This includes the count, mean, the min and max values as well as some percentiles.

In [9]:
# descriptions
print(dataset.describe())

       sepal-length  sepal-width  petal-length  petal-width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000


We can see that all of the numerical values have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.

In [11]:
widget_utils.iris_features()

HTML(value='<b> Click all the features used in this dataset to distinguish different Iris flowers </b> <br> Cl…

SelectMultiple(options=('sepal-length', 'species', 'sepal-width', 'petal-length', 'petal-width'), value=())

#### Class Distribution
Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.

In [12]:
# class distribution
print(dataset.groupby('species').size())

species
setosa        50
versicolor    50
virginica     50
dtype: int64


### Let's use a few Machine learning algorithms to solve our problem

Now it is time to create some models of the data and estimate their accuracy on unseen data.

Here is what we are going to cover in this step:

    1. Separate out a validation dataset.
    2. Set-up the test harness to use 10-fold cross validation.
    3. Build 5 different models to predict species from flower measurements
    4. Select the best model.

### Create a Validation Dataset

We need to know that the model we created is any good.

Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.

That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

In [13]:
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

You now have training data in the X_train and Y_train for preparing models and a X_validation and Y_validation sets that we can use later for evaluation.

During training we need a subset of data to see how well we are able to learn as training progresses i.e. we should be able to calculate our accuracy. To do this we can use k-fold cross validation. Let's explain this below.

We will use 10-fold cross validation to estimate accuracy.

This will split our training set into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits during training.

In [14]:
# Test options and evaluation metric
seed = 7
scoring = 'accuracy'

The specific random seed above does not matter, learn more about pseudorandom number generators here:

Introduction to Random Number Generators for Machine Learning in Python

We are using the metric of ‘accuracy‘ to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the scoring variable when we run build and evaluate each model next.

### Build Models

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s evaluate 2 different algorithms:

1. Logistic Regression (LR)
2. K-Nearest Neighbors (KNN).

This is a good mixture of simple linear (LR), nonlinear (KNN) algorithms. We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

Let’s build and evaluate our models:

In [15]:
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('KNN', KNeighborsClassifier()))

# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

LR: 0.966667 (0.040825)
KNN: 0.983333 (0.033333)


### Select Best Model

We now have 2 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

Running the example above, we get the following raw results:

### Make Predictions

The KNN algorithm is very simple and was an accurate model based on our tests. Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.

We can run the KNN model directly on the validation set and summarize the results as a final accuracy score, a confusion matrix and a classification report.

In [16]:
# Make predictions on validation dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

0.9
[[ 7  0  0]
 [ 0 11  1]
 [ 0  2  9]]
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00         7
  versicolor       0.85      0.92      0.88        12
   virginica       0.90      0.82      0.86        11

   micro avg       0.90      0.90      0.90        30
   macro avg       0.92      0.91      0.91        30
weighted avg       0.90      0.90      0.90        30



We can see that the accuracy is 0.9 or 90%. The confusion matrix provides an indication of the three errors made. Finally, the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small).

In [17]:
lr = LogisticRegression(solver='liblinear', multi_class='ovr')
lr.fit(X_train, Y_train)
predictions = lr.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

0.8
[[ 7  0  0]
 [ 0  7  5]
 [ 0  1 10]]
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00         7
  versicolor       0.88      0.58      0.70        12
   virginica       0.67      0.91      0.77        11

   micro avg       0.80      0.80      0.80        30
   macro avg       0.85      0.83      0.82        30
weighted avg       0.83      0.80      0.80        30

