Your First Machine Learning Project in Python Step-By-Step

https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
by Jason Brownlee on February 10, 2019 in Python Machine Learning

Here is an overview of what we are going to cover:

-- Installing the Python and SciPy platform.
-- Loading the dataset.
-- Summarizing the dataset.
-- Visualizing the dataset.
-- Evaluating some algorithms.
-- Making some predictions.

There are 5 key libraries that you will need to install. Below is a list of the Python SciPy libraries required for this tutorial:

scipy
numpy
matplotlib
pandas
sklearn

# 1.2 Start Python and Check Versions

Open a command line and start the python interpreter:

In [None]:
# you can skip this if using Jupyter Notebook (JNB)
# JNB runs Python 
# python

Check the versions of libraries

In [1]:
# Check the versions of libraries
 
# Python version
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))

Python: 3.7.3 (default, Mar 27 2019, 16:54:48) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
scipy: 1.3.0
numpy: 1.16.4
matplotlib: 3.1.0
pandas: 0.24.2
sklearn: 0.21.2


  -- Compare to Jason's output
  -- Ideally, your versions should match or be more recent.
Python: 3.6.8 (default, Dec 30 2018, 13:01:55) 
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)]
scipy: 1.1.0
numpy: 1.15.4
matplotlib: 3.0.2
pandas: 0.23.4
sklearn: 0.20.2

# 2. Load The Data

using the iris flowers dataset
You can learn more about this dataset on Wikipedia - https://en.wikipedia.org/wiki/Iris_flower_data_set

2.1 Import libraries

In [2]:
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

2.2 Load Dataset

We are using pandas to load the data. We will also use pandas next
to explore the data both with descriptive statistics and data visualization.

Note that we are specifying the names of each column when loading
the data. This will help later when we explore the data.

In [3]:
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)

# 3. Summarize the Dataset

we are going to take a look at the data a few different ways:
1 - Dimensions of the dataset.
2 - Peek at the data itself.
3 - Statistical summary of all attributes.
4 - Breakdown of the data by the class variable.

3.1 Dimensions of Dataset

In [None]:
# shape
print(dataset.shape)

-- output
(150, 5)

3.2 Peek at the Data

In [None]:
# head
print(dataset.head(20))

should resemble...
    sepal-length  sepal-width  petal-length  petal-width        class
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa

3.3 Statistical Summary

In [None]:
# descriptions
print(dataset.describe())

should resemble...
       sepal-length  sepal-width  petal-length  petal-width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

3.4 Class Distribution

In [None]:
# class distribution
print(dataset.groupby('class').size())

should resemble...
class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50

# 4. Data Visualization

will look at two types of plots:

1 -- Univariate plots to better understand each attribute.
2 -- Multivariate plots to better understand the relationships between attributes.

4.1 Univariate Plots

In [None]:
# box and whisker plots
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

In [None]:
# histograms
dataset.hist()
plt.show()

4.2 Multivariate Plots

In [None]:
# scatter plot matrix
scatter_matrix(dataset)
plt.show()

Note the diagonal grouping of some pairs of attributes. 
This suggests a high correlation and a predictable relationship.

# 5. Evaluate Some Algorithms

Here is what we are going to cover in this step:

1 -- Separate out a validation dataset.
2 -- Set-up the test harness to use 10-fold cross validation.
3 -- Build 5 different models to predict species from flower measurements
4 -- Select the best model.

5.1 Create a Validation Dataset

Later, we will use statistical methods to estimate the accuracy of the models
that we create on unseen data. We also want a more concrete estimate of the accuracy
of the best model on unseen data by evaluating it on actual unseen data.

we are going to hold back some data that the algorithms will not get to see and 
we will use this data to get a second and independent idea of how accurate the best model
might actually be.

In [None]:
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

You now have training data in the X_train and Y_train for preparing models and a X_validation 
and Y_validation sets that we can use later.

Notice that we used a python slice to select the columns in the NumPy array. If this is new to you, you might want to check-out this post:

How to Index, Slice and Reshape NumPy Arrays for Machine Learning in Python
https://machinelearningmastery.com/index-slice-reshape-numpy-arrays-machine-learning-python/

5.2 Test Harness

We will use 10-fold cross validation to estimate accuracy.

This will split our dataset into 10 parts, train on 9 and test on 1 
and repeat for all combinations of train-test splits.

In [None]:
# Test options and evaluation metric
seed = 7
scoring = 'accuracy'

5.3 Build Models

We don’t know which algorithms would be good on this problem or what configurations to use.
We get an idea from the plots that some of the classes are partially linearly separable 
in some dimensions, so we are expecting generally good results.

Let’s evaluate 6 different algorithms:

- Logistic Regression (LR)
- Linear Discriminant Analysis (LDA)
- K-Nearest Neighbors (KNN).
- Classification and Regression Trees (CART).
- Gaussian Naive Bayes (NB).
- Support Vector Machines (SVM).

This is a good mixture of simple linear (LR and LDA), 
nonlinear (KNN, CART, NB and SVM) algorithms. 



In [None]:
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = model_selection.KFold(n_splits=10, random_state=seed)
	cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)

5.4 Select Best Model

   output fm above --
LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.975000 (0.038188)
NB: 0.975000 (0.053359)
SVM: 0.991667 (0.025000)

In [None]:
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

# 6. Make Predictions

We can run the KNN model directly on the validation set and summarize 
the results as a final accuracy score, a confusion matrix and a classification report.
http://machinelearningmastery.com/confusion-matrix-machine-learning/

In [None]:
# Make predictions on validation dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

should resemble...
0.9
[[ 7  0  0]
 [ 0 11  1]
 [ 0  2  9]]
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00         7
Iris-versicolor       0.85      0.92      0.88        12
 Iris-virginica       0.90      0.82      0.86        11

      micro avg       0.90      0.90      0.90        30
      macro avg       0.92      0.91      0.91        30
   weighted avg       0.90      0.90      0.90        30

How to Make Predictions with scikit-learn

https://machinelearningmastery.com/make-predictions-scikit-learn/