
# Spot-Check Classification Algorithms
You cannot know which algorithms are best suited to your problem beforehand. You
must trial a number of methods and focus attention on those that prove themselves the most
promising.

The question is not: <span class="girk">What algorithm should I use on my dataset? Instead it is: What
algorithms should I spot-check on my dataset?</span>

### I recommend trying a mixture of algorithms and see what is good at picking out the structure in your data. Below are some suggestions when spot-checking algorithms on your dataset:

- Try a mixture of algorithm representations (e.g. instances and trees).
- Try a mixture of learning algorithms (e.g. different algorithms for learning the same type of representation).
- Try a mixture of modeling types (e.g. linear and nonlinear functions or parametric and nonparametric).


# Algorithms Overview
We are going to take a look at six classification algorithms that you can spot-check on your
dataset. Starting with two linear machine learning algorithms:

- Logistic Regression.
- Linear Discriminant Analysis.
Then looking at four nonlinear machine learning algorithms:
- k-Nearest Neighbors.
- Naive Bayes.
- Classification and Regression Trees.
- Support Vector Machines.

# Linear Machine Learning Algorithms
This section demonstrates minimal recipes for how to use two linear machine learning algorithms:
logistic regression and linear discriminant analysis.

### Logistic Regression
Logistic regression <span class="mark">assumes a Gaussian distribution for the numeric input variables</span> and can
model binary classification problems.

In [1]:
# Pima Indians Diabetes Dataset
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

In [2]:
#Loading dataset
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv('pima-indians-diabetes.data',names=names)

# separate array into input and output components
X = df.drop('class',axis='columns')
Y = df['class']

In [14]:
kfold = KFold(n_splits=10, random_state=7)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
final = []
final.append(results.mean())

0.76951469583


### Linear Discriminant Analysis
Linear Discriminant Analysis or LDA is a statistical technique <span class="mark">for binary and multiclass
classification</span>. It too assumes a <span class="mark">Gaussian distribution for the numerical input variables.</span>

In [4]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [16]:
model = LinearDiscriminantAnalysis()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
final.append(results.mean())

0.773462064252


# Nonlinear Machine Learning Algorithms
### k-Nearest Neighbors
The k-Nearest Neighbors algorithm (or KNN) <span class="mark">uses a distance metric</span> to nd the k most similar
instances in the training data for a new instance and takes the <span class="mark">mean outcome of the neighbors
as the prediction</span>.

In [6]:
from sklearn.neighbors import KNeighborsClassifier

In [17]:
model = KNeighborsClassifier()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
final.append(results.mean())

0.726555023923


### Naive Bayes
- Naive Bayes calculates the probability of each class and the conditional probability of each class given each input value.
- These probabilities are estimated for new data and multiplied together, <span class="mark">assuming that they are all independent</span> (a simple or naive assumption).
- When working with <span class="mark">real-valued data</span>, a <span class="mark">Gaussian distribution</span> is assumed <span class="girk">to easily estimate the probabilities</span> for input variables <span class="mark">using the Gaussian Probability Density Function.</span>

In [8]:
from sklearn.naive_bayes import GaussianNB

In [18]:
model = GaussianNB()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
final.append(results.mean())

0.75517771702


## Classification and Regression Trees
Classification and Regression Trees (CART or just decision trees) <span class="mark">construct a binary tree</span> from
the training data. <span class="girk">Split points are chosen greedily</span> by evaluating each attribute and each value
of each attribute in the training data <span class="girk">in order to minimize a cost function</span> (like the Gini index).

In [10]:
from sklearn.tree import DecisionTreeClassifier

In [19]:
model = DecisionTreeClassifier()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
final.append(results.mean())

0.692634996582


## Support Vector Machines
Support Vector Machines (or SVM) seek <span class="mark">a line that best separates two classes</span>. Those data instances that are <span class="mark">closest to the line</span> that best separates the classes <span class="girk">are called support vectors</span> and influence where the line is placed. <span class="girk">SVM</span> has been <span class="mark">extended to support multiple classes</span>. Of <span class="girk">particular importance</span> is the <span class="mark">use of different kernel functions</span> via the kernel parameter. A powerful <span class="girk">Radial Basis Function is used by default</span>. You can construct an SVM model using the SVC class.

In [12]:
from sklearn.svm import SVC

In [20]:
model = SVC()
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
final.append(results.mean())

0.651025290499


In [21]:
final

[0.76951469583048526,
 0.77346206425153796,
 0.72655502392344506,
 0.75517771701982228,
 0.69263499658236494,
 0.65102529049897473]

In [22]:
modelsList = ['Logistic regression','lda','knn','naive','decisiontree','svm']

In [24]:
resdict = pd.DataFrame.from_dict(dict(zip(modelsList,final)),orient='index')
resdict.sort_values(0,ascending=False)

Unnamed: 0,0
lda,0.773462
Logistic regression,0.769515
naive,0.755178
knn,0.726555
decisiontree,0.692635
svm,0.651025


In [25]:
# linear algos did great, but only for this dataset, what happens if there are categorical variables