# Fundamentals of machine learning using Python 
## Classification models

***
<br>

## Statistical classification

* Classification is the problem of identifying which of a set of categories (sub-populations) an observation (or observations) belongs to.
* Example of classification is assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.).
* An algorithm that implements classification, especially in a concrete implementation, is known as a classifier.

## Different types of classifiers in scikit-learn

* Scikit-Learn provides easy access to numerous different classification algorithms.
* Among these classifiers are:
    * K-Nearest Neighbors
    * Support Vector Machines
    * Decision Tree Classifiers/Random Forests
    * Naive Bayes
    * Linear Discriminant Analysis
    * Logistic Regression

#### K-Nearest Neighbors (kNN)

* It operates by checking the distance from some test example to the known values of some training example.
* The group of data points/class that would give the smallest distance between the training points and the testing point is the class that is selected.

<img src="img/classifier-knn.png" style="width:400px">

#### Decision Trees

* A Decision Tree Classifier functions by breaking down a dataset into smaller and smaller subsets based on different criteria.
* Different sorting criteria will be used to divide the dataset, with the number of examples getting smaller with every division.
* Once the network has divided the data down to one example, the example will be put into a class that corresponds to a key.
* When multiple random tree classifiers are linked together they are called Random Forest Classifiers.

<img src="img/classifier-decissiontree.png" style="width:500px">

#### Naive Bayes

* A Naive Bayes Classifier determines the probability that an example belongs to some class, calculating the probability that an event will occur given that some input event has occurred.
* When it does this calculation it is assumed that all the predictors of a class have the same effect on the outcome, that the predictors are independent.

<img src="img/classifier-naivebayes.png" style="width:500px">

#### Linear Discriminant Analysis

* Linear Discriminant Analysis works by reducing the dimensionality of the dataset, projecting all of the data points onto a line.
* Then it combines these points into classes based on their distance from a chosen point or centroid.
* Linear discriminant analysis, as you may be able to guess, is a linear classification algorithm and best used when the data has a linear relationship.

<img src="img/classifier-lda.png" style="width:500px">

#### Support Vector Machines (SVM)

* SVM work by drawing a line between the different clusters of data points to group them into classes.
* Points on one side of the line will be one class and points on the other side belong to another class.
* The classifier will try to maximize the distance between the line it draws and the points on either side of it, to increase its confidence in which points belong to which class.
* When the testing points are plotted, the side of the line they fall on is the class they are put in.

<img src="img/classifier-svm.jpg" style="width:500px">

#### Logistic Regression

* Logistic Regression outputs predictions about test data points on a binary scale, zero or one.
* If the value of something is 0.5 or above, it is classified as belonging to class 1, while below 0.5 if is classified as belonging to 0.

<img src="img/classifier-logisticregression.png" style="width:500px">

## The use of classifiers

* All classification models have a similar workflow and the same interface.
* The `fit(train_data, train_target)` method is used to prepare the classifier for operation, this process is called training (learning) the model.
* The `predict(test_data)` method is used to assign (predict) new objects to classes.

## Example 1: Construction of kNN model for classifying objects from the `iris` dataset

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# load isis dataset
iris_dataset = datasets.load_iris()

# split the data into training and testing sets
# test_size specifies how much of the data you want to set aside for the testing set 
# random_state parameter is just a random seed we can use, setting of value results in reproducible splitting
X_train, X_test, y_train, y_test = train_test_split(iris_dataset.data, iris_dataset.target, test_size=0.20, random_state=27)

# creating a model that uses the 5 nearest neighbours to make decisions
knn_model = KNeighborsClassifier(n_neighbors=5)

# fit the classifier on training data
knn_model.fit(X_train, y_train)

# predict and store the prediction in a variable
prediction = knn_model.predict(X_test)

# evaluate how the classifier performed
print(accuracy_score(prediction, y_test))

0.9666666666666667


##### We received a very high classification accuracy of over 96%. This indicates that the kNN model will be a good choice as a classifier of iris flower examples.

## Example 2: Construction of SVM model for classifying objects from the `breast cancer` dataset

In [2]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

In [3]:
# load and exploring dataset
cancer_data = datasets.load_breast_cancer()

print("Features: ", cancer_data.feature_names)
print("Labels: ", cancer_data.target_names)
print("Data size: ", cancer_data.data.shape)

Features:  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
Labels:  ['malignant' 'benign']
Data size:  (569, 30)


In [4]:
# split dataset into training set and test set
# 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(cancer_data.data, cancer_data.target, test_size=0.3, random_state=109)

print(X_train.shape, X_test.shape)

(398, 30) (171, 30)


In [5]:
# create a svm classifier, linear kernel
clf_svm = svm.SVC(kernel='linear')

# train the model using the training sets
clf_svm.fit(X_train, y_train)

# predict the response for test dataset
y_pred = clf_svm.predict(X_test)

# model accuracy: how often is the classifier correct?
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.9649122807017544


##### Again, we obtained a very high classification accuracy of more than 96%. The SVM model will be a good choice to support the assessment of new patients.

## --- Exercise ---

Build a logistic regression model, train and evaluate it on `data/pima-indians-diabetes.csv` data.

In [6]:
# dataset loading

import pandas as pd

dataset = pd.read_csv("data/pima-indians-diabetes.csv")
dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [7]:
# selecting features and targets

data = dataset.iloc[:,:8].values
target = dataset.iloc[:,8]

data.shape, target.shape

((768, 8), (768,))

In [None]:
from sklearn.linear_model import LogisticRegression

# Write your code here