# <font color = #4854E8> Supervised Learning </font>

Supervised learning Algorithms can be further divided into two types of problems:

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/classification1.png)

# <font color = #4854E8> How Supervised Learning Works? </font>

In supervised learning, models are trained using labelled dataset, where the model learns about each type of data. Once the training process is completed, the model is tested on the basis of test data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and diagram:

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/classification2.png)

Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and Polygon. Now the first step is that we need to train the model for each shape.

If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
If the given shape has three sides, then it will be labelled as a triangle.
If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the shape on the bases of a number of sides, and predicts the output.

# <font color = #4854E8> What Is Classification? </font>

Classification is the process of recognizing, understanding, and grouping ideas and objects into preset categories or “sub-populations.” Using pre-categorized training datasets, machine learning programs use a variety of algorithms to classify future datasets into categories.

Classification algorithms in machine learning use input training data to predict the likelihood that subsequent data will fall into one of the predetermined categories. One of the most common uses of classification is filtering emails into “spam” or “non-spam.” 

In short, classification is a form of “pattern recognition,” with classification algorithms applied to the training data to find the same pattern (similar words or sentiments, number sequences, etc.) in future sets of data.

## Popular Classification Algorithms:
- Logistic Regression
- K-Nearest Neighbors
- Decision Tree
- Support Vector Machines
- Naive Bayes
- etc

## Types of Logistic Regression
1. Binary Logistic Regression:
The categorical response has only two 2 possible outcomes. Example: Spam or Not
2. Multinomial Logistic Regression:
Three or more categories without ordering. Example: Predicting which food is preferred more (Veg, Non-Veg, Vegan)
3. Ordinal Logistic Regression:
Three or more categories with ordering. Example: Movie rating from 1 to 5

# <font color = #4854E8> Binary Logistic Regression </font>
 

Logistic regression is a calculation used to predict a binary outcome: either something happens, or does not. This can be exhibited as Yes/No, Pass/Fail, Alive/Dead, etc. 

Independent variables are analyzed to determine the binary outcome with the results falling into one of two categories. The independent variables can be categorical or numeric, but the dependent variable is always categorical. 

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/c6fbaf6182755b528d3232d60408a2414b6d76c1)

or

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/da88d81e6a0169ec2da7d7bc0e0f4efef4b9e2c7)

where usually b=exp().

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/classification3.png)

## Estimation of coefficients

The regression coefficients are usually estimated using maximum likelihood estimation.

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/image1.jpeg)

Unlike linear regression, it is not possible to find a closed-form expression for the coefficient values that maximize the likelihood function, so that an iterative process must be used instead; 
for example:
- Newton's method 
- gradient descent 

This process begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this revision until no more improvement is made, at which point the process is said to have converged.

### Newton Rapshon

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/image2.5.jpeg)

### gradient Descent

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/image2.4.jpeg)

## f(β), ∇f(β), and ∇²f(β)

f(β) = likelihood function

∇f(β) = the first derivative (gradient of f(β))

∇²f(β) = the second derivative (Hassian Matrix of f(β))

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/image2.1.jpeg)
![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/image2.2.jpeg)
![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/image2.3.jpeg)


## Data Set Information:
Data were extracted from images that were taken from genuine and forged/artificial banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.


#### Attribute Information:
1. variance of Wavelet Transformed image (continuous) (X1)
2. skewness of Wavelet Transformed image (continuous) (X2)
3. curtosis of Wavelet Transformed image (continuous) (X3)
4. entropy of image (continuous) (X4)
5. class (integer) (Y)

#### Source:
Owner of database: Volker Lohweg (University of Applied Sciences, Ostwestfalen-Lippe, volker.lohweg '@' hs-owl.de)
Donor of database: Helene DÃ¶rksen (University of Applied Sciences, Ostwestfalen-Lippe, helene.doerksen '@' hs-owl.de)
Date received: August, 2012


## Imports

The first step is to import the package numpy and the class LinearRegression from sklearn.linear_model:

In [109]:
#library
import pandas as pd
import numpy as np

mydata = pd.read_table("https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt",
        delimiter=",", header=0, names=["X1","X2","X3","X4","Y"])

In [110]:
mydata.shape

(1371, 5)

In [111]:
# get input "X1","X2","X3","X4"
inputX = mydata[["X1","X2","X3","X4"]].to_numpy()

In [112]:
# define input X
X_1 = np.matrix(np.ones(mydata.shape[0])).T
X = np.append(X_1,inputX,axis=1)

# define output Y
Y = mydata[["Y"]].to_numpy()

In [7]:
X[:5,:]

matrix([[ 1.     ,  4.5459 ,  8.1674 , -2.4586 , -1.4621 ],
        [ 1.     ,  3.866  , -2.6383 ,  1.9242 ,  0.10645],
        [ 1.     ,  3.4566 ,  9.5228 , -4.0112 , -3.5944 ],
        [ 1.     ,  0.32924, -4.4552 ,  4.5718 , -0.9888 ],
        [ 1.     ,  4.3684 ,  9.6718 , -3.9606 , -3.1625 ]])

## Logistic Regression: Newton Raphson Approach

he Newton-Raphson method (also known as Newton's method) is a way to quickly find a good approximation for the root of a real-valued function f(x)=0. It uses the idea that a continuous and differentiable function can be approximated by a straight line tangent to it.

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/image2.5.jpeg)

## Define f(β), ∇f(β), and ∇²f(β)

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/image2.2.jpeg)
![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/image2.3.jpeg)

In [32]:
# f(β)
def f(beta):
    return np.ravel(np.ones(len(Y))*(np.log(1+np.exp(X*beta)))-Y.T*X*beta)[0]

# ∇f(β)
def deriv1_f(beta):
    return X.T*(1/(1+1/np.exp(X*beta))-Y)

# ∇²f(β)
def deriv2_f(beta):
    return X.T*(np.diag(np.ravel(np.exp(X*beta)/np.power(1+np.exp(X*beta),2)))*X)

In [33]:
# Newton raphson Method
beta = np.matrix(np.zeros(X.shape[1])).T
TOL = np.power(10.,-10)
counter = 0

while np.linalg.norm(deriv1_f(beta)) > TOL:
  counter += 1
  beta -= np.linalg.inv(deriv2_f(beta))*deriv1_f(beta)
  
print('iter =',counter)
print(beta)
print('norm =',np.linalg.norm(deriv1_f(beta)))

iter = 13
[[ 7.32180471]
 [-7.85933049]
 [-4.19096321]
 [-5.28743068]
 [-0.60531897]]
norm = 1.6992576342907546e-13


## Logistic Regression: Gradient Descent Approach

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/image2.1.jpeg)
![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/image2.2.jpeg)
![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/image2.3.jpeg)

beta = beta - learning_rate * the loss function (the first derivative)

In [34]:
# Gradient Descent
beta = np.matrix(np.zeros(X.shape[1])).T
TOL = np.power(10.,-10)
lam = 0.001 # learning_rate
counter = 0

while np.linalg.norm(deriv1_f(beta)) > TOL:
  counter += 1
  beta -= lam*deriv1_f(beta)
    
print('iter =',counter)
print(beta)
print('norm =',np.linalg.norm(deriv1_f(beta)))

iter = 164906
[[ 7.32180471]
 [-7.85933049]
 [-4.19096321]
 [-5.28743068]
 [-0.60531897]]
norm = 9.998779425131155e-11


In [10]:
# Classification Process
Xtest = [[1, 0.4,0.5,1.0,1.5]] # X1=0.4, X2=0.5, X3=1.0, X4=1.5
p = (np.exp(np.dot(Xtest, beta)) / (1 + np.exp(np.dot(Xtest, beta))))
p

matrix([[0.01609821]])

## Logistic Regression: Scikit-Learn

In [115]:
# Library
from sklearn.linear_model import LogisticRegression

# Newton-Conjugate Gradient
clf1 = LogisticRegression(penalty="none",solver='newton-cg',fit_intercept=False)
clf1.fit(X,np.ravel(Y))
print(clf1.coef_)

[[ 7.32180341 -7.85932911 -4.19096249 -5.28742975 -0.60531885]]


In [119]:
# Stochastic Average Gradient
clf2 = LogisticRegression(penalty="none",solver='sag',fit_intercept=False,max_iter=1000000) 
clf2.fit(X,np.ravel(Y))
print(clf2.coef_)

[[ 6.77654642 -7.23563953 -3.87113226 -4.87185514 -0.54036622]]




## Classification Process

In [22]:
clf1.predict([[1, 0.4,0.5,1.0,1.5]])

array([0], dtype=int64)

 # <font color = #4854E8> K-Nearest Neighbors </font>

K nearest neighbors is a simple algorithm used for classification. It basically stores all available cases to classify the new cases by a majority vote of its k neighbors. The case assigned to the class is most common amongst its K nearest neighbors measured by a distance function (Euclidean, Manhattan, Minkowski, and Hamming).

While the three former distance functions are used for continuous variables, the Hamming distance function is used for categorical variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out to be a challenge while performing kNN modeling.

## Distance metrics

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/image3.JPG)

## Propreties

- K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
- K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories.
- K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.
- K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
- It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.
- KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data.
- Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either cat or dog category.

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/classification4.png)

## Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below diagram:

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/classification5.png)

## How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

- Step-1: Select the number K of the neighbors
- Step-2: Calculate the Euclidean distance
- Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
- Step-4: Among these k neighbors, count the number of the data points in each category.
- Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
- Step-6: we will get our prediction.

Suppose we have a new data point and we need to put it in the required category. Consider the below image:

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/classification6.png)

- Firstly, we will choose the number of neighbors, so we will choose the k=5.
- Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two points, which we have already studied in geometry. It can be calculated as:

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/classification7.png)

- By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two nearest neighbors in category B. Consider the below image:

![](https://raw.githubusercontent.com/rauzansumara/introduction-to-machine-learning/master/Images/classification8.png)

- As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A.

## Things to Consider Before Selecting KNN:
- KNN is computationally expensive
- Variables should be normalized else higher range variables can bias it
- Works on pre-processing stage more before going for kNN like an outlier, noise removal

In [100]:
# library
from scipy import stats

df = pd.read_table("https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt",
        delimiter=",", header=0, names=["X1","X2","X3","X4","Y"])

X = df[["X1","X2","X3","X4"]].to_numpy()
Y = df[["Y"]].to_numpy()

In [41]:
# Feature Scaling (MinMax Normalize) 
def NormalizeData(data):
    return (data - np.min(data)) / (np.max(data) - np.min(data))

# Normalize dataset
scale_X = pd.DataFrame(X).apply(NormalizeData, axis=0).to_numpy()

In [42]:
scale_X

array([[0.83565902, 0.82098209, 0.12180412, 0.64432563],
       [0.78662859, 0.41664827, 0.31060805, 0.78695091],
       [0.75710505, 0.87169921, 0.05492063, 0.45043964],
       ...,
       [0.23738543, 0.01176814, 0.98560321, 0.52475518],
       [0.25084193, 0.20170105, 0.76158701, 0.6606745 ],
       [0.32452819, 0.49074676, 0.34334762, 0.88594888]])

In [46]:
# Generate values
nr = np.arange(scale_X.shape[0])

# shuffle datasets
np.random.shuffle(nr)
datX = scale_X[nr] # shuffle X
datY = Y[nr] # shuffle Y

# set 80% train and 20% test sets
training_ratio = 0.8
nr_split = np.round(datX.shape[0]*training_ratio, 0).astype(int)

# Divide data into train and test sets
X_train = datX[:nr_split, :]
y_train = datY[:nr_split]
X_test = datX[nr_split:, :]
y_test = datY[nr_split:]

In [47]:
X_train.shape

(1097, 4)

In [89]:
# Create kNN function
def knn_classifier(X_train, y_train, X_test, k):
    A = np.expand_dims(X_test, axis=1)
    B = np.expand_dims(X_train, axis=0)
    X_tensor = A - B

    D = np.sqrt(np.sum(np.power(X_tensor, 2), axis=2))
    nr = np.argsort(D, axis = 1)
    return stats.mode(y_train[nr][:,:k], axis = 1, keepdims=False)[0]

# call k-NN function
y_test_est = knn_classifier(X_train, y_train, X_test, k = 3)

In [99]:
# result
y_test_est[:10]

array([[0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0]], dtype=int64)

# 2. K-Nearest Neighbors: Scikit-Learn

In [105]:
# Library
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Normalize Data 
stdX = MinMaxScaler().fit_transform(X)

# Splitting the dataset into training and test set.  
X_train, X_test, y_train, y_test = train_test_split(stdX, Y, test_size= 0.20, random_state=44)  

In [106]:
# Fitting K-NN classifier to the training set  
classifier = KNeighborsClassifier(n_neighbors=3, metric='minkowski', p=2)  
classifier.fit(X_train, np.ravel(y_train))

In [107]:
# Predicting the test set  
y_pred = classifier.predict(X_test)

In [108]:
y_pred[:10]

array([0, 1, 0, 1, 1, 1, 0, 1, 1, 0], dtype=int64)