# Classification models practice

## General notes:

###    Weights:
    Weights show the strength of the particular node.

###    Bias:
    A bias value allows you to shift the activation function curve up or down.

In [2]:
import pandas as pd
import numpy as np
import sklearn
from sklearn import metrics
from sklearn.metrics import accuracy_score

## Naive Bayes

#### Basic math behind it:

Bayes Theorem: 
### $\begin{equation*}P(A|B) = \frac{P(B|A)P(A)}{P(B)}\end{equation*}$

-> For classification, the aim is to map the probabilities of each class.

### $\begin{equation*} MAP(y) = MAX(\frac{P(x_i|y)P(y)}{P(x_1)P(x_2)...P(x_n)})\end{equation*}$

-> Since for classification, the aim is to find the most probable one, the normalising term, i.e, P(B) can be dropped.
### $\begin{equation*} MAP(y) = MAX(P(x_i|y)P(y))\end{equation*}$
      

In [4]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

dataset = np.loadtxt('spambase.data', delimiter=',')
# print(dataset[0])

df = pd.DataFrame(dataset, columns=['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d', 'word_freq_our', 'word_freq_over', 'word_freq_remove', 'word_freq_internet', 'word_freq_order', 'word_freq_mail', 'word_freq_receive', 'word_freq_will', 'word_freq_people', 'word_freq_report', 'word_freq_addresses', 'word_freq_free', 'word_freq_business', 'word_freq_email', 'word_freq_you', 'word_freq_credit', 'word_freq_your', 'word_freq_font', 'word_freq_000', 'word_freq_money', 'word_freq_hp', 'word_freq_hpl', 'word_freq_george', 'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet', 'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85', 'word_freq_technology', 'word_freq_1999', 'word_freq_parts', 'word_freq_pm', 'word_freq_direct', 'word_freq_cs', 'word_freq_meeting', 'word_freq_original', 'word_freq_project', 'word_freq_re', 'word_freq_edu', 'word_freq_table', 'word_freq_conference', 'char_freq_;', 'char_freq_(', 'char_freq_[', 'char_freq_!', 'char_freq_$', 'char_freq_#', 'capital_run_length_average', 'capital_run_length_longest', 'capital_run_length_total', 'spam-ham'])
df.shape
# df

(4601, 58)

In [3]:
df.describe()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam-ham
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285,0.394045
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851,0.488698
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


In [4]:
train, test = train_test_split(df, test_size=0.2)
# train.describe()
# test.describe()

X_train = train.iloc[:, :57]
y_train = train.iloc[:, 57:]

X_test = test.iloc[:, :57]
y_test = test.iloc[:, 57:]

## Naive Bayes: Gaussian method

-> Uses mean and SD via the Probability Density Function.
    
 ### PDF : $\begin{equation*} f(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_y^2}}e^-\frac{(x_i-\mu_y)^2}{2\sigma_y^2}\end{equation*}$
  
  where:  
      $\mu$ --> Mean given by $\begin{equation*} \mu = \frac{\sum x}{n}\end{equation*}$        
      $\sigma$ --> Standard Deviation given by $\begin{equation*} \sqrt{\frac{\sum(x-\mu)^2}{n}} \end{equation*}$  
      $\sigma^2$ --> Variance

In [5]:

model = GaussianNB()

model.fit(X_train, y_train)

prediction_G = model.predict(X_test)



  y = column_or_1d(y, warn=True)


In [6]:
print('Accuracy', accuracy_score(y_test, prediction_G))

Accuracy 0.8023887079261672


## Naive Bayes: Bernoulli method

Similar to Multinomial but is used for binary classification.

For example if a word occurs in a document or not.

In [7]:
model_B = BernoulliNB()

model_B.fit(X_train, y_train)

prediction_B = model_B.predict(X_test)

  y = column_or_1d(y, warn=True)


In [8]:
print('BeroulliNB Accuracy:', accuracy_score(y_test, prediction_B))

BeroulliNB Accuracy: 0.8773072747014115


## Naive Bayes: Multinomial

Feature vectors represent the frequency with which events have been generated by a $multinomial \, distribution$ which makes this method good for something like document classification.

For example, we can find freqency of different occurances of words in a document.


In [9]:
model_Multi = MultinomialNB()

model_Multi.fit(X_train, y_train)

prediction_Multi = model_Multi.predict(X_test)

  y = column_or_1d(y, warn=True)


In [10]:
print('MultinomialNB Accuracy:', accuracy_score(y_test, prediction_Multi))

MultinomialNB Accuracy: 0.7730727470141151


## Stochastic Gradient Descent

Stochastic GD refers to the process of gradient descent where the error calculated from the error function is minimised, but using a batch size of just 1 example per step.

In [11]:
from sklearn.linear_model import SGDClassifier

sgd_model = SGDClassifier(loss="hinge", max_iter=5, tol=None)
sgd_model.fit(X_train, y_train)

sgd_pred = sgd_model.predict(X_test)

  y = column_or_1d(y, warn=True)


In [12]:
print('SGD Accuracy:', accuracy_score(y_test, sgd_pred))

SGD Accuracy: 0.749185667752443


## Random Forest

It builds multiple decision trees and merges them together to get a better and more accurate prediciton.

A decision tree chooses features that partition data as effectively as possible. So effectively, feature nodes that split the data better are much higher up in the tree.

When it comes to random forests, it takes a random subset of the features for each decision split. This helps random forests to gain more generalisation.


In [13]:
from sklearn.ensemble import RandomForestClassifier

randForest_model = RandomForestClassifier()
randForest_model.fit(X_train, y_train)

randForest_pred = randForest_model.predict(X_test)

  after removing the cwd from sys.path.


In [14]:
print("Random Forest Accuracy:", accuracy_score(y_test, randForest_pred))

Random Forest Accuracy: 0.9381107491856677


## Logistic Regression

-> Used when output labels are categorical in nature.

-> Can be used to predict probability as values lie in between 0 & 1.

-> uses the logistic or sigmoid function.

   ### $\begin{equation*} sigmoid(x) = \frac{1}{1+e^{-x}} \end{equation*}$
   
   ### Logistic regression equation: 
   # $\begin{equation*} y = \frac{1}{1+e^{-(\beta_1 x + \beta_0)}} \end{equation*} $
   
   where:
   
   $\beta_1$ -> weight (slope)
   
   $\beta_0$ -> bias (y intercept)


In [15]:
from sklearn.linear_model import LogisticRegression

LogReg_model = LogisticRegression()
LogReg_model.fit(X_train, y_train)

LogReg_pred = LogReg_model.predict(X_test)

  y = column_or_1d(y, warn=True)


In [16]:
print('Logistic regression Accuracy:', accuracy_score(y_test, LogReg_pred))

Logistic regression Accuracy: 0.9218241042345277


## KNN

The K-Nearest Neighbours algorithm essentially finds the closest an unknown point is to its neighbouring clusters that have been classified already which makes this a supervised learning algorithm.

To classify and unknown point, calculate the Euclidean distance ( $ \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2} $ ) between the unknown point and all the other known and classified clusters of points.

Sort the distances in ascending order and take the set of first k distances.

The most number of occurances of one label in the set of k distances of points from the unclassified point gives the classification for the point.

In [17]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(X_train, y_train)

knn_pred = knn_model.predict(X_test)

  after removing the cwd from sys.path.


In [18]:
print('K-Nearest Neighbors Accuracy:', accuracy_score(y_test, knn_pred))

K-Nearest Neighbors Accuracy: 0.8056460369163952


## Perceptron

It is the simplest version (single layer nn) of a neural network where there is a single artificial neuron that accepts multiple inputs and gives a single output essentially being a binary classifier.

### How it works:

-> Sums the products of inputs, x and weights, w called a weighted sum.   
 ###   $ \begin{equation*} \sum_{i=1}^n{x_i w_i} \end{equation*} $
 
-> This is then applied to an activation function, here, the step function. The activation function helps map the input between the required output values like (0, 1) or (-1, 1), etc.

### Unit Step function:

### $ \begin{equation*} f(x) = \begin{cases} 0 & \text{if } x < 0  \\ 
1 & \text{if } x \ge 0 \end{cases} \end{equation*} $

In [19]:
from sklearn.linear_model import Perceptron
perceptron_model = Perceptron(max_iter=50)
perceptron_model.fit(X_train, y_train)

perceptron_pred = perceptron_model.predict(X_test)

  y = column_or_1d(y, warn=True)


In [20]:
print('Perceptron Accuracy:', accuracy_score(y_test, perceptron_pred))

Perceptron Accuracy: 0.6004343105320304


## SVC

Here, each data item is plotted as a point in n-dimensional space where 'n' is the number of features.

The right hyper-plane should be found that is used to differentiate classes well.

Support vectors are the data points that are closest to any hyper-plane.

For a number of hyper-plane configurations, the set of distances of the support vectors from the hyper-plane is taken and the minimum of those distances are taken.

Now from the set of all minimum distances to all the hyper-planes, the maximum is taken.

The corresponding hyper-plane is taken as the most optimal one that would be able to classify the data points as best as possible.

In [1]:
# TODO how distance is calculated

from sklearn.svm import LinearSVC
linearSVC_Model = LinearSVC()
linearSVC_Model.fit(X_train, y_train)

linearSVC_pred = linearSVC_Model.predict(X_test)

NameError: name 'X_train' is not defined

In [22]:
print('Linear SVC Accuracy:', accuracy_score(y_test, linearSVC_pred))

Linear SVC Accuracy: 0.8034744842562432


## Decision Tree

Simply splits the data points into two sets based on a feature.

Multiple splits lead to many branches that makes a decision tree which can make decision based on the feature values at each node and classify the data point.

TODO
Information gain

In [24]:
from sklearn.tree import DecisionTreeClassifier
DecisionTree_Model = DecisionTreeClassifier()
DecisionTree_Model.fit(X_train, y_train)

DecisionTree_Pred = DecisionTree_Model.predict(X_test)

In [25]:
print('Decision Tree Accuracy:', accuracy_score(y_test, DecisionTree_Pred))

Decision Tree Accuracy: 0.9142236699239956
